Quantcast
Channel: User Thomas - Stack Overflow
Viewing all articles
Browse latest Browse all 39

Tokenize String for words including non-word characters

$
0
0

I want to tokenize Twitter messages including hash- and cash-tags. A correct example for tokenization would be like this:

"Bought $AAPL today,because of the new #iphone".match(...);>>>> ['Bought', '$AAPL', 'today', 'because', 'of', 'the', 'new', '#iphone']

I tried several regexes for this task, ie:

"Bought $AAPL today,because of the new #iphone".match(/\b([\w]+?)\b/g);>>>> ['Bought', 'AAPL', 'today', 'because', 'of', 'the', 'new', 'iphone']

and

"Bought $AAPL today,because of the new #iphone".match(/\b([\$#\w]+?)\b/g);>>>> ['Bought', 'AAPL', 'today', 'because', 'of', 'the', 'new', 'iphone']

and

"Bought $AAPL today,because of the new #iphone".match(/[\b^#\$]([\w]+?)\b/g);>>>> ['$AAPL', '#iphone']

Which regex could I use, to include the leading sharp or dollar sign in the tokens?


Viewing all articles
Browse latest Browse all 39

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>