Reputation: 2055
I know this question had been asked here and here but there was a small problem when I tried it out:
x<- str_extract("Hello peopllz! My new home is #crazy gr8! #wow", "#\S+")
Error: '\S' is an unrecognized escape in character string starting "#\S"
I changed the regex to "#(.+) ?"
, "#\\s"
, but they did not extract the hashtags.
I then tried the gsub way:
x<- gsub("[^#(.+) ?]","","Hello! #London is gr8. #Wow")
It gave: " # . #"
Any ideas where I am going wrong? I'd like my output as a vector/list of all the hashtags in the tweet(without the hashes!)
Edit: I would prefer not tokenizing the tweet, because: 1. I am not tokenizing the tweets for the rest of my program, 2. It would become a very expensive step were I to scale it to handle large volumes of tweets.
Upvotes: 6
Views: 8990
Reputation: 14872
Use "#\\S+"
instead of "#\S+"
.
str_extract_all("Hello peopllz! My new home is #crazy gr8! #wow", "#\\S+")
# [[1]]
# [1] "#crazy" "#wow"
There are two levels of parsing going on here. Before the low level regexp function within str_extract
gets the pattern you want to search for (i.e. "#\S+"
) it is first parsed by R. R does not recognize \S
as a valid escape character and throws an error. By escaping the slash with \\
you tell R to pass the \
and S
as two normal characters to the regexp function, instead of interpreting it as one escape character.
This can produce rather bizarre expressions. Imagine that you have a list of addresses to computers on a windows network on the form of "\\computer"
. To search for it you would need to type str_extract(adr, "\\\\\\w+")
which would turn into "\\\w+"
internally and then search for.
Upvotes: 11
Reputation: 40186
Just chiming in. Depending on how you access the twitter data, this information may already be parsed for you. For example, if you access the sample stream, the raw JSON format has an entry that parses the references, tags, etc., as an array for you. See twitter api documentation here.
Upvotes: 3