Reputation: 2622
How can I use regex in R to extract Twitter usernames from a string of text?
I've tried
library(stringr)
theString <- '@foobar Foobar! and @foo (@bar) but not [email protected]'
str_extract_all(string=theString,pattern='(?:^|(?:[^-a-zA-Z0-9_]))@([A-Za-z]+[A-Za-z0-9_]+)')
But I end up with @foobar
, @foo
and (@bar
which contains an unwanted parenthesis.
How can I get just @foobar
, @foo
and @bar
as output?
Upvotes: 4
Views: 3562
Reputation: 174
@[a-zA-Z0-9_]{0,15}
Where:
@
matches the character @
literally (case sensitive).
[a-zA-Z0-15]
match a single character present in the list
{0,15}
Quantifier matches between 0 and 15 times, as many times as
possible, giving back as needed
It is working fine on selecting twitter usernames from a mixed dataset.
Upvotes: 2
Reputation: 71538
Try using a negative lookbehind so that characters are not consumed in your match:
(?:^|(?<![-a-zA-Z0-9_]))@([A-Za-z]+[A-Za-z0-9_]+)
^^^
EDIT: Since it seems lookbehinds don't work in R (I found somewhere here that lookbehinds worked on R, but apparently not...), try this one:
@([A-Za-z]+[A-Za-z0-9_]+)(?![A-Za-z0-9_]*\\.)
Edit: double escaped the dot
EDITv3... : Try turning on PCRE:
str_extract_all(string=theString,perl("(?:^|(?<![-a-zA-Z0-9_]))@([A-Za-z]+[A-Za-z0-9_]+)")
Upvotes: 1
Reputation: 42283
Here's one method that works in R
:
theString <- '@foobar Foobar! and @foo (@bar) but not [email protected]'
theString1 <- unlist(strsplit(theString, " "))
regex <- "(^|[^@\\w])@(\\w{1,15})\\b"
idx <- grep(regex, theString1, perl = T)
theString1[idx]
[1] "@foobar" "@foo" "(@bar)"
If you want to use @Jerry's answer in R
:
regex <- "@([A-Za-z]+[A-Za-z0-9_]+)(?![A-Za-z0-9_]*\\.)"
idx <- grep(regex, theString1, perl = T)
theString1[idx]
[1] "@foobar" "@foo" "(@bar)"
Both of these methods include the parenthesis that you don't want, however.
UPDATE This will get to you start-to-finish with no parentheses or any other kind of punctuation (except underscores, since they're allowed in usernames)
theString <- '@foobar Foobar! and @fo_o (@bar) but not [email protected]'
theString1 <- unlist(strsplit(theString, " "))
regex1 <- "(^|[^@\\w])@(\\w{1,15})\\b" # get strings with @
regex2 <- "[^[:alnum:]@_]" # remove all punctuation except _ and @
users <- gsub(regex2, "", theString1[grep(regex1, theString1, perl = T)])
users
[1] "@foobar" "@fo_o" "@bar"
Upvotes: 8