Nooblhu
Nooblhu

Reputation: 572

Find a Twitter mention using Regex in Java (with a twist)

I need to create a regex pattern to find mentions in a class called Tweets. In this case, the valid characters after the '@' are: (A-Z or a-z), digits, underscore ("_"), or hyphen ("-"). The difference with classic Twitter usernames is that pattern should allow @--- or @___ or @00000, but also, when the char before '@' or after the name is not in the valid character list it should return the mention as valid (so not only white-spaces).

Strings like:

$$$$$@john$$$$$$ or %%%%@john%%%

should find @john as a valid mention since % isn't a valid name.

@@@@@john@@@@@ should also return @john.

Using http://regexr.com/ I created this pattern:

@[a-zA-Z0-9_-]*

which in the mentioned page passes most requirements except @@@@john@@@ and [email protected] which should ignore (since it has richard before) but instead turns into @gmail.

After that I improved this pattern

(?<=^|(?<=[^a-zA-Z0-9-_\\.]))@[A-Za-z0-9_-]+

Tested in the same page (and also this one to verify results: http://myregexp.com/signedJar.html) Not sure why this page shows results that my code doesn't. I include my Java implementation just in case:

Set<String> users = new HashSet<>(); // to avoid repeated mentions
    for (Tweet t: tweets){
        String line = t.getText();
        String pattern = "(?<=^|(?<=[^a-zA-Z0-9-_\\.]))@[A-Za-z0-9_-]+";
        Pattern p = Pattern.compile(pattern);
        Matcher m = p.matcher(line);
        if (m.find()){
            users.add(m.group().toLowerCase()); 
        }

And these are my test cases (all should return a mention except last two):

@tony
$$@yahoo$$
john @john john
@joules-
@john-cassidy
%%@jake%%
@@@jake@@@
dude$@jake$$
$$$@jack$$$
@@@jake@@@ 
@john4 
@jake2$
@johN3 
@rock-smith
@John 
@gmail.com //should not return but does: wrong
[email protected] //should not find and it doesn't: good

From what I understand this: (?<=^|(?<=[^a-zA-Z0-9-\.])) is the *lookbehind* and I'm lacking the *lookahead* (not sure how to lookahead AFTER the valid chars) and cant understand the explanation about lookahead at http://www.regular-expressions.info/lookaround.html to allow these chars: [A-Za-z0-9-] but not the rest (to ignore the match @gmail.com and avoid @gmail as return) .

Thanks in advance for your help. I have just 6 months in Java so this is the second time I use Regex and this feels like a complex one.

Upvotes: 1

Views: 968

Answers (1)

user7018603
user7018603

Reputation:

Adding positive lookahead to the end of original regex should help:

(?<=^|(?<=[^a-zA-Z0-9-\.]))@[A-Za-z0-9-]+(?=[^a-zA-Z0-9-_\.])

Upvotes: 2

Related Questions