Reputation: 572
I need to create a regex pattern to find mentions in a class called Tweets. In this case, the valid characters after the '@' are: (A-Z or a-z), digits, underscore ("_"), or hyphen ("-"). The difference with classic Twitter usernames is that pattern should allow @--- or @___ or @00000, but also, when the char before '@' or after the name is not in the valid character list it should return the mention as valid (so not only white-spaces).
Strings like:
$$$$$@john$$$$$$ or %%%%@john%%%
should find @john as a valid mention since % isn't a valid name.
@@@@@john@@@@@ should also return @john.
Using http://regexr.com/ I created this pattern:
@[a-zA-Z0-9_-]*
which in the mentioned page passes most requirements except @@@@john@@@ and [email protected] which should ignore (since it has richard before) but instead turns into @gmail.
After that I improved this pattern
(?<=^|(?<=[^a-zA-Z0-9-_\\.]))@[A-Za-z0-9_-]+
Tested in the same page (and also this one to verify results: http://myregexp.com/signedJar.html) Not sure why this page shows results that my code doesn't. I include my Java implementation just in case:
Set<String> users = new HashSet<>(); // to avoid repeated mentions
for (Tweet t: tweets){
String line = t.getText();
String pattern = "(?<=^|(?<=[^a-zA-Z0-9-_\\.]))@[A-Za-z0-9_-]+";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(line);
if (m.find()){
users.add(m.group().toLowerCase());
}
And these are my test cases (all should return a mention except last two):
@tony
$$@yahoo$$
john @john john
@joules-
@john-cassidy
%%@jake%%
@@@jake@@@
dude$@jake$$
$$$@jack$$$
@@@jake@@@
@john4
@jake2$
@johN3
@rock-smith
@John
@gmail.com //should not return but does: wrong
[email protected] //should not find and it doesn't: good
From what I understand this: (?<=^|(?<=[^a-zA-Z0-9-\.])) is the *lookbehind* and I'm lacking the *lookahead* (not sure how to lookahead AFTER the valid chars) and cant understand the explanation about lookahead at http://www.regular-expressions.info/lookaround.html to allow these chars: [A-Za-z0-9-] but not the rest (to ignore the match @gmail.com and avoid @gmail as return) .
Thanks in advance for your help. I have just 6 months in Java so this is the second time I use Regex and this feels like a complex one.
Upvotes: 1
Views: 968
Reputation:
Adding positive lookahead to the end of original regex should help:
(?<=^|(?<=[^a-zA-Z0-9-\.]))@[A-Za-z0-9-]+(?=[^a-zA-Z0-9-_\.])
Upvotes: 2