Reputation: 5731
I have a regular expression to remove all usernames from tweets. It looks like this:
regexFinder = "(?:\\s|\\A)[@]+([A-Za-z0-9-_]+):";
I'm trying to understand what each component does. So far, I've got:
( Used to begin a “group” element
?: Starts non-capturing group (this means one that will be removed from the final result)
\\s Matches against shorthand characters
| or
\\A Matches at the start of the string and matches a position as opposed to a character
[@] Matches against this symbol (which is used for Twitter usernames)
+ Match the previous followed by
([A-Za-z0-9- ] Match against any capital or small characters and numbers or hyphens
I'm a bit lost with the last bit though. Could somebody tell me what the +): means? I'm assuming the bracket is ending the group, but I don't get the colon or the plus sign.
If I've made any mistakes in my understanding of the regex please feel free to point it out!
Upvotes: 2
Views: 207
Reputation: 19066
The +
actually means "one or more" of whatever it follows.
In this case [@]+
means "one or more @ symbols" and [A-Za-z0-9-_]+
means "one or more of a letter, number, dash, or underscore". The +
is one of several quantifiers, learn more here.
The colon at the end is just making sure the match has a colon at the end of the match.
Sometimes it helps to see a visualization, here is one generated by debuggex:
Upvotes: 1
Reputation: 70722
Well, we shall see..
[@]+ any character of: '@' (1 or more times)
( group and capture to \1:
[A-Za-z0-9-_]+ any character of: (a-z A-Z), (0-9), '-', '_' (1 or more times)
) end of capture group \1
: look for and match ':'
The following quantifiers are recognized:
* Match 0 or more times
+ Match 1 or more times
? Match 1 or 0 times
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times
Upvotes: 1
Reputation: 1722
The plus sign in regular expressions means "one or more occurrences of the previous character or group of characters." Since the second plus sign is within the second set of parentheses, it basically means that the second set of parentheses matches any string comprised of at least one lowercase or uppercase letter, number, or hyphen.
As for the colon, it doesn't have any meaning in Java's regex class. If you're not sure, someone else already found out.
Upvotes: 1
Reputation: 2998
The +
sign means "the previous character can be repeated 1 or more times". This is in contrast to the *
symbol, which means "the previous character can be repeated 0 or more times". The colon, as far as I can tell, is literal—it matches a literal :
in the string.
Upvotes: 1