Reputation: 265
I'm trying to create a regular expression in order to extract some text from strings. I want to extract text from urls or normal text messages e.g.:
endpoint/?userId=#someuser.id
OR
Hi #someuser.name, how are you?
And from both I want to extract exactly #someuser.name
from message and #someuser.id
from url. There might be be many of those string to extract from the url and messages.
My regular expression currently looks like this:
(#[^\.]+?\.)([^\W]\w+\b)
It works fine, except one for one case and I don't know how to do it - e.g.:
Those strings SHOULD NOT be matched: # .id
, #.id
. There must be at least one character between #
and .
. One or more spaces between those characters should not be matched.
How can I do it using my current regex?
Upvotes: 2
Views: 1045
Reputation: 627607
You may use
String regex = "#[^.#]*[^.#\\s][^#.]*\\.\\w+";
See the regex demo and its graph:
Details
#
- a #
symbol[^.#]*
- zero or more chars other than .
and #
[^.#\\s]
- any char but .
, #
and whitespace[^#.]*
- - zero or more chars other than .
and #
\.
- a dot\w+
- 1+ word chars (letters, digits or _
).String s = "# #.id\nendpoint/?userId=#someuser.id\nHi #someuser.name, how are you?";
String regex = "#[^.#]*[^.#\\s][^#.]*\\.\\w+";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println(matcher.group(0));
}
Output:
#someuser.id
#someuser.name
Upvotes: 4
Reputation: 9336
The redefined requirements are:
#A.B
A
can be anything, except for only whitespaces, nor may it contain #
or .
B
can only be regular ASCII letters or digitsConverting those requirements into a (possible) regex:
#[^.#]+((?<!#\\s+)\\.)[A-Za-z0-9]+
Explanation:
#[^.#]+((?<!#\\s+)\\.)[A-Za-z0-9]+ # The entire capture for the Java-Matcher:
# # A literal '#' character
[^.#]+ # Followed by 1 or more characters which are NOT '.' nor '#'
( \\.) # Followed by a '.' character
(?<! ) # Which is NOT preceded by (negative lookbehind):
# # A literal '#'
\\s+ # With 1 or more whitespaces
[A-Za-z0-9]+ # Followed by 1 or more alphanumeric characters
# (PS: \\w+ could be used here if '_' is allowed as well)
Test code:
String input = "endpoint/?userId=#someuser.id Hi #someuser.name, how are you? # .id #.id %^*#@*(.H(@EH Ok, # some spaces here .but none here #$p€©ï@l.$p€©ï@l that should do it..";
System.out.println("Input: \""+ input + '"');
System.out.println("Outputs: ");
java.util.regex.Matcher matcher = java.util.regex.Pattern.compile("#[^.#]+((?<!#\\s+)\\.)[A-Za-z0-9]+")
.matcher(input);
while(matcher.find())
System.out.println('"'+matcher.group()+'"');
Which outputs:
Input: "endpoint/?userId=#someuser.id Hi #someuser.name, how are you? # .id #.id %^*#@*(.H(@EH Ok, # some spaces here .but none here #$p€©ï@l.$p€©ï@l that should do it.."
Outputs:
"#someuser.id"
"#someuser.name"
"#@*(.H"
"# some spaces here .but"
Upvotes: 1
Reputation: 12456
You can try the following regex:
#(\w+)\.(\w+)
Notes:
\
#(\\w+)\\.(\\w+)
id
is only made of numbers you can change the second \w
by [0-9]
username
include other characters than alphabet, numbers and underscore you have to change \w
into a character class with all the authorised characters defined explicitly.Code sample:
String input = "endpoint/?userId=#someuser.id Hi #someuser.name, how are you? # .id, #.id.";
Matcher m = Pattern.compile("#(\\w+)\\.(\\w+)").matcher(input);
while (m.find()) {
System.out.println(m.group());
}
output:
#someuser.id
#someuser.name
Upvotes: 1
Reputation: 101
#(\w+)[.](\w+)
results two groups, e.g
endpoint/?userId=#someuser.id -> group[0]=someuser and group[1]=id
Upvotes: 0