Carath
Carath

Reputation: 265

Regex to extract hashtags with two dot-separated parts

I'm trying to create a regular expression in order to extract some text from strings. I want to extract text from urls or normal text messages e.g.:

endpoint/?userId=#someuser.id

OR

Hi #someuser.name, how are you?

And from both I want to extract exactly #someuser.name from message and #someuser.id from url. There might be be many of those string to extract from the url and messages.

My regular expression currently looks like this:

(#[^\.]+?\.)([^\W]\w+\b)

It works fine, except one for one case and I don't know how to do it - e.g.:

Those strings SHOULD NOT be matched: # .id, #.id. There must be at least one character between # and .. One or more spaces between those characters should not be matched.

How can I do it using my current regex?

Upvotes: 2

Views: 1045

Answers (4)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627607

You may use

String regex = "#[^.#]*[^.#\\s][^#.]*\\.\\w+";

See the regex demo and its graph:

enter image description here

Details

  • # - a # symbol
  • [^.#]* - zero or more chars other than . and #
  • [^.#\\s] - any char but ., # and whitespace
  • [^#.]* - - zero or more chars other than . and #
  • \. - a dot
  • \w+ - 1+ word chars (letters, digits or _).

Java demo:

String s = "# #.id\nendpoint/?userId=#someuser.id\nHi #someuser.name, how are you?";
String regex = "#[^.#]*[^.#\\s][^#.]*\\.\\w+";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
    System.out.println(matcher.group(0)); 
} 

Output:

#someuser.id
#someuser.name

Upvotes: 4

Kevin Cruijssen
Kevin Cruijssen

Reputation: 9336

The redefined requirements are:

  • We search for pattern #A.B
  • A can be anything, except for only whitespaces, nor may it contain # or .
  • B can only be regular ASCII letters or digits

Converting those requirements into a (possible) regex:

#[^.#]+((?<!#\\s+)\\.)[A-Za-z0-9]+

Explanation:

#[^.#]+((?<!#\\s+)\\.)[A-Za-z0-9]+  # The entire capture for the Java-Matcher:
#                                   #  A literal '#' character
 [^.#]+                             #  Followed by 1 or more characters which are NOT '.' nor '#'
       (          \\.)              #  Followed by a '.' character
        (?<!     )                  #  Which is NOT preceded by (negative lookbehind):
            #                       #   A literal '#'
             \\s+                   #   With 1 or more whitespaces
                      [A-Za-z0-9]+  #  Followed by 1 or more alphanumeric characters
                                    #  (PS: \\w+ could be used here if '_' is allowed as well)

Test code:

String input = "endpoint/?userId=#someuser.id Hi #someuser.name, how are you? # .id #.id %^*#@*(.H(@EH Ok, # some spaces here .but none here #$p€©ï@l.$p€©ï@l that should do it..";
System.out.println("Input: \""+ input + '"');

System.out.println("Outputs: ");
java.util.regex.Matcher matcher = java.util.regex.Pattern.compile("#[^.#]+((?<!#\\s+)\\.)[A-Za-z0-9]+")
                                                         .matcher(input);
while(matcher.find())
  System.out.println('"'+matcher.group()+'"');

Try it online.

Which outputs:

Input: "endpoint/?userId=#someuser.id Hi #someuser.name, how are you? # .id #.id %^*#@*(.H(@EH Ok, # some spaces here .but none here #$p€©ï@l.$p€©ï@l that should do it.."
Outputs: 
"#someuser.id"
"#someuser.name"
"#@*(.H"
"# some spaces here .but"

Upvotes: 1

Allan
Allan

Reputation: 12456

You can try the following regex:

#(\w+)\.(\w+)

demo

Notes:

  • remove the parenthesis if you do not want to capture any group.
  • in your java regex string you need to escape every \
  • this gives #(\\w+)\\.(\\w+)
  • if the id is only made of numbers you can change the second \w by [0-9]
  • if the username include other characters than alphabet, numbers and underscore you have to change \w into a character class with all the authorised characters defined explicitly.

Code sample:

String input = "endpoint/?userId=#someuser.id Hi #someuser.name, how are you? # .id, #.id.";
Matcher m = Pattern.compile("#(\\w+)\\.(\\w+)").matcher(input);
while (m.find()) {
    System.out.println(m.group());
}

output:

#someuser.id
#someuser.name

Upvotes: 1

Akash
Akash

Reputation: 101

#(\w+)[.](\w+)

results two groups, e.g

endpoint/?userId=#someuser.id -> group[0]=someuser and group[1]=id

Upvotes: 0

Related Questions