jshi510
jshi510

Reputation: 31

Regex to extract name from a string

I'm trying to use regular expression to extract the name from a string. The name always follow by a protocol. The protocols are: ssh , folder, http.

Thu May 23 22:41:55 2019 19 10.10.10.20 22131676 /mnt/tmp/test.txt b s o r John ssh 0 *
Thu May 23 22:42:55 2019 19 10.10.10.20 22131676 /mnt/tmp/test.txt b s o i Jake folder 0 *
Thu May 23 22:41:55 2019 19 10.10.10.20 22131676 /mnt/tmp/test.txt b s o t Steve http 0 *

The expected output would be:

John
Jake
Steve

Upvotes: 1

Views: 4385

Answers (4)

Emma
Emma

Reputation: 27723

Another approach would be to take the single letter and space present right before the names as a left boundary, then collect the names' letters and save it in capturing group $1, maybe similar to:

\s+[a-z]\s+([A-Z][a-z]+)

We can also add more boundaries to it, if it might be necessary.

enter image description here

RegEx

If this expression wasn't desired, it can be modified or changed in regex101.com.

RegEx Circuit

jex.im visualizes regular expressions:

enter image description here

DEMO

Test

const regex = /\s+[a-z]\s+([A-Z][a-z]+)/gm;
const str = `Thu May 23 22:41:55 2019 19 10.10.10.20 22131676 /mnt/tmp/test.txt b s o r John ssh 0 *
Thu May 23 22:42:55 2019 19 10.10.10.20 22131676 /mnt/tmp/test.txt b s o i Jake folder 0 *
Thu May 23 22:41:55 2019 19 10.10.10.20 22131676 /mnt/tmp/test.txt b s o t Steve http 0 *`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

Upvotes: 0

chatnoir
chatnoir

Reputation: 2293

Try:

\b[A-Za-z]+(?=\s(?=ssh|folder|http))

Regex Demo here.

let regex = /\b[A-Za-z]+(?=\s(?=ssh|folder|http))/g;

[match] = "Thu May 23 22:41:55 2019 19 10.10.10.20 22131676 /mnt/tmp/test.txt b s o r John ssh 0 *".match(regex);
console.log(match); //John

[match] = "Thu May 23 22:42:55 2019 19 10.10.10.20 22131676 /mnt/tmp/test.txt b s o i Jake folder 0 *".match(regex);
console.log(match); //Jake

[match] = "Thu May 23 22:41:55 2019 19 10.10.10.20 22131676 /mnt/tmp/test.txt b s o t Steve http 0 *".match(regex);
console.log(match); //Steve

Regex explanation:

\b defines a word boundary to start match

[A-Za-z] match any alphabet, any case

+ repeat previous character any number of times till next pattern

(?= finds lookahead pattern (which won't be included in matching group)

\s a whitespace

(?=ssh|folder|http) another lookahead to either ssh, folder or http

Putting it all together, the regex looks for a word that is followed by a space and then one of the following: ssh, folder, or http.

Upvotes: 1

Allan
Allan

Reputation: 12438

You can use the following PCRE regex (as you haven't precised which language):

\b[a-zA-Z]+(?=\s+(?:ssh|folder|http))

demo: https://regex101.com/r/t62Ra7/4/

Explanations:

  • \b start the match from a word boundary
  • [a-zA-Z]+ match any sequence of ASCII character in a-zA-Z range, you might have to generalise this to accept Unicode letters.
  • (?= lookahead pattern to add the constraint that the name is followed by one of the protocols
  • \s+ a whitespace class char
  • (?:ssh|folder|http) non-capturing group for the protocols ssh, folder or http

Upvotes: 2

WJS
WJS

Reputation: 40034

Here's how you might do it in Java.

String[] str = {
            "Thu May 23 22:41:55 2019 19 10.10.10.20 22131676 /mnt/tmp/test.txt b s o r John ssh 0 *    ",
            "Thu May 23 22:42:55 2019 19 10.10.10.20 22131676 /mnt/tmp/test.txt b s o i Jake folder 0 * ",
            "Thu May 23 22:41:55 2019 19 10.10.10.20 22131676 /mnt/tmp/test.txt b s o t Steve http 0 *  ",
      };

      String pat = "(\\w+) (ssh|folder|http)"; // need to escape the second \
      Pattern p = Pattern.compile(pat);
      for (String s : str) {
         Matcher m = p.matcher(s);
         if (m.find()) {
            System.out.println(m.group(1));
         }

      }
   }

The actual pattern is in the string pat and can be used with other regex engines. This simply matches a name followed by a space followed by the protocols or'd together. But it captures the name in the first capture group.

Upvotes: 0

Related Questions