Zaibi
Zaibi

Reputation: 343

Regex to extract usernames/names from a string

I have strings that includes names and sometime a username in a string followed by a datetime stamp:

GN1RLWFH0546-2020-04-10-18-09-52-563945.txt
JOHN-DOE-2020-04-10-18-09-52-563946t64.txt
DESKTOP-OHK45JO-2020-04-09-02-27-11-451975.txt

I want to extract the usernames from this string:

GN1RLWFH0546
JOHN-DOE   
DESKTOP-OHK45JO

I have tried different regex patterns the closest I came to extract was following:

GN1RLWFH0546
DESKTOP
JOHN

Using the following regex pattern:

names = re.search(r"\(?([0-9A-Za-z]+)\)?", agent_str)
print(names.group(1))

Upvotes: 5

Views: 914

Answers (4)

user11133653
user11133653

Reputation:

import re

agent_str = ["GN1RLWFH0546-2020-04-10-18-09-52-563945.txt", "JOHN-DOE-2020-04-10-18-09-52-563946t64.txt", "DESKTOP-OHK45JO-2020-04-09-02-27-11-451975.txt"]

for sub in agent_str:
    names = re.search(r"([A-Za-z]+[A-Za-z0-9]+)(\-[A-Za-z]+[A-Za-z0-9]+)?", sub)
    print(names.group())

Upvotes: 0

simon-pearson
simon-pearson

Reputation: 1970

How about the following regex: (.*)-\d{4}-. This matches anything followed by a hyphen, four digits, and another hyphen.

Using the above regex the first group is the username, ala:

import re
agent_str = 'DESKTOP-OHK45JO-2020-04-09-02-27-11-451975.txt'
names = re.search(r'(.*)-\d{4}-', agent_str)
print(names.group(1)) 

Upvotes: 0

Tim Biegeleisen
Tim Biegeleisen

Reputation: 520918

I suggest stripping off the trailing content you don't want, leaving behind what you do want:

inp = "GN1RLWFH0546-2020-04-10-18-09-52-563945.txt"
out = re.sub(r'-\d{4}-\d{2}-\d{2}.*$', '', inp)
print(out)

This prints:

GN1RLWFH0546

See the regex demo below.

Demo

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626689

You may get all text up to the first occurrence of -+digits+-:

^.*?(?=-\d+-)

If the number must be exactly 4 digits (say, if it is a year), then replace + with {4}:

^.*?(?=-\d{4}-)

See the regex demo

Details

  • ^ - start of string
  • .*? - any 0+ chars other than line break chars, as few as possible
  • (?=-\d+-) - up to the first occurrence of - and 1+ digits (or, if \d{4} is used, exactly four digits) and then - (this part is not added to the match value as the positive lookahead is a non-consuming pattern).

See Python demo:

import re
strs = ["GN1RLWFH0546-2020-04-10-18-09-52-563945.txt", "JOHN-DOE-2020-04-10-18-09-52-563946t64.txt", "DESKTOP-OHK45JO-2020-04-09-02-27-11-451975.txt"]
rx = re.compile(r"^.*?(?=-\d+-)")
for s in strs:
  m = rx.search(s)
  if m:
    print("{} => '{}'".format(s, m.group()))

Output:

GN1RLWFH0546-2020-04-10-18-09-52-563945.txt => 'GN1RLWFH0546'
JOHN-DOE-2020-04-10-18-09-52-563946t64.txt => 'JOHN-DOE'
DESKTOP-OHK45JO-2020-04-09-02-27-11-451975.txt => 'DESKTOP-OHK45JO'

Upvotes: 2

Related Questions