Felix
Felix

Reputation: 323

Regex getting number from string that stops at _ and str after _ underscore

I have several strings that look like this:

str1 = "C:/Users/10MedicineA\20072018_medicineName_00222_01111"
str2 = "C:/Users/MedicineB\21072018_medicineName_03333_01121"

I need to extract the digits after the backslash (supposed to be the date) and the medicineName as well as the identifier (which is the first number series after the "medicineName".

So the final result should look like:

['20072018','medicineName','00222']

How is it possible to get everything after the backslash \ till the underscore _?

I would like to do it with regex, and of course its easy to filter the C:/Users/ part, cause its always the same, but thats not true for the rest:

final = re.findall(r'\d+\.*',str1)
['10','20072018','00222','01111']

or

final = re.findall(r'(?=[0-9]).*(?=\_)')

Upvotes: 2

Views: 1987

Answers (3)

Tomerikoo
Tomerikoo

Reputation: 19432

If you want to stick with regex, you could do something like:

import re

strings = ["C:/Users/10MedicineA/20072018_medicineName_00222_01111",
           "C:/Users/MedicineB/21072018_medicineName_03333_01121"]

for s in strings:
    r = re.search(r"(\d+)_(medicineName)_(\d+)_", str1)
    if r:
        print(list(r.groups()))

And this gives:

['20072018', 'medicineName', '00222']
['21072018', 'medicineName', '03333']

If you want to cover more general options change to:

"(\d+)_([^_]*)_(\d+)_"

Considering that your strings are paths, you could also use pathlib for that task:

from pathlib import Path

s = "C:/Users/10MedicineA/20072018_medicineName_00222_01111"

last_part = Path(s).name
print(last_part.split("_")[:3])

Upvotes: 3

Emma
Emma

Reputation: 27743

My guess is that this expression might likely return the desired output:

.*\\|(.+?)_

which would collect all chars upto the last \, then using this capturing group (.+?) would return our desired outputs, and simultaneously excludes the last undesired substring after the _.


If you wish to find the first three substrings before _, this expression might work:

\\([^\\_\s]+)_([^\\_\s]+)_([^\\_\s]+)_

Test

import re

regex = r"\\([^\\_\s]+)_([^\\_\s]+)_([^\\_\s]+)_"

test_str = ("C:/Users/10MedicineA\\20072018_medicineName_00222_01111\n"
    "C:/Users/MedicineB\\21072018_medicineName_03333_01121\n"
    "Users/3A Medicine\\\\200726_21-PQmed_00223_07_01110")

print(re.findall(regex, test_str))

DEMO

Upvotes: 0

Kushan Gunasekera
Kushan Gunasekera

Reputation: 8576

Try this,

import re

str1 = "C:/Users/10MedicineA\20072018_medicineName_00222_01111"
str2 = "C:/Users/MedicineB\21072018_medicineName_03333_01121"

pattern = re.compile(r'(\d+)_([^_\s]*)_(\d+)')

print(list(pattern.search(str1).groups()))
# ['72018', 'medicineName', '00222']

print(list(pattern.search(str2).groups()))
# ['72018', 'medicineName', '03333']

Here is the visualization of my regex pattern.

enter image description here

Upvotes: 1

Related Questions