Reputation: 323
I have several strings that look like this:
str1 = "C:/Users/10MedicineA\20072018_medicineName_00222_01111"
str2 = "C:/Users/MedicineB\21072018_medicineName_03333_01121"
I need to extract the digits after the backslash (supposed to be the date) and the medicineName
as well as the identifier (which is the first number series after the "medicineName".
So the final result should look like:
['20072018','medicineName','00222']
How is it possible to get everything after the backslash \
till the underscore _
?
I would like to do it with regex, and of course its easy to filter the C:/Users/
part, cause its always the same, but thats not true for the rest:
final = re.findall(r'\d+\.*',str1)
['10','20072018','00222','01111']
or
final = re.findall(r'(?=[0-9]).*(?=\_)')
Upvotes: 2
Views: 1987
Reputation: 19432
If you want to stick with regex, you could do something like:
import re
strings = ["C:/Users/10MedicineA/20072018_medicineName_00222_01111",
"C:/Users/MedicineB/21072018_medicineName_03333_01121"]
for s in strings:
r = re.search(r"(\d+)_(medicineName)_(\d+)_", str1)
if r:
print(list(r.groups()))
And this gives:
['20072018', 'medicineName', '00222']
['21072018', 'medicineName', '03333']
If you want to cover more general options change to:
"(\d+)_([^_]*)_(\d+)_"
Considering that your strings are paths, you could also use pathlib
for that task:
from pathlib import Path
s = "C:/Users/10MedicineA/20072018_medicineName_00222_01111"
last_part = Path(s).name
print(last_part.split("_")[:3])
Upvotes: 3
Reputation: 27743
My guess is that this expression might likely return the desired output:
.*\\|(.+?)_
which would collect all chars upto the last \
, then using this capturing group (.+?)
would return our desired outputs, and simultaneously excludes the last undesired substring after the _
.
If you wish to find the first three substrings before _
, this expression might work:
\\([^\\_\s]+)_([^\\_\s]+)_([^\\_\s]+)_
import re
regex = r"\\([^\\_\s]+)_([^\\_\s]+)_([^\\_\s]+)_"
test_str = ("C:/Users/10MedicineA\\20072018_medicineName_00222_01111\n"
"C:/Users/MedicineB\\21072018_medicineName_03333_01121\n"
"Users/3A Medicine\\\\200726_21-PQmed_00223_07_01110")
print(re.findall(regex, test_str))
Upvotes: 0
Reputation: 8576
Try this,
import re
str1 = "C:/Users/10MedicineA\20072018_medicineName_00222_01111"
str2 = "C:/Users/MedicineB\21072018_medicineName_03333_01121"
pattern = re.compile(r'(\d+)_([^_\s]*)_(\d+)')
print(list(pattern.search(str1).groups()))
# ['72018', 'medicineName', '00222']
print(list(pattern.search(str2).groups()))
# ['72018', 'medicineName', '03333']
Here is the visualization of my regex pattern.
Upvotes: 1