Reputation: 7693
I have a list of strings like below
strings = [
"1234_4534_41247612_2462184_ASN_ABCDEF_GHI_.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_KOR_PQRST_GHI.xlsx",
"12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_TWN_JKLMN_abcd_OPQ.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_IND_WXY.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_IND_WXY_TUV.xlsx",
]
I would like to do the below
a) Identify ASN, KOR, IND, TWN
keyword and pick whatever comes after it
b) Identify .xlsx
keyword and pick whatever comes before it.
c) The resulting output should not start with _
but it can have underscore in between the keywords of output string (but it should not start with _
)
I tried the below (Based on inspiration from this post
regex_output = re.compile(r"\_[ASN|KOR|TWN|IND]{3}\_([a-zA-Z\_]+)")
for s in strings:
print(regex_output.search(s).group(1))
How can I put a condition that it should be only looking for words before .xlsx
I expect my output to be like as shown below (exclude .xlsx keyword and _ symbol before the start of output string)
ABCDEF_GHI
PQRST_GHI
JKLMN_abcd_OPQ
WXY
WXY_TUV
Upvotes: 0
Views: 137
Reputation: 163207
You can repeat your character class 1 or more times in a non greedy way, and then match an optional underscore before matching .xslx
(?:ASN|KOR|IND|TWN)_([a-zA-Z_]+?)_?\.xlsx
Explanation
(?:ASN|KOR|IND|TWN)
Match any of the alternatives in a non capture group_([a-zA-Z_]+?)
Match an underscore, and capture in group 1 1 or more times any of the listed characters, as few as possible using +?
_?
Match an optional underscore\.xlsx
Match `.xslxSee a regex demo.
import re
strings = [
"1234_4534_41247612_2462184_ASN_ABCDEF_GHI_.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_KOR_PQRST_GHI.xlsx",
"12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_TWN_JKLMN_abcd_OPQ.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_IND_WXY.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_IND_WXY_TUV.xlsx",
]
regex_output = re.compile(r"(?:ASN|KOR|IND|TWN)_([a-zA-Z_]+?)_?\.xlsx")
for s in strings:
m = regex_output.search(s)
if m:
print(m.group(1))
Output
ABCDEF_GHI
PQRST_GHI
JKLMN_abcd_OPQ
WXY
WXY_TUV
Note that this pattern would allow for double underscores like __
in the string.
If you don't want to allow that, you may use a pattern repeating a single underscore after matching 1 or more chars A-Za-z
(?:ASN|KOR|IND|TWN)_((?:[a-zA-Z]+_)*[a-zA-Z]+_?)\.xlsx
See another regex demo.
Upvotes: 1
Reputation: 260300
Assuming a DataFrame/Series, you can use str.extract
with the regex _(?:ASN|KOR|TWN|IND)_([a-zA-Z_]+[a-zA-Z])
:
regex = r"_(?:ASN|KOR|TWN|IND)_([a-zA-Z_]+[a-zA-Z])"
df['out'] = df['col'].str.extract(regex)
output:
col out
0 1234_4534_41247612_2462184_ASN_ABCDEF_GHI.xlsx ABCDEF_GHI
1 1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184... PQRST_GHI
2 12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_... JKLMN_abcd_OPQ
3 1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_24621... WXY
4 1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_24621... WXY_TUV
Regex:
_ # underscore
(?:ASN|KOR|TWN|IND) # any of the words
_ # underscore
([a-zA-Z_]+) # capture letters/underscore
Upvotes: 1
Reputation: 195408
Try (regex101):
import re
strings = [
"1234_4534_41247612_2462184_ASN_ABCDEF_GHI_.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_KOR_PQRST_GHI.xlsx",
"12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_TWN_JKLMN_abcd_OPQ.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_IND_WXY.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_IND_WXY_TUV.xlsx",
]
pat = re.compile(r"(?:ASN|KOR|IND|TWN)_(.*?)_?\.xlsx")
for s in strings:
m = pat.search(s)
if m:
print(m.group(1))
Prints:
ABCDEF_GHI
PQRST_GHI
JKLMN_abcd_OPQ
WXY
WXY_TUV
Upvotes: 1
Reputation: 76
import re
result = [re.split('.xlsx', re.split('ASN|KOR|IND|TWN', i)[-1])[0].replace('_', '', 1) for i in strings]
Upvotes: 1