The Great
The Great

Reputation: 7693

pandas extract keywords after a list of select keywords

I have a list of strings like below

strings = [
    "1234_4534_41247612_2462184_ASN_ABCDEF_GHI_.xlsx",
    "1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_KOR_PQRST_GHI.xlsx",
    "12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_TWN_JKLMN_abcd_OPQ.xlsx",
    "1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_IND_WXY.xlsx",
    "1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_IND_WXY_TUV.xlsx",
]

I would like to do the below

a) Identify ASN, KOR, IND, TWN keyword and pick whatever comes after it

b) Identify .xlsx keyword and pick whatever comes before it.

c) The resulting output should not start with _ but it can have underscore in between the keywords of output string (but it should not start with _)

I tried the below (Based on inspiration from this post

regex_output = re.compile(r"\_[ASN|KOR|TWN|IND]{3}\_([a-zA-Z\_]+)")

for s in strings:
    print(regex_output.search(s).group(1))

How can I put a condition that it should be only looking for words before .xlsx

I expect my output to be like as shown below (exclude .xlsx keyword and _ symbol before the start of output string)

ABCDEF_GHI
PQRST_GHI
JKLMN_abcd_OPQ
WXY
WXY_TUV

Upvotes: 0

Views: 137

Answers (4)

The fourth bird
The fourth bird

Reputation: 163207

You can repeat your character class 1 or more times in a non greedy way, and then match an optional underscore before matching .xslx

(?:ASN|KOR|IND|TWN)_([a-zA-Z_]+?)_?\.xlsx

Explanation

  • (?:ASN|KOR|IND|TWN) Match any of the alternatives in a non capture group
  • _([a-zA-Z_]+?) Match an underscore, and capture in group 1 1 or more times any of the listed characters, as few as possible using +?
  • _? Match an optional underscore
  • \.xlsx Match `.xslx

See a regex demo.

import re

strings = [
    "1234_4534_41247612_2462184_ASN_ABCDEF_GHI_.xlsx",
    "1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_KOR_PQRST_GHI.xlsx",
    "12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_TWN_JKLMN_abcd_OPQ.xlsx",
    "1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_IND_WXY.xlsx",
    "1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_IND_WXY_TUV.xlsx",
]

regex_output = re.compile(r"(?:ASN|KOR|IND|TWN)_([a-zA-Z_]+?)_?\.xlsx")

for s in strings:
    m = regex_output.search(s)
    if m:
        print(m.group(1))

Output

ABCDEF_GHI
PQRST_GHI
JKLMN_abcd_OPQ
WXY
WXY_TUV

Note that this pattern would allow for double underscores like __ in the string.

If you don't want to allow that, you may use a pattern repeating a single underscore after matching 1 or more chars A-Za-z

(?:ASN|KOR|IND|TWN)_((?:[a-zA-Z]+_)*[a-zA-Z]+_?)\.xlsx

See another regex demo.

Upvotes: 1

mozway
mozway

Reputation: 260300

Assuming a DataFrame/Series, you can use str.extract with the regex _(?:ASN|KOR|TWN|IND)_([a-zA-Z_]+[a-zA-Z]):

regex = r"_(?:ASN|KOR|TWN|IND)_([a-zA-Z_]+[a-zA-Z])"
df['out'] = df['col'].str.extract(regex)

output:

                                                 col             out
0     1234_4534_41247612_2462184_ASN_ABCDEF_GHI.xlsx      ABCDEF_GHI
1  1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184...       PQRST_GHI
2  12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_...  JKLMN_abcd_OPQ
3  1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_24621...             WXY
4  1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_24621...         WXY_TUV

Regex:

_                      # underscore
(?:ASN|KOR|TWN|IND)    # any of the words
_                      # underscore
([a-zA-Z_]+)           # capture letters/underscore

Upvotes: 1

Andrej Kesely
Andrej Kesely

Reputation: 195408

Try (regex101):

import re

strings = [
    "1234_4534_41247612_2462184_ASN_ABCDEF_GHI_.xlsx",
    "1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_KOR_PQRST_GHI.xlsx",
    "12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_TWN_JKLMN_abcd_OPQ.xlsx",
    "1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_IND_WXY.xlsx",
    "1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_IND_WXY_TUV.xlsx",
]

pat = re.compile(r"(?:ASN|KOR|IND|TWN)_(.*?)_?\.xlsx")

for s in strings:
    m = pat.search(s)
    if m:
        print(m.group(1))

Prints:

ABCDEF_GHI
PQRST_GHI
JKLMN_abcd_OPQ
WXY
WXY_TUV

Upvotes: 1

pizza_slice
pizza_slice

Reputation: 76

import re
result =  [re.split('.xlsx', re.split('ASN|KOR|IND|TWN', i)[-1])[0].replace('_', '', 1) for i in strings]

Upvotes: 1

Related Questions