BenT
BenT

Reputation: 37

Using Reg Ex to Match Strings in a Data Frame and Replace - python

i have data frame that looks like this

    value
0   A067-M4FL-CAA-020
1   MRF2-050A-TFC,60 ,R-12,HT
2   moreinfo
3   MZF8-050Z-AAB
4   GoCats
5   MZA2-0580-TFD,60 ,R-669,LT

i want to be able to strip ,60 ,R-12,HT using regex and also deletes the moreinfo and GoCats rows from the df.

My expected Results:

     value
0   A067-M4FL-CAA-020
1   MRF2-050A-TFC
2   MZF8-050Z-AAB
3   MZA2-0580-TFD

I first removed the strings

del = ['hello', 'moreinfo']
for i in del:
   df = df[value!= i]

Can somebody suggest a way to use regex to match and delete all case that do match A067-M4FL-CAA-020 or MZF8-050Z-AAB pattern so i don't have to create a list for all possible cases?

I was able to strip a single line like this but i want to be able to strip all matching cases in the dataframe

pattern = r',\w+ \,\w+-\w+\,\w+ *'
line = 'MRF2-050A-TFC,60 ,R-12,HT'
for i in re.findall(pattern, line):
   line = line.replace(i,'')

>>> MRF2-050A-TFC

I tried adjusting my code but it prints out the same output for each row

pattern = r',\w+ \,\w+-\w+\,\w+ *'
for d in df:
   for i in re.findall(pattern, d):
     d = d.replace(i,'')

Any suggestions will be greatly appreciated. Thanks

Upvotes: 2

Views: 78

Answers (3)

YOLO
YOLO

Reputation: 21739

Here's a simpler approach you can try without using regex. pandas has many in-built functions to deal with text data.

# remove unwanted values
df['value'] = df.value.str.replace(r'moreinfo|60|R-.*|HT|GoCats|\,', '')

# drop na
df = df[(df != '')].dropna()

# print
print(df)

    value
0   A067-M4FL-CAA-020
1   MRF2-050A-TFC
3   MZF8-050Z-AAB
5   MZA2-0580-TFD

-----------
# data used
df = pd.read_fwf(StringIO(u'''
    value
0   A067-M4FL-CAA-020
1   MRF2-050A-TFC,60 ,R-12,HT
2   moreinfo
3   MZF8-050Z-AAB
4   GoCats
5   MZA2-0580-TFD,60 ,R-669,LT'''),header=1)

Upvotes: 1

Thm Lee
Thm Lee

Reputation: 1236

You may try this

(?:\w+-){2,}[^,\n]*

Demo

Python scripts may be as follows

ss="""0   A067-M4FL-CAA-020
1   MRF2-050A-TFC,60 ,R-12,HT
2   moreinfo
3   MZF8-050Z-AAB
4   GoCats
5   MZA2-0580-TFD,60 ,R-669,LT"""

import re
regx=re.compile(r'(?:\w+-){2,}[^,\n]*')
m= regx.findall(ss)

for i in range(len(m)):
    print("%d   %s" %(i, m[i]))

and the output is

0   A067-M4FL-CAA-020
1   MRF2-050A-TFC
2   MZF8-050Z-AAB
3   MZA2-0580-TFD

Upvotes: 1

zzxyz
zzxyz

Reputation: 2981

I'd suggest capturing the data you DO want, since it's pretty particular, and the data you do NOT want could be anything.

Your pattern should look something like this:

^\w{4}-\w{4}-\w{3}(?:-\d{3})?

https://regex101.com/r/NtH2Ut/2

I'd recommend being more specific than \w where possible. (Like ^[A-Z]\w{3}) if you know the beginning four character chunk should start with a letter.

edit Sorry, I may not have read your input and output literally enough:

https://regex101.com/r/NtH2Ut/3

^(?:\d+\s+\w{4}-\w{4}-\w{3}(?:-\d{3})?)|^\s+.*

Upvotes: 0

Related Questions