Reputation: 37
i have data frame that looks like this
value
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC,60 ,R-12,HT
2 moreinfo
3 MZF8-050Z-AAB
4 GoCats
5 MZA2-0580-TFD,60 ,R-669,LT
i want to be able to strip ,60 ,R-12,HT
using regex and also deletes the moreinfo
and GoCats
rows from the df.
My expected Results:
value
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC
2 MZF8-050Z-AAB
3 MZA2-0580-TFD
I first removed the strings
del = ['hello', 'moreinfo']
for i in del:
df = df[value!= i]
Can somebody suggest a way to use regex to match and delete all case that do match A067-M4FL-CAA-020
or MZF8-050Z-AAB
pattern so i don't have to create a list for all possible cases?
I was able to strip a single line like this but i want to be able to strip all matching cases in the dataframe
pattern = r',\w+ \,\w+-\w+\,\w+ *'
line = 'MRF2-050A-TFC,60 ,R-12,HT'
for i in re.findall(pattern, line):
line = line.replace(i,'')
>>> MRF2-050A-TFC
I tried adjusting my code but it prints out the same output for each row
pattern = r',\w+ \,\w+-\w+\,\w+ *'
for d in df:
for i in re.findall(pattern, d):
d = d.replace(i,'')
Any suggestions will be greatly appreciated. Thanks
Upvotes: 2
Views: 78
Reputation: 21739
Here's a simpler approach you can try without using regex. pandas has many in-built functions to deal with text data.
# remove unwanted values
df['value'] = df.value.str.replace(r'moreinfo|60|R-.*|HT|GoCats|\,', '')
# drop na
df = df[(df != '')].dropna()
# print
print(df)
value
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC
3 MZF8-050Z-AAB
5 MZA2-0580-TFD
-----------
# data used
df = pd.read_fwf(StringIO(u'''
value
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC,60 ,R-12,HT
2 moreinfo
3 MZF8-050Z-AAB
4 GoCats
5 MZA2-0580-TFD,60 ,R-669,LT'''),header=1)
Upvotes: 1
Reputation: 1236
You may try this
(?:\w+-){2,}[^,\n]*
Python scripts may be as follows
ss="""0 A067-M4FL-CAA-020
1 MRF2-050A-TFC,60 ,R-12,HT
2 moreinfo
3 MZF8-050Z-AAB
4 GoCats
5 MZA2-0580-TFD,60 ,R-669,LT"""
import re
regx=re.compile(r'(?:\w+-){2,}[^,\n]*')
m= regx.findall(ss)
for i in range(len(m)):
print("%d %s" %(i, m[i]))
and the output is
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC
2 MZF8-050Z-AAB
3 MZA2-0580-TFD
Upvotes: 1
Reputation: 2981
I'd suggest capturing the data you DO want, since it's pretty particular, and the data you do NOT want could be anything.
Your pattern should look something like this:
^\w{4}-\w{4}-\w{3}(?:-\d{3})?
https://regex101.com/r/NtH2Ut/2
I'd recommend being more specific than \w
where possible. (Like ^[A-Z]\w{3}
) if you know the beginning four character chunk should start with a letter.
edit Sorry, I may not have read your input and output literally enough:
https://regex101.com/r/NtH2Ut/3
^(?:\d+\s+\w{4}-\w{4}-\w{3}(?:-\d{3})?)|^\s+.*
Upvotes: 0