Reputation: 169
I’m using jupyter notebook (python 3). I’m trying to extract from pandas data frame keywords from my list. I will have around 50 keywords in the list.
Example:
import pandas as pd
import re
rgx_words1 = ['algaecid','algaecide','algaecides','anti-bakterien']
pattern = "\\b("+'|'.join(rgx_words1)+")\\b"
re_patt = re.compile(pattern)
pattern2 = "("+'|'.join(rgx_words1)+")"
re_patt2 = re.compile(pattern2)
data = [[1, 'I, will, find, algaecide, dd, algaecid, algaecides'], [2, 'fff, algaecid, dd, algaecide'], [3, 'ssssalgaecidllll, algaecides']]
# Create the pandas DataFrame
mydf = pd.DataFrame(data, columns = ['id', 'text'])
mydf['matches'] = mydf.apply(lambda x : re.findall(re_patt,x['text']),axis=1)
mydf['matches2'] = mydf.apply(lambda x : re.findall(re_patt2,x['text']),axis=1)
With re_patt I’m extracting exact words and I’m getting correct results. In id 1 my output is algaecide, algaecid, algaecides. With re_patt2 I would like to have all patterns like ‘'ssssalgaecidllll’ with wanted output ‘algaecid’. Output with re_patt2 in id 1 is algaecid, algaecid, algaecid and my wanted output is algaecide, algaecid, algaecides. I would be grateful for any advice. Thank you in advance.
Upvotes: 0
Views: 82
Reputation: 163362
You can change pattern2
to optionally match non whitespace chars except a comma [^\s,]*
at the left and the right.
pattern2 = "[^\s,]*(?:"+'|'.join(rgx_words1)+")[^\s,]*"
The code could look like
import pandas as pd
import re
rgx_words1 = ['algaecid','algaecide','algaecides','anti-bakterien']
pattern = "\\b("+'|'.join(rgx_words1)+")\\b"
re_patt = re.compile(pattern)
pattern2 = "[^\s,]*(?:"+'|'.join(rgx_words1)+")[^\s,]*"
re_patt2 = re.compile(pattern2)
data = [[1, 'I, will, find, algaecide, dd, algaecid, algaecides'], [2, 'fff, algaecid, dd, algaecide'], [3, 'ssssalgaecidllll, algaecides']]
mydf = pd.DataFrame(data, columns = ['id', 'text'])
mydf['matches'] = mydf.apply(lambda x : re.findall(re_patt, x['text']), axis=1)
mydf['matches2'] = mydf.apply(lambda x : re.findall(re_patt2, x['text']), axis=1)
print(mydf)
Output
id text matches matches2
0 1 I, will, find, algaecide, dd, algaecid, algaec... [algaecide, algaecid, algaecides] [algaecide, algaecid, algaecides]
1 2 fff, algaecid, dd, algaecide [algaecid, algaecide] [algaecid, algaecide]
2 3 ssssalgaecidllll, algaecides [algaecides] [ssssalgaecidllll, algaecides]
Upvotes: 1