Reputation: 60
I am working towards cleaning a pandas DataFrame column. The column has words which I would like find and extract from a list.
Below is what I 've got. But it does not return multiple matches. Below is an example.
data = {'A':['abc 1 foo','def 1,bar','abc 2','def 2', 'abc 1/def 1 baz', 'abc 1,def 1']}
l = ['abc 1', 'def 1']
df = pd.DataFrame(data)
for idx, row in df.iterrows():
for x in l:
if x in row.A:
df.loc[idx, 'new_col'] = x```
Actual output:
A new_col
abc 1 abc 1
def 1 def 1
abc 2 NaN
def 2 NaN
abc 1/def 1 def 1
abc 1,def 1 def 1
Expected output:
A new_col
abc 1 abc 1
def 1 def 1
abc 2 NaN
def 2 NaN
abc 1/def 1 abc 1,def 1
abc 1,def 1 abc 1,def 1
Note: the seperator in col A could be anything('/', ';') but seperator in new_col should be fixed.
Upvotes: 2
Views: 80
Reputation: 862611
Use str.findall
with Series.str.join
join values of list with pattern joined by |
for regex OR
and \b
for word boundaries:
pat = '|'.join(r"\b{}\b".format(x) for x in l)
df['new_col'] = df['A'].str.findall(pat).str.join(',')
print (df)
A new_col
0 abc 1 foo abc 1
1 def 1,bar def 1
2 abc 2
3 def 2
4 abc 1/def 1 baz abc 1,def 1
5 abc 1,def 1 abc 1,def 1
If need NaN
s instead empty strings use numpy.where
:
pat = '|'.join(r"\b{}\b".format(x) for x in l)
s = df['A'].str.findall(pat)
df['new_col'] = np.where(s.astype(bool), s.str.join(','), np.nan)
print (df)
A new_col
0 abc 1 foo abc 1
1 def 1,bar def 1
2 abc 2 NaN
3 def 2 NaN
4 abc 1/def 1 baz abc 1,def 1
5 abc 1,def 1 abc 1,def 1
Upvotes: 1