Reputation: 149
I needed to search multiple cells for a specific value and when it is found it should be returned in new column.
I got an answer here; Python: find string in multiple columns and return it in new column yet this line below return the first value found
df['b'] = (df[cols].where(df[cols].stack().str.contains('b')
.unstack(fill_value=False)).ffill(1).iloc[:,-1])
where cols
df = df[['col1', 'col2', 'col3', 'col4']]
I tried the other answers and they all gave me error ValueError: cannot reindex from a duplicate axis
Do any one have an idea how can I get all the matching values in one cell.
the data set
ID col0 col1 col2 col3 col4 col5
1 jack a/h t/m w/n y/h 56
2 sam z/n b/w null b/n 93
3 john b/i y/d b/d null 33
The code I'm using now :
df['b'] = (df[cols].where(df[cols].stack().str.contains('b')
.unstack(fill_value=False)).ffill(1).iloc[:,-1])
And here the output I am getting now
ID col0 col1 col2 col3 col4 col5 b
1 jack a/h t/m w/n y/h 56 -
2 sam z/n b/w null b/n 93 b/w
3 john b/i y/d b/d null 33 b/i
And actually I want the output to look like the data-frame below
ID col0 col1 col2 col3 col4 col5 b
1 jack a/h t/m w/n y/h 56 null
2 sam z/n b/w null b/n 93 b/w - b/n
3 john b/i y/d b/d null 33 b/i - b/d
Upvotes: 1
Views: 467
Reputation: 71707
Use DataFrame.filter
to filter dataframe containing columns col1-col4
and use DataFrame.stack
, then using Series.str.contains
filter the stacked dataframe finally use Series.groupby
on level=0
and aggregate using join
:
s = df.filter(regex=r'col[1-4]').stack()
s = s[s.str.contains('b')].groupby(level=0).agg(' - '.join)
df['b'] = s
Result:
# print(df)
ID col0 col1 col2 col3 col4 col5 b
0 1 jack a/h t/m w/n y/h 56 NaN
1 2 sam z/n b/w NaN b/n 93 b/w - b/n
2 3 john b/i y/d b/d NaN 33 b/i - b/d
Upvotes: 2
Reputation: 12523
Here's a way to do it with melt
:
t = df.melt(id_vars="ID", value_vars = ["col0", "col1", "col2", "col3", "col4"], var_name=[]).dropna()
t = t[t.value.str.contains("b")].groupby("ID")["value"].apply(lambda x: " - ".join(x))
res = pd.concat([df, t], axis=1).rename(columns={"value":"b"})
The output is:
ID col0 col1 col2 col3 col4 col5 b
0 1.0 jack a/h t/m w/n y/h 56.0 NaN
1 2.0 sam z/n b/w NaN b/n 93.0 NaN
2 3.0 john b/i y/d b/d NaN 33.0 b/w - b/n
3 NaN NaN NaN NaN NaN NaN NaN b/i - b/d
Upvotes: 0