Reputation: 309
Suppose I have a dataframe df as:
df = pd.DataFrame({'Index': [1, 2, 3, 4, 5],
'Name': ['A', 'B', 100, 'C', 'D'],
'col1': [np.nan, 'bbby', 'cccy', 'dddy', 'EEEEE'],
'col2': ['water', np.nan, 'WATER', 'soil', 'cold air'],
'col3': ['watermelone', 'hot AIR', 'air conditioner', 'drink', 50000],
'Results': [1000, 2000, 3000, 4000, 5000]})
Out
Index Name col1 col2 col3 Results
1 A NaN water watermelone 1000
2 B bbbY NaN hot AIR 2000
3 100 cccY water air conditioner 3000
4 C dddf soil drink 4000
5 D EEEEE cold air 50000 5000
I have a list: matches = ['wat','air']
How can I choose all rows with col1
or col2
or col3
containing i
in matches
.
Expected output:
Index Name col1 col2 col3 Results
1 A NaN water watermelone 1000
2 B bbbY NaN hot AIR 2000
3 100 cccY water air conditioner 3000
5 D EEEEE cold air 50000 5000
Upvotes: 5
Views: 1248
Reputation: 708
Here is how I would do it:
df.loc[(df.apply(lambda x: any(m in str(v).lower()
for v in x.values
for m in matches), axis=1))]
Here is a efficiency comparison test with @David Erikson answer on a 50,000 row dataframe on a 15 core cpu setup:
test_df = pd.concat([df]*10000)
test_df.reset_index(drop=True, inplace=True)
%timeit result_df = test_df.loc[(test_df.apply(lambda x: any(m in str(v).lower() for v in x.values for m in matches), axis=1))]
455 ms ± 5.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
matchess = '|'.join(matches)
%timeit test_df = test_df.reset_index(drop=True).T.fillna('')
2.97 s ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit test_df_T = test_df_T.applymap(str)
7.94 s ± 140 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit result_df_2 = test_df.T[[test_df[col].str.lower().str.contains(matchess).values.any() for col in test_df.columns]]
28 s ± 1.6 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
So just the last line of the second method is more than 56 times slower. (all together ~80 times slower)
Upvotes: 3
Reputation: 16683
You can use .T
to transpose the dataframe and str.contains
to check the values column-wise and then transpose back (also str.contains
can have multiple values passed to if separated with |
, which is why I change the list to a string with matches = '|'.join(matches)
).
The benefit of transposing the dataframe is that you can use column-wise pandas method instead of looping through rows or a long lambda x:
list comprehension. This technique should have good performance
compared to a lambda x
with axis=1
answer:
# df = df.set_index('Index')
matches = ['wat','air']
matches = '|'.join(matches)
df = df.reset_index(drop=True).T.fillna('')
df = df.T[[df[col].str.lower().str.contains(matches).values.any() for col in df.columns]]
df
Out[1]:
Name col1 col2 col3
0 A water watermelone
1 B bbbY hot AIR
2 B cccY water air conditioner
4 D EEEEE cold air eat
Upvotes: 0
Reputation: 1614
Try this as well:
df = df[df['col1'].str.contains('|'.join(matches))|df['col2'].str.contains('|'.join(matches))|df['col3'].str.contains('|'.join(matches))]
Prints:
Name col1 col2 col3
1 A aadY water watermelone
2 B bbbY air hot AIR
3 B cccY water air conditioner
5 D EEEEE cold air eat
Upvotes: 0