Reputation: 249
I have a complex, large pandas dataframe with one column, X that can contain either one list or a list of lists. I'm curious if the solution can apply to any content though, so I give a mock example with one element of X being a string as well:
df1 = pd.DataFrame({
'A': [1, 1, 3],
'B': ['a', 'e', 'f'],
'X': ['something', ['hello'], [['something'],['hello']]]}
)
I want to get the subset of that dataframe, df2, for which column X contains the substring "hello", when whatever is in there is read as a string.
>>> df2
A B X
0 1 e [hello]
1 3 f [[something], [hello]]
I have tried extensive combinations of str() and .str.contains, apply, map, .find(), list comprehensions, and nothing seems to work without getting into loops (related questions here and here. What am I missing?
Upvotes: 3
Views: 1224
Reputation: 294556
Borrowing from @wim https://stackoverflow.com/a/49247980/2336654
The most general solution would be to allow for arbitrarily nested lists. Also, We can focus on the string elements being equal rather than containing.
# This import is for Python 3
# for Python 2 use `from collections import Iterable`
from collections.abc import Iterable
def flatten(collection):
for x in collection:
if isinstance(x, Iterable) and not isinstance(x, str):
yield from flatten(x)
else:
yield x
df1[df1.X.map(lambda x: any('hello' == s for s in flatten(x)))]
A B X
1 1 e [hello]
2 3 f [[something], [hello]]
So now if we complicate it
df1 = pd.DataFrame({
'A': [1, 1, 3, 7, 7],
'B': ['a', 'e', 'f', 's', 's'],
'X': [
'something',
['hello'],
[['something'],['hello']],
['hello world'],
[[[[[['hello']]]]]]
]}
)
df1
A B X
0 1 a something
1 1 e [hello]
2 3 f [[something], [hello]]
3 7 s [hello world]
4 7 s [[[[[['hello']]]]]]
Our filter does not grab hello world
and does grab the very nested hello
df1[df1.X.map(lambda x: any('hello' == s for s in flatten(x)))]
A B X
1 1 e [hello]
2 3 f [[something], [hello]]
4 7 s [[[[[['hello']]]]]]
Upvotes: 2
Reputation: 323396
Adding astype
before str.contains
df1[df1.X.astype(str).str.contains('hello')]
Out[538]:
A B X
1 1 e [hello]
2 3 f [[something], [hello]]
Upvotes: 6
Reputation: 38425
You can use np.ravel() to flatten nested list and use in operator
df1[df1['X'].apply(lambda x: 'hello' in np.ravel(x))]
A B X
1 1 e [hello]
2 3 f [[something], [hello]]
Upvotes: 5