Reputation: 314
Lets say I have the following list:
list = ['a', 'b', 'c', 'd']
And a DataFrame like this:
df = pd.DataFrame({'content': [['a', 'b', 'abc'], ['c', 'd', 'xyz'], ['d', 'xyz']]})
Out:
content
0 [a, b, abc]
1 [c, d, xyz]
2 [d, xyz]
I need a function that can remove every element from the 'content' column that is not in 'list', so my output would look like this:
Out:
content
0 [a, b]
1 [b, d]
2 [d]
Please consider that my actual df has about 1m rows and the list about 1k items. I tried by iterating over rows, but that took ages...
Upvotes: 2
Views: 232
Reputation: 6543
Given that the list of strings we want to check membership against is of length ~1k, any of the answers already posted can be made significantly more efficient by first converting this list to a set
.
In my testing, the fastest method was converting the list to a set and then using the answer posted by W-B:
l = set(l)
df['new'] = [[y for y in x if y in l] for x in df.content]
Full testing code and results below. I had to make some assumptions about the exact nature of the real dataset, but I think that my randomly generated lists of strings should be reasonably representative. Note that I excluded the solution from T Burgis as I ran into an error with it - could have been me doing something wrong, but since they had already commented that W-B's solution was faster, I didn't try too hard to figure it out. I should also note that for all solutions I assigned the result to df['new']
regardless of whether or not the original answer did so, for consistency's sake.
import random
import string
import pandas as pd
def initial_setup():
"""
Returns a 1m row x 1 column DataFrame, and a 992 element list of strings (all unique).
"""
random.seed(1)
keep = list(set([''.join(random.choices(string.ascii_lowercase, k=random.randint(1, 5))) for i in range(1250)]))
content = [[''.join(random.choices(string.ascii_lowercase, k=random.randint(1, 5))) for i in range(5)] for j in range(1000000)]
df = pd.DataFrame({'content': content})
return df, keep
def jpp(df, L):
df['new'] = [list(dict.fromkeys(x).keys() & L) for x in df['content']]
def wb(df, l):
df['new'] = [[y for y in x if y in l] for x in df.content]
def jonathon(df, list1):
df['new'] = [list(filter(lambda x:x in list1,i)) for i in df['content']]
Tests without conversion to set:
In [3]: df, keep = initial_setup()
...: %timeit jpp(df, keep)
...:
16.9 s ± 333 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: df, keep = initial_setup()
...: %timeit wb(df, keep)
1min ± 612 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [5]: df, keep = initial_setup()
...: %timeit jonathon(df, keep)
1min 2s ± 1.26 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Tests with conversion to set:
In [6]: df, keep = initial_setup()
...: %timeit jpp(df, set(keep))
...:
1.7 s ± 18.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [7]: df, keep = initial_setup()
...: %timeit wb(df, set(keep))
...:
689 ms ± 20.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]: df, keep = initial_setup()
...: %timeit jonathon(df, set(keep))
...:
1.26 s ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 0
Reputation: 164773
Assuming the lists in your series contain unique values, you can use dict.keys
to calculate the intersection while (in Python 3.7+) maintaining order:
df['content'] = [list(dict.fromkeys(x).keys() & L) for x in df['content']]
print(df)
content
0 [a, b]
1 [d, c]
2 [d]
Upvotes: 2
Reputation: 2991
Another option using filter
>>> list1 = ['a', 'b', 'c', 'd']
>>> df = pd.DataFrame({'content': [['a', 'b', 'abc'], ['c', 'd', 'xyz'], ['d', 'xyz']]})
>>> df['content']=[list(filter(lambda x:x in list1,i)) for i in df['content']]
>>> df
content
0 [a, b]
1 [c, d]
2 [d]
Upvotes: 0
Reputation: 323326
IIUC
df['new']=[[y for y in x if y in l] for x in df.content]
df
Out[535]:
content new
0 [a, b, abc] [a, b]
1 [c, d, xyz] [c, d]
2 [d, xyz] [d]
Upvotes: 3
Reputation: 1435
One way to do this is with apply
:
keep = ['a', 'b', 'c', 'd'] # don't use list as a variable name
df = pd.DataFrame({'content': [['a', 'b', 'abc'], ['c', 'd', 'xyz'], ['d', 'xyz']]})
df['fixed_content'] = df.apply(lambda row: [x for x in row['content'] if x in keep],axis=1)
Upvotes: 2