Reputation: 314

Elegant way to remove elements from list item in data frame if not contained in another list

Lets say I have the following list:

list = ['a', 'b', 'c', 'd']

And a DataFrame like this:

df = pd.DataFrame({'content': [['a', 'b', 'abc'], ['c', 'd', 'xyz'], ['d', 'xyz']]})
Out:
       content
0  [a, b, abc]
1  [c, d, xyz]
2     [d, xyz]

I need a function that can remove every element from the 'content' column that is not in 'list', so my output would look like this:

Out:  
  content
0  [a, b]
1  [b, d]
2     [d]

Please consider that my actual df has about 1m rows and the list about 1k items. I tried by iterating over rows, but that took ages...

Upvotes: 2

Answers (5)

sjw

Reputation: 6543

Given that the list of strings we want to check membership against is of length ~1k, any of the answers already posted can be made significantly more efficient by first converting this list to a set.

In my testing, the fastest method was converting the list to a set and then using the answer posted by W-B:

l = set(l)
df['new'] = [[y for y in x if y in l] for x in df.content]

Full testing code and results below. I had to make some assumptions about the exact nature of the real dataset, but I think that my randomly generated lists of strings should be reasonably representative. Note that I excluded the solution from T Burgis as I ran into an error with it - could have been me doing something wrong, but since they had already commented that W-B's solution was faster, I didn't try too hard to figure it out. I should also note that for all solutions I assigned the result to df['new'] regardless of whether or not the original answer did so, for consistency's sake.

import random
import string
import pandas as pd


def initial_setup():
    """
    Returns a 1m row x 1 column DataFrame, and a 992 element list of strings (all unique).
    """
    random.seed(1)
    keep = list(set([''.join(random.choices(string.ascii_lowercase, k=random.randint(1, 5))) for i in range(1250)]))
    content = [[''.join(random.choices(string.ascii_lowercase, k=random.randint(1, 5))) for i in range(5)] for j in range(1000000)]
    df = pd.DataFrame({'content': content})
    return df, keep


def jpp(df, L):
    df['new'] = [list(dict.fromkeys(x).keys() & L) for x in df['content']]


def wb(df, l):
    df['new'] = [[y for y in x if y in l] for x in df.content]


def jonathon(df, list1):
    df['new'] = [list(filter(lambda x:x in list1,i)) for i in df['content']]

Tests without conversion to set:

In [3]: df, keep = initial_setup()
   ...: %timeit jpp(df, keep)
   ...: 
16.9 s ± 333 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: df, keep = initial_setup()
   ...: %timeit wb(df, keep)
1min ± 612 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: df, keep = initial_setup()
   ...: %timeit jonathon(df, keep)
1min 2s ± 1.26 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Tests with conversion to set:

In [6]: df, keep = initial_setup()
   ...: %timeit jpp(df, set(keep))
   ...: 
1.7 s ± 18.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: df, keep = initial_setup()
   ...: %timeit wb(df, set(keep))
   ...: 
689 ms ± 20.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: df, keep = initial_setup()
   ...: %timeit jonathon(df, set(keep))
   ...: 
1.26 s ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 0

jpp

Reputation: 164773

Assuming the lists in your series contain unique values, you can use dict.keys to calculate the intersection while (in Python 3.7+) maintaining order:

df['content'] = [list(dict.fromkeys(x).keys() & L) for x in df['content']]

print(df)

  content
0  [a, b]
1  [d, c]
2     [d]

Upvotes: 2

Jonathon McMurray

Reputation: 2991

Another option using filter

>>> list1 = ['a', 'b', 'c', 'd']
>>> df = pd.DataFrame({'content': [['a', 'b', 'abc'], ['c', 'd', 'xyz'], ['d', 'xyz']]})
>>> df['content']=[list(filter(lambda x:x in list1,i)) for i in df['content']]
>>> df
  content
0  [a, b]
1  [c, d]
2     [d]

Upvotes: 0

BENY

Reputation: 323326

IIUC

df['new']=[[y for y in x if y in l] for x in df.content]
df
Out[535]: 
       content     new
0  [a, b, abc]  [a, b]
1  [c, d, xyz]  [c, d]
2     [d, xyz]     [d]

Upvotes: 3

T Burgis

Reputation: 1435

One way to do this is with apply:

keep = ['a', 'b', 'c', 'd'] # don't use list as a variable name
df = pd.DataFrame({'content': [['a', 'b', 'abc'], ['c', 'd', 'xyz'], ['d', 'xyz']]})

df['fixed_content'] = df.apply(lambda row: [x for x in row['content'] if x in keep],axis=1)

Upvotes: 2

Elegant way to remove elements from list item in data frame if not contained in another list

Answers (5)

Related Questions