nerd
nerd

Reputation: 493

Remove rows when column values already present as an element of a list in another column

I would like to remove the rows entirely when the column values of a specific column like user is already present as an element of a list in another column. How can I best accommpish this?

    user          friend
0   jack         [mary, jane, alex]
1   mary         [kate, andrew, jensen]
2   alice        [marina, catherine, howard]
3   andrew       [syp, yuslina, john ] 
4   catherine    [yute, kelvin]
5   john         [beyond, holand]

Expected Output:

    user                       friend
0   jack           [mary, jane, alex]
2  alice  [marina, catherine, howard]

Upvotes: 1

Views: 73

Answers (2)

mozway
mozway

Reputation: 260790

Your example seems incorrect, as either john should be kept (blacklist is made of all previous friends), or andrew should be removed (blacklist is only the previous list of friends).

Here are different options.

Remove is the used is present in:

any set of friends

S = set().union(*df['friend'])

mask = ~df['user'].isin(S)
# [False, True, False, True, True, True]

df[mask]

output:

    user                       friend
0   jack           [mary, jane, alex]
2  alice  [marina, catherine, howard]

all previous sets of friends

You can first compute an expanding set of friends, then check whether each user is in the set:

S = set()
# line below uses python ≥ 3.8, if older version use a classical loop
sets = [(S:=S.union(set(x))) for x in df['friend']]

mask = [u not in s for u,s in zip(df['user'], sets)]
# [True, False, True, False, False, False]
out = df[mask]

output:

    user                       friend
0   jack           [mary, jane, alex]
2  alice  [marina, catherine, howard]

only previous set of friends

mask = [u not in s for u,s in zip(df['user'], df['friend'].agg(set).shift(fill_value={}))]
# [True, False, True, True, True, True]

out = df[mask]

output:

        user                       friend
0       jack           [mary, jane, alex]
2      alice  [marina, catherine, howard]
3     andrew         [syp, yuslina, john]
4  catherine               [yute, kelvin]
5       john             [beyond, holand]

used input:

d = {'user': ['jack', 'mary', 'alice', 'andrew', 'catherine', 'john'],
     'friend': [['mary', 'jane', 'alex'], 
                ['kate', 'andrew', 'jensen'],
                ['marina', 'catherine', 'howard'],
                ['syp', 'yuslina', 'john'],
                ['yute', 'kelvin'],
                ['beyond', 'holand']]}
df = pd.DataFrame(d)

Upvotes: 2

I'mahdi
I'mahdi

Reputation: 24049

You can convert the desired column to one list without any nested list. For this purpose you can use itertools.chain.from_iterable then you can use pandas.isin.

(andrew exists in the [kate, andrew, jensen] so this solution don't show this row too.)

import itertools
df = df[~df['user'].isin(list(itertools.chain.from_iterable(df['friend'])))]

Output:

    user                       friend
0   jack           [mary, jane, alex]
2  alice  [marina, catherine, howard]

Upvotes: 2

Related Questions