Checking if a list is a subset of another in a pandas Dataframe

Question

So, i have this Dataframe with almost 3 thousand rows, that looks something like this:

        CITIES
0       ['A','B']
1       ['A','B','C','D']
2       ['A','B','C']
4       ['X']
5       ['X','Y','Z']
...     ...
2670    ['Y','Z']

I would like to remove from the DF all rows were the 'CITIES' list is contained in another row (the order does not matter), on the example above, i would like to remove 0 and 2, since both are contained in 1, and also remove 4 and 2670, since both are contained, i tried something, it kinda worked, but it was really stupid and took almost 10 minutes to compute, this was it:

indexesToRemove=[]
for index, row in entrada.iterrows():
    citiesListFixed=row['CITIES']
    for index2, row2 in entrada.iloc[index+1:].iterrows():
        citiesListCurrent=row2['CITIES']
        if set(citiesListFixed) <= set(citiesListCurrent):
            indexesToRemove.append(index)
            break

Is there a more efficient way to do this?

rodrigocfaria · Accepted Answer

A possible and didactic approach would be the following:

import pandas as pd
import numpy as np
import time

start = time.process_time()
  
lst1 = [0,1,2,3,4,2670]   
lst2 = [['A','B'], ['A','B','C','D'], ['A','B','C'], ['X'], ['X','Y','Z'], ['Y','Z']] 

df = pd.DataFrame(list(zip(lst1, lst2)), columns =['id', 'cities'])

df['contained'] = 0

n = df.shape[0]

for i in range(n):
    for j in range(n):
        if i != j:
            if((set(df.loc[i,'cities']) & set(df.loc[j,'cities']))== set(df.loc[i,'cities'])): 
                df.loc[i,'contained'] = 1


print(df)

print("
Time elapsed:", time.process_time() - start, "seconds")

The time complexity of this solution is $O(n^2)$ .

You end up with this data frame as a result:

     id        cities  contained
0     0        [A, B]          1
1     1  [A, B, C, D]          0
2     2     [A, B, C]          1
3     3           [X]          1
4     4     [X, Y, Z]          0
5  2670        [Y, Z]          1

Then you just have to exclude the rows where contained == 1.

Checking if a list is a subset of another in a pandas Dataframe

Answers (2)

Related Questions