Reputation: 442
So, i have this Dataframe with almost 3 thousand rows, that looks something like this:
CITIES
0 ['A','B']
1 ['A','B','C','D']
2 ['A','B','C']
4 ['X']
5 ['X','Y','Z']
... ...
2670 ['Y','Z']
I would like to remove from the DF all rows were the 'CITIES' list is contained in another row (the order does not matter), on the example above, i would like to remove 0 and 2, since both are contained in 1, and also remove 4 and 2670, since both are contained, i tried something, it kinda worked, but it was really stupid and took almost 10 minutes to compute, this was it:
indexesToRemove=[]
for index, row in entrada.iterrows():
citiesListFixed=row['CITIES']
for index2, row2 in entrada.iloc[index+1:].iterrows():
citiesListCurrent=row2['CITIES']
if set(citiesListFixed) <= set(citiesListCurrent):
indexesToRemove.append(index)
break
Is there a more efficient way to do this?
Upvotes: 4
Views: 1422
Reputation: 59549
First create the DataFrame of dummies and then we can use matrix multiplication to see if one of the rows is a complete subset of another row, by checking if the sum of multiplication with another row is equal to the number of elements in that row. (Going to be a memory intensive)
import pandas as pd
import numpy as np
df = pd.DataFrame({'Cities': [['A','B'], ['A','B','C','D'], ['A','B','C'],
['X'], ['X','Y','Z'], ['Y','Z']]})
arr = pd.get_dummies(df['Cities'].explode()).max(level=0).to_numpy()
#[[1 1 0 0 0 0 0]
# [1 1 1 1 0 0 0]
# [1 1 1 0 0 0 0]
# [0 0 0 0 1 0 0]
# [0 0 0 0 1 1 1]
# [0 0 0 0 0 1 1]]
subsets = np.matmul(arr, arr.T)
np.fill_diagonal(subsets, 0) # So same row doesn't exclude itself
mask = ~np.equal(subsets, np.sum(arr, 1)).any(0)
df[mask]
# Cities
#1 [A, B, C, D]
#4 [X, Y, Z]
As it stands if you have two rows which tie for the longest subset, (i.e. two rows with ['A','B','C','D']
) both are dropped. If this is not desired you can first drop_duplicates
on 'Cities'
(will need to covert to a hashable type like frozenset
) and then apply the above.
Upvotes: 6
Reputation: 440
A possible and didactic approach would be the following:
import pandas as pd
import numpy as np
import time
start = time.process_time()
lst1 = [0,1,2,3,4,2670]
lst2 = [['A','B'], ['A','B','C','D'], ['A','B','C'], ['X'], ['X','Y','Z'], ['Y','Z']]
df = pd.DataFrame(list(zip(lst1, lst2)), columns =['id', 'cities'])
df['contained'] = 0
n = df.shape[0]
for i in range(n):
for j in range(n):
if i != j:
if((set(df.loc[i,'cities']) & set(df.loc[j,'cities']))== set(df.loc[i,'cities'])):
df.loc[i,'contained'] = 1
print(df)
print("\nTime elapsed:", time.process_time() - start, "seconds")
The time complexity of this solution is .
You end up with this data frame as a result:
id cities contained
0 0 [A, B] 1
1 1 [A, B, C, D] 0
2 2 [A, B, C] 1
3 3 [X] 1
4 4 [X, Y, Z] 0
5 2670 [Y, Z] 1
Then you just have to exclude the rows where contained == 1.
Upvotes: 2