Reputation: 3411
I have a dataframe with columns A,B,C
. I have a list of tuples like [(x1,y1), (x2,y2), ...]
. I would like to delete all rows that meet the following condition:
(B=x1 && C=y1) | (B=x2 && C=y2) | ...
How can I do that in pandas? I wanted to use the isin
function, but not sure if it is possible since my list has tuples. I could do something like this:
for x,y in tuples:
df = df.drop(df[df.B==x && df.C==y].index)
Maybe there is an easier way.
Upvotes: 8
Views: 7135
Reputation: 294516
Use pandas indexing
df.set_index(list('BC')).drop(tuples, errors='ignore').reset_index()
def linear_indexing_based(df, tuples):
idx = np.array(tuples)
BC_arr = df[['B','C']].values
shp = np.maximum(BC_arr.max(0)+1,idx.max(0)+1)
BC_IDs = np.ravel_multi_index(BC_arr.T,shp)
idx_IDs = np.ravel_multi_index(idx.T,shp)
return df[~np.in1d(BC_IDs,idx_IDs)]
def divakar(df, tuples):
idx = np.array(tuples)
mask = (df.B.values == idx[:, None, 0]) & (df.C.values == idx[:, None, 1])
return df[~mask.any(0)]
def pirsquared(df, tuples):
return df.set_index(list('BC')).drop(tuples).reset_index()
10 rows, 1 tuple
np.random.seed([3,1415])
df = pd.DataFrame(np.random.choice(range(10), (10, 3)), columns=list('ABC'))
tuples = [tuple(row) for row in np.random.choice(range(10), (1, 2))]
10,000 rows, 500 tuples
np.random.seed([3,1415])
df = pd.DataFrame(np.random.choice(range(10), (10000, 3)), columns=list('ABC'))
tuples = [tuple(row) for row in np.random.choice(range(10), (500, 2))]
Upvotes: 7
Reputation: 221704
Approach #1
Here's a vectorized approach using NumPy's broadcasting
-
def broadcasting_based(df, tuples):
idx = np.array(tuples)
mask = (df.B.values == idx[:, None, 0]) & (df.C.values == idx[:, None, 1])
return df[~mask.any(0)]
Sample run -
In [224]: df
Out[224]:
A B C
0 6 4 4
1 2 0 3
2 8 3 4
3 7 8 3
4 6 7 8
5 3 3 2
6 5 4 2
7 2 4 7
8 6 1 6
9 1 1 1
In [225]: tuples = [(3,4),(7,8),(1,6)]
In [226]: broadcasting_based(df,tuples)
Out[226]:
A B C
0 6 4 4
1 2 0 3
3 7 8 3
5 3 3 2
6 5 4 2
7 2 4 7
9 1 1 1
Approach #2 : To cover a generic number of columns
For a case like this, one could collapse the information from different columns into one single entry that would represent the uniqueness among all columns. This could be achieved by considering each row as indexing tuple. Thus, basically each row would become one entry. Similarly, each entry from the list of tuple that is to be matched could be reduced to a 1D
array with each tuple becoming one scalar each. Finally, we use np.in1d
to look for the correspondence, get the valid mask and have the desired rows removed dataframe, Thus, the implementation would be -
def linear_indexing_based(df, tuples):
idx = np.array(tuples)
BC_arr = df[['B','C']].values
shp = np.maximum(BC_arr.max(0)+1,idx.max(0)+1)
BC_IDs = np.ravel_multi_index(BC_arr.T,shp)
idx_IDs = np.ravel_multi_index(idx.T,shp)
return df[~np.in1d(BC_IDs,idx_IDs)]
Upvotes: 4
Reputation: 19124
It will probably be more efficient to use boolean indexing than a bunch of calls to DataFrame.drop
. This is because Pandas doesn't have to reallocate memory in each loop iteration.
m = pd.Series(False, index=df.index)
for x,y in tuples:
m |= (df.B == x) & (df.C == y)
df = df[~m]
Upvotes: 0