user4979733
user4979733

Reputation: 3411

Pandas: Delete rows based on multiple columns values

I have a dataframe with columns A,B,C. I have a list of tuples like [(x1,y1), (x2,y2), ...]. I would like to delete all rows that meet the following condition: (B=x1 && C=y1) | (B=x2 && C=y2) | ... How can I do that in pandas? I wanted to use the isin function, but not sure if it is possible since my list has tuples. I could do something like this:

for x,y in tuples:   
    df = df.drop(df[df.B==x && df.C==y].index)

Maybe there is an easier way.

Upvotes: 8

Views: 7135

Answers (3)

piRSquared
piRSquared

Reputation: 294516

Use pandas indexing

df.set_index(list('BC')).drop(tuples, errors='ignore').reset_index()

Timing

def linear_indexing_based(df, tuples):
    idx = np.array(tuples)
    BC_arr = df[['B','C']].values
    shp = np.maximum(BC_arr.max(0)+1,idx.max(0)+1)
    BC_IDs = np.ravel_multi_index(BC_arr.T,shp)
    idx_IDs = np.ravel_multi_index(idx.T,shp)
    return df[~np.in1d(BC_IDs,idx_IDs)]

def divakar(df, tuples):
    idx = np.array(tuples)
    mask = (df.B.values == idx[:, None, 0]) & (df.C.values == idx[:, None, 1])
    return df[~mask.any(0)]

def pirsquared(df, tuples):
    return df.set_index(list('BC')).drop(tuples).reset_index()

10 rows, 1 tuple

np.random.seed([3,1415])
df = pd.DataFrame(np.random.choice(range(10), (10, 3)), columns=list('ABC'))
tuples = [tuple(row) for row in np.random.choice(range(10), (1, 2))]

enter image description here

10,000 rows, 500 tuples

np.random.seed([3,1415])
df = pd.DataFrame(np.random.choice(range(10), (10000, 3)), columns=list('ABC'))
tuples = [tuple(row) for row in np.random.choice(range(10), (500, 2))]

enter image description here

Upvotes: 7

Divakar
Divakar

Reputation: 221704

Approach #1

Here's a vectorized approach using NumPy's broadcasting -

def broadcasting_based(df, tuples):
    idx = np.array(tuples)
    mask = (df.B.values == idx[:, None, 0]) & (df.C.values == idx[:, None, 1])
    return df[~mask.any(0)]

Sample run -

In [224]: df
Out[224]: 
   A  B  C
0  6  4  4
1  2  0  3
2  8  3  4
3  7  8  3
4  6  7  8
5  3  3  2
6  5  4  2
7  2  4  7
8  6  1  6
9  1  1  1

In [225]: tuples = [(3,4),(7,8),(1,6)]

In [226]: broadcasting_based(df,tuples)
Out[226]: 
   A  B  C
0  6  4  4
1  2  0  3
3  7  8  3
5  3  3  2
6  5  4  2
7  2  4  7
9  1  1  1

Approach #2 : To cover a generic number of columns

For a case like this, one could collapse the information from different columns into one single entry that would represent the uniqueness among all columns. This could be achieved by considering each row as indexing tuple. Thus, basically each row would become one entry. Similarly, each entry from the list of tuple that is to be matched could be reduced to a 1D array with each tuple becoming one scalar each. Finally, we use np.in1d to look for the correspondence, get the valid mask and have the desired rows removed dataframe, Thus, the implementation would be -

def linear_indexing_based(df, tuples):
    idx = np.array(tuples)
    BC_arr = df[['B','C']].values
    shp = np.maximum(BC_arr.max(0)+1,idx.max(0)+1)
    BC_IDs = np.ravel_multi_index(BC_arr.T,shp)
    idx_IDs = np.ravel_multi_index(idx.T,shp)
    return df[~np.in1d(BC_IDs,idx_IDs)]

Upvotes: 4

Alex
Alex

Reputation: 19124

It will probably be more efficient to use boolean indexing than a bunch of calls to DataFrame.drop. This is because Pandas doesn't have to reallocate memory in each loop iteration.

m = pd.Series(False, index=df.index)
for x,y in tuples:
    m |= (df.B == x) & (df.C == y)
df = df[~m]

Upvotes: 0

Related Questions