Mohammad.sh
Mohammad.sh

Reputation: 255

How to remove rows of a data frame when specific amount are not in specific columns?

I have two data frames with four and two columns. For example:

   A  B  C   D
0  4  2 320 700
1  5  7 400 800
2  2  6 300 550
3  4  6 100 300
4  5  2 250 360

and

   A  B  
0  2  4 
1  5  7
2  2  5 

I need to compare the first data frame with the second data frame and if column A and column B in the second data frame was in column A and column B in the first data frame. (order doesn't matter. it means in the first data frame in the first row A is 4, B is 2 and in the second data frame is A is 2 and B is 4 and it's not important but both numbers should be in the columns) keep the whole row in the first data frame; otherwise remove the row. so the output will be :

   A  B  C   D
0  4  2 320 700
1  5  7 400 800
2  5  2 250 360

How can I get this output (my actual data frames are so huge and can't iterate through them so need a fast efficient way)?

Upvotes: 4

Views: 398

Answers (3)

rafaelc
rafaelc

Reputation: 59304

Using np.equal.outer

arr = np.equal.outer(df, df2)
df.loc[arr.any(1).all(-1).any(-1)]

Outputs

A   B   C   D
0   4   2   320 700
1   5   7   400 800
4   5   2   250 360

Upvotes: 0

BENY
BENY

Reputation: 323396

What I will do using frozenset + isin

yourdf=df[df[['A','B']].apply(frozenset,1).isin(df1.apply(frozenset,1))].copy()
   A  B    C    D
0  4  2  320  700
1  5  7  400  800
4  5  2  250  360

Upvotes: 1

cs95
cs95

Reputation: 403278

I would do this by first sorting, then performing an LEFT OUTER JOIN using merge with an indicator to determine which rows to keep. Example,

u = df.loc[:, ['A', 'B']]
u.values.sort()     #  sort columns of `u` 
df2.values.sort()   #  sort columns of `df2`

df[u.merge(df2, how='left', indicator='ind').eval('ind == "both"').values]

   A  B    C    D
0  4  2  320  700
1  5  7  400  800
4  5  2  250  360

More info on joins with indicator can be found in my post: Pandas Merging 101


If you don't care about the final result being sorted or not, you can simplify this to an inner join.

df[['A', 'B']] = np.sort(df[['A', 'B']])
df2[:] = np.sort(df2)

df.merge(df2, on=['A', 'B'])

   A  B    C    D
0  2  4  320  700
1  5  7  400  800
2  2  5  250  360

Upvotes: 2

Related Questions