Reputation: 2333
I have two dataframes:
all_data:
AID VID Freq
0 00016A3E 0127C661 1
1 00016A3E 0C05DA5D 2
2 00016A3E 0C032814 1
3 00016A3E 0BF6C78D 1
4 00016A3E 0A79DFF1 1
5 00016A3E 07BD2FB2 1
6 00016A3E 0790E61B 1
7 00016A3E 0C24ED25 3
8 00016A3E 073630B5 3
9 00016A3E 06613535 1
10 00016A3E 05F809AF 1
11 00016A3E 05C625FF 1
12 00016A3E 04220EA8 4
13 00016A3E 013A29E5 1
14 00016A3E 0761C98A 1
15 00016AE9 0A769475 16
16 00016AE9 0A7DED0A 2
17 00016AE9 0ABF60DF 9
18 00016AE9 0AE3F25A 2
19 00016AE9 0AEFE12F 5
20 00016AE9 0BD8975A 2
21 00016AE9 44DF880B 1
22 00016AE9 43F9E08E 2
23 00016AE9 44EA5E08 2
24 00016AE9 4539ED1E 16
25 00016AE9 8516B55A 4
26 00016AE9 0972AFF2 1
27 00016AE9 0C559B34 1
28 00016AE9 06B5C040 7
29 00016AE9 0B0426FA 1
subset:
AID VID Freq
0 00016A3E 0C24ED25 3
1 00016A3E 0C05DA5D 2
2 00016AE9 0B0426FA 1
3 00016AE9 0AEFE12F 5
I need to create a third dataframe that has all the rows that are in all_data
that DO NOT exist in subset
. Note that all rows in subset
exist in all_data
.
So the new df in this instance should be:
AID VID Freq
0 00016A3E 0127C661 1
2 00016A3E 0C032814 1
3 00016A3E 0BF6C78D 1
4 00016A3E 0A79DFF1 1
5 00016A3E 07BD2FB2 1
6 00016A3E 0790E61B 1
8 00016A3E 073630B5 3
9 00016A3E 06613535 1
10 00016A3E 05F809AF 1
11 00016A3E 05C625FF 1
12 00016A3E 04220EA8 4
13 00016A3E 013A29E5 1
14 00016A3E 0761C98A 1
15 00016AE9 0A769475 16
16 00016AE9 0A7DED0A 2
17 00016AE9 0ABF60DF 9
18 00016AE9 0AE3F25A 2
20 00016AE9 0BD8975A 2
21 00016AE9 44DF880B 1
22 00016AE9 43F9E08E 2
23 00016AE9 44EA5E08 2
24 00016AE9 4539ED1E 16
25 00016AE9 8516B55A 4
26 00016AE9 0972AFF2 1
27 00016AE9 0C559B34 1
28 00016AE9 06B5C040 7
I tried the methods described here: pandas get rows which are NOT in other dataframe but they don't work as the indices in each dataframe don't match.
Upvotes: 3
Views: 6265
Reputation: 32085
If you don't have duplicates in all_data
, here is a way without merge but with drop_duplicates
. keep=False
argument drops all duplicates rows:
pd.concat([all_data, subset]).drop_duplicates(keep=False)
Out[23]:
AID VID Freq
0 00016A3E 0127C661 1
2 00016A3E 0C032814 1
3 00016A3E 0BF6C78D 1
4 00016A3E 0A79DFF1 1
5 00016A3E 07BD2FB2 1
6 00016A3E 0790E61B 1
8 00016A3E 073630B5 3
9 00016A3E 06613535 1
10 00016A3E 05F809AF 1
11 00016A3E 05C625FF 1
12 00016A3E 04220EA8 4
13 00016A3E 013A29E5 1
14 00016A3E 0761C98A 1
15 00016AE9 0A769475 16
16 00016AE9 0A7DED0A 2
17 00016AE9 0ABF60DF 9
18 00016AE9 0AE3F25A 2
20 00016AE9 0BD8975A 2
21 00016AE9 44DF880B 1
22 00016AE9 43F9E08E 2
23 00016AE9 44EA5E08 2
24 00016AE9 4539ED1E 16
25 00016AE9 8516B55A 4
26 00016AE9 0972AFF2 1
27 00016AE9 0C559B34 1
28 00016AE9 06B5C040 7
%timeit pd.concat([all_data, subset]).drop_duplicates(keep=False)
1000 loops, best of 3: 1.53 ms per loop
%%timeit
pd.merge(all_data, subset, how='left', indicator=True) \
.query("_merge == 'left_only'") \
.drop('_merge',1)
100 loops, best of 3: 8.58 ms per loop
Upvotes: 1
Reputation: 210812
you can use merge(..., how='left', indicator=True) together with query() method:
In [38]: pd.merge(all_data, subset, how='left', indicator=True) \
.query("_merge == 'left_only'") \
.drop('_merge',1)
Out[38]:
AID VID Freq
0 00016A3E 0127C661 1
2 00016A3E 0C032814 1
3 00016A3E 0BF6C78D 1
4 00016A3E 0A79DFF1 1
5 00016A3E 07BD2FB2 1
6 00016A3E 0790E61B 1
8 00016A3E 073630B5 3
9 00016A3E 06613535 1
10 00016A3E 05F809AF 1
11 00016A3E 05C625FF 1
12 00016A3E 04220EA8 4
13 00016A3E 013A29E5 1
14 00016A3E 0761C98A 1
15 00016AE9 0A769475 16
16 00016AE9 0A7DED0A 2
17 00016AE9 0ABF60DF 9
18 00016AE9 0AE3F25A 2
20 00016AE9 0BD8975A 2
21 00016AE9 44DF880B 1
22 00016AE9 43F9E08E 2
23 00016AE9 44EA5E08 2
24 00016AE9 4539ED1E 16
25 00016AE9 8516B55A 4
26 00016AE9 0972AFF2 1
27 00016AE9 0C559B34 1
28 00016AE9 06B5C040 7
Step by step:
In [39]: pd.merge(all_data, subset, how='left', indicator=True)
Out[39]:
AID VID Freq _merge
0 00016A3E 0127C661 1 left_only
1 00016A3E 0C05DA5D 2 both
2 00016A3E 0C032814 1 left_only
3 00016A3E 0BF6C78D 1 left_only
4 00016A3E 0A79DFF1 1 left_only
5 00016A3E 07BD2FB2 1 left_only
6 00016A3E 0790E61B 1 left_only
7 00016A3E 0C24ED25 3 both
8 00016A3E 073630B5 3 left_only
9 00016A3E 06613535 1 left_only
10 00016A3E 05F809AF 1 left_only
11 00016A3E 05C625FF 1 left_only
12 00016A3E 04220EA8 4 left_only
13 00016A3E 013A29E5 1 left_only
14 00016A3E 0761C98A 1 left_only
15 00016AE9 0A769475 16 left_only
16 00016AE9 0A7DED0A 2 left_only
17 00016AE9 0ABF60DF 9 left_only
18 00016AE9 0AE3F25A 2 left_only
19 00016AE9 0AEFE12F 5 both
20 00016AE9 0BD8975A 2 left_only
21 00016AE9 44DF880B 1 left_only
22 00016AE9 43F9E08E 2 left_only
23 00016AE9 44EA5E08 2 left_only
24 00016AE9 4539ED1E 16 left_only
25 00016AE9 8516B55A 4 left_only
26 00016AE9 0972AFF2 1 left_only
27 00016AE9 0C559B34 1 left_only
28 00016AE9 06B5C040 7 left_only
29 00016AE9 0B0426FA 1 both
In [40]: pd.merge(all_data, subset, how='left', indicator=True).query("_merge == 'left_only'")
Out[40]:
AID VID Freq _merge
0 00016A3E 0127C661 1 left_only
2 00016A3E 0C032814 1 left_only
3 00016A3E 0BF6C78D 1 left_only
4 00016A3E 0A79DFF1 1 left_only
5 00016A3E 07BD2FB2 1 left_only
6 00016A3E 0790E61B 1 left_only
8 00016A3E 073630B5 3 left_only
9 00016A3E 06613535 1 left_only
10 00016A3E 05F809AF 1 left_only
11 00016A3E 05C625FF 1 left_only
12 00016A3E 04220EA8 4 left_only
13 00016A3E 013A29E5 1 left_only
14 00016A3E 0761C98A 1 left_only
15 00016AE9 0A769475 16 left_only
16 00016AE9 0A7DED0A 2 left_only
17 00016AE9 0ABF60DF 9 left_only
18 00016AE9 0AE3F25A 2 left_only
20 00016AE9 0BD8975A 2 left_only
21 00016AE9 44DF880B 1 left_only
22 00016AE9 43F9E08E 2 left_only
23 00016AE9 44EA5E08 2 left_only
24 00016AE9 4539ED1E 16 left_only
25 00016AE9 8516B55A 4 left_only
26 00016AE9 0972AFF2 1 left_only
27 00016AE9 0C559B34 1 left_only
28 00016AE9 06B5C040 7 left_only
Upvotes: 5
Reputation: 10621
This worked for me:
merged = all_data.merge(subset, how='left', indicator=True)
print(merged[merged['_merge']=='left_only'])
After the merge, you can filter by the _merge
with left_only
value.
Upvotes: 1