Reputation: 181
Consider the following two data frames:
dic_1 = {'intA': {0: 'A', 1: 'B', 2: 'A', 3: 'D', 4: 'A', 5: 'F', 6: 'C', 7: 'H', 8: 'D', 9: 'X', 10: 'A', 11: 'C', 12: 'X'}, 'intB': {0: 'FG', 1: 'GH', 2: 'SD', 3: 'KJ', 4: 'FG', 5: 'WE', 6: 'NB', 7: 'GH', 8: 'AA', 9: 'ZX', 10: 'QQ', 11: 'OI', 12: 'XX'}}
df_1 = pd.DataFrame(dic_1)
df_1
intA intB
0 A FG
1 B GH
2 A SD
3 D KJ
4 A FG
5 F WE
6 C NB
7 H GH
8 D AA
9 X ZX
10 A QQ
11 C OI
12 X XX
dic_2 = {'ref': {0: 'AA',
1: 'GH',
2: 'CD',
3: 'FG',
4: 'XX',
5: 'TY',
6: 'ZX',
7: 'SD',
8: 'KJ',
9: 'IU'}}
df_2 = pd.DataFrame(dic_2)
df_2
ref
0 AA
1 GH
2 CD
3 FG
4 XX
5 TY
6 ZX
7 SD
8 KJ
9 IU
What I want is a resulting data frame that looks like this:
symbol count
0 A 3
1 B 1
2 D 2
3 F 0
4 C 0
5 H 1
6 X 2
Logic: For each item in 'intA' column in df_1, if its corresponding item in 'intB' column in df_1 exists in 'ref' column in df_2, increase the count by one. For example, there are four occurrences of 'A' in 'intA' but only three of its corresponding items in 'intB' column exists in 'ref' in df_2, so the count is 3.
Upvotes: 2
Views: 940
Reputation: 14949
Another option via groupby
:
result = df_1.assign(count=df_1.intB.isin(df_2.ref)).groupby(
'intA', sort=False, as_index=False).agg({'count': sum})
OUTPUT:
intA count
0 A 3
1 B 1
2 D 2
3 F 0
4 C 0
5 H 1
6 X 2
Upvotes: 0
Reputation: 71689
Create a boolean mask with isin
to test whether the elements in intB
column matches with any element from the ref
column, then sum
the True
values in this mask per unique value in intA
df_1.set_index('intA')['intB'].isin(df_2['ref'])\
.sum(level=0).rename_axis('symbol').reset_index(name='count')
symbol count
0 A 3
1 B 1
2 D 2
3 F 0
4 C 0
5 H 1
6 X 2
Upvotes: 5
Reputation: 862661
Use Series.isin
for filter in df_1
, then count values by Series.value_counts
with add missing values by Series.reindex
, last convert Series
to 2 columns Dataframe
:
df = (df_1.loc[df_1['intB'].isin(df_2['ref']), 'intA']
.value_counts()
.reindex(df_1['intA'].unique(), fill_value=0)
.rename_axis('symbol')
.reset_index(name='count'))
print (df)
symbol count
0 A 3
1 B 1
2 D 2
3 F 0
4 C 0
5 H 1
6 X 2
Upvotes: 1