Saad Zaheer
Saad Zaheer

Reputation: 181

how to compare two columns in two different data frames and count the occurrences

Consider the following two data frames:

dic_1 = {'intA': {0: 'A', 1: 'B', 2: 'A', 3: 'D', 4: 'A', 5: 'F', 6: 'C', 7: 'H', 8: 'D', 9: 'X', 10: 'A', 11: 'C', 12: 'X'}, 'intB': {0: 'FG', 1: 'GH', 2: 'SD', 3: 'KJ', 4: 'FG', 5: 'WE', 6: 'NB', 7: 'GH', 8: 'AA', 9: 'ZX', 10: 'QQ', 11: 'OI', 12: 'XX'}}
df_1 = pd.DataFrame(dic_1)
df_1


   intA intB
0   A   FG
1   B   GH
2   A   SD
3   D   KJ
4   A   FG
5   F   WE
6   C   NB
7   H   GH
8   D   AA
9   X   ZX
10  A   QQ
11  C   OI
12  X   XX

dic_2 = {'ref': {0: 'AA',
  1: 'GH',
  2: 'CD',
  3: 'FG',
  4: 'XX',
  5: 'TY',
  6: 'ZX',
  7: 'SD',
  8: 'KJ',
  9: 'IU'}}
df_2 = pd.DataFrame(dic_2)
df_2
    ref
0   AA
1   GH
2   CD
3   FG
4   XX
5   TY
6   ZX
7   SD
8   KJ
9   IU

What I want is a resulting data frame that looks like this:

  symbol    count
0   A   3
1   B   1
2   D   2
3   F   0
4   C   0
5   H   1
6   X   2

Logic: For each item in 'intA' column in df_1, if its corresponding item in 'intB' column in df_1 exists in 'ref' column in df_2, increase the count by one. For example, there are four occurrences of 'A' in 'intA' but only three of its corresponding items in 'intB' column exists in 'ref' in df_2, so the count is 3.

Upvotes: 2

Views: 940

Answers (3)

Nk03
Nk03

Reputation: 14949

Another option via groupby:

result = df_1.assign(count=df_1.intB.isin(df_2.ref)).groupby(
    'intA', sort=False, as_index=False).agg({'count': sum})

OUTPUT:

  intA  count
0    A      3
1    B      1
2    D      2
3    F      0
4    C      0
5    H      1
6    X      2

Upvotes: 0

Shubham Sharma
Shubham Sharma

Reputation: 71689

Create a boolean mask with isin to test whether the elements in intB column matches with any element from the ref column, then sum the True values in this mask per unique value in intA

df_1.set_index('intA')['intB'].isin(df_2['ref'])\
    .sum(level=0).rename_axis('symbol').reset_index(name='count')

  symbol  count
0      A      3
1      B      1
2      D      2
3      F      0
4      C      0
5      H      1
6      X      2

Upvotes: 5

jezrael
jezrael

Reputation: 862661

Use Series.isin for filter in df_1, then count values by Series.value_counts with add missing values by Series.reindex, last convert Series to 2 columns Dataframe:

df = (df_1.loc[df_1['intB'].isin(df_2['ref']), 'intA']
          .value_counts()
          .reindex(df_1['intA'].unique(), fill_value=0)
          .rename_axis('symbol')
          .reset_index(name='count'))
print (df)
  symbol  count
0      A      3
1      B      1
2      D      2
3      F      0
4      C      0
5      H      1
6      X      2

Upvotes: 1

Related Questions