Creating a new column in Pandas based on another dataframe

Question

I need to add a column to an existing pandas dataframe based on an attribute from a second dataframe. I've made a minimal example to illustrate my exact requirements.

I've got two dataframes, one representing pairs of names, and the other representing an interaction between two individuals:

    >>> names
    id_a   id_b
0    ben   jack
1   jack    ben
2   jill   amir
3  wilma   jill
4   amir  wilma

>>> interactions
  individual1 individual2
0        jill        jack
1        jack        jill
2       wilma        jill
3        amir        jill
4        amir        jack
5        jack        amir
6        jill        amir

What I need is essentially this: for each pair of names in names, I need a count of the number of interactions between those two names, so the number of rows in interactions in which names['id_a'] is either interactions['individual1'] or interactions['individual2'] AND names['id_b'] is either interactions['individual1'] or interactions['individual2']. This count needs to be included in a column num_interactions for all rows in names, even if the names are duplicate (i.e. if there is a row in which id_a is ben and id_b is jack AND a row in which those names are reversed (id_a is jack and id_b is ben), the num_interactions should be included for both of those rows)

The resulting dataframe would look like this:

>>> names
    id_a   id_b  num_interactions
0    ben   jack               0.0
1   jack    ben               0.0
2   jill   amir               2.0
3  wilma   jill               1.0
4   amir  wilma               0.0
    enter code here

What I've Done

This works just fine, but it's ugly, hard to read, inefficient, and I know there must be a better way! Maybe with some sort of merge, but I don't really know how that works with complicated criteria...

for i in range(len(names)):
    names.loc[i, 'num_interactions'] = len(
        interactions[((interactions['individual1'] == names.loc[i, 'id_a']) &
                      (interactions['individual2'] == names.loc[i, 'id_b'])) |
                     ((interactions['individual2'] == names.loc[i, 'id_a']) &
                      (interactions['individual1'] == names.loc[i, 'id_b']))
                     ])

To Reproduce my example dataframes

In case you want to play around with this, you can use this to reproduce my dummy dataframes above.

import pandas as pd
names = pd.DataFrame(data={'id_a': ['ben', 'jack', 'jill', 'wilma', 'amir'],
                           'id_b': ['jack', 'ben', 'amir', 'jill', 'wilma']})

interactions = pd.DataFrame(data={'individual1': ['jill', 'jack',
                                                  'wilma', 'amir',
                                                  'amir', 'jack', 'jill'],
                                  'individual2': ['jack', 'jill', 'jill',
                                                  'jill', 'jack', 'amir',
                                                  'amir']})

Thanks in advance!

cs95 · Accepted Answer

Assuming order doesn't matter, you can sort each dataframe by their columns. For the second dataframe, count each group of interactions with groupby + count and then perform a left outer merge on the result and the first dataframe.

i = pd.DataFrame(np.sort(names, axis=1))
j = pd.DataFrame(np.sort(interactions, axis=1))

k = j.groupby(j.columns.tolist())[0].count().reset_index(name='count')

df = i.merge(k, on=[0, 1], how='left')\
      .fillna(0)\
      .rename(columns={0 : 'id_a', 1 : 'id_b'})
df.iloc[:, :2] = names.values

df

   id_a   id_b  count
0   ben   jack    0.0
1   ben   jack    0.0
2  amir   jill    2.0
3  jill  wilma    1.0
4  amir  wilma    0.0

Creating a new column in Pandas based on another dataframe

What I've Done

To Reproduce my example dataframes

Answers (2)

Related Questions