Reputation: 51425
I need to add a column to an existing pandas dataframe based on an attribute from a second dataframe. I've made a minimal example to illustrate my exact requirements.
I've got two dataframes, one representing pairs of names, and the other representing an interaction between two individuals:
>>> names
id_a id_b
0 ben jack
1 jack ben
2 jill amir
3 wilma jill
4 amir wilma
>>> interactions
individual1 individual2
0 jill jack
1 jack jill
2 wilma jill
3 amir jill
4 amir jack
5 jack amir
6 jill amir
What I need is essentially this: for each pair of names in names
, I need a count of the number of interactions between those two names, so the number of rows in interactions
in which names['id_a']
is either interactions['individual1']
or interactions['individual2']
AND names['id_b']
is either interactions['individual1']
or interactions['individual2']
. This count needs to be included in a column num_interactions
for all rows in names, even if the names are duplicate (i.e. if there is a row in which id_a is ben and id_b is jack AND a row in which those names are reversed (id_a is jack and id_b is ben), the num_interactions should be included for both of those rows)
The resulting dataframe would look like this:
>>> names
id_a id_b num_interactions
0 ben jack 0.0
1 jack ben 0.0
2 jill amir 2.0
3 wilma jill 1.0
4 amir wilma 0.0
enter code here
This works just fine, but it's ugly, hard to read, inefficient, and I know there must be a better way! Maybe with some sort of merge, but I don't really know how that works with complicated criteria...
for i in range(len(names)):
names.loc[i, 'num_interactions'] = len(
interactions[((interactions['individual1'] == names.loc[i, 'id_a']) &
(interactions['individual2'] == names.loc[i, 'id_b'])) |
((interactions['individual2'] == names.loc[i, 'id_a']) &
(interactions['individual1'] == names.loc[i, 'id_b']))
])
In case you want to play around with this, you can use this to reproduce my dummy dataframes above.
import pandas as pd
names = pd.DataFrame(data={'id_a': ['ben', 'jack', 'jill', 'wilma', 'amir'],
'id_b': ['jack', 'ben', 'amir', 'jill', 'wilma']})
interactions = pd.DataFrame(data={'individual1': ['jill', 'jack',
'wilma', 'amir',
'amir', 'jack', 'jill'],
'individual2': ['jack', 'jill', 'jill',
'jill', 'jack', 'amir',
'amir']})
Thanks in advance!
Upvotes: 2
Views: 1706
Reputation: 495
Sorry it is a bit ugly how i add new columns etc but you can get the idea and improve it... First I assume that all pairs in names are unique. So I give each pair an ID
names_ids = pd.DataFrame(pd.concat([names.iloc[:, 0] + '-' + names.iloc[:, 1],
names.iloc[:, 1] + '-' + names.iloc[:, 0]],
axis=0),
columns=['pair'])
names_ids['id'] = names_ids.index
names_ids.index = names_ids.pair
Then I join these ids with the interactions where I again flip each pair in interactions
interactions_new = pd.DataFrame(pd.concat([interactions.iloc[:, 0] + '-' + interactions.iloc[:, 1],
interactions.iloc[:, 1] + '-' + interactions.iloc[:, 0]],
axis=0),
columns=['pair'])
interactions_new['count'] = np.ones(len(interactions_new))
count_id = interactions_new.join(names_ids['id'], on='pair', how='left').groupby('id').count().loc[:, ['count']]
count_id['id'] = count_id.index
So in the end I just count each id in interactions:
names_ids.index = names_ids.id
result = count_id.join(names_ids.pair.iloc[:len(names_ids)/2], on='id', how='left')
result['count'] /= 2
print result
Ugly but no for loops and I get:
count id pair
id
2.0 2 2.0 jill-amir
3.0 1 3.0 wilma-jill
Upvotes: 1
Reputation: 403050
Assuming order doesn't matter, you can sort each dataframe by their columns. For the second dataframe, count each group of interactions with groupby
+ count
and then perform a left outer merge
on the result and the first dataframe.
i = pd.DataFrame(np.sort(names, axis=1))
j = pd.DataFrame(np.sort(interactions, axis=1))
k = j.groupby(j.columns.tolist())[0].count().reset_index(name='count')
df = i.merge(k, on=[0, 1], how='left')\
.fillna(0)\
.rename(columns={0 : 'id_a', 1 : 'id_b'})
df.iloc[:, :2] = names.values
df
id_a id_b count
0 ben jack 0.0
1 ben jack 0.0
2 amir jill 2.0
3 jill wilma 1.0
4 amir wilma 0.0
Upvotes: 1