Geinkehdsk
Geinkehdsk

Reputation: 99

How to map multiple datasets based on the same row values of one column?

I have three datasets,

df1
   name              A         B         C         D  
0  Jeffrey Ray       0.171130  0.307791 -0.031378  0.366607
1  Crystal Roberts   0.429614  1.674201  0.176098 -1.289623
2  Paul Choute       0.396332 0.879003  1.889926  0.745100
3  Amy Clements      1.452253  0.724694 -0.830645 -0.342138
4  Stanley Coston    -0.135439 -0.293293  0.684627  0.538049

df2
   name              A         B         C         D  
0  Jeffrey Ray       0.173430  0.399791 -0.012278  0.88897
1  Sara Flicker      0.635744  1.699001  0.176098 -1.11113
2  Jack Sether      0.323333 0.879003  1.889926  0.983640
3  Amy Clements      1.452253  0.784734 -0.673695 -0.342138
4  Stanley Coston    -0.143222 -0.293293  0.683647  0.638479

df3
   name              A         B         C         D  
0  Jeffrey Ray       0.638273  0.687777  0.766666  0.900032
1  Sara Flicker      0.635744  1.699001  0.176098 -1.11113
2  Samantha Runyon   0.323333 -0.879003  1.889926  -0.683640
3  Amy Clements      -1.452253  0.784734 -0.673695 -0.342138
4  Stanley Coston    -0.143222 -0.282222  0.683647  -0.638479

I want to map name in each name and get the sum D:

Jeffrey Ray  0.366607+0.88897+0.900032
Amy Clements -0.342138+(-0.342138)+(-0.342138)
Stanley Coston 0.538049+0.638479-0.638479

I tried pd.merge:

pd.merge(df1, df2, on='name', how='inner').sum().reset_index()

But the output is not what I expected. Any suggestion on how should I change it? Another issue is that it can only merge two datasets each time, is there an efficient way that I can map multiple datasets at the same time without a for loop?

Thanks

Upvotes: 1

Views: 442

Answers (4)

G.G
G.G

Reputation: 765

df1=df1.set_index('name')
df2=df2.set_index('name')
df3=df3.set_index('name')

idx1=df1.index.intersection(df2.index)\
    .intersection(df3.index)

(df1.loc[idx1]+df2.loc[idx1]+df3.loc[idx1]).reset_index()

            name         A         B         C         D
0     Jeffrey-Ray  0.982833  1.395359  0.723010  2.155609
1    Amy-Clements  1.452253  2.294162 -2.178035 -1.026414
2  Stanley-Coston -0.421883 -0.868808  2.051921  0.538049

Upvotes: 1

Corralien
Corralien

Reputation: 120509

Use pd.merge and reduce from functools:

from functools import reduce

out = reduce(lambda dfL, dfR: dfL.set_index('name').add(dfR.set_index('name')).reset_index(), [df1, df2, df3]) \
            .dropna().set_index('name')['D']
print(out)

# Output:
name
Amy Clements     -1.026414
Jeffrey Ray       2.155609
Stanley Coston    0.538049
dtype: float64

Upvotes: 2

user7864386
user7864386

Reputation:

How about:

dfs = pd.concat([df1,df2,df3])
out = dfs.groupby('name')['D'].sum()[dfs['name'].value_counts()==3]

Output:

 name
Amy Clements     -1.026414
Jeffrey Ray       2.155609
Stanley Coston    0.538049
Name: D, dtype: float64

Upvotes: 3

mozway
mozway

Reputation: 262164

You can concat only the D column, using the names as index. Use dropna to get rid of the incomplete datasets, and sum:

dfs = [df1, df2, df3]
(pd.concat([d.set_index('name')['D'] for d in dfs], axis=1)
   .dropna()
   .sum(1)
)

output:

name
Jeffrey Ray       2.155609
Amy Clements     -1.026414
Stanley Coston    0.538049

Upvotes: 3

Related Questions