Reputation: 3529
I'm trying to figure out if Panda's, when adding two series together, automatically matches on index or if it simply adds by element position. If its just by position, is there a way to get it to add on index? I've looked at merging, but I'm not very clear if the key in this case can be the index of both...
For example, if I have do DFs:
df1 = index value
0 10
1 12
2 15
4 20
df2 = index value
0 10
1 10
3 10
4 10
and I want to add df1[total] = df1[value] + df2[value] =
df1 = index value
0 20
1 22
2 15
3 10
4 30
Thanks for your help in advance!
Upvotes: 9
Views: 19229
Reputation: 10223
pd.concat([df1,df2], axis=1).sum(axis=1)
pd.concat
will merge the 2(or more) frames and match based on index. sum(axis=1)
just sums across the rows.
#create the example data
df1 = pd.DataFrame({'index':[0,1,2,4],'value':[10,12,15,20]}).set_index('index')
df2 = pd.DataFrame({'index':[0,1,3,4],'value':[10,10,10,10]}).set_index('index')
The above will give you:
In [7]: pd.concat([df1,df2],axis=1).sum(axis=1)
Out[7]:
index
0 20.0
1 22.0
2 15.0
3 10.0
4 30.0
dtype: float64
A comment wanted to see which solution is faster. tldr, using .add()
is about 2x faster than pd.concat
, but if you use .align
explicitly, it'll be even faster, albeit marginally
# create example dataframe
N=1000000
np.random.seed(666)
df1 = pd.DataFrame({'A':np.random.randint(0,1000, size=N),"B":np.random.randint(0,1000, size=N)}, index=np.arange(0,N,1))
df2 = pd.DataFrame({'A':np.random.randint(0,1000, size=N),"B":np.random.randint(0,1000, size=N)}, index=np.arange(0,N,1))
# punch a few holes to mismatch integers
df1 = df1.loc[np.random.choice(df1.index,int(.8*N), replace=False),:].sort_index()
df2 = df2.loc[np.random.choice(df1.index,int(.8*N), replace=False),:].sort_index()
# define functions
def summatch1(x,y):
return pd.concat([x,y],axis=1).sum(axis=1)
def summatch2(x,y):
return x.add(y,fill_value=0)
def summatch3(x,y):
z = x.merge(y,how='outer',suffixes=['1','2'],left_index=True, right_index=True).fillna(0)
return z.A1+z.A2
def summatch4(x,y):
p1,p2 = x.align(y,join='outer', fill_value=0)
return p1+p2
# time them
%timeit summatch1(df1.A, df2.A)
%timeit summatch2(df1.A, df2.A)
%timeit summatch3(df1.loc[:,['A']], df2.loc[:,['A']])
%timeit summatch4(df1.A, df2.A)
Times are reported in the following order:
pd.concat
df1.add
pd.merge
df1.align
800k rows in df1
and df2
79.7 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
36.6 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
40.2 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
31.3 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
80k rows in df1
and df2
7.2 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.08 ms ± 90.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.14 ms ± 43.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.76 ms ± 30.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 4
Reputation: 153500
Because of the intrinsic data alignment in pandas, you can use add
with fill_value=0
and it will sum these two series based on index alignment.
df1.add(df2,fill_value=0)
Input:
df1 = pd.Series([10]*4,index=[0,1,3,4])
df2 = pd.Series([10,12,15,20], index=[0,1,2,4])
df1.add(df2,fill_value=0)
Output:
0 20.0
1 22.0
2 15.0
3 10.0
4 30.0
dtype: float64
Upvotes: 6