keynesiancross
keynesiancross

Reputation: 3529

Pandas - adding columns, matching on index

I'm trying to figure out if Panda's, when adding two series together, automatically matches on index or if it simply adds by element position. If its just by position, is there a way to get it to add on index? I've looked at merging, but I'm not very clear if the key in this case can be the index of both...

For example, if I have do DFs:

df1 = index  value
        0      10
        1      12
        2      15
        4      20


df2 = index  value
        0      10
        1      10
        3      10
        4      10

and I want to add df1[total] = df1[value] + df2[value] =

df1 = index  value
        0      20
        1      22
        2      15
        3      10
        4      30

Thanks for your help in advance!

Upvotes: 9

Views: 19229

Answers (2)

Gene Burinsky
Gene Burinsky

Reputation: 10223

Just do this:

pd.concat([df1,df2], axis=1).sum(axis=1)

pd.concat will merge the 2(or more) frames and match based on index. sum(axis=1) just sums across the rows.

Here's the working example:

#create the example data
df1 = pd.DataFrame({'index':[0,1,2,4],'value':[10,12,15,20]}).set_index('index')
df2 = pd.DataFrame({'index':[0,1,3,4],'value':[10,10,10,10]}).set_index('index')

The above will give you:

In [7]: pd.concat([df1,df2],axis=1).sum(axis=1)
Out[7]:
index
0    20.0
1    22.0
2    15.0
3    10.0
4    30.0
dtype: float64

Timing Update

A comment wanted to see which solution is faster. tldr, using .add() is about 2x faster than pd.concat, but if you use .align explicitly, it'll be even faster, albeit marginally

# create example dataframe
N=1000000
np.random.seed(666)
df1 = pd.DataFrame({'A':np.random.randint(0,1000, size=N),"B":np.random.randint(0,1000, size=N)}, index=np.arange(0,N,1))
df2 = pd.DataFrame({'A':np.random.randint(0,1000, size=N),"B":np.random.randint(0,1000, size=N)}, index=np.arange(0,N,1))

# punch a few holes to mismatch integers
df1 = df1.loc[np.random.choice(df1.index,int(.8*N), replace=False),:].sort_index()
df2 = df2.loc[np.random.choice(df1.index,int(.8*N), replace=False),:].sort_index()

# define functions
def summatch1(x,y):
    return pd.concat([x,y],axis=1).sum(axis=1)

def summatch2(x,y):
    return x.add(y,fill_value=0)

def summatch3(x,y):
    z = x.merge(y,how='outer',suffixes=['1','2'],left_index=True, right_index=True).fillna(0)
    return z.A1+z.A2

def summatch4(x,y):
    p1,p2 = x.align(y,join='outer', fill_value=0)
    return p1+p2

# time them
%timeit summatch1(df1.A, df2.A)
%timeit summatch2(df1.A, df2.A)
%timeit summatch3(df1.loc[:,['A']], df2.loc[:,['A']])
%timeit summatch4(df1.A, df2.A)

Results

Times are reported in the following order:

  1. pd.concat
  2. df1.add
  3. pd.merge
  4. df1.align

800k rows in df1 and df2

79.7 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
36.6 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
40.2 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
31.3 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

80k rows in df1 and df2

7.2 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.08 ms ± 90.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.14 ms ± 43.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.76 ms ± 30.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 4

Scott Boston
Scott Boston

Reputation: 153500

Because of the intrinsic data alignment in pandas, you can use add with fill_value=0 and it will sum these two series based on index alignment.

df1.add(df2,fill_value=0)

Input:

df1 = pd.Series([10]*4,index=[0,1,3,4])

df2 = pd.Series([10,12,15,20], index=[0,1,2,4])

df1.add(df2,fill_value=0)

Output:

0    20.0
1    22.0
2    15.0
3    10.0
4    30.0
dtype: float64

Upvotes: 6

Related Questions