pbreach
pbreach

Reputation: 17017

Subtract first row from all rows in Pandas DataFrame

I have a pandas dataframe:

a = pd.DataFrame(rand(5,6)*10, index=pd.DatetimeIndex(start='2005', periods=5, freq='A'))
a.columns = pd.MultiIndex.from_product([('A','B'),('a','b','c')])

I want to subtract the row a['2005'] from a. To do that I've tried this:

In [22]:

a - a.ix['2005']

Out[22]:
    A   B
    a   b   c   a   b   c
2005-12-31  0   0   0   0   0   0
2006-12-31  NaN     NaN     NaN     NaN     NaN     NaN
2007-12-31  NaN     NaN     NaN     NaN     NaN     NaN
2008-12-31  NaN     NaN     NaN     NaN     NaN     NaN
2009-12-31  NaN     NaN     NaN     NaN     NaN     NaN

Which obviously doesn't work because pandas is lining up the index while doing the operation. This works:

In [24]:

pd.DataFrame(a.values - a['2005'].values, index=a.index, columns=a.columns)

Out[24]:
    A   B
    a   b   c   a   b   c
2005-12-31  0.000000    0.000000    0.000000    0.000000    0.000000    0.000000
2006-12-31  -3.326761   -7.164628   8.188518    -0.863177   0.519587    -3.281982
2007-12-31  3.529531    -4.719756   8.444488    1.355366    7.468361    -4.023797
2008-12-31  3.139185    -8.420257   1.465101    -2.942519   1.219060    -5.146019
2009-12-31  -3.459710   0.519435    -1.049617   -2.779370   4.792227    -1.922461

But I don't want to have to form a new DataFrame every time I have to do this kind of operation. I've tried the apply() method like this: a.apply(lambda x: x-a['2005'].values) but I get ValueError: cannot copy sequence with size 6 to array axis with dimension 5 So I'm not really sure how to proceed. Is there a simple way to do this that I am not seeing? I think there should be an easy way to do this in place so you don't have to construct a new dataframe each time. I also tried the sub() method but the subtraction is only applied to the first row whereas I want to subtract the first row from each row in the dataframe.

Upvotes: 9

Views: 20599

Answers (3)

Satyajeet Patil
Satyajeet Patil

Reputation: 31

For timestamp values to calculate how much time passed with respect to the start time use:

df['Time_column'].apply(lambda x: x-df.iloc[[0],[1]])

Where df.iloc[[0],[1]] = the start time

Upvotes: 3

Gourneau
Gourneau

Reputation: 12878

Here is a more verbose simple break down of how to do this.

First make a simple DataFrame to make it easier to understand.

import numpy as np
import pandas as pd
#make a simple DataFrame
df = pd.DataFrame(np.fromfunction(lambda i, j: i+1 , (3, 3), dtype=int))

Which will look like this

# 1 1 1
# 2 2 2
# 3 3 3

Now get the values from the first row

first_row = df.iloc[[0]].values[0]

Now use apply() to subtract the first row from the rest of the rows.

df.apply(lambda row: row - first_row, axis=1)

The result will look like this. See that 1 was subtracted from each row

#  0 0 0
#  1 1 1
#  2 2 2

Upvotes: 4

unutbu
unutbu

Reputation: 880547

Pandas is great for aligning by index. So when you want Pandas to ignore the index, you need to drop the index. You can do that by converting the DataFrame a.loc['2005'] to a 1-dimensional NumPy array:

In [56]: a - a.loc['2005'].values.squeeze()
Out[56]: 
                   A                             B                    
                   a         b         c         a         b         c
2005-12-31  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000
2006-12-31  0.325968  1.314776 -0.789328 -0.344669 -2.518857  7.361711
2007-12-31  0.084203  2.234445 -2.838454 -6.176795 -3.645513  8.955443
2008-12-31  3.798700  0.299529  1.303325 -2.770126 -1.284188  3.093806
2009-12-31  1.520930  2.660040  0.846996 -9.437851 -2.886603  6.705391

The squeeze method converts the NumPy array, a.loc['2005'], of shape to (1, 6) to an array of shape (6,). This allows the array to be broadcasted (during the subtraction) as desired.

Upvotes: 13

Related Questions