johnklee
johnklee

Reputation: 2270

Efficiently calculate the difference between two rows in dataframe

Consider I have a dataframe as below:

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=['f1', 'f2'], index=['r1', 'r2'])
>>> df
    f1  f2
r1   1   2
r2   3   4

How should I efficiently calculate the absolute different between row r1 and r2 and create another row as r3 to keep the result. That is to say the result will look like:

>>> for cn in df.columns:
...     diff_dat.append(abs(df[cn]['r1'] - df[cn]['r2']))
... 
>>> diff_dat
[2, 2]
>>> df.append(pd.DataFrame([diff_dat], index=['r3'], columns=df.columns))
    f1  f2
r1   1   2
r2   3   4
r3   2   2

Upvotes: 1

Views: 1888

Answers (3)

jezrael
jezrael

Reputation: 862751

Use loc for select rows, subtract, get abs and last add new row by setting with enlargement:

df.loc['r3'] = (df.loc['r1'] - df.loc['r2']).abs()
print (df)
    f1  f2
r1   1   2
r2   3   4
r3   2   2

Performance for 1000 columns:

np.random.seed(123)
df = pd.DataFrame(np.random.randint(10, size=(2, 1000)), index=['r1', 'r2']).add_prefix('f')-5

#Mayank Porwal solution
In [40]: %timeit df.append(df.diff().dropna().abs())
1.51 ms ± 19.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

#jezrael solution
In [41]: %timeit df.loc['r3'] = (df.loc['r1'] - df.loc['r2']).abs()
663 µs ± 54.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

#NaT3z solution
In [42]: %timeit df.loc["r3"] = df.apply(lambda c: abs(c["r1"] - c["r2"]), axis=0)
967 µs ± 80.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

For improve performance is possible use numpy:

In [49]: %timeit df.loc['r3'] = np.abs(df.loc['r1'].values - df.loc['r2'].values)
414 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Upvotes: 5

Mayank Porwal
Mayank Porwal

Reputation: 34056

You can do this:

In [576]: df.append(df.diff().dropna().abs())
Out[583]: 
     f1   f2
r1  1.0  2.0
r2  3.0  4.0
r2  2.0  2.0

Upvotes: 5

NaT3z
NaT3z

Reputation: 344

The easiest solution for this is to use the .loc function which takes row indexes.

(Edited to remove code identical to that written by jezrael)

If you aren't familiar with pandas I'd suggest checking out the DataFrame.apply function since it allows broader manipulations of data (both row-wise and columns-wise). A solution would look like this:

df["r3"] = df.apply(lambda c: abs(c["r1"] - c["r2"]), axis=0)

pandas.DataFrame.apply is a powerful tool, letting you apply functions to the rows or columns in your dataset and taking advantage of pandas vectorisation.

Upvotes: 3

Related Questions