Jean-Francois T.
Jean-Francois T.

Reputation: 12940

Dataframe diffs with values from 2 columns

On Pandas Dataframe, we can compute difference between values on different rows of a given column (or different columns of a given rows): https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.diff.html

How can we compute a series that correspond to the difference of values of 2 different columns in consecutive rows.

For example, if we have a start date and an end date, how can we compute the time between the end date of row N and the start date of row N+1?

>>> df = pd.DataFrame({"Start": pd.date_range("2020-01-01", "2020-01-07", freq="2D"), "End":pd.date_range("2020-01-02", "2020-01-11", freq="3D")})
>>> df
    Start   End
0   2020-01-01  2020-01-02
1   2020-01-03  2020-01-05
2   2020-01-05  2020-01-08
3   2020-01-07  2020-01-11

# and then we would call something like that
>>> delta = df.ndiff(columns=["End", "Start"])
0      NaT
1   1 days
2   0 days
3   -1 days
dtype: timedelta64[ns]

Upvotes: 0

Views: 37

Answers (1)

Jean-Francois T.
Jean-Francois T.

Reputation: 12940

Elaboration on the answer from @wwnde (put as comment)

The idea is to compare rows after a shift:

>>> df['Start'] - df['End'].shift()
0       NaT
1    1 days
2    0 days
3   -1 days
dtype: timedelta64[ns]

We can also use the column names as attributes:
>>> df.Start - df.End.shift()

In other words, the function s.diff(periods) (same for a Dataframe) is equivalent to s - s.shift(periods).

Upvotes: 1

Related Questions