Elena
Elena

Reputation: 33

Reverse rows in time series dataframe

I have a time series that looks like this:

value   date
63.85   2017-01-15
63.95   2017-01-22
63.88   2017-01-29
64.02   2017-02-05
63.84   2017-02-12
62.13   2017-03-05
65.36   2017-03-25
66.45   2017-04-25

And I would like to reverse the order of the rows so they look like this:

value   date
66.45   2000-01-01
65.36   2000-02-01
62.13   2000-02-20
63.84   2000-03-12
64.02   2000-03-19
63.88   2000-03-26
63.95   2000-04-02
63.85   2000-04-09

As you can see, the "value" column requires to simply flip the row values, but for the date column what I would like to do is keep the same "difference in days" between dates. It doesn't really matter what the start date value is as long as the difference in days is flipped correctly too. In the second dataframe of the example, the start date value is 2000-01-01 and the second value is 2020-02-01, which is 31 days later than the first date. This "day difference" of 31 days is the same one as the last (2017-04-25) and penultimate date (2017-03-25) of the first dataframe. And, the same for the second (2000-02-01) and the third value (2000-02-20) of the second dataframe: the "difference in days" is 20 days, the same one between the penultimate date (2017-03-25) and the antepenultimate date (2017-03-05) of the first dataframe. And so on.

I believe that the steps needed to do this would require to first calculate this "day differences", but I would like to know how to do it efficiently. Thank you :)

Upvotes: 2

Views: 2110

Answers (2)

cesar-nascimento
cesar-nascimento

Reputation: 26

Here's how I did it:

Creating the DataFrame:

value = [63.85, 63.95, 63.88, 64.02, 63.84, 62.13, 65.36, 66.45]
date = ["2017-01-15", "2017-01-22", "2017-01-29", "2017-02-05", "2017-02-12", "2017-03-05", "2017-03-25", "2017-04-25",]
df = pd.DataFrame({"value": value, "date": date})

Creating a second DataFrame with the values reversed and converting the date column to datetime

new_df = df.astype({'date': 'datetime64'})
new_df.sort_index(ascending=False, inplace=True, ignore_index=True)
new_df

    value   date
0   66.45   2017-04-25
1   65.36   2017-03-25
2   62.13   2017-03-05
3   63.84   2017-02-12
4   64.02   2017-02-05
5   63.88   2017-01-29
6   63.95   2017-01-22
7   63.85   2017-01-15

I then used pandas.Series.diff to calculate the time delta between each row and converted those values to absolute values.

time_delta_series = new_df['date'].diff().abs()
time_delta_series

0       NaT
1   31 days
2   20 days
3   21 days
4    7 days
5    7 days
6    7 days
7    7 days
Name: date, dtype: timedelta64[ns]

Then you need to convert those values to a cumulative time delta. But to use the cumsum() method you need to first remove the missing values (NaT).

time_delta_series = time_delta_series.fillna(pd.Timedelta(seconds=0)).cumsum()
time_delta_series

0     0 days
1    31 days
2    51 days
3    72 days
4    79 days
5    86 days
6    93 days
7   100 days
Name: date, dtype: timedelta64[ns

Then you can create your starting date and create the date column for the second DataFrame we created before:

from datetime import date
start = date(2000, 1, 1)
new_df['date'] = start
new_df['date'] = new_df['date'] + time_delta_series
new_df

    value   date
0   66.45   2000-01-01
1   65.36   2000-02-01
2   62.13   2000-02-21
3   63.84   2000-03-13
4   64.02   2000-03-20
5   63.88   2000-03-27
6   63.95   2000-04-03
7   63.85   2000-04-10

Upvotes: 1

Jacob K
Jacob K

Reputation: 784

NumPy has support for this via its datetime and timedelta data types.

First you reverse both columns in your time series as follows:

import pandas as pd
import numpy as np
df2 = df
df2 = df2.iloc[::-1]
df2

where df is your original time series data and df2 (shown below) is the reversed time series.

    value   date
7   66.45   2017-04-25
6   65.36   2017-03-25
5   62.13   2017-03-05
4   63.84   2017-02-12
3   64.02   2017-02-05
2   63.88   2017-01-29
1   63.95   2017-01-22
0   63.85   2017-01-15

Next you find the day differences and store them as timedelta objects:

dates_np = np.array(df2.date).astype(np.datetime64)       # Convert dates to np.datetime64 ojects
timeDeltas = np.insert(abs(np.diff(dates_np)), 0, 0)      # np.insert is to account for -1 length during np.diff call

d2 = {'value': df_reversed.value, 'day_diff': timeDeltas}     # Create new dataframe (df3)
df3 = pd.DataFrame(data=d2)
df3

where df3 (the day differences table) looks like this:

    value   day_diff
7   66.45   0 days
6   65.36   31 days
5   62.13   20 days
4   63.84   21 days
3   64.02   7 days
2   63.88   7 days
1   63.95   7 days
0   63.85   7 days

Lastly, to get back to dates accumulating from a start data, you do the following:

startDate = np.datetime64('2000-01-01')         # You can change this if you like
df4 = df2         # Copy coumn data from df2
df4.date = np.array(np.cumsum(df3.day_diff) + startDate   # np.cumsum accumulates the day_diff sum  
df4

where df4 (the start date accumulation) looks like this:

    value   date
7   66.45   2000-01-01
6   65.36   2000-02-01
5   62.13   2000-02-21
4   63.84   2000-03-13
3   64.02   2000-03-20
2   63.88   2000-03-27
1   63.95   2000-04-03
0   63.85   2000-04-10

I noticed there is a 1-day discrepancy with my final table, however this is most likely due to the implementation of timedelta inclusivity/exluclusivity.

Upvotes: 1

Related Questions