vestland
vestland

Reputation: 61154

How to find the difference between hour of day of separate days in a pandas dataframe?

For a dataframe with no missing values, this would be as easy as df.diff(periods=24, axis=0). But how is it possible to connect the calculations to the index values?


Reproducible dataframe - Code:

# Imports
import pandas as pd
import numpy as np

# A dataframe with two variables, random numbers and hourly time series
np.random.seed(123)
rows = 36
rng = pd.date_range('1/1/2017', periods=rows, freq='H')
df = pd.DataFrame(np.random.randint(100,150,size=(rows, 2)), columns=['A', 'B']) 
df = df.set_index(rng)

Reproducible dataframe - Screenshot:

enter image description here

Desired output - Code:

# Running difference step = 24
df = df.diff(periods=24, axis=0)
df = df.dropna(axis=0, how='all')

Desired output - Screenshot

enter image description here

The real challenge

The problem is that my real-world examples are full of missing values. So I'll have to connect the difference intervals with the index values, and I have no Idea how. I've tried a few solutions with filling in the missing hours in the index first, and then running the differences like before, but it's not very elegant.

Thank you for any suggestions!

Edit - As requested in the comments, here's my best attempt for a bit longer time period:

df_missing = df.drop(df.index[[2,3]])
newIndex = pd.date_range(start = '1/1/2017',  end = '1/3/2017', freq='H')
df_missing = df_missing.reindex(newIndex, fill_value = np.nan)
df_refilled = df_missing.diff(periods=24, axis=0)

Compared to the other suggestions, I would say that this is not very elegant =)

Upvotes: 4

Views: 226

Answers (2)

BENY
BENY

Reputation: 323326

I think maybe you can use groupby

df.groupby(df.index.hour).diff().dropna()
Out[784]: 
                        A     B
2017-01-02 00:00:00  -3.0   3.0
2017-01-02 01:00:00 -28.0 -23.0
2017-01-02 02:00:00  -4.0  -7.0
2017-01-02 03:00:00   3.0 -29.0
2017-01-02 04:00:00  -4.0   3.0
2017-01-02 05:00:00 -17.0  -6.0
2017-01-02 06:00:00 -20.0  35.0
2017-01-02 07:00:00  -2.0 -40.0
2017-01-02 08:00:00  13.0 -21.0
2017-01-02 09:00:00  -9.0 -13.0
2017-01-02 10:00:00   0.0   3.0
2017-01-02 11:00:00 -21.0  -9.0

Upvotes: 5

cs95
cs95

Reputation: 402852

You can snap your dataframe to hourly recordings using asfreq, and then use diff?

df.asfreq('1H').diff(periods=24, axis=0).dropna()

Or, use shift and then subtract (instead of diff),

v = df.asfreq('1h') 
(v - v.shift(periods=24)).dropna()

                        A     B
2017-01-02 00:00:00  -3.0   3.0
2017-01-02 01:00:00 -28.0 -23.0
2017-01-02 02:00:00  -4.0  -7.0
2017-01-02 03:00:00   3.0 -29.0
2017-01-02 04:00:00  -4.0   3.0
2017-01-02 05:00:00 -17.0  -6.0
2017-01-02 06:00:00 -20.0  35.0
2017-01-02 07:00:00  -2.0 -40.0
2017-01-02 08:00:00  13.0 -21.0
2017-01-02 09:00:00  -9.0 -13.0
2017-01-02 10:00:00   0.0   3.0
2017-01-02 11:00:00 -21.0  -9.0

Upvotes: 5

Related Questions