Reputation: 23

Python: Working with dataframes of different sizes to create new columns based on datetime conditions

I have 2 dataframes of different sizes in Python. The smaller dataframe has 2 datetime columns, one for the beginning datetime and one for the ending datetime. The other dataframe is bigger (more rows and columns) and it has one datetime column.

df A

Date_hour_beginning   Date_hour_end
3/8/2019 18:35        3/8/2019 19:45
4/8/2019 14:22        4/8/2019 14:55

df B

Date_hour         compression
3/8/2019 18:37      41
3/8/2019 18:55      47
3/8/2019 19:30      55
3/8/2019 19:51      51
4/8/2019 14:10      53
4/8/2019 14:35      48
4/8/2019 14:51      51
4/8/2019 15:02      58

I want to add the mean and amplitude of compression to df_A that cover the datetime range. To get the following result:

df_A
Date_hour_beginning   Date_hour_end     mean_compression     amplitude
3/8/2019 18:35        3/8/2019 19:45        47.66              14
4/8/2019 14:22        4/8/2019 14:55        49.5               3

I tried the np.where and groupby but I didn't know but I had the error of mismatching dataframe shapes.

Upvotes: 2

Answers (4)

presenter

Reputation: 394

Here is my solution. It is kind of a more verbose (and maybe more readable?) version of eva-vw's. eva-vw's uses the .apply() method which is the fastest way of looping over the rows of your dataframe. However it should only make a significant difference in run time if your df_A has really many (many) rows (which does not seem to be the case here).

for i, row in df_A.iterrows() :

    start = row['Date_hour_beginning']
    end = row['Date_hour_end']

    mask = (df_B['Date_hour'] >= start) & (df_B['Date_hour'] <= end) 
    compression_values = df_B.loc[mask, 'compression']

    df_A.loc[i, 'avg comp'] = compression_values.mean()
    df_A.loc[i, 'amp comp'] = compression_values.max() - compression_values.min()

For completeness, here is how I created the dataframes:

import numpy as np
import pandas as pd

columns = ['Date_hour_beginning', 'Date_hour_end']

times_1 = pd.to_datetime(['3/8/2019 18:35', '3/8/2019 19:45'])
times_2 =  pd.to_datetime(['4/8/2019 14:22', '4/8/2019 14:55'])

df_A = pd.DataFrame(data=[times_1, times_2], columns=columns)

data_B = [ ['3/8/2019 18:37',      41],
            ['3/8/2019 18:55',      47],
            ['3/8/2019 19:30',      55],
            ['3/8/2019 19:51',      51],
            ['4/8/2019 14:10',      53],
            ['4/8/2019 14:35',      48],
            ['4/8/2019 14:51',      51],
            ['4/8/2019 15:02',      58]]

columns_B = ['Date_hour', 'compression']
df_B = pd.DataFrame(data=data_B, columns=columns_B)

df_B['Date_hour'] = pd.to_datetime(df_B['Date_hour'])

To go a bit further: to solve your problem, you need to loop over the rows of df_A. This can be done in three main ways: (i) with a plain for loop over the indices of the rows of the dataframe, (ii) with for loop with the .iterrows() method, or with the apply() method.

I ordered them from the slowest to the fastest at runtime. I picked method (ii) and eva-vw picked method (iii). The advantage of .apply() is that it is the fastest, but its disadvantage (to me) is that you have to write everything you want to do with the row, in a one-line lambda function.

Upvotes: 2

eva-vw

Reputation: 670

# create test dataframes
df_A = pd.DataFrame(
    {
        "Date_hour_beginning": ["3/8/2019 18:35", "4/8/2019 14:22"],
        "Date_hour_end": ["3/8/2019 19:45", "4/8/2019 14:55"],
    }
)
df_B = pd.DataFrame(
    {
        "Date_hour": [
            "3/8/2019 18:37",
            "3/8/2019 18:55",
            "3/8/2019 19:30",
            "3/8/2019 19:51",
            "4/8/2019 14:10",
            "4/8/2019 14:35",
            "4/8/2019 14:51",
            "4/8/2019 15:02",
        ],
        "compression": [41, 47, 55, 51, 53, 48, 51, 58],
    }
)

# convert to datetime
df_A['Date_hour_beginning'] = pd.to_datetime(df_A['Date_hour_beginning'])
df_A['Date_hour_end'] = pd.to_datetime(df_A['Date_hour_end'])
df_B['Date_hour'] = pd.to_datetime(df_B['Date_hour'])

# accumulate compression values per range
df_A["compression"] = df_A.apply(
    lambda row: df_B.loc[
        (df_B["Date_hour"] >= row["Date_hour_beginning"])
        & (df_B["Date_hour"] <= row["Date_hour_end"]),
        "compression",
    ].values.tolist(),
    axis=1,
)

# calculate mean compression and amplitude
df_A['mean_compression'] = df_A['compression'].apply(lambda x: sum(x) / len(x))
df_A['amplitude'] = df_A['compression'].apply(lambda x: max(x) - min(x))

Upvotes: 2

Oleg O

Reputation: 1055

groupby can accept series equally indexed, i.e.

df['Date_hour'] = pd.to_datetime(df['Date_hour'])

df_a['begin'] = pd.to_datetime(df_a['begin'])
df_a['end'] = pd.to_datetime(df_a['end'])

selector = df.apply(lambda x: df_a.query(f'begin <= \'{x["Date_hour"]}\' <= end').index[0], axis=1)

for i_gr, gr in df.groupby(selector):
    print(i_gr, gr)

And then go on with your .mean() or .median()

Upvotes: 1

luigigi

Reputation: 4215

Use this:

df_A['Date_hour_beginning'] = pd.to_datetime(df_A['Date_hour_beginning'])
df_A['Date_hour_end'] = pd.to_datetime(df_A['Date_hour_end'])
df_B['Date_hour'] = pd.to_datetime(df_B['Date_hour'])

df_A = df_A.assign(key=1)
df_B = df_B.assign(key=1)
df_merge = pd.merge(df_A, df_B, on='key').drop('key',axis=1)

df_merge = df_merge.query('Date_hour >= Date_hour_beginning and Date_hour <= Date_hour_end')
df_merge['amplitude'] = df_merge.groupby(['Date_hour_beginning','Date_hour_end'])['compression'].transform(lambda x: x.max()-x.min())
df_merge = df_merge.groupby(['Date_hour_beginning','Date_hour_end']).mean()

Output:

                                         compression  amplitude
Date_hour_beginning Date_hour_end                              
2019-03-08 18:35:00 2019-03-08 19:45:00    47.666667       14.0
2019-04-08 14:22:00 2019-04-08 14:55:00    49.500000        3.0

Upvotes: 1

Python: Working with dataframes of different sizes to create new columns based on datetime conditions

Answers (4)

Related Questions