Reputation: 23
I have 2 dataframes of different sizes in Python. The smaller dataframe has 2 datetime columns, one for the beginning datetime and one for the ending datetime. The other dataframe is bigger (more rows and columns) and it has one datetime column.
df A
Date_hour_beginning Date_hour_end
3/8/2019 18:35 3/8/2019 19:45
4/8/2019 14:22 4/8/2019 14:55
df B
Date_hour compression
3/8/2019 18:37 41
3/8/2019 18:55 47
3/8/2019 19:30 55
3/8/2019 19:51 51
4/8/2019 14:10 53
4/8/2019 14:35 48
4/8/2019 14:51 51
4/8/2019 15:02 58
I want to add the mean and amplitude of compression to df_A that cover the datetime range. To get the following result:
df_A
Date_hour_beginning Date_hour_end mean_compression amplitude
3/8/2019 18:35 3/8/2019 19:45 47.66 14
4/8/2019 14:22 4/8/2019 14:55 49.5 3
I tried the np.where and groupby but I didn't know but I had the error of mismatching dataframe shapes.
Upvotes: 2
Views: 100
Reputation: 394
Here is my solution. It is kind of a more verbose (and maybe more readable?) version of eva-vw's. eva-vw's uses the .apply()
method which is the fastest way of looping over the rows of your dataframe. However it should only make a significant difference in run time if your df_A
has really many (many) rows (which does not seem to be the case here).
for i, row in df_A.iterrows() :
start = row['Date_hour_beginning']
end = row['Date_hour_end']
mask = (df_B['Date_hour'] >= start) & (df_B['Date_hour'] <= end)
compression_values = df_B.loc[mask, 'compression']
df_A.loc[i, 'avg comp'] = compression_values.mean()
df_A.loc[i, 'amp comp'] = compression_values.max() - compression_values.min()
For completeness, here is how I created the dataframes:
import numpy as np
import pandas as pd
columns = ['Date_hour_beginning', 'Date_hour_end']
times_1 = pd.to_datetime(['3/8/2019 18:35', '3/8/2019 19:45'])
times_2 = pd.to_datetime(['4/8/2019 14:22', '4/8/2019 14:55'])
df_A = pd.DataFrame(data=[times_1, times_2], columns=columns)
data_B = [ ['3/8/2019 18:37', 41],
['3/8/2019 18:55', 47],
['3/8/2019 19:30', 55],
['3/8/2019 19:51', 51],
['4/8/2019 14:10', 53],
['4/8/2019 14:35', 48],
['4/8/2019 14:51', 51],
['4/8/2019 15:02', 58]]
columns_B = ['Date_hour', 'compression']
df_B = pd.DataFrame(data=data_B, columns=columns_B)
df_B['Date_hour'] = pd.to_datetime(df_B['Date_hour'])
To go a bit further: to solve your problem, you need to loop over the rows of df_A
. This can be done in three main ways: (i) with a plain for
loop over the indices of the rows of the dataframe, (ii) with for
loop with the .iterrows()
method, or with the apply()
method.
I ordered them from the slowest to the fastest at runtime. I picked method (ii) and eva-vw picked method (iii). The advantage of .apply()
is that it is the fastest, but its disadvantage (to me) is that you have to write everything you want to do with the row, in a one-line lambda
function.
Upvotes: 2
Reputation: 670
# create test dataframes
df_A = pd.DataFrame(
{
"Date_hour_beginning": ["3/8/2019 18:35", "4/8/2019 14:22"],
"Date_hour_end": ["3/8/2019 19:45", "4/8/2019 14:55"],
}
)
df_B = pd.DataFrame(
{
"Date_hour": [
"3/8/2019 18:37",
"3/8/2019 18:55",
"3/8/2019 19:30",
"3/8/2019 19:51",
"4/8/2019 14:10",
"4/8/2019 14:35",
"4/8/2019 14:51",
"4/8/2019 15:02",
],
"compression": [41, 47, 55, 51, 53, 48, 51, 58],
}
)
# convert to datetime
df_A['Date_hour_beginning'] = pd.to_datetime(df_A['Date_hour_beginning'])
df_A['Date_hour_end'] = pd.to_datetime(df_A['Date_hour_end'])
df_B['Date_hour'] = pd.to_datetime(df_B['Date_hour'])
# accumulate compression values per range
df_A["compression"] = df_A.apply(
lambda row: df_B.loc[
(df_B["Date_hour"] >= row["Date_hour_beginning"])
& (df_B["Date_hour"] <= row["Date_hour_end"]),
"compression",
].values.tolist(),
axis=1,
)
# calculate mean compression and amplitude
df_A['mean_compression'] = df_A['compression'].apply(lambda x: sum(x) / len(x))
df_A['amplitude'] = df_A['compression'].apply(lambda x: max(x) - min(x))
Upvotes: 2
Reputation: 1055
groupby can accept series equally indexed, i.e.
df['Date_hour'] = pd.to_datetime(df['Date_hour'])
df_a['begin'] = pd.to_datetime(df_a['begin'])
df_a['end'] = pd.to_datetime(df_a['end'])
selector = df.apply(lambda x: df_a.query(f'begin <= \'{x["Date_hour"]}\' <= end').index[0], axis=1)
for i_gr, gr in df.groupby(selector):
print(i_gr, gr)
And then go on with your .mean() or .median()
Upvotes: 1
Reputation: 4215
Use this:
df_A['Date_hour_beginning'] = pd.to_datetime(df_A['Date_hour_beginning'])
df_A['Date_hour_end'] = pd.to_datetime(df_A['Date_hour_end'])
df_B['Date_hour'] = pd.to_datetime(df_B['Date_hour'])
df_A = df_A.assign(key=1)
df_B = df_B.assign(key=1)
df_merge = pd.merge(df_A, df_B, on='key').drop('key',axis=1)
df_merge = df_merge.query('Date_hour >= Date_hour_beginning and Date_hour <= Date_hour_end')
df_merge['amplitude'] = df_merge.groupby(['Date_hour_beginning','Date_hour_end'])['compression'].transform(lambda x: x.max()-x.min())
df_merge = df_merge.groupby(['Date_hour_beginning','Date_hour_end']).mean()
Output:
compression amplitude
Date_hour_beginning Date_hour_end
2019-03-08 18:35:00 2019-03-08 19:45:00 47.666667 14.0
2019-04-08 14:22:00 2019-04-08 14:55:00 49.500000 3.0
Upvotes: 1