Reputation: 23
I have two dataframes df1 and df2, where df1 has a datetime index based on days, and df2 has two date columns 'wk start' and 'wk end' that are weekly ranges as well as one data column 'statistic' that stores data corresponding to the week range.
I would like to add to df1 a column for 'statistic' whereby I lookup each date (on a daily basis, i.e. each row) and try to find the corresponding 'statistic' depending on the week that this date falls into.
I believe the answer would require merging df2 into df1 but I'm lost as to how to proceed after that.
df1: (note: I skipped the rows between 2019-06-12 and 2019-06-16 to keep the example short.)
date | age |
---|---|
2019-06-10 | 20 |
2019-06-11 | 21 |
2019-06-17 | 19 |
2019-06-18 | 18 |
df2:
wk start | wk end | statistic |
---|---|---|
2019-06-10 | 2019-06-14 | 102 |
2019-06-17 | 2019-06-21 | 100 |
2019-06-24 | 2019-06-28 | 547 |
2019-07-02 | 2019-07-25 | 268 |
Desired output:
date | age | statistic |
---|---|---|
2019-06-10 | 20 | 102 |
2019-06-11 | 21 | 102 |
2019-06-17 | 19 | 100 |
2019-06-18 | 18 | 100 |
code for the dataframes d1 and d2
import pandas as pd
import datetime
data1 = {
'date': ['2019-06-10', '2019-06-11', '2019-06-17', '2019-06-18'],
'age': [20, 21, 19, 18]
}
data1['date'] = pd.to_datetime(data1['date'])
df1 = pd.DataFrame(data1)
df1.set_index('date', inplace=True)
data2 = {
'wk start': ['2019-06-10', '2019-06-17', '2019-06-24', '2019-07-02'],
'wk end': ['2019-06-14', '2019-06-21', '2019-06-28', '2019-07-05'],
'height': [120, 121, 119, 118]
}
data2['wk start'] = pd.to_datetime(data2['wk start'])
data2['wk end'] = pd.to_datetime(data2['wk end'])
df2 = pd.DataFrame(data2)
Upvotes: 2
Views: 1021
Reputation: 23217
You can firstly reset_index()
on df1
to get the date
row index back to data column. Then, cross join df1
and df2
by .merge()
with how='cross'
and then filter the result by date
field is between wk start
and wk end
using .between()
, as follows;
df_merge = df1.reset_index().merge(df2, how='cross')
df_out = df_merge[df_merge['date'].between(df_merge['wk start'], df_merge['wk end'])]
Or, if your Pandas version is < 1.2.0 (released in December 2020)
df_merge = df1.reset_index().assign(key=1).merge(df2.assign(key=1), on='key').drop('key', axis=1)
df_out = df_merge[df_merge['date'].between(df_merge['wk start'], df_merge['wk end'])]
Result:
print(df_out)
date age wk start wk end height
0 2019-06-10 20 2019-06-10 2019-06-14 120
4 2019-06-11 21 2019-06-10 2019-06-14 120
9 2019-06-17 19 2019-06-17 2019-06-21 121
13 2019-06-18 18 2019-06-17 2019-06-21 121
You can further remove the 2 columns wk start
wk end
and set column date
as index by:
df_out = df_out.drop(['wk start', 'wk end'], axis=1).set_index('date')
Result:
print(df_out)
age height
date
2019-06-10 20 120
2019-06-11 21 120
2019-06-17 19 121
2019-06-18 18 121
Upvotes: 4
Reputation: 28709
Cartesian product are generally fast (if the data is not so large), but can also consume lots of memory (and in some cases, rather inefficient); an alternative would be intervalIndex :
interval_index = pd.IntervalIndex.from_tuples([*zip(df2['wk start'], df2['wk end'])])
index_position = interval_index.get_indexer(df1.index)
df1.assign(statistic = df2.height[index_position].array)
age statistic
date
2019-06-10 20 120
2019-06-11 21 120
2019-06-17 19 121
2019-06-18 18 121
Upvotes: 0
Reputation: 1
df[df.some_date.between(start_date, end_date)] you can isin method on the date column like that df[df["date"].isin(pd.date_range(start_date, end_date))] follow this Select DataFrame rows between two dates and have a look at this Assign values in one dataframe if date is in date range in another dataframe and projects are equal
Upvotes: 1
Reputation: 31
You could loop through the dataframe and subset the second dataframe as you go.
import pandas as pd
import datetime
data1 = {'date': ['2019-06-10', '2019-06-11', '2019-06-17', '2019-06-18'], 'age': [20, 21, 19, 18]}
data1['date']=pd.to_datetime(data1['date'])
df1 = pd.DataFrame(data1)
df1.set_index('date', inplace=True)
data2 = {'wk start': ['2019-06-10', '2019-06-17', '2019-06-24', '2019-07-02'], 'wk end':[ '2019-06-14', '2019-06-21', '2019-06-28', '2019-07-05'], 'height': [120,121, 119, 118]}
data2['wk start']=pd.to_datetime(data2['wk start'])
data2['wk end']=pd.to_datetime(data2['wk end'])
df2 = pd.DataFrame(data2)
# Loop
list1 = []
for row in df1.iterrows():
subdf = df2[(df2['wk start'] <= index) & (df2['wk end'] >= index)]
list1.append(subdf['height'].tolist()[0])
df1['height'] = list1
print(df1)
The values differ a bit from the table you show above (statistic one) because the code you provided has different values and height rather than statistic, but the principle is the same.
Upvotes: 1