I have two dataframes df1 and df2, where df1 has a datetime index based on days, and df2 has two date columns 'wk start' and 'wk end' that are weekly ranges as well as one data column 'statistic' that stores data corresponding to the week range. I would like to add to df1 a column for 'statistic' whereby I lookup each date (on a daily basis, i.e. each row) and try to find the corresponding 'statistic' depending on the week that this date falls into. I believe the answer would require merging df2 into df1 but I'm lost as to how to proceed after that. df1: (note: I skipped the rows between 2019-06-12 and 2019-06-16 to keep the example short.) date age 2019-06-10 20 2019-06-11 21 2019-06-17 19 2019-06-18 18 df2: wk start wk end statistic 2019-06-10 2019-06-14 102 2019-06-17 2019-06-21 100 2019-06-24 2019-06-28 547 2019-07-02 2019-07-25 268 Desired output: date age statistic 2019-06-10 20 102 2019-06-11 21 102 2019-06-17 19 100 2019-06-18 18 100 code for the dataframes d1 and d2 import pandas as pd import datetime data1 = { 'date': ['2019-06-10', '2019-06-11', '2019-06-17', '2019-06-18'], 'age': [20, 21, 19, 18] } data1['date'] = pd.to_datetime(data1['date']) df1 = pd.DataFrame(data1) df1.set_index('date', inplace=True) data2 = { 'wk start': ['2019-06-10', '2019-06-17', '2019-06-24', '2019-07-02'], 'wk end': ['2019-06-14', '2019-06-21', '2019-06-28', '2019-07-05'], 'height': [120, 121, 119, 118] } data2['wk start'] = pd.to_datetime(data2['wk start']) data2['wk end'] = pd.to_datetime(data2['wk end']) df2 = pd.DataFrame(data2)

Reputation: 23

How to look up data in a separate dataframe (df2) based on date in df1 falling between date range values across two columns in df2

I have two dataframes df1 and df2, where df1 has a datetime index based on days, and df2 has two date columns 'wk start' and 'wk end' that are weekly ranges as well as one data column 'statistic' that stores data corresponding to the week range.

I would like to add to df1 a column for 'statistic' whereby I lookup each date (on a daily basis, i.e. each row) and try to find the corresponding 'statistic' depending on the week that this date falls into.

I believe the answer would require merging df2 into df1 but I'm lost as to how to proceed after that.

df1: (note: I skipped the rows between 2019-06-12 and 2019-06-16 to keep the example short.)

date	age
2019-06-10	20
2019-06-11	21
2019-06-17	19
2019-06-18	18

df2:

wk start	wk end	statistic
2019-06-10	2019-06-14	102
2019-06-17	2019-06-21	100
2019-06-24	2019-06-28	547
2019-07-02	2019-07-25	268

Desired output:

date	age	statistic
2019-06-10	20	102
2019-06-11	21	102
2019-06-17	19	100
2019-06-18	18	100

code for the dataframes d1 and d2

import pandas as pd

import datetime

data1 = {
    'date': ['2019-06-10', '2019-06-11', '2019-06-17', '2019-06-18'],
    'age': [20, 21, 19, 18]
}
data1['date'] = pd.to_datetime(data1['date'])
df1 = pd.DataFrame(data1)
df1.set_index('date', inplace=True)

data2 = {
    'wk start': ['2019-06-10', '2019-06-17', '2019-06-24', '2019-07-02'],
    'wk end': ['2019-06-14', '2019-06-21', '2019-06-28', '2019-07-05'],
    'height': [120, 121, 119, 118]
}
data2['wk start'] = pd.to_datetime(data2['wk start'])
data2['wk end'] = pd.to_datetime(data2['wk end'])
df2 = pd.DataFrame(data2)

Upvotes: 2

Answers (4)

SeaBean

Reputation: 23217

You can firstly reset_index() on df1 to get the date row index back to data column. Then, cross join df1 and df2 by .merge() with how='cross' and then filter the result by date field is between wk start and wk end using .between(), as follows;

df_merge = df1.reset_index().merge(df2, how='cross')   
df_out = df_merge[df_merge['date'].between(df_merge['wk start'], df_merge['wk end'])]

Or, if your Pandas version is < 1.2.0 (released in December 2020)

df_merge = df1.reset_index().assign(key=1).merge(df2.assign(key=1), on='key').drop('key', axis=1) 
df_out = df_merge[df_merge['date'].between(df_merge['wk start'], df_merge['wk end'])]

Result:

print(df_out)


         date  age   wk start     wk end  height
0  2019-06-10   20 2019-06-10 2019-06-14     120
4  2019-06-11   21 2019-06-10 2019-06-14     120
9  2019-06-17   19 2019-06-17 2019-06-21     121
13 2019-06-18   18 2019-06-17 2019-06-21     121

You can further remove the 2 columns wk start wk end and set column date as index by:

df_out = df_out.drop(['wk start', 'wk end'], axis=1).set_index('date')

Result:

print(df_out)

            age  height
date                   
2019-06-10   20     120
2019-06-11   21     120
2019-06-17   19     121
2019-06-18   18     121

Upvotes: 4

sammywemmy

Reputation: 28709

Cartesian product are generally fast (if the data is not so large), but can also consume lots of memory (and in some cases, rather inefficient); an alternative would be intervalIndex :

interval_index = pd.IntervalIndex.from_tuples([*zip(df2['wk start'], df2['wk end'])])

index_position = interval_index.get_indexer(df1.index)

df1.assign(statistic = df2.height[index_position].array)

            age  statistic
date                      
2019-06-10   20        120
2019-06-11   21        120
2019-06-17   19        121
2019-06-18   18        121

Upvotes: 0

Alex

Reputation: 1

df[df.some_date.between(start_date, end_date)] you can isin method on the date column like that df[df["date"].isin(pd.date_range(start_date, end_date))] follow this Select DataFrame rows between two dates and have a look at this Assign values in one dataframe if date is in date range in another dataframe and projects are equal

Upvotes: 1

jacobscgc

Reputation: 31

You could loop through the dataframe and subset the second dataframe as you go.

import pandas as pd

import datetime

data1 = {'date': ['2019-06-10', '2019-06-11', '2019-06-17', '2019-06-18'], 'age': [20, 21, 19, 18]}

data1['date']=pd.to_datetime(data1['date'])

df1 = pd.DataFrame(data1)

df1.set_index('date', inplace=True)

data2 = {'wk start': ['2019-06-10', '2019-06-17', '2019-06-24', '2019-07-02'], 'wk end':[ '2019-06-14', '2019-06-21', '2019-06-28', '2019-07-05'], 'height': [120,121, 119, 118]}

data2['wk start']=pd.to_datetime(data2['wk start'])

data2['wk end']=pd.to_datetime(data2['wk end'])

df2 = pd.DataFrame(data2)

# Loop
list1 = []
for row in df1.iterrows():
    subdf = df2[(df2['wk start'] <= index) & (df2['wk end'] >= index)]
    list1.append(subdf['height'].tolist()[0])
df1['height'] = list1
print(df1)

The values differ a bit from the table you show above (statistic one) because the code you provided has different values and height rather than statistic, but the principle is the same.

Upvotes: 1

How to look up data in a separate dataframe (df2) based on date in df1 falling between date range values across two columns in df2

Answers (4)

Related Questions