Reputation: 29
In this problem, you have two dataframes, one with the last price release, normally the present day. And in the other dataframe, we have all the launches.
The idea is that we can work with these two dataframes, in a way that the result is the result of the difference between the present day and the second most recent day of that price. Repeating the present day and ignoring the penultimate date. And the hardest part is that this difference needs to follow the periodicity pattern. So if the date type is Friday, the difference can only be from previous Fridays.
In a way that the lines are repeated, with the exception of the price which is not available.
First Dataframe:
import pandas as pd
data = {
'Type': ['Product1', 'Product2', 'Product3'],
'State': ['New York', 'Washington', 'Illinois'],
'Date':['25/03/2022','25/03/2022','25/03/2022'],
'Price':['5.00','4.00','4.00'],
'Type-Date':['Friday (only)','Friday (only)','Monday, Wednesday, Friday (only)']}
df_1 = pd.DataFrame(data)
df_1
Type State Date Price Name-Date
0 Product1 New York 25/03/2022 5.00 Friday (only)
1 Product2 Washington 25/03/2022 4.00 Friday (only)
2 Product3 Illinois 25/03/2022 4.00 Monday, Wednesday, Friday (only)
Second Dataframe:
data = {'Type': ['Product1', 'Product1', 'Product1','Product2','Product2','Product2','Product3','Product3','Product3'],
'State': ['New York', 'New York','New York', 'Washington', 'Washington', 'Washington', 'Illinois', 'Illinois', 'Illinois'],
'Date':['25/03/2022','04/03/2022','25/02/2022', '25/03/2022', '11/03/2022', '04/03/2022', '25/03/2022', '16/03/2022', '14/03/2022'],
'Price':['5.00','4.00','4.00','4.00','3.00','2.00','4.00','3.00','4.00'],
'Type-Date':['Friday (only)','Friday (only)','Friday (only)','Friday (only)','Friday (only)','Friday (only)',
'Monday, Wednesday, Friday (only)','Monday, Wednesday, Friday (only)','Monday, Wednesday, Friday (only)']}
df_2 = pd.DataFrame(data)
df_2
Type State Date Price Type-Date
0 Product1 New York 25/03/2022 5.00 Friday (only)
1 Product1 New York 04/03/2022 4.00 Friday (only)
2 Product1 New York 25/02/2022 4.00 Friday (only)
3 Product2 Washington 25/03/2022 4.00 Friday (only)
4 Product2 Washington 11/03/2022 3.00 Friday (only)
5 Product2 Washington 04/03/2022 2.00 Friday (only)
6 Product3 Illinois 25/03/2022 4.00 Monday, Wednesday, Friday (only)
7 Product3 Illinois 16/03/2022 3.00 Monday, Wednesday, Friday (only)
8 Product3 Illinois 14/03/2022 4.00 Monday, Wednesday, Friday (only)
Desired Results
Type State Date Price Type-Date
0 Product1 New York 25/03/2022 5.00 Friday (only)
1 Product1 New York 18/03/2022 NaN Friday (only)
2 Product1 New York 11/03/2022 NaN Friday (only)
3 Product2 Washington 25/03/2022 4.00 Friday (only)
4 Product2 Washington 18/03/2022 NaN Friday (only)
5 Product3 Illinois 25/03/2022 4.00 Monday, Wednesday, Friday (only)
6 Product3 Illinois 23/03/2022 NaN Monday, Wednesday, Friday (only)
7 Product3 Illinois 21/03/2022 NaN Monday, Wednesday, Friday (only)
8 Product3 Illinois 18/03/2022 NaN Monday, Wednesday, Friday (only)
Upvotes: 0
Views: 79
Reputation: 9379
There's a lot here, which also means there are several possible circumstances that could arise that may or may not be anticipated in this answer. For example, what if the date in df_1 for a given Type is not found in df_2, or there are no entries in df_2 for a given Type, etc.
With that caveat, here is some code that produces the desired results specified in the question:
import pandas as pd
import numpy as np
data = {
'Type': ['Product1', 'Product2', 'Product3'],
'State': ['New York', 'Washington', 'Illinois'],
'Date':['25/03/2022','25/03/2022','25/03/2022'],
'Price':['5.00','4.00','4.00'],
'Type-Date':['Friday (only)','Friday (only)','Monday, Wednesday, Friday (only)']}
df_1 = pd.DataFrame(data)
data = {'Type': ['Product1', 'Product1', 'Product1','Product2','Product2','Product2','Product3','Product3','Product3'],
'State': ['New York', 'New York','New York', 'Washington', 'Washington', 'Washington', 'Illinois', 'Illinois', 'Illinois'],
'Date':['25/03/2022','04/03/2022','25/02/2022', '25/03/2022', '11/03/2022', '04/03/2022', '25/03/2022', '16/03/2022', '14/03/2022'],
'Price':['5.00','4.00','4.00','4.00','3.00','2.00','4.00','3.00','4.00'],
'Type-Date':['Friday (only)','Friday (only)','Friday (only)','Friday (only)','Friday (only)','Friday (only)',
'Monday, Wednesday, Friday (only)','Monday, Wednesday, Friday (only)','Monday, Wednesday, Friday (only)']}
df_2 = pd.DataFrame(data)
'''
Objective:
Create a dataframe which for each Type contains:
- today's Date and Price from df_1
- prior Date values with Price of NaN going back in time according to the Type's corresponding Type-Date value, back to but not including the penultimate date for which a price is available in df_2
'''
dayStrToInt = {'Monday':0,'Tuesday':1,'Wednesday':2,'Thursday':3,'Friday':4,'Saturday':5,'Sunday':6}
freqByType = {}
def setFreqByType(row):
weekdays = [s.strip() for s in row['Type-Date'].replace('(only)', '').split(',')]
if not weekdays:
raise ValueError(f"No weekdays found in Type-Date {repr(row['Type-Date'])}")
days = []
for w in weekdays:
if w not in dayStrToInt:
raise ValueError(f'Bad day-of-week string {w}')
days.append(dayStrToInt[w])
freqByType[row['Type']] = days
import datetime
datePriceListByType = []
def compileDatePriceByType(row):
curType = row['Type']
curDate = datetime.datetime.strptime(row['Date'], '%d/%m/%Y').date()
allDates = [datetime.datetime.strptime(dateStr, '%d/%m/%Y').date() for dateStr in df_2[df_2['Type']==row['Type']]['Date']]
allDateStrs = [dt.strftime('%d/%m/%Y') for dt in allDates]
minDate = min(allDates)
newDates = [curDate]
dt = curDate
days = freqByType[curType]
while dt > minDate:
curWD = dt.weekday()
nextWD = curWD
while nextWD not in days:
nextWD = (nextWD - 1) % 7
iWD = (days.index(nextWD) - (1 if nextWD == curWD else 0)) % len(days)
dt -= datetime.timedelta(days=(curWD - days[iWD]) % 7 if curWD != days[iWD] else 7)
if dt in allDates:
break
if dt > minDate:
newDates.append(dt)
datePrice = [[dt.strftime('%d/%m/%Y') for dt in newDates], [row['Price']] + [np.nan]*(len(newDates) - 1)]
datePriceListByType.append(datePrice)
df_1.apply(setFreqByType, axis=1)
df_1.apply(compileDatePriceByType, axis=1)
df_result = df_1
df_result[['Date', 'Price']] = pd.DataFrame(datePriceListByType, columns=['Date', 'Price'])
df_result = df_result.explode(['Date', 'Price'], ignore_index=True)
print(df_result)
Output:
Type State Date Price Type-Date
0 Product1 New York 25/03/2022 5.00 Friday (only)
1 Product1 New York 18/03/2022 NaN Friday (only)
2 Product1 New York 11/03/2022 NaN Friday (only)
3 Product2 Washington 25/03/2022 4.00 Friday (only)
4 Product2 Washington 18/03/2022 NaN Friday (only)
5 Product3 Illinois 25/03/2022 4.00 Monday, Wednesday, Friday (only)
6 Product3 Illinois 23/03/2022 NaN Monday, Wednesday, Friday (only)
7 Product3 Illinois 21/03/2022 NaN Monday, Wednesday, Friday (only)
8 Product3 Illinois 18/03/2022 NaN Monday, Wednesday, Friday (only)
UPDATE: Having the key be (Type, Region) instead of just Type.
If there is a need to have the Type-Date (i.e., the weekly schedule) vary based on a multi-column key such as (Type, Region), this can be achieved as well. While it is possible to generalize this based on a list of key columns, I will just share an example that hardcodes the two columns Type and Region:
import pandas as pd
import numpy as np
data = {
'Type': ['Product1', 'Product1', 'Product2', 'Product3'],
'Region': ['Northeast', 'Southeast', 'Northwest', 'Midwest'],
'State': ['New York', 'Florida', 'Washington', 'Illinois'],
'Date':['25/03/2022','25/03/2022','25/03/2022','25/03/2022'],
'Price':['5.00','4.50','4.00','4.00'],
'Type-Date':['Friday (only)','Tuesday, Friday (only)','Friday (only)','Monday, Wednesday, Friday (only)']}
df_1 = pd.DataFrame(data)
print(f"df_1\n{df_1}")
data = {'Type': ['Product1', 'Product1', 'Product1','Product1', 'Product1', 'Product1','Product2','Product2','Product2','Product3','Product3','Product3'],
'Region': ['Northeast', 'Northeast', 'Northeast', 'Southeast', 'Southeast', 'Southeast', 'Northwest', 'Northwest', 'Northwest', 'Midwest', 'Midwest', 'Midwest'],
'State': ['New York', 'New York','New York', 'Florida', 'Florida', 'Florida', 'Washington', 'Washington', 'Washington', 'Illinois', 'Illinois', 'Illinois'],
'Date':['25/03/2022','04/03/2022','25/02/2022', '25/03/2022','04/03/2022','25/02/2022', '25/03/2022', '11/03/2022', '04/03/2022', '25/03/2022', '16/03/2022', '14/03/2022'],
'Price':['5.00','4.00','4.00','4.50','4.25','4.10','4.00','3.00','2.00','4.00','3.00','4.00'],
'Type-Date':['Friday (only)','Friday (only)','Friday (only)','Tuesday, Friday (only)','Tuesday, Friday (only)','Tuesday, Friday (only)','Friday (only)','Friday (only)','Friday (only)',
'Monday, Wednesday, Friday (only)','Monday, Wednesday, Friday (only)','Monday, Wednesday, Friday (only)']}
df_2 = pd.DataFrame(data)
print(f"df_2\n{df_2}")
'''
Objective:
Create a dataframe which for each (Type, Region) pair contains:
- today's Date and Price from df_1
- prior Date values with Price of NaN going back in time according to the (Type, Region) pair's corresponding Type-Date value, back to but not including the penultimate date for which a price is available in df_2
'''
dayStrToInt = {'Monday':0,'Tuesday':1,'Wednesday':2,'Thursday':3,'Friday':4,'Saturday':5,'Sunday':6}
freqByTypeRegion = {}
def setFreqByTypeRegion(row):
weekdays = [s.strip() for s in row['Type-Date'].replace('(only)', '').split(',')]
if not weekdays:
raise ValueError(f"No weekdays found in Type-Date {repr(row['Type-Date'])}")
days = []
for w in weekdays:
if w not in dayStrToInt:
raise ValueError(f'Bad day-of-week string {w}')
days.append(dayStrToInt[w])
freqByTypeRegion[(row['Type'], row['Region'])] = days
import datetime
datePriceListByTypeRegion = []
def compileDatePriceByTypeRegion(row):
curTypeRegion = (row['Type'], row['Region'])
curDate = datetime.datetime.strptime(row['Date'], '%d/%m/%Y').date()
allDates = [datetime.datetime.strptime(dateStr, '%d/%m/%Y').date() for dateStr in df_2[(df_2['Type']==row['Type']) & (df_2['Region']==row['Region'])]['Date']]
allDateStrs = [dt.strftime('%d/%m/%Y') for dt in allDates]
minDate = min(allDates)
newDates = [curDate]
dt = curDate
days = freqByTypeRegion[curTypeRegion]
while dt > minDate:
curWD = dt.weekday()
nextWD = curWD
while nextWD not in days:
nextWD = (nextWD - 1) % 7
iWD = (days.index(nextWD) - (1 if nextWD == curWD else 0)) % len(days)
dt -= datetime.timedelta(days=(curWD - days[iWD]) % 7 if curWD != days[iWD] else 7)
if dt in allDates:
break
if dt > minDate:
newDates.append(dt)
datePrice = [[dt.strftime('%d/%m/%Y') for dt in newDates], [row['Price']] + [np.nan]*(len(newDates) - 1)]
datePriceListByTypeRegion.append(datePrice)
df_1.apply(setFreqByTypeRegion, axis=1)
df_1.apply(compileDatePriceByTypeRegion, axis=1)
df_result = df_1
df_result[['Date', 'Price']] = pd.DataFrame(datePriceListByTypeRegion, columns=['Date', 'Price'])
df_result = df_result.explode(['Date', 'Price'], ignore_index=True)
print(f"df_result\n{df_result}")
Output:
df_1
Type Region State Date Price Type-Date
0 Product1 Northeast New York 25/03/2022 5.00 Friday (only)
1 Product1 Southeast Florida 25/03/2022 4.50 Tuesday, Friday (only)
2 Product2 Northwest Washington 25/03/2022 4.00 Friday (only)
3 Product3 Midwest Illinois 25/03/2022 4.00 Monday, Wednesday, Friday (only)
df_2
Type Region State Date Price Type-Date
0 Product1 Northeast New York 25/03/2022 5.00 Friday (only)
1 Product1 Northeast New York 04/03/2022 4.00 Friday (only)
2 Product1 Northeast New York 25/02/2022 4.00 Friday (only)
3 Product1 Southeast Florida 25/03/2022 4.50 Tuesday, Friday (only)
4 Product1 Southeast Florida 04/03/2022 4.25 Tuesday, Friday (only)
5 Product1 Southeast Florida 25/02/2022 4.10 Tuesday, Friday (only)
6 Product2 Northwest Washington 25/03/2022 4.00 Friday (only)
7 Product2 Northwest Washington 11/03/2022 3.00 Friday (only)
8 Product2 Northwest Washington 04/03/2022 2.00 Friday (only)
9 Product3 Midwest Illinois 25/03/2022 4.00 Monday, Wednesday, Friday (only)
10 Product3 Midwest Illinois 16/03/2022 3.00 Monday, Wednesday, Friday (only)
11 Product3 Midwest Illinois 14/03/2022 4.00 Monday, Wednesday, Friday (only)
df_result
Type Region State Date Price Type-Date
0 Product1 Northeast New York 25/03/2022 5.00 Friday (only)
1 Product1 Northeast New York 18/03/2022 NaN Friday (only)
2 Product1 Northeast New York 11/03/2022 NaN Friday (only)
3 Product1 Southeast Florida 25/03/2022 4.50 Tuesday, Friday (only)
4 Product1 Southeast Florida 22/03/2022 NaN Tuesday, Friday (only)
5 Product1 Southeast Florida 18/03/2022 NaN Tuesday, Friday (only)
6 Product1 Southeast Florida 15/03/2022 NaN Tuesday, Friday (only)
7 Product1 Southeast Florida 11/03/2022 NaN Tuesday, Friday (only)
8 Product1 Southeast Florida 08/03/2022 NaN Tuesday, Friday (only)
9 Product2 Northwest Washington 25/03/2022 4.00 Friday (only)
10 Product2 Northwest Washington 18/03/2022 NaN Friday (only)
11 Product3 Midwest Illinois 25/03/2022 4.00 Monday, Wednesday, Friday (only)
12 Product3 Midwest Illinois 23/03/2022 NaN Monday, Wednesday, Friday (only)
13 Product3 Midwest Illinois 21/03/2022 NaN Monday, Wednesday, Friday (only)
14 Product3 Midwest Illinois 18/03/2022 NaN Monday, Wednesday, Friday (only)
Upvotes: 1