Reputation: 567
I know there are some similar questions here, but please read on as I have looked and tried to adapt existing solutions without any luck. I have a data frame that pulls data for year and quarter. In the scenario shown below prevYearLeadCount
displays data from Q1 2020. To be clear
prevYearLeadCount
will always display the lead count from the previous year in the same quarter. The below is just an example to show how the data is structured. Also, looking at the data below, since there is data for Q4 of 2019 I would expect 2020 Q4 prevYearLeadCount
to equal 236
[
{
"salesforceAccountId": 3148,
"accountName": "Account Name",
"year": 2017,
"quarter": 2,
"leadCount": 151,
"prevYearLeadCount": 0.0
},
{
"salesforceAccountId": 3148,
"accountName": "Account Name",
"year": 2018,
"quarter": 2,
"leadCount": 73,
"prevYearLeadCount": 151.0
},
{
"salesforceAccountId": 3148,
"accountName": "Account Name",
"year": 2018,
"quarter": 3,
"leadCount": 271,
"prevYearLeadCount": 0.0
},
{
"salesforceAccountId": 3148,
"accountName": "Account Name",
"year": 2018,
"quarter": 4,
"leadCount": 173,
"prevYearLeadCount": 0.0
},
{
"salesforceAccountId": 3148,
"accountName": "Account Name",
"year": 2019,
"quarter": 1,
"leadCount": 209,
"prevYearLeadCount": 0.0
},
{
"salesforceAccountId": 3148,
"accountName": "Account Name",
"year": 2019,
"quarter": 2,
"leadCount": 274,
"prevYearLeadCount": 0.0
},
{
"salesforceAccountId": 3148,
"accountName": "Account Name",
"year": 2019,
"quarter": 3,
"leadCount": 311,
"prevYearLeadCount": 0.0
},
{
"salesforceAccountId": 3148,
"accountName": "Account Name",
"year": 2019,
"quarter": 4,
"leadCount": 236,
"prevYearLeadCount": 0.0
},
{
"salesforceAccountId": 3148,
"accountName": "Account Name",
"year": 2020,
"quarter": 1,
"leadCount": 245,
"prevYearLeadCount": 209.0
},
{
"salesforceAccountId": 3148,
"accountName": "Account Name",
"year": 2020,
"quarter": 2,
"leadCount": 430,
"prevYearLeadCount": 0.0
},
{
"salesforceAccountId": 3148,
"accountName": "Account Name",
"year": 2020,
"quarter": 3,
"leadCount": 907,
"prevYearLeadCount": 0.0
},
{
"salesforceAccountId": 3148,
"accountName": "Account Name",
"year": 2020,
"quarter": 4,
"leadCount": 657,
"prevYearLeadCount": 0.0
},
{
"salesforceAccountId": 3148,
"accountName": "Account Name",
"year": 2021,
"quarter": 1,
"leadCount": 609,
"prevYearLeadCount": 245.0
}
]
Looking at the data above, I would expect for the year of 2020 to look like this:
{
"salesforceAccountId": 3148,
"accountName": "Account Name",
"year": 2020,
"quarter": 1,
"leadCount": 209,
"prevYearLeadCount": 209.0
},
{
"salesforceAccountId": 3148,
"accountName": "Account Name",
"year": 2020,
"quarter": 2,
"leadCount": 430,
"prevYearLeadCount": 274
},
{
"salesforceAccountId": 3148,
"accountName": "Account Name",
"year": 2020,
"quarter": 3,
"leadCount": 907,
"prevYearLeadCount": 311
},
{
"salesforceAccountId": 3148,
"accountName": "Account Name",
"year": 2020,
"quarter": 4,
"leadCount": 657,
"prevYearLeadCount": 236
},
{
"salesforceAccountId": 3148,
"accountName": "Account Name",
"year": 2021,
"quarter": 1,
"leadCount": 609,
"prevYearLeadCount": 245.0
}
As seen here I have tried the following:
df['prev_year_lead_count'] = df.groupby("quarter").lead_count.shift()[ (df.year == df.year.shift() + 1) ]
This is close as I get what I would expect in some cases, but not all. In some frames I see 0s where I should data definitely exist in the previous year and quarter. I am trying to do exactly what is seen here but each year is broken into quarters.
One more thing I have tried is combining python and pandas a bit. The idea here is to loop over existing years in the frame, and check the previous year to see if the quarter exist. If it does, do the pandas.
qs = [1, 2, 3, 4]
for year in leads_df["year"].unique():
df = leads_df[leads_df["year"] == year - 1]
for q in qs:
if q in df["quarter"]:
leads_df["prev_year_lead_count"] = leads_df.groupby("quarter")["lead_count"].shift(+1)
leads_df["prev_year_cost"] = leads_df.groupby("quarter")["cost"].shift(+1)
leads_df["prev_year_ga_spent"] = leads_df.groupby("quarter")["ga_spent"].shift(+1)
leads_df["prev_year_fb_spent"] = leads_df.groupby("quarter")["fb_spent"].shift(+1)
leads_df["prev_year_monthly_package_cost"] = leads_df.groupby("quarter")[
"monthly_package_cost"
].shift(+1)
leads_df["prev_year_cpl"] = leads_df.groupby("quarter")["cpl"].shift(+1)
Upvotes: 2
Views: 701
Reputation: 8508
To solve for one year difference, we need to trick groupby. Here's how to do it.
import pandas as pd
df = pd.DataFrame(d)
#find difference between years for each quarter
df['yeardiff'] = df.groupby(['quarter'])['year'].transform(lambda x: x - x.shift())
#create a condition to pick only NaN and difference of 1 year
#this will eliminate 2 years or more
cond = (df['yeardiff'].isnull() | (df['yeardiff'] == 1.0))
#use this condition while doing the groupby
#If condition not met, it will default to NaN
df['newprevYearLeadCount'] = df[cond].groupby(['quarter'])['leadCount'].transform(lambda x: x.shift())
print (df[['year','quarter','leadCount','prevYearLeadCount', 'newprevYearLeadCount']])
Results of this looks like this:
I removed the entry for 2020 quarter 1. So 2021 quarter 1 should be NaN.
year quarter leadCount prevYearLeadCount newprevYearLeadCount
0 2017 2 151 0.0 NaN
1 2018 2 73 151.0 151.0
2 2018 3 271 0.0 NaN
3 2018 4 173 0.0 NaN
4 2019 1 209 0.0 NaN
5 2019 2 274 0.0 73.0
6 2019 3 311 0.0 271.0
7 2019 4 236 0.0 173.0
8 2020 2 430 0.0 274.0
9 2020 3 907 0.0 311.0
10 2020 4 657 0.0 236.0
11 2021 1 609 245.0 NaN #prev year LeadCount ignored
Another example with 2019 Quarter 2 excluded:
year quarter leadCount prevYearLeadCount newprevYearLeadCount
0 2017 2 151 0.0 NaN
1 2018 2 73 151.0 151.0
2 2018 3 271 0.0 NaN
3 2018 4 173 0.0 NaN
4 2019 1 209 0.0 NaN
5 2019 3 311 0.0 271.0
6 2019 4 236 0.0 173.0
7 2020 1 245 209.0 209.0
8 2020 2 430 0.0 NaN #prev year LeadCount ignored
9 2020 3 907 0.0 311.0
10 2020 4 657 0.0 236.0
11 2021 1 609 245.0 245.0
You should be able to groupby(['quarter']
, then do a shift()
to get the results.
import pandas as pd
df = pd.DataFrame(d)
#df.sort_values(by=['quarter','year'],inplace=True)
#df.reset_index(drop=True,inplace=True)
df['newprevYearLeadCount'] = df.groupby(['quarter'])['leadCount'].transform(lambda x:x.shift())
print (df[['year','quarter','leadCount','prevYearLeadCount', 'newprevYearLeadCount']])
The output of this will be:
year quarter leadCount prevYearLeadCount newprevYearLeadCount
0 2017 2 151 0.0 NaN
1 2018 2 73 151.0 151.0
2 2018 3 271 0.0 NaN
3 2018 4 173 0.0 NaN
4 2019 1 209 0.0 NaN
5 2019 2 274 0.0 73.0
6 2019 3 311 0.0 271.0
7 2019 4 236 0.0 173.0
8 2020 1 245 209.0 209.0
9 2020 2 430 0.0 274.0
10 2020 3 907 0.0 311.0
11 2020 4 657 0.0 236.0
12 2021 1 609 245.0 245.0
Initially I was going to sort_values by quarter
then year
but groupby takes care of it. So you need only groupby. transform takes care of assigning the value to each row.
If there are no records to pick the previous year Lead Count, then the value is set to NaN. You can decide to fillna(0)
so it will get replaced with 0.0 instead.
If you need to have 0 instead of NaN, then do this:
df['newprevYearLeadCount'] = df.groupby(['quarter'])['leadCount'].transform(lambda x:x.shift()).fillna(0)
Upvotes: 2