00robinette
00robinette

Reputation: 567

Pandas shift on quarterly data with missing quarters

I know there are some similar questions here, but please read on as I have looked and tried to adapt existing solutions without any luck. I have a data frame that pulls data for year and quarter. In the scenario shown below prevYearLeadCount displays data from Q1 2020. To be clear prevYearLeadCount will always display the lead count from the previous year in the same quarter. The below is just an example to show how the data is structured. Also, looking at the data below, since there is data for Q4 of 2019 I would expect 2020 Q4 prevYearLeadCount to equal 236

[
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2017,
        "quarter": 2,
        "leadCount": 151,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2018,
        "quarter": 2,
        "leadCount": 73,
        "prevYearLeadCount": 151.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2018,
        "quarter": 3,
        "leadCount": 271,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2018,
        "quarter": 4,
        "leadCount": 173,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2019,
        "quarter": 1,
        "leadCount": 209,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2019,
        "quarter": 2,
        "leadCount": 274,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2019,
        "quarter": 3,
        "leadCount": 311,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2019,
        "quarter": 4,
        "leadCount": 236,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2020,
        "quarter": 1,
        "leadCount": 245,
        "prevYearLeadCount": 209.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2020,
        "quarter": 2,
        "leadCount": 430,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2020,
        "quarter": 3,
        "leadCount": 907,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2020,
        "quarter": 4,
        "leadCount": 657,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2021,
        "quarter": 1,
        "leadCount": 609,
        "prevYearLeadCount": 245.0
    }
]

Looking at the data above, I would expect for the year of 2020 to look like this:

{
    "salesforceAccountId": 3148,
    "accountName": "Account Name",
    "year": 2020,
    "quarter": 1,
    "leadCount": 209,
    "prevYearLeadCount": 209.0
},
{
    "salesforceAccountId": 3148,
    "accountName": "Account Name",
    "year": 2020,
    "quarter": 2,
    "leadCount": 430,
    "prevYearLeadCount": 274
},
{
    "salesforceAccountId": 3148,
    "accountName": "Account Name",
    "year": 2020,
    "quarter": 3,
    "leadCount": 907,
    "prevYearLeadCount": 311
},
{
    "salesforceAccountId": 3148,
    "accountName": "Account Name",
    "year": 2020,
    "quarter": 4,
    "leadCount": 657,
    "prevYearLeadCount": 236 
},
{
    "salesforceAccountId": 3148,
    "accountName": "Account Name",
    "year": 2021,
    "quarter": 1,
    "leadCount": 609,
    "prevYearLeadCount": 245.0
}

As seen here I have tried the following:

df['prev_year_lead_count'] = df.groupby("quarter").lead_count.shift()[ (df.year == df.year.shift() + 1) ]

This is close as I get what I would expect in some cases, but not all. In some frames I see 0s where I should data definitely exist in the previous year and quarter. I am trying to do exactly what is seen here but each year is broken into quarters.

One more thing I have tried is combining python and pandas a bit. The idea here is to loop over existing years in the frame, and check the previous year to see if the quarter exist. If it does, do the pandas.

qs = [1, 2, 3, 4]
for year in leads_df["year"].unique():
    df = leads_df[leads_df["year"] == year - 1]
    for q in qs:
        if q in df["quarter"]:
            leads_df["prev_year_lead_count"] = leads_df.groupby("quarter")["lead_count"].shift(+1)
            leads_df["prev_year_cost"] = leads_df.groupby("quarter")["cost"].shift(+1)
            leads_df["prev_year_ga_spent"] = leads_df.groupby("quarter")["ga_spent"].shift(+1)
            leads_df["prev_year_fb_spent"] = leads_df.groupby("quarter")["fb_spent"].shift(+1)
            leads_df["prev_year_monthly_package_cost"] = leads_df.groupby("quarter")[
                "monthly_package_cost"
            ].shift(+1)
            leads_df["prev_year_cpl"] = leads_df.groupby("quarter")["cpl"].shift(+1)

Upvotes: 2

Views: 701

Answers (1)

Joe Ferndz
Joe Ferndz

Reputation: 8508

fix to check for prev year only

To solve for one year difference, we need to trick groupby. Here's how to do it.

import pandas as pd

df = pd.DataFrame(d)

#find difference between years for each quarter

df['yeardiff'] = df.groupby(['quarter'])['year'].transform(lambda x: x - x.shift())

#create a condition to pick only NaN and difference of 1 year
#this will eliminate 2 years or more

cond = (df['yeardiff'].isnull() | (df['yeardiff'] == 1.0))

#use this condition while doing the groupby
#If condition not met, it will default to NaN

df['newprevYearLeadCount'] = df[cond].groupby(['quarter'])['leadCount'].transform(lambda x: x.shift())

print (df[['year','quarter','leadCount','prevYearLeadCount', 'newprevYearLeadCount']])

Results of this looks like this:

I removed the entry for 2020 quarter 1. So 2021 quarter 1 should be NaN.

    year  quarter  leadCount  prevYearLeadCount  newprevYearLeadCount
0   2017        2        151                0.0                   NaN
1   2018        2         73              151.0                 151.0
2   2018        3        271                0.0                   NaN
3   2018        4        173                0.0                   NaN
4   2019        1        209                0.0                   NaN
5   2019        2        274                0.0                  73.0
6   2019        3        311                0.0                 271.0
7   2019        4        236                0.0                 173.0
8   2020        2        430                0.0                 274.0
9   2020        3        907                0.0                 311.0
10  2020        4        657                0.0                 236.0
11  2021        1        609              245.0                   NaN  #prev year LeadCount ignored

Another example with 2019 Quarter 2 excluded:

    year  quarter  leadCount  prevYearLeadCount  newprevYearLeadCount
0   2017        2        151                0.0                   NaN
1   2018        2         73              151.0                 151.0
2   2018        3        271                0.0                   NaN
3   2018        4        173                0.0                   NaN
4   2019        1        209                0.0                   NaN
5   2019        3        311                0.0                 271.0
6   2019        4        236                0.0                 173.0
7   2020        1        245              209.0                 209.0
8   2020        2        430                0.0                   NaN  #prev year LeadCount ignored
9   2020        3        907                0.0                 311.0
10  2020        4        657                0.0                 236.0
11  2021        1        609              245.0                 245.0

prev answer

You should be able to groupby(['quarter'], then do a shift() to get the results.

import pandas as pd
df = pd.DataFrame(d)
#df.sort_values(by=['quarter','year'],inplace=True)
#df.reset_index(drop=True,inplace=True)
df['newprevYearLeadCount'] = df.groupby(['quarter'])['leadCount'].transform(lambda x:x.shift())
print (df[['year','quarter','leadCount','prevYearLeadCount', 'newprevYearLeadCount']])

The output of this will be:

    year  quarter  leadCount  prevYearLeadCount  newprevYearLeadCount
0   2017        2        151                0.0                   NaN
1   2018        2         73              151.0                 151.0
2   2018        3        271                0.0                   NaN
3   2018        4        173                0.0                   NaN
4   2019        1        209                0.0                   NaN
5   2019        2        274                0.0                  73.0
6   2019        3        311                0.0                 271.0
7   2019        4        236                0.0                 173.0
8   2020        1        245              209.0                 209.0
9   2020        2        430                0.0                 274.0
10  2020        3        907                0.0                 311.0
11  2020        4        657                0.0                 236.0
12  2021        1        609              245.0                 245.0

Initially I was going to sort_values by quarter then year but groupby takes care of it. So you need only groupby. transform takes care of assigning the value to each row.

If there are no records to pick the previous year Lead Count, then the value is set to NaN. You can decide to fillna(0) so it will get replaced with 0.0 instead.

If you need to have 0 instead of NaN, then do this:

df['newprevYearLeadCount'] = df.groupby(['quarter'])['leadCount'].transform(lambda x:x.shift()).fillna(0)

Upvotes: 2

Related Questions