stackq
stackq

Reputation: 491

Quarter to date growth

I have some daily data in a df, which goes back as far as 1st January 2020. It looks similar to the below but with many id1s on each day.

| yyyy_mm_dd | id1 | id2  | cost  |
|------------|-----|------|-------|
| 2020-01-01 | 23  | 7253 | 5003  |
| 2020-01-01 | 23  | 7743 | 30340 |
| 2020-01-02 | 23  | 7253 | 450   |
| 2020-01-02 | 23  | 7743 | 4500  |
| ...        | ... | ...  | ...   |
| 2021-01-01 | 23  | 7253 | 5675  |
| 2021-01-01 | 23  | 134  | 1030  |
| 2021-01-01 | 23  | 3445 | 564   |
| 2021-01-01 | 23  | 4534 | 345   |
| ...        | ... | ...  | ...   |

I have grouped and calculated the summed cost like so:

grouped_quarterly = (
    df
    .withColumn('year_quarter', (F.year(F.col('yyyy_mm_dd')) * 100 + F.quarter(F.col('yyyy_mm_dd'))
    .groupby('id1', 'year_quarter')
    .agg(
        F.sum('cost').alias('cost')
    )
)

I am able to then successfully make a quarter over quarter comparison like so:

w = Window.partitionBy(F.col('id1'), F.expr('substring(string(year_quarter), -2)')).orderBy('year_quarter')
growth = (
    grouped_quarterly
    .withColumn('prev_value', F.lag(F.col('cost')).over(w))
    .withColumn('diff', F.when(F.isnull(F.col('cost') - F.col('prev_value')), 0).otherwise(F.col('cost') - F.col('prev_value')))
).where(F.col('year_quarter') >= 202101)

I would like to modify this to be quarter to date instead of quarter over quarter. For example, the above would compare April 1st 2020 - June 30th 2020 with April 1st 2020 - April 15th 2021 (or whatever maximum date in df is).

Instead, I would prefer to compare April 1st 2020 - April 15th 2020 with April 1st 2021 - April 15th 2021.

Is it possible to ensure only the same periods are compared within year_quarter?

Edit: Adding sample output:


grouped_quarterly.where(F.col('id1') == 222).sort('year_quarter').show(10,False)

| id1 | year_quarter | cost  |
|-----|--------------|-------|
| 222 | 202001       | 49428 |
| 222 | 202002       | 43292 |
| 222 | 202003       | 73928 |
| 222 | 202004       | 12028 |
| 222 | 202101       | 19382 |
| 222 | 202102       | 4282  |

growth.where(F.col('id1') == 222).sort('year_quarter').show(10,False)

| id1 | year_quarter | cost  | prev_value | diff   | growth |
|-----|--------------|-------|------------|--------|--------|
| 222 | 202101       | 52494 | 49428      | 3066   | 6.20   |
| 222 | 202102       | 4282  | 43292      | -39010 | -90.10 |

The growth calculation from the window is being done correctly. However, since 202102 is in progress, it gets compared to the full 202002. The comparison for 202101 works perfectly as both year_quarters are complete.

Is there anyway to ensure the window function only compares the same period within the year_quarter with the previous year, for incomplete quarters? I hope the sample data makes my question a bit more clear

Upvotes: 8

Views: 322

Answers (2)

werner
werner

Reputation: 14845

The idea is to split the task into two parts:

  1. Calculate the growth for the complete quarters. This logic is completely taken over from the question and then
  2. calculate the growth for the currently running quarter.

First generate some additional test data for 2019Q2, 2020Q2 and 2021Q2:

data = [('2019-04-01', 23, 1), ('2019-04-01', 23, 2), ('2019-04-02', 23, 3), ('2019-04-15', 23, 4),
        ('2019-04-16', 23, 5), ('2019-04-17', 23, 6), ('2019-05-01', 23, 7), ('2019-06-30', 23, 8),
        ('2019-07-01', 23, 9), ('2020-01-01',23,5003),('2020-01-01',23,30340), ('2020-01-02',23,450),
        ('2020-01-02',23,4500), ('2020-04-01', 23, 10), ('2020-04-01', 23, 20), ('2020-04-02', 23, 30),
        ('2020-04-15', 23, 40), ('2020-04-16', 23, 50), ('2020-04-17', 23, 60), ('2020-05-01', 23, 70),
        ('2020-06-30', 23, 80), ('2020-07-01', 23, 90), ('2021-01-01',23,5675), ('2021-01-01',23,1030),
        ('2021-01-01',23,564), ('2021-01-01',23,345), ('2021-04-01', 23, -10), ('2021-04-01', 23, -20),
        ('2021-04-02', 23, -30), ('2021-04-15', 23, -40)]

Calcuate the year_quarter column and cache the result:

df = spark.createDataFrame(data=data, schema = ["yyyy_mm_dd", "id1", "cost"]) \
    .withColumn("yyyy_mm_dd", F.to_date("yyyy_mm_dd", "yyyy-MM-dd")) \
    .withColumn('year_quarter', (F.year(F.col('yyyy_mm_dd')) * 100 + F.quarter(F.col('yyyy_mm_dd')))) \
    .cache()

Get the maximum date and its corresponding quarter:

max_row = df.selectExpr("max(yyyy_mm_dd)", "max_by(year_quarter, yyyy_mm_dd)").head()
cur_date, cur_quarter = max_row[0], max_row[1]

It is not strictly necessary to set cur_date to the maximum date of the data. Instead cur_date and cur_quarter could also be set manually.

For all quarters but the current one apply the logic given in the question:

w = Window.partitionBy(F.col('id1'), F.expr('substring(string(year_quarter), -2)')).orderBy('year_quarter')
df_full_quarters = df.filter(f"year_quarter <> {cur_quarter}") \
    .groupby('id1', 'year_quarter') \
    .agg(F.sum('cost').alias('cost')) \
    .withColumn('prev_value', F.lag(F.col('cost')).over(w))

For the current quarter filter out all dates in the previous year that should be ignored:

df_cur_quarter = df.filter(f"year_quarter = {cur_quarter} or (year_quarter = {cur_quarter - 100} and add_months(yyyy_mm_dd, 12) <= '{cur_date}')") \
    .groupby('id1', 'year_quarter') \
    .agg(F.sum('cost').alias('cost')) \
    .withColumn('prev_value', F.lag(F.col('cost')).over(w)) \
    .filter(f"year_quarter = {cur_quarter}")

Finally union the two parts and calculate the diff column:

growth = df_full_quarters.union(df_cur_quarter) \
    .withColumn('diff', F.when(F.isnull(F.col('cost') - F.col('prev_value')), 0).otherwise(F.col('cost') - F.col('prev_value'))) \
    .orderBy("id1", "year_quarter")

The result will be:

+---+------------+-----+----------+------+                                      
|id1|year_quarter| cost|prev_value|  diff|
+---+------------+-----+----------+------+
| 23|      201902|   36|      null|     0|
| 23|      201903|    9|      null|     0|
| 23|      202001|40293|      null|     0|
| 23|      202002|  360|        36|   324|
| 23|      202003|   90|         9|    81|
| 23|      202101| 7614|     40293|-32679|
| 23|      202102| -100|       100|  -200|
+---+------------+-----+----------+------+

In this example, for the comparison of 2021Q2 with the previous year the sum for 2020Q2 is given as 100, but the actual value for the full 2020Q2 is 360.

Upvotes: 1

user7367561
user7367561

Reputation:

If you want quarter to date comparison YoY but quarter is incomplete, then do agg by

dayofmonth(col("input")).alias("dayofmonth")

if current quarter being compared to is equal to current month of current year maybe with .agg(when(col("date_column") condition exp))

Some more insights here

Upvotes: 1

Related Questions