Pyspark sum of columns after union of dataframe

Question

How can I sum all columns after unioning two dataframe ?

I have this first df with one row per user:

df = sqlContext.createDataFrame([("2022-01-10", 3, 2,"a"),("2022-01-10",3,4,"b"),("2022-01-10", 1,3,"c")], ["date", "value1", "value2", "userid"])
df.show()

+----------+------+------+------+
|      date|value1|value2|userid|
+----------+------+------+------+
|2022-01-10|     3|     2|     a|
|2022-01-10|     3|     4|     b|
|2022-01-10|     1|     3|     c|
+----------+------+------+------+

date value will always be the today's date.

and I have another df, with multiple row per userid this time, so one value for each day:

df2 = sqlContext.createDataFrame([("2022-01-01", 13, 12,"a"),("2022-01-02",13,14,"b"),("2022-01-03", 11,13,"c"),
                                 ("2022-01-04", 3, 2,"a"),("2022-01-05",3,4,"b"),("2022-01-06", 1,3,"c"),
                                 ("2022-01-10", 31, 21,"a"),("2022-01-07",31,41,"b"),("2022-01-09", 11,31,"c")], ["date", "value3", "value4", "userid"])
df2.show()

+----------+------+------+------+
|      date|value3|value4|userid|
+----------+------+------+------+
|2022-01-01|    13|    12|     a|
|2022-01-02|    13|    14|     b|
|2022-01-03|    11|    13|     c|
|2022-01-04|     3|     2|     a|
|2022-01-05|     3|     4|     b|
|2022-01-06|     1|     3|     c|
|2022-01-10|    31|    21|     a|
|2022-01-07|    31|    41|     b|
|2022-01-09|    11|    31|     c|
+----------+------+------+------+

After unioning the two of them with this function, here what I have:

def union_different_tables(df1, df2):
    columns_df1 = df1.columns
    columns_df2 = df2.columns
    data_types_df1 = [i.dataType for i in df1.schema.fields]
    data_types_df2 = [i.dataType for i in df2.schema.fields]
    
    for col, _type in zip(columns_df1, data_types_df1):
        if col not in df2.columns:
            df2 = df2.withColumn(col, f.lit(None).cast(_type))
    for col, _type in zip(columns_df2, data_types_df2):
        if col not in df1.columns:
            df1 = df1.withColumn(col, f.lit(None).cast(_type))
    union = df1.unionByName(df2)
    return union

+----------+------+------+------+------+------+
|      date|value1|value2|userid|value3|value4|
+----------+------+------+------+------+------+
|2022-01-10|     3|     2|     a|  null|  null|
|2022-01-10|     3|     4|     b|  null|  null|
|2022-01-10|     1|     3|     c|  null|  null|
|2022-01-01|  null|  null|     a|    13|    12|
|2022-01-02|  null|  null|     b|    13|    14|
|2022-01-03|  null|  null|     c|    11|    13|
|2022-01-04|  null|  null|     a|     3|     2|
|2022-01-05|  null|  null|     b|     3|     4|
|2022-01-06|  null|  null|     c|     1|     3|
|2022-01-10|  null|  null|     a|    31|    21|
|2022-01-07|  null|  null|     b|    31|    41|
|2022-01-09|  null|  null|     c|    11|    31|
+----------+------+------+------+------+------+

What I want to get is the sum of all columns in df2 (I have 10 of them in the real case) till the date of the day for each userid, so one row per user:

 +----------+------+------+------+------+------+
|      date|value1|value2|userid|value3|value4|
+----------+------+------+------+------+------+
|2022-01-10|     3|     2|     a|  47  |  35  |
|2022-01-10|     3|     4|     b|  47  |  59  |
|2022-01-10|     1|     3|     c|  23  |  47  |
+----------+------+------+------+------+------+

Since I have to do this operation for multiple tables, here what I tried:

user_window = Window.partitionBy(['userid']).orderBy('date')
    
list_tables = [df2]

list_col_original = df.columns

for table in list_tables:
    df = union_different_tables(df, table)
    list_column = list(set(table.columns) - set(list_col_original))

    list_col_original.extend(list_column)

    df = df.select('userid',
            *[f.sum(f.col(col_name)).over(user_window).alias(col_name) for col_name in list_column])
df.show()

+------+------+------+
|userid|value4|value3|
+------+------+------+
|     c|    13|    11|
|     c|    16|    12|
|     c|    47|    23|
|     c|    47|    23|
|     b|    14|    13|
|     b|    18|    16|
|     b|    59|    47|
|     b|    59|    47|
|     a|    12|    13|
|     a|    14|    16|
|     a|    35|    47|
|     a|    35|    47|
+------+------+------+

But that give me a sort of cumulative sum, plus I didn't find a way to add all the columns in the resulting df.

The only thing is that I can't do any join ! My df are very very large and any join is taking too long to compute.

Do you know how I can fix my code to have the result I want ?

blackbishop · Accepted Answer

After union of df1 and df2, you can group by userid and sum all columns except date for which you get the max.

Note that for the union part, you can actually use DataFrame.unionByName if you have the same data types but only number of columns can differ:

df = df1.unionByName(df2, allowMissingColumns=True)

Then group by and agg:

import pyspark.sql.functions as F

result = df.groupBy("userid").agg(
    F.max("date").alias("date"),
    *[F.sum(c).alias(c) for c in df.columns if c not in ("date", "userid")]
)

result.show()

#+------+----------+------+------+------+------+
#|userid|      date|value1|value2|value3|value4|
#+------+----------+------+------+------+------+
#|     a|2022-01-10|     3|     2|    47|    35|
#|     b|2022-01-10|     3|     4|    47|    59|
#|     c|2022-01-10|     1|     3|    23|    47|
#+------+----------+------+------+------+------+

This supposes the second dataframe contains only dates prior to the today date in the first one. Otherwise, you'll need to filter df2 before union.

Pyspark sum of columns after union of dataframe

Answers (1)

Related Questions