merging two spark dataframes into one schema using python

Question

I have two different pyspark dataframes which needs to be merged into one. There is some logic that needs to be coded for the merging. One of the dataframes has the following schema: (id, type, count), and the other one has the schema: (id, timestamp, test1, test2, test3)

The first dataframe is created via sql "group by" query. There can be duplicate ids, but the type will be different for the ids. And, there is an associated count for the given type.

In the final schema (merged one), there will be different columns for the type count. The count data is retrieved from the first schema.

An example final schema: (id, timestamp, test1, test2, test3, type1count, type2count, type3count)

The way I am doing it now is using two for loops to build a dictionary. I have an empty schema, and I use the dictionary to update the schema. If I do it this way, I am not really using the spark features.

schema1: (id, type, count) -- type has the values type1, type2, type3
schema2: (id, timestamp, test1, test2, test3)
finalschema: (id, timestamp, test1, test2, test3, type1count, type2count, type3count)

Does anyone have any suggestion on how this can be improved?

Thanks much in advance.

LeandroHumb · Accepted Answer

You can use the Pyspark pivot function to pivot the first dataframe before you join it with the second one

Working example:

import pyspark.sql.functions as F
import pyspark.sql.functions as F
df = spark.createDataFrame([[1,'type1',10],
                            [1,'type2',10],
                            [1,'type3',10]],
                           schema=['id','type','quantity'])

df = df.groupBy('id').pivot('type').sum('quantity')
display(df)

You can change the aggregation at your will.

merging two spark dataframes into one schema using python

Answers (2)

Related Questions