Reputation: 1155
I have two different pyspark dataframes which needs to be merged into one. There is some logic that needs to be coded for the merging. One of the dataframes has the following schema: (id, type, count), and the other one has the schema: (id, timestamp, test1, test2, test3)
The first dataframe is created via sql "group by" query. There can be duplicate ids, but the type will be different for the ids. And, there is an associated count for the given type.
In the final schema (merged one), there will be different columns for the type count. The count data is retrieved from the first schema.
An example final schema: (id, timestamp, test1, test2, test3, type1count, type2count, type3count)
The way I am doing it now is using two for loops to build a dictionary. I have an empty schema, and I use the dictionary to update the schema. If I do it this way, I am not really using the spark features.
schema1: (id, type, count) -- type has the values type1, type2, type3
schema2: (id, timestamp, test1, test2, test3)
finalschema: (id, timestamp, test1, test2, test3, type1count, type2count, type3count)
Does anyone have any suggestion on how this can be improved?
Thanks much in advance.
Upvotes: 0
Views: 537
Reputation: 873
You can use the Pyspark pivot function to pivot the first dataframe before you join it with the second one
Working example:
import pyspark.sql.functions as F
import pyspark.sql.functions as F
df = spark.createDataFrame([[1,'type1',10],
[1,'type2',10],
[1,'type3',10]],
schema=['id','type','quantity'])
df = df.groupBy('id').pivot('type').sum('quantity')
display(df)
You can change the aggregation at your will.
Upvotes: 2
Reputation: 1525
You can join above two dataframe on id column, below is sample code snippet for same,
df1 schema is (id, type, count).
df2 schema is (id, timestamp, test1, test2, test3, type1count, type2count, type3count)
merged_df = df1.join(df2, on=['id'], how='left_outer')
Hope this will help.
Upvotes: 0