Reputation: 981
I am working within PySpark, and have a transaction table imported as a Spark DataFrame as follows:
User_ID Date Product_Name
-------- ------ -------------
A 2019-11-30. Product 1
B 2019-10-20 Product 2
C 2019-10-01 Product 1
A 2019-12-01 Product 1
What I am trying to do is create a resulting table that for each unique User_ID, counts whether or not that user has bought more of product 1 than product 2, and then will return the string, "Product 1", or "Product 2" in the other case in the second column of this new table.
I am finding it difficult to in PySpark.
Upvotes: 2
Views: 107
Reputation: 14845
Create a pivot table with columns for the counts of Product 1 and Product 2. Then the two columns can be compared.
df.groupBy("User_ID").pivot("Product_Name").agg(F.count("Product_Name")) \
.fillna(0) \
.withColumn("result", F.when(F.col("Product 1") > F.col("Product 2"), "Product 1").otherwise("Product 2")) \
.show()
Output:
+-------+---------+---------+---------+
|User_ID|Product 1|Product 2| result|
+-------+---------+---------+---------+
| B| 0| 1|Product 2|
| C| 1| 0|Product 1|
| A| 2| 0|Product 1|
+-------+---------+---------+---------+
For a more generic approach how to calculate the result
column (especially if there are more than two distinct products) please check this answer.
Upvotes: 1