Create a new column that marks customers

Question

My goal is to aggregate over the customerID (count), create a new Column and mark the customer which return often an article. How can I do that? (using Databricks, pyspark)

train.select("itemID","customerID","returnShipment").show(10)
+------+----------+--------------+
|itemID|customerID|returnShipment|
+------+----------+--------------+
|   186|       794|             0|
|    71|       794|             1|
|    71|       794|             1|
|    32|       850|             1|
|    32|       850|             1|
|    57|       850|             1|
|     2|       850|             1|
|   259|       850|             1|
|   603|       850|             1|
|   259|       850|             1|
+------+----------+--------------+

werner · Accepted Answer

You can define a threshold value and then compare this threshold value to the sum of returnShipments for each customerID:

from pyspark.sql import functions as F

threshold=5
df.groupBy("customerID")\
    .sum("returnShipment") \
    .withColumn("mark", F.col("sum(returnShipment)") > threshold) \
    .show()

Create a new column that marks customers

Answers (1)

Related Questions