Reputation: 391
I have a union dataframe from two sources. I want to add a flag column based on 'ID' and 'Source', The rule:
How do I create the flag column? I tried Windows Function, totally WRONG, Is there anyway I can combine Windows Function and When? Or is there a better way to add a flag col? I am stuck.
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql import SparkSession as ss
from pyspark.sql.functions import when, col, lit, row_number
from pyspark.sql.window import Window
spark = ss.builder.appname('new').getOrCreate()
data = [(14, 'AA', None), (14, 'AA', None), (15, 'BB', None), (15, 'BB', 2), (16, 'AA', None), (16, 'AA', 1), (16, 'BB', None), (16, 'BB', 2), (17, 'AA', None), (17, 'AA', None), (17, 'BB', None), (17, 'BB', None)]
schema=StructType([StructField('ID', IntegerType(), False), StructField('Source', StringType(), False), StructField('rating', IntegerType(), True)])
df = Spark.createDataFrame(data, schema)
w = Window.partitionby('ID').orderBy('Source', 'rating')
df = df.withColumn('flag', row_number().over(w))
input table:
|ID|Source |rating
|--|-----|------
| 14 | AA |Null
| 14 | AA |NUll
| 15 | BB |Null
| 15 | BB |2
| 16 | AA |Null
| 16 | AA |1
| 16 | BB |Null
| 16 | BB |2
| 17 | AA |Null
| 17 | AA |Null
| 17 | BB |Null
| 17 | BB |Null
Ideally output table:
| ID| Source | rating| flag
| --| -----|------|-----
| 14 | AA |Null |N
| 14 | AA |NUll |N
| 15 | BB |Null |N
| 15 | BB |2 |Y
| 16 | AA |Null |N
| 16 | AA |1 |Y
| 16 | BB |Null |N
| 16 | BB |2 |Y
| 17 | AA |Null |Y
| 17 | AA |Null |Y
| 17 | BB |Null |Y
| 17 | BB |Null |Y
Upvotes: 0
Views: 103