How to subset/create a new Pyspark DF based on values from two columns

Question

How do we filter values for CID from the following df where the combination of ID and TS is same.

df =

ID	CID	TS
A	C1	t1
A	C2	t1
A	C3	t2
B	C4	t2

Output DF required

CID
C1
C2

Thank you.

mck · Accepted Answer

You can get the count for each partition of ID and TS, and filter the rows with a count of greater than or equal to 2.

from pyspark.sql import functions as F, Window 

df2 = df.withColumn(
    'cnt', 
    F.count('*').over(Window.partitionBy('ID', 'TS'))
).filter('cnt >= 2').select('CID')

df2.show()
+---+
|CID|
+---+
| C1|
| C2|
+---+

How to subset/create a new Pyspark DF based on values from two columns

Answers (1)

Related Questions