pyspark, new column, mismatch from pattern

Question

I need to create a new column called Check that will show Mismatch if the value in a group of rows is not the same.

What I have now:

data = [
  ("Category2","File1",2,2),
  ("Category2","File2",2,2),
  ("Category2","File3",2,2),
  ("Category2","File4",5,2),
  ("Category1","File1",4,1),
  ("Category1","File2",4,1),
  ("Category1","File3",4,1),
  ("Category1","File4",4,1),
]

cols = ["Category","Filename","count","DistinctCount"]

df = spark.createDataFrame(data,cols)

df.show()


+---------+--------+-----+-------------+
| Category|Filename|count|DistinctCount|
+---------+--------+-----+-------------+
|Category2|   File1|    2|            2|
|Category2|   File2|    2|            2|
|Category2|   File3|    2|            2|
|Category2|   File4|    5|            2|
|Category1|   File1|    4|            1|
|Category1|   File2|    4|            1|
|Category1|   File3|    4|            1|
|Category1|   File4|    4|            1|
+---------+--------+-----+-------------+

Desired result:


+---------+--------+-----+-------------+---------+
| Category|Filename|count|DistinctCount|    Check|
+---------+--------+-----+-------------+---------+
|Category2|   File1|    2|            2|       OK|
|Category2|   File3|    2|            2|       OK|
|Category2|   File2|    2|            2|       OK|
|Category2|   File4|    5|            2| Mismatch|
|Category1|   File1|    4|            1|       OK|
|Category1|   File4|    4|            1|       OK|
|Category1|   File2|    4|            1|       OK|
|Category1|   File3|    4|            1|       OK|
+---------+--------+-----+-------------+---------+

I'm thinking of using a window function to group the rows by Category but stuck on how to think/write the logic for the mismatch.

Thank you!

/B

wwnde · Accepted Answer

from pyspark.sql import Window
import pyspark.sql.functions as F
from pyspark.sql.functions import *

win=Window.partitionBy ('count')

(df.withColumn('UniqueCount',F.count('DistinctCount').over(win))#groupby count and count DistinctCount in eachgroup
 
     .withColumn('UniqueCount',when(F.col('UniqueCount')=='1','mismatch').otherwise('ok'))# Attribute with mismatch if UniqueCount=1, else OK
 
     .orderBy(F.asc('Category'))#Sort dataframe
 
     .show())


+---------+--------+-----+-------------+-----------+

| Category|Filename|count|DistinctCount|UniqueCount|
+---------+--------+-----+-------------+-----------+
|Category1|   File1|    4|            1|         ok|
|Category1|   File2|    4|            1|         ok|
|Category1|   File3|    4|            1|         ok|
|Category1|   File4|    4|            1|         ok|
|Category2|   File1|    2|            2|         ok|
|Category2|   File4|    5|            2|   mismatch|
|Category2|   File2|    2|            2|         ok|
|Category2|   File3|    2|            2|         ok|
+---------+--------+-----+-------------+-----------+

pyspark, new column, mismatch from pattern

Answers (2)

Related Questions