PySpark: Filter out rows where column value appears multiple times in dataframe

Question

Given a PySpark dataframe with two columns, I want to split the data set into two dataframes: One where the combination of ColA and ColB is unique, and one where it is non-unique.

So for example:

ColA | ColB | ColCDoesntMatter
1    | 2    | 4
1    | 3    | 3
2    | 2    | 2
1    | 2    | 6
2    | 2    | 8
2    | 3    | 9

The unique A/B pairings are:

1-3
2-3

The nonunique A/B pairings are:

1-2
2-2

So one dataframe would have all rows with the unique AB values and the other would contain the nonunique AB values. ColC doesn't matter for the filter, but needs to be retained. How can I run this filter?

Edit: Note that I can't use dropDuplicates because it's really the combination of A and B that needs to be unique, not merely A or B.

akuiper · Accepted Answer

# count rows per A-B
dfWithRowCount = df.selectExpr("*", "count('*') over (partition by (ColA, ColB)) as nrows")

# filter
uniqueAB = dfWithRowCount.filter(dfWithRowCount.nrows == 1).drop('nrows')
nonUniqueAB = dfWithRowCount.filter(dfWithRowCount.nrows > 1).drop('nrows')

nonUniqueAB.show()
+----+----+----+
|ColA|ColB|ColC|
+----+----+----+
|   2|   2|   2|
|   2|   2|   8|
|   1|   2|   4|
|   1|   2|   6|
+----+----+----+

uniqueAB.show()
+----+----+----+
|ColA|ColB|ColC|
+----+----+----+
|   1|   3|   3|
|   2|   3|   9|
+----+----+----+

PySpark: Filter out rows where column value appears multiple times in dataframe

Answers (1)

Related Questions