Randomly Split DataFrame by Unique Values in One Column

Question

I have a pyspark DataFrame like the following:

+--------+--------+-----------+
| col1   |  col2  |  groupId  |
+--------+--------+-----------+
| val11  | val21  |   0       |
| val12  | val22  |   1       |
| val13  | val23  |   2       |
| val14  | val24  |   0       |
| val15  | val25  |   1       |
| val16  | val26  |   1       |
+--------+--------+-----------+

Each row has a groupId and multiple rows can have the same groupId.

I want to randomly split this data into two datasets. But all the data having a particular groupId must be in one of the splits.

This means that if d1.groupId = d2.groupId, then d1 and d2 are in the same split.

For example:

# Split 1:

+--------+--------+-----------+
| col1   |  col2  |  groupId  |
+--------+--------+-----------+
| val11  | val21  |   0       |
| val13  | val23  |   2       |
| val14  | val24  |   0       |
+--------+--------+-----------+

# Split 2:
+--------+--------+-----------+
| col1   |  col2  |  groupId  |
+--------+--------+-----------+
| val12  | val22  |   1       |
| val15  | val25  |   1       |
| val16  | val26  |   1       |
+--------+--------+-----------+

What is the good way to do it on PySpark? Can I use the randomSplit method somehow?

Randomly Split DataFrame by Unique Values in One Column

Answers (1)

Related Questions