Reputation: 317
Is there a way to select random samples based on a distribution of a column using spark sql? For example for the dataframe below, I'd like to select a total of 6 rows but about 2 rows with prod_name = A
and 2 rows of prod_name = B
and 2 rows of prod_name = C
, because they each account for 1/3 of the data? Note that each product doesn't always account for 1/3 percent. It's just an example. Many thanks for your help.
prod_name | value
----------------------
A | 100
A | 200.
A | 300
A | 400
B | 500
B | 600
B | 650
B | 700
C | 500
C | 600
C | 650
C | 700
Upvotes: 1
Views: 946
Reputation: 14845
Using sampleBy should do it. The required fractions for each prod_name
can be calculated by dividing the expected number for rows by the actual number of rows:
df=...
prods=df.select('prod_name').distinct().collect()
cnt=df.count()
expected_rows=6 # change this number to adjust of rows in the result
fractions={r['prod_name']:expected_rows/cnt for r in prods}
df.stat.sampleBy('prod_name', fractions).show()
Output:
+---------+-----+
|prod_name|value|
+---------+-----+
| A| 200|
| A| 400|
| B| 500|
| B| 700|
| C| 500|
| C| 700|
+---------+-----+
The size of the result might not exactly match the number of expected_rows
as the sampling involves random operations. The result will vary a bit.
Upvotes: 1