spark- how to select random rows based on the percentage of a column value

Question

Is there a way to select random samples based on a distribution of a column using spark sql? For example for the dataframe below, I'd like to select a total of 6 rows but about 2 rows with prod_name = A and 2 rows of prod_name = B and 2 rows of prod_name = C , because they each account for 1/3 of the data? Note that each product doesn't always account for 1/3 percent. It's just an example. Many thanks for your help.

prod_name | value
----------------------
   A      | 100    
   A      | 200.    
   A      | 300
   A      | 400
   B      | 500
   B      | 600     
   B      | 650   
   B      | 700   
   C      | 500
   C      | 600     
   C      | 650   
   C      | 700

werner · Accepted Answer

Using sampleBy should do it. The required fractions for each prod_name can be calculated by dividing the expected number for rows by the actual number of rows:

df=...
prods=df.select('prod_name').distinct().collect()
cnt=df.count()
expected_rows=6 # change this number to adjust of rows in  the result
fractions={r['prod_name']:expected_rows/cnt for r in prods}
df.stat.sampleBy('prod_name', fractions).show()

Output:

+---------+-----+
|prod_name|value|
+---------+-----+
|        A|  200|
|        A|  400|
|        B|  500|
|        B|  700|
|        C|  500|
|        C|  700|
+---------+-----+

The size of the result might not exactly match the number of expected_rows as the sampling involves random operations. The result will vary a bit.

spark- how to select random rows based on the percentage of a column value

Answers (1)

Related Questions