Random stratified sampling in pyspark

Question

I have created a pandas dataframe as follows:

import pandas as pd
import numpy as np

ds = {'col1' : [1,1,1,1,1,1,1,2,2,2,2,3,3,3,3,3,4,4,4,4,4,4,4,4,4],
      'col2' : [12,3,4,5,4,3,2,3,4,6,7,8,3,3,65,4,3,2,32,1,2,3,4,5,32],
      }

df = pd.DataFrame(data=ds)

The dataframe looks as follows:

print(df)

    col1  col2
0      1    12
1      1     3
2      1     4
3      1     5
4      1     4
5      1     3
6      1     2
7      2     3
8      2     4
9      2     6
10     2     7
11     3     8
12     3     3
13     3     3
14     3    65
15     3     4
16     4     3
17     4     2
18     4    32
19     4     1
20     4     2
21     4     3
22     4     4
23     4     5
24     4    32

Based on the values of column col1, I need to extract in Pyspark (not pandas):

3 random records where col1 == 1 2 random records such that col1 = 2 2 random records such that col1 = 3 3 random records such that col1 = 4

Can anyone help me please with the Pyspark code?

Random stratified sampling in pyspark

Answers (0)

Related Questions