Reputation: 1622
I have created a pandas dataframe as follows:
import pandas as pd
import numpy as np
ds = {'col1' : [1,1,1,1,1,1,1,2,2,2,2,3,3,3,3,3,4,4,4,4,4,4,4,4,4],
'col2' : [12,3,4,5,4,3,2,3,4,6,7,8,3,3,65,4,3,2,32,1,2,3,4,5,32],
}
df = pd.DataFrame(data=ds)
The dataframe looks as follows:
print(df)
col1 col2
0 1 12
1 1 3
2 1 4
3 1 5
4 1 4
5 1 3
6 1 2
7 2 3
8 2 4
9 2 6
10 2 7
11 3 8
12 3 3
13 3 3
14 3 65
15 3 4
16 4 3
17 4 2
18 4 32
19 4 1
20 4 2
21 4 3
22 4 4
23 4 5
24 4 32
Based on the values of column col1, I need to extract in Pyspark
(not pandas):
3 random records where col1 == 1
2 random records such that col1 = 2
2 random records such that col1 = 3
3 random records such that col1 = 4
Can anyone help me please with the Pyspark code?
Upvotes: 0
Views: 44