Reputation: 4482
I have a spark dataframe that looks like this:
import pandas as pd
foo = pd.DataFrame({'id':['abc', 'abc', 'abc', 'abc', 'de', 'de', 'opqrs', 'opqrs', 'opqrs', 'opqrs', 'opqrs'], 'value': [1,2,3,4,5,6,7,8,9,10,11]})
I would like to randomly select 30% of the unique id
s, and keep the slice of the original dataframe that contains these id
s.
How could I do that in pyspark ?
Upvotes: 1
Views: 866
Reputation: 1712
You can use distinct and sample functions of pyspark followed by join.
tst= sqlContext.createDataFrame([(1,2),(1,3),(1,9),(2,4),(2,10),(3,5),(2,9),(3,6),(3,8),(4,9),(4,5),(5,1)],schema=['a','b'])
tst_s = tst.select('a').distinct().sample(withReplacement=True,fraction=0.3)
#%%
tst_res = tst_s.join(tst,on='a',how='left')
But spark places a warning on the sample function : Note This is not guaranteed to provide exactly the fraction specified of the total count of the given DataFrame. Is this ok, or do you need it exactly 30 %?
Upvotes: 1