quant
quant

Reputation: 4482

How to randomly sample 30% of the ids in a spark dataframe, when each id apperas more than once

I have a spark dataframe that looks like this:

import pandas as pd
foo = pd.DataFrame({'id':['abc', 'abc', 'abc', 'abc', 'de', 'de', 'opqrs', 'opqrs', 'opqrs', 'opqrs', 'opqrs'], 'value': [1,2,3,4,5,6,7,8,9,10,11]})

I would like to randomly select 30% of the unique ids, and keep the slice of the original dataframe that contains these ids.

How could I do that in pyspark ?

Upvotes: 1

Views: 866

Answers (1)

Raghu
Raghu

Reputation: 1712

You can use distinct and sample functions of pyspark followed by join.

tst= sqlContext.createDataFrame([(1,2),(1,3),(1,9),(2,4),(2,10),(3,5),(2,9),(3,6),(3,8),(4,9),(4,5),(5,1)],schema=['a','b'])
tst_s = tst.select('a').distinct().sample(withReplacement=True,fraction=0.3)
#%%
tst_res = tst_s.join(tst,on='a',how='left')

But spark places a warning on the sample function : Note This is not guaranteed to provide exactly the fraction specified of the total count of the given DataFrame. Is this ok, or do you need it exactly 30 %?

Upvotes: 1

Related Questions