How save list to file in spark?

Question

I have read this SO post, but I still need random.

I have datasets, like the follow:

I want to get some ranom lines from it, so I write the follow pyspark code:

rdd = spark_context.textFile('a.tx').takeSample(False, 3)
rdd.saveAsTextFile('b.tx')

So takeSample returns on list, it will have one error:

'list' object has no attribute 'saveAsTextFile'

mrsrinivas · Accepted Answer

takeSample() returns array. you need parallelize it and save it.

rdd = spark_context.textFile('a.tx')
spark_context.parallelize(rdd.takeSample(False, 3)).saveAsTextFile('b.tx')

But the best way is to use sample()(Here, I am taking 30%) which will return RDD

rdd.sample(False, 0.3).saveAsTextFile('b.tx')

Answers (2)