Reputation: 1877
I have read this SO post, but I still need random.
I have datasets, like the follow:
123456789
23458ef12
ef12345ea
111223345
I want to get some ranom lines from it, so I write the follow pyspark code:
rdd = spark_context.textFile('a.tx').takeSample(False, 3)
rdd.saveAsTextFile('b.tx')
So takeSample returns on list, it will have one error:
'list' object has no attribute 'saveAsTextFile'
Upvotes: 7
Views: 13720
Reputation: 35404
takeSample()
returns array. you need parallelize it and save it.
rdd = spark_context.textFile('a.tx')
spark_context.parallelize(rdd.takeSample(False, 3)).saveAsTextFile('b.tx')
But the best way is to use sample()
(Here, I am taking 30%) which will return RDD
rdd.sample(False, 0.3).saveAsTextFile('b.tx')
Upvotes: 6
Reputation: 1048
If you need to begin from a pure python list ; such as on the result of calling .collect()
on a pyspark dataframe, I have the following function
def write_lists_to_hdfs_textfile(ss, python_list, hdfs_filename):
'''
:param ss : SparkSession Object
:param python_list: simple list in python. Can be a result of .collect() on pyspark dataframe.
:param hdfs_filename : the path of hdfs filename to write
:return: None
'''
# First need to convert the list to parallel RDD
rdd_list = ss.sparkContext.parallelize(python_list)
# Use the map function to write one element per line and write all elements to a single file (coalesce)
rdd_list.coalesce(1).map(lambda row: str(row)).saveAsTextFile(hdfs_filename)
return None
Eg:
write_lists_to_hdfs_textfile(ss,[5,4,1,18],"/test_file.txt")
Upvotes: 1