thinkerou
thinkerou

Reputation: 1877

How save list to file in spark?

I have read this SO post, but I still need random.

I have datasets, like the follow:

123456789
23458ef12
ef12345ea
111223345

I want to get some ranom lines from it, so I write the follow pyspark code:

rdd = spark_context.textFile('a.tx').takeSample(False, 3)
rdd.saveAsTextFile('b.tx')

So takeSample returns on list, it will have one error:

'list' object has no attribute 'saveAsTextFile'

Upvotes: 7

Views: 13720

Answers (2)

mrsrinivas
mrsrinivas

Reputation: 35404

takeSample() returns array. you need parallelize it and save it.

rdd = spark_context.textFile('a.tx')
spark_context.parallelize(rdd.takeSample(False, 3)).saveAsTextFile('b.tx')

But the best way is to use sample()(Here, I am taking 30%) which will return RDD

rdd.sample(False, 0.3).saveAsTextFile('b.tx')

Upvotes: 6

Bikash Gyawali
Bikash Gyawali

Reputation: 1048

If you need to begin from a pure python list ; such as on the result of calling .collect() on a pyspark dataframe, I have the following function

def write_lists_to_hdfs_textfile(ss, python_list, hdfs_filename):
    '''
    :param ss : SparkSession Object
    :param python_list: simple list in python. Can be a result of .collect() on pyspark dataframe.
    :param hdfs_filename : the path of hdfs filename to write
    :return: None
    '''

    # First need to convert the list to parallel RDD
    rdd_list = ss.sparkContext.parallelize(python_list)

    # Use the map function to write one element per line and write all elements to a single file (coalesce)
    rdd_list.coalesce(1).map(lambda row: str(row)).saveAsTextFile(hdfs_filename)

    return None

Eg:

write_lists_to_hdfs_textfile(ss,[5,4,1,18],"/test_file.txt")

Upvotes: 1

Related Questions