fri6aug
fri6aug

Reputation: 51

Use pyspark to partition 100 rows from csv file

I'm trying to group 100 rows of a large csv file (100M+ rows) to send to a Lambda function. I can use SparkContext to have a workaround like this:

csv_file_rdd = sc.textFile(csv_file).collect()

count = 0
buffer = []
while count < len(csv_file_rdd):
    buffer.append(csv_file_rdd[count])
    count += 1
    if count % 100 == 0 or count == len(csv_file_rdd):
        # Send buffer to process
        print("Send:", buffer)
        # Clear buffer
        buffer = []

but there must be a more elegant solution. I've tried using SparkSession and mapPartition but I haven't been able to make it work.

Upvotes: 0

Views: 403

Answers (1)

Dioni
Dioni

Reputation: 118

I suppose that your current data is not partitioned in any way (I mean its only one file), so iterating over it sequencially is a must. I suggest to load it as a data frame spark.read.csv(csv_file) then repartition as in this question and save to disk. Once it's saved you'll have a big number of files containing the specified number of records (100 in your case), taht can be used by other program to send to a Lambda (probably with a Pool of workers). See this post to get an idea. Probably is a naive idea but get's the job done.

Upvotes: 1

Related Questions