Use pyspark to partition 100 rows from csv file

Question

I'm trying to group 100 rows of a large csv file (100M+ rows) to send to a Lambda function. I can use SparkContext to have a workaround like this:

csv_file_rdd = sc.textFile(csv_file).collect()

count = 0
buffer = []
while count < len(csv_file_rdd):
    buffer.append(csv_file_rdd[count])
    count += 1
    if count % 100 == 0 or count == len(csv_file_rdd):
        # Send buffer to process
        print("Send:", buffer)
        # Clear buffer
        buffer = []

but there must be a more elegant solution. I've tried using SparkSession and mapPartition but I haven't been able to make it work.

Use pyspark to partition 100 rows from csv file

Answers (1)

Related Questions