Reputation: 51
I'm trying to group 100 rows of a large csv file (100M+ rows) to send to a Lambda function. I can use SparkContext to have a workaround like this:
csv_file_rdd = sc.textFile(csv_file).collect()
count = 0
buffer = []
while count < len(csv_file_rdd):
buffer.append(csv_file_rdd[count])
count += 1
if count % 100 == 0 or count == len(csv_file_rdd):
# Send buffer to process
print("Send:", buffer)
# Clear buffer
buffer = []
but there must be a more elegant solution. I've tried using SparkSession
and mapPartition
but I haven't been able to make it work.
Upvotes: 0
Views: 403
Reputation: 118
I suppose that your current data is not partitioned in any way (I mean its only one file), so iterating over it sequencially is a must. I suggest to load it as a data frame spark.read.csv(csv_file)
then repartition as in this question and save to disk. Once it's saved you'll have a big number of files containing the specified number of records (100 in your case), taht can be used by other program to send to a Lambda (probably with a Pool of workers). See this post to get an idea. Probably is a naive idea but get's the job done.
Upvotes: 1