Reputation: 546
I have a RDD with MANY columns (e.g. hundreds), and most of my operation is on columns, e.g. I need to create many intermediate variables from different columns.
What is the most efficient way to do this?
I create a RDD from a CSV file:
dataRDD = sc.textFile("/...path/*.csv").map(lambda line: line.split(",”))
For example, this will give me an RDD like below:
123, 523, 534, ..., 893
536, 98, 1623, ..., 98472
537, 89, 83640, ..., 9265
7297, 98364, 9, ..., 735
......
29, 94, 956, ..., 758
I need to create a new column or a variable as calculatedvalue = 2ndCol+19thCol and create a new RDD.
123, 523, 534, ..., 893, calculatedvalue
536, 98, 1623, ..., 98472, calculatedvalue
537, 89, 83640, ..., 9265, calculatedvalue
7297, 98364, 9, ..., 735, calculatedvalue
......
29, 94, 956, ..., 758, calculatedvalue
What is the best way of doing this?
Upvotes: 1
Views: 942
Reputation: 18022
With just a map it would be enough:
rdd = sc.parallelize([(1,2,3,4), (4,5,6,7)])
# just replace my index with yours
newrdd = rdd.map(lambda x: x + (x[1] + x[2],))
newrdd.collect() # [(1,2,3,4,6), (4,5,6,7,12)]
Upvotes: 1