RDD transformation map, Python

Question

is it possible to convert all elements in the map method of Spark to the float (double), excepted the first one without to do the iteration with the for-loop? Something like this in pseudocode:

input = sc.textFile('file.csv').map(lambda line: line.split(',')) #create a rdd
test = input.map(lambda line: line[0] else float(line)) #convert all elements of the list to float excepted the first one

zero323 · Accepted Answer

It is possible although it arguably not a good practice. RDD is a homogeneous collection of objects. If you expect some kind of header it would be better to drop it than drag it all the way through. Nevertheless you can try something like this:

from itertools import islice

# Dummy data
with open("/tmp/foo", "w") as fw:
    fw.writelines(["foo", "1.0", "2.0", "3.0"])

def process_part(i, iter):
    if i == 0:
        # We could use enumerate as well
        for x in islice(iter, 1):
            yield x
    for x in iter:
        yield float(x)

(sc.textFile("foo.txt")
    .mapPartitionsWithIndex(process_part)
    .collect())
## ['"foo"', 1.0, 2.0, 3.0, 4.0]

If you expect empty partitions you count elements first:

rdd.mapPartitionsWithIndex(lambda i, iter: [(i,  sum(1 for _ in iter))]).collect()

and replace 0 with an index of the first non-empty partition.

RDD transformation map, Python

Answers (1)

Related Questions