Guforu
Guforu

Reputation: 4023

RDD transformation map, Python

is it possible to convert all elements in the map method of Spark to the float (double), excepted the first one without to do the iteration with the for-loop? Something like this in pseudocode:

input = sc.textFile('file.csv').map(lambda line: line.split(',')) #create a rdd<list>
test = input.map(lambda line: line[0] else float(line)) #convert all elements of the list to float excepted the first one

Upvotes: 1

Views: 100

Answers (1)

zero323
zero323

Reputation: 330073

It is possible although it arguably not a good practice. RDD is a homogeneous collection of objects. If you expect some kind of header it would be better to drop it than drag it all the way through. Nevertheless you can try something like this:

from itertools import islice

# Dummy data
with open("/tmp/foo", "w") as fw:
    fw.writelines(["foo", "1.0", "2.0", "3.0"])

def process_part(i, iter):
    if i == 0:
        # We could use enumerate as well
        for x in islice(iter, 1):
            yield x
    for x in iter:
        yield float(x)

(sc.textFile("foo.txt")
    .mapPartitionsWithIndex(process_part)
    .collect())
## ['"foo"', 1.0, 2.0, 3.0, 4.0]

If you expect empty partitions you count elements first:

rdd.mapPartitionsWithIndex(lambda i, iter: [(i,  sum(1 for _ in iter))]).collect()

and replace 0 with an index of the first non-empty partition.

Upvotes: 2

Related Questions