generate/insert contiguous numbers to elements in a Spark RDD

Question

Suppose I have a RDD loaded by lines = sc.textFile('/test.txt') and RDD is like ['apple', 'orange', 'banana']. Then I would like to generate RDD [(0, 'apple'), (1, 'orange'), (2, 'banana')].

I know this could be done by indexed_lines = lines.zipWithIndex().map(lambda (x, y): ','.join([str(y), x])).collect()

But now I have another RDD new_lines = ['pineapple','blueberry'], I would like to union these two RDDs(indexed_lines and new_lines) to construct [(0, 'apple'), (1, 'orange'), (2, 'banana'), (3, 'pineapple'), (4, 'blueberry')] notice that the indexed_lines already exists and I don't want to change the data within it.

I am trying to zip and union RDDs index = sc.parallelize(range(3, 5)) new_indexed_lines = new_lines.zip(index) but it broke in this zip transform.

Any idea why it is broken and if there is a smarter way to do this?

thanks.

zero323 · Accepted Answer

How about something like this?

offset = lines.count()
new_indexed_lines = (new_lines
  .zipWithIndex()
  .map(lambda xi: (xi[1] + offset, xi[0])))

generate/insert contiguous numbers to elements in a Spark RDD

Answers (1)

Related Questions