Reputation: 127
Suppose I have a RDD loaded by
lines = sc.textFile('/test.txt')
and RDD is like ['apple', 'orange', 'banana']
. Then I would like to generate RDD [(0, 'apple'), (1, 'orange'), (2, 'banana')]
.
I know this could be done by indexed_lines = lines.zipWithIndex().map(lambda (x, y): ','.join([str(y), x])).collect()
But now I have another RDD new_lines = ['pineapple','blueberry']
, I would like to union
these two RDDs(indexed_lines and new_lines) to construct [(0, 'apple'), (1, 'orange'), (2, 'banana'), (3, 'pineapple'), (4, 'blueberry')]
notice that the indexed_lines already exists and I don't want to change the data within it.
I am trying to zip and union
RDDs
index = sc.parallelize(range(3, 5))
new_indexed_lines = new_lines.zip(index)
but it broke in this zip
transform.
Any idea why it is broken and if there is a smarter way to do this?
thanks.
Upvotes: 0
Views: 356
Reputation: 330273
How about something like this?
offset = lines.count()
new_indexed_lines = (new_lines
.zipWithIndex()
.map(lambda xi: (xi[1] + offset, xi[0])))
Upvotes: 2