Reputation: 21739
I have a RDD like this:
rdd = sc.parallelize(['a','b','a','c','d','b','e'])
I want to create a map(dictionary) of each unique value to an index.
The output will be a map (key, value) like:
{'a':0, 'b':1, 'c':2,'d':3,'e':4}
It's super easy to do in Python but I don't know how to do this in Spark.
Upvotes: 1
Views: 1410
Reputation: 1676
What you are looking for is zipWithIndex
So for your example (The "sort" part is only to get a to be 0 and so on):
rdd = sc.parallelize(['a','b','a','c','d','b','e'])
print rdd.distinct().sortBy(lambda x: x).zipWithIndex().collectAsMap()
{'a': 0, 'c': 2, 'b': 1, 'e': 4, 'd': 3}
Upvotes: 1
Reputation: 35249
If you can accept gaps this should do the trick:
rdd.zipWithIndex().reduceByKey(min).collectAsMap()
# {'b': 1, 'c': 3, 'a': 0, 'e': 6, 'd': 4}
Otherwise (much more expensive)
(rdd
.zipWithIndex()
.reduceByKey(min)
.sortBy(lambda x: x[1])
.keys()
.zipWithIndex()
.collectAsMap())
# {'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}
Upvotes: 1