Create a dictionary (map) with string, index in pyspark

Question

I have a RDD like this:

rdd = sc.parallelize(['a','b','a','c','d','b','e'])

I want to create a map(dictionary) of each unique value to an index.

The output will be a map (key, value) like:

{'a':0, 'b':1, 'c':2,'d':3,'e':4}

It's super easy to do in Python but I don't know how to do this in Spark.

user3689574 · Accepted Answer

So for your example (The "sort" part is only to get a to be 0 and so on):

rdd = sc.parallelize(['a','b','a','c','d','b','e'])

print rdd.distinct().sortBy(lambda x: x).zipWithIndex().collectAsMap()

{'a': 0, 'c': 2, 'b': 1, 'e': 4, 'd': 3}

Answers (2)