YOLO
YOLO

Reputation: 21739

Create a dictionary (map) with string, index in pyspark

I have a RDD like this:

rdd = sc.parallelize(['a','b','a','c','d','b','e'])

I want to create a map(dictionary) of each unique value to an index.

The output will be a map (key, value) like:

{'a':0, 'b':1, 'c':2,'d':3,'e':4}

It's super easy to do in Python but I don't know how to do this in Spark.

Upvotes: 1

Views: 1410

Answers (2)

user3689574
user3689574

Reputation: 1676

What you are looking for is zipWithIndex

So for your example (The "sort" part is only to get a to be 0 and so on):

rdd = sc.parallelize(['a','b','a','c','d','b','e'])

print rdd.distinct().sortBy(lambda x: x).zipWithIndex().collectAsMap()

{'a': 0, 'c': 2, 'b': 1, 'e': 4, 'd': 3}

Upvotes: 1

Alper t. Turker
Alper t. Turker

Reputation: 35249

If you can accept gaps this should do the trick:

rdd.zipWithIndex().reduceByKey(min).collectAsMap()
# {'b': 1, 'c': 3, 'a': 0, 'e': 6, 'd': 4}

Otherwise (much more expensive)

(rdd
    .zipWithIndex()
    .reduceByKey(min)
    .sortBy(lambda x: x[1])
    .keys()
    .zipWithIndex()
    .collectAsMap())
# {'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}

Upvotes: 1

Related Questions