Turning an rdd into a local dictionary in PySpark

Question

Here's the work I've done so far I'm using PySpark to read in a csv file:

x = sc.textFile("file:///tmp/data.csv").map(lambda l: l.split(',')).map(clean)

Clean is simply a function clean that removes non-ascii characters as the strings I'm importing are wrapped in a u' as in u'"This is a string"'. I've parsed it and have it in the form 'This is a string'. I wrote this function myself (I don't know if there's a more efficient way to do this as I'm fairly new to Python but there's non-ascii characters that PySpark can't handle. I'm using PySpark 2.6.6 that comes with the hortonworks sandbox.

Now my problem is I'm trying to put this in a dictionary structure. It should fit in memory so I tried .collectAsDict() but I received an run time error.

The keys are simply strings (although unicode strings) hence why I'm getting the error. Is there a good solution?

Jonathan Taws · Accepted Answer

If your RDD has a tuple structure, you can use the collectAsMap operation to get key-value pairs from the RDD as a dictionary.

The following should work :

>>>> xDict = x.collectAsMap()
>>>> xDict["a key"]

Turning an rdd into a local dictionary in PySpark

Answers (1)

Related Questions