Reputation: 348
Here's the work I've done so far I'm using PySpark to read in a csv file:
x = sc.textFile("file:///tmp/data.csv").map(lambda l: l.split(',')).map(clean)
Clean is simply a function clean that removes non-ascii characters as the strings I'm importing are wrapped in a u' as in u'"This is a string"'. I've parsed it and have it in the form 'This is a string'. I wrote this function myself (I don't know if there's a more efficient way to do this as I'm fairly new to Python but there's non-ascii characters that PySpark can't handle. I'm using PySpark 2.6.6 that comes with the hortonworks sandbox.
Now my problem is I'm trying to put this in a dictionary structure. It should fit in memory so I tried .collectAsDict() but I received an run time error.
The keys are simply strings (although unicode strings) hence why I'm getting the error. Is there a good solution?
Upvotes: 1
Views: 6230
Reputation: 1188
If your RDD has a tuple structure, you can use the collectAsMap
operation to get key-value pairs from the RDD as a dictionary.
The following should work :
>>>> xDict = x.collectAsMap()
>>>> xDict["a key"]
Upvotes: 4