converting rdd list of list unicode values into string

Question

I've a spark rdd which has values in unicode as list of list :

ex_rdd.take(5)
[[u'70450535982412348', u'1000000923', u'1'],
 [u'535982417348', u'1000000923', u'1'],
 [u'50535929459899', u'1000000923', u'99'],
 [u'8070450535936297811', u'1000000923', u'1'],
 [u'5937908667', u'1000000923', u'1']]

To write them into a hdfs file ,it is giving a unicode error.How do i convert them to string and write them in file efficiently in pyspark? hdfs output file should look like below -

 70450535982412348,1000000923,1
 535982417348,1000000923,1

and so on

A.M. · Accepted Answer

You can use Python's join function for strings, along with the map and saveAsTextFile operations on pyspark.RDD objects (see the documentation here).

ex_rdd.map(lambda L: ','.join(L)).saveAsTextFile('/path/to/hdfs/save/file')

This should be available on even the earlier versions (>= 1.0) of PySpark, if I'm not mistaken.

I'm not sure what you mean by "unicode error". Is this an exception in Python? Or is this an exception in the Java internals?

converting rdd list of list unicode values into string

Answers (1)

Related Questions