How to Remove / Replace Character from PySpark List

Question

I am very new to Python/PySpark and currently using it with Databricks. I have the following list

dummyJson= [
 ('{"name":"leo", "object" : ["191.168.192.96", "191.168.192.99"]}',), 
 ('{"name":"anne", "object" : ["191.168.192.103", "191.168.192.107"]}',),
]

When I tried to

jsonRDD = sc.parallelize(dummyJson) then put it in dataframe spark.read.json(jsonRDD)

it does not parse the JSON correctly. The resulting dataframe is one column with _corrupt_record as the header.

Looking at the elements in dummyJson, it looks like there are extra / unnecessary comma just before the closing parantheses on each element/record.

How can I remove this comma from each of the element of this list?

Thanks

SpaceJammer · Accepted Answer

If you can fix the input format at the source, that would be ideal.

But for your given case, you may fix it by taking the objects out of the tuple.

>>> dJson = [i[0] for i in dummyJson]
>>> jsonRDD = sc.parallelize(dJson)
>>> jsonDF = spark.read.json(jsonRDD)
>>> jsonDF.show()
+----+--------------------+
|name|              object|
+----+--------------------+
| leo|[191.168.192.96, ...|
|anne|[191.168.192.103,...|
+----+--------------------+

How to Remove / Replace Character from PySpark List

Answers (1)

Related Questions