iKnowNothing
iKnowNothing

Reputation: 79

How to Remove / Replace Character from PySpark List

I am very new to Python/PySpark and currently using it with Databricks. I have the following list

dummyJson= [
 ('{"name":"leo", "object" : ["191.168.192.96", "191.168.192.99"]}',), 
 ('{"name":"anne", "object" : ["191.168.192.103", "191.168.192.107"]}',),
]

When I tried to

jsonRDD = sc.parallelize(dummyJson) then put it in dataframe spark.read.json(jsonRDD)

it does not parse the JSON correctly. The resulting dataframe is one column with _corrupt_record as the header.

Looking at the elements in dummyJson, it looks like there are extra / unnecessary comma just before the closing parantheses on each element/record.

How can I remove this comma from each of the element of this list?

Thanks

Upvotes: 0

Views: 241

Answers (1)

SpaceJammer
SpaceJammer

Reputation: 182

If you can fix the input format at the source, that would be ideal.

But for your given case, you may fix it by taking the objects out of the tuple.

>>> dJson = [i[0] for i in dummyJson]
>>> jsonRDD = sc.parallelize(dJson)
>>> jsonDF = spark.read.json(jsonRDD)
>>> jsonDF.show()
+----+--------------------+
|name|              object|
+----+--------------------+
| leo|[191.168.192.96, ...|
|anne|[191.168.192.103,...|
+----+--------------------+

Upvotes: 2

Related Questions