Reputation: 45
I am trying to read a simple json file into a dataframe. Here is the JSON:
{
"apiName": "bctt-e-card-operations",
"correlationId": "bctt-e-card-operations",
"msg": "{ "MSG_TIP": "1261", "MSG_VER": "02", "LOG_SIS": "32", "LOG_PERN01": "2122", "LOG_NUMN01": "30108916", "MSG_RESTIPA00": "0", "MSG_IDE": "00216452708916", "MSG_REACOD": "000", "SAN_NUM": "000010X01941XXX", "EXT_MOECOD": "978", "SAN_SDIMNT": "00000043830", "LOG_SINMOV": "C", "SAN_SAUMNT": "00000043830", "LOG_SINMOV": "C", "SAN_SCOMNT": "00000043830", "LOG_SINMOV": "C", "SAN_SCODAT": "20220502", "SDF_FORDAD": "0", "CAR_PAGXTR": "0", "CLI_LIMCRE": "0000000", "SDF_AUTCAT": "000000000", "LOG_SINMOV": "C", "SDF_SLDDIV": "000000000", "CLI_SDICSH": "000000000", "LOG_SINMOV": "C", "CLI_SDICPR": "000000000", "LOG_SINMOV": "C", "MSG_DADLGT": "0000" }"
"contentType": "application/json",
"msgFormat": "SIBS"
"step": "response",
"status": "0",
"transTimestamp": "2022-05-02 16:45:28.487",
"operationMetadata": {
"msgType": "PAYMENT",
"msgSubtype": "116100-02300",
"accountId": "10X01941XXX",
"cardNumber": "451344X063617XXX",
"reference": "212230108916",
"originalReference": null,
"reversalReference": null,
"timeout": false
"flow": "PRT"
},
"error": {
"errorCode": null
"errorDescription": null
}
}
Here is my code to read the file
file_location = "/mnt/test/event example.json"
df = spark.read.option("multiline", "true").json(file_location, schema=schema)
display(df)
But as you can see the json file is missing some commas and also the 'msg' key has its value wrapped in quotes. This makes the dataframe return everything with nulls.
Is there a way to format the JSON, (in Pyspark because the file comes like that from the source), for removing the quotes and adding commas so the json is properly formated?
Thanks in advance
Upvotes: 1
Views: 106
Reputation: 81
You could turn the JSON into a string and use Python RegEx sub() function to reformat the string. Example:
import re
json_string= " "contentType": "application/json",
"msgFormat": "SIBS"
"step": "response","
x = re.sub("\"SIBS\"", "\"SIBS\"," , json_string)
print(json_string)
" "contentType": "application/json",
"msgFormat": "SIBS",
"step": "response","
In this case, the function is looking for the value "SIBS" and replacing it with "SIBS", .You might need to play around with it and use escape characters.
Upvotes: 1