By using PySpark how to parse nested JSON (Blob format)

Question

I'm getting the following records in blob format with a new line separated. Below is an example of two events separated by a newline,

Few things to note here,
In the example below, event(Structure) are in inconsistent. For certain events i will get Channel Id,conversation Id,replyActivity Id,from Id,locale columns and for absent columns i need to populate as null in my data frame.

How will i able to achive this in Pyspark ?

Example:

{
   "event":[
      {
         "name":"Zip/Postal Code",
         "count":1
      }
   ],
   "internal":{
      "data":{
         "id":"XXXX",
         "documentVersion":"1.61"
      }
   },
   "context":{
      "application":{
         "version":"Thu 10/15/2020  2:46:54.65 
UTC (fv-az464-530) [Build 174613] [Repo Intercom] [Branch prod] [Commit XXXX] 
[IntercomWebUIVersion 1.6.20-169031]  [IntercomBotAppTemplatesVersion 1.3.27-165664] 
"
      },
      "data":{
         "eventTime":"2020-10-20T15:54:48.7734934Z",
         "isSynthetic":false,
         "samplingRate":100.0
      },
      "cloud":{
         
      },
      "device":{
         "type":"PC",
         "roleName":"bc-directline-eus2",
         "roleInstance":"RD0004FFA145F5",
         "screenResolution":{
            
         }
      },
      "session":{
         "isFirst":false
      },
      "operation":{
         "id":"f115c4bf-4fa31385d9a8f248",
         "parentId":"|f115c4bf-4fa31385d9a8f248."
      },
      "location":{
         "clientip":"0.0.0.0",
         "continent":"North America",
         "country":"United States",
         "province":"Virginia",
         "city":"Boydton"
      },
      "custom":{
         "dimensions":[
            {
               "Timestamp":"XXXX"
            },
            {
               "StatusCode":"200"
            },
            {
               "Activity ID":"HR48uEYXuCE1yIsFMLL3X3-j|0000006"
            },
            {
               "From ID":"XXXX"
            },
            {
               "Correlation ID":"|f115c4bf-4fa31385d9a8f248."
            },
            {
               "Channel ID":"directline"
            },
            {
               "Recipient ID":"7222C-RG-CAR-MP5-HVC-Chatbot-P-p7rpums@Ye6TP1LJz0o"
            },
            {
               "Bot ID":"XXXX"
            },
            {
               "Activity Type":"message"
            },
            {
               "Conversation ID":"HR48uEYXuCE1yIsFMLL3X3-j"
            }
         ]
      }
   }
}{
   "event":[
      {
         "name":"Activity",
         "count":1
      }
   ],
   "internal":{
      "data":{
         "id":"992b0fc7-12ec-11eb-b59a-fb2df7d234d8",
         "documentVersion":"1.61"
      }
   },
   "context":{
      "application":{
         "version":"Thu 10/15/2020  2:46:54.65 
UTC (fv-az464-530) [Build 174613] [Repo Intercom] [Branch prod] [Commit XXXX] 
[IntercomWebUIVersion 1.6.20-169031]  [IntercomBotAppTemplatesVersion 1.3.27-165664] 
"
      },
      "data":{
         "eventTime":"2020-10-20T15:54:34.3811795Z",
         "isSynthetic":false,
         "samplingRate":100.0
      },
      "cloud":{
         
      },
      "device":{
         "type":"PC",
         "roleName":"bc-directline-eastus3",
         "roleInstance":"RD00155D33F838",
         "screenResolution":{
            
         }
      },
      "session":{
         "isFirst":false
      },
      "operation":{
         "id":"00-508c4cceaa6d954599230123d012265b-5f1d891b61135340-00",
         "parentId":"|00-508c4cceaa6d954599230123d012265b-5f1d891b61135340-00.2fac18fc_"
      },
      "location":{
         "clientip":"0.0.0.0",
         "continent":"North America",
         "country":"United States",
         "province":"Virginia",
         "city":"Washington"
      },
      "custom":{
         "dimensions":[
            {
               "Timestamp":"XXXX"
            },
            {
               "StatusCode":"200"
            },
            {
               "Activity ID":"HR48uEYXuCE1yIsFMLL3X3-j|0000000"
            },
            {
               "From ID":"XXXX"
            },
            {
               "Correlation ID":"|00-508c4cceaa6d954599230123d012265b-5f1d891b61135340-00.2fac18fc_"
            },
            {
               "Channel ID":"directline"
            },
            {
               "Bot ID":"7222C-RG-CAR-MP5-HVC-Chatbot-P-p7rpums"
            },
            {
               "Activity Type":"message"
            },
            {
               "Conversation ID":"HR48uEYXuCE1yIsFMLL3X3-j"
            }
         ]
      }
   }
}

I need to extract these records in to following table format (Column Name mentioned below),

ActivityId | ActivityType | ChannelId | conversationId | replyActivityId | fromId | locale | recipientId | speak | text | name |eventTime | Date | InstanceId | DialogId | StepName | applicationId | intent | intentScore | entities | question | sentimentLabel | sentimentScore | knowledgeBaseId | answer | articleFound | originalQuestion| question | questionId | score | username | city | province | country | Feedback | Comment | Tag

By using PySpark how to parse nested JSON (Blob format)

Answers (1)

Related Questions