Read multiline json string using Python Spark dataframe

Question

I am reading the contents of an api into a dataframe using the pyspark code below in a databricks notebook. I validated the json payload and the string is in valid json format. I guess the error is due to multiline json string. The below code worked fine with other json api payloads.

Spark version < 2.2

import requests
user = "usr"
password = "aBc!23"
response = requests.get('https://myapi.com/allcolor', auth=(user, password))
jsondata = response.json()
from pyspark.sql import *
df = spark.read.json(sc.parallelize([jsondata]))
df.show()

JSON payload:

{
  "colors": [
    {
      "color": "black",
      "category": "hue",
      "type": "primary",
      "code": {
        "rgba": [
          255,
          255,
          255,
          1
        ],
        "hex": "#000"
      }
    },
    {
      "color": "white",
      "category": "value",
      "code": {
        "rgba": [
          0,
          0,
          0,
          1
        ],
        "hex": "#FFF"
      }
    },
    {
      "color": "red",
      "category": "hue",
      "type": "primary",
      "code": {
        "rgba": [
          255,
          0,
          0,
          1
        ],
        "hex": "#FF0"
      }
    },
    {
      "color": "blue",
      "category": "hue",
      "type": "primary",
      "code": {
        "rgba": [
          0,
          0,
          255,
          1
        ],
        "hex": "#00F"
      }
    },
    {
      "color": "yellow",
      "category": "hue",
      "type": "primary",
      "code": {
        "rgba": [
          255,
          255,
          0,
          1
        ],
        "hex": "#FF0"
      }
    },
    {
      "color": "green",
      "category": "hue",
      "type": "secondary",
      "code": {
        "rgba": [
          0,
          255,
          0,
          1
        ],
        "hex": "#0F0"
      }
    }
  ]
}

Error:

pyspark.sql.dataframe.DataFrame = [_corrupt_record: string]

Modified code:

spark.sql("set spart.databricks.delta.preview.enabled=true")
spark.sql("set spart.databricks.delta.retentionDutationCheck.preview.enabled=false")
import json
import requests
from requests.auth import HTTPDigestAuth
import pandas as pd
user = "username"
password = "password"
myResponse = requests.get('https://myapi.com/allcolor', auth=(user, password))
if(myResponse.ok):
  jData = json.loads(myResponse.content)
  s1 = json.dumps(jData)
  #load data from api
  x = json.loads(s1)
  data = pd.read_json(json.dumps(x))
  #create dataframe
  spark_df = spark.createDataFrame(data)
  spark_df.show()          
  spark.conf.set("fs.azure.account.key..blob.core.windows.net","")
  spark_df.write.mode("overwrite").json("wasbs://@.blob.core.windows.net//")
else:
  myResponse.raise_for_status()

Output not in right format as the source.

Output modified: (not same as source)

{
  "colors": 
    {
      "color": "black",
      "category": "hue",
      "type": "primary",
      "code": {
        "rgba": [
          255,
          255,
          255,
          1
        ],
        "hex": "#000"
      }
    }
    }
{
  "colors":     
    {
      "color": "white",
      "category": "value",
      "code": {
        "rgba": [
          0,
          0,
          0,
          1
        ],
        "hex": "#FFF"
      }
    }
    }

Could you please point to me where I am going wrong as the output file I am storing in ADLS Gen2 does not match the source api json payload.

Read multiline json string using Python Spark dataframe

Answers (1)

Related Questions