Utkarsh Saraf
Utkarsh Saraf

Reputation: 495

Convert to JSON format expected by Spark for creating schema for dataframe in Java

I have test JSON data at following link

http://developer.trade.gov/api/market-research-library.json

When I am trying to read schema directly from it in following manner

public void readJsonFormat() {
        Dataset<Row> people = spark.read().json("market-research-library.json");
        people.printSchema();
    }

It is giving me error as

root
 |-- _corrupt_record: string (nullable = true)

If it is malformed, how to convert it into format as expected by Spark.

Upvotes: 2

Views: 1692

Answers (3)

Vikas Subramaniam
Vikas Subramaniam

Reputation: 11

The format expected by spark is JSONL(JSON lines) which is not the standard JSON. Got to know this from here. Here's a small python script to convert your json to expected format:

import jsonlines
import json


with open('C:/Users/ak/Documents/card.json', 'r') as f:
    json_data = json.load(f)

with jsonlines.open('C:/Users/ak/Documents/card_lines.json', 'w') as writer:
    writer.write_all(json_data)

Then you can access the file in your program as you have written in your code.

Upvotes: 1

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41957

If this is the only json you would like to convert to dataframe then I suggest you to go with wholeTextFiles api. Since the json is not in spark readable format, you can convert it to spark readable format only when whole of the data is read as one parameter and wholeTextFiles api does that.

Then you can replace the linefeed and spaces from the json string. And finally you should have required dataframe.

sqlContext.read.json(sc.wholeTextFiles("path to market-research-library.json file").map(_._2.replace("\n", "").replace(" ", "")))

You should have your required dataframe with following schema

root
 |-- basePath: string (nullable = true)
 |-- definitions: struct (nullable = true)
 |    |-- Report: struct (nullable = true)
 |    |    |-- properties: struct (nullable = true)
 |    |    |    |-- click_url: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- country: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- description: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- expiration_date: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- id: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- industry: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- report_type: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- source_industry: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- title: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- url: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |-- host: string (nullable = true)
 |-- info: struct (nullable = true)
 |    |-- description: string (nullable = true)
 |    |-- title: string (nullable = true)
 |    |-- version: string (nullable = true)
 |-- paths: struct (nullable = true)
 |    |-- /market_research_library/search: struct (nullable = true)
 |    |    |-- get: struct (nullable = true)
 |    |    |    |-- description: string (nullable = true)
 |    |    |    |-- parameters: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |    |-- format: string (nullable = true)
 |    |    |    |    |    |-- in: string (nullable = true)
 |    |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |    |    |-- required: boolean (nullable = true)
 |    |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- responses: struct (nullable = true)
 |    |    |    |    |-- 200: struct (nullable = true)
 |    |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |    |-- schema: struct (nullable = true)
 |    |    |    |    |    |    |-- items: struct (nullable = true)
 |    |    |    |    |    |    |    |-- $ref: string (nullable = true)
 |    |    |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- summary: string (nullable = true)
 |    |    |    |-- tags: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |-- produces: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- schemes: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- swagger: string (nullable = true)

Upvotes: 1

Zhang Tong
Zhang Tong

Reputation: 4719

Converting your json to single line.

Or set option("multiLine", true) to allow multiply line json.

Upvotes: 3

Related Questions