Reputation: 495
I have test JSON data at following link
http://developer.trade.gov/api/market-research-library.json
When I am trying to read schema directly from it in following manner
public void readJsonFormat() {
Dataset<Row> people = spark.read().json("market-research-library.json");
people.printSchema();
}
It is giving me error as
root
|-- _corrupt_record: string (nullable = true)
If it is malformed, how to convert it into format as expected by Spark.
Upvotes: 2
Views: 1692
Reputation: 11
The format expected by spark is JSONL(JSON lines) which is not the standard JSON. Got to know this from here. Here's a small python script to convert your json to expected format:
import jsonlines
import json
with open('C:/Users/ak/Documents/card.json', 'r') as f:
json_data = json.load(f)
with jsonlines.open('C:/Users/ak/Documents/card_lines.json', 'w') as writer:
writer.write_all(json_data)
Then you can access the file in your program as you have written in your code.
Upvotes: 1
Reputation: 41957
If this is the only json
you would like to convert to dataframe
then I suggest you to go with wholeTextFiles
api. Since the json
is not in spark readable format, you can convert it to spark readable format only when whole of the data is read as one parameter and wholeTextFiles
api does that.
Then you can replace
the linefeed and spaces from the json
string. And finally you should have required dataframe
.
sqlContext.read.json(sc.wholeTextFiles("path to market-research-library.json file").map(_._2.replace("\n", "").replace(" ", "")))
You should have your required dataframe
with following schema
root
|-- basePath: string (nullable = true)
|-- definitions: struct (nullable = true)
| |-- Report: struct (nullable = true)
| | |-- properties: struct (nullable = true)
| | | |-- click_url: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- country: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- description: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- expiration_date: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- id: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- industry: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- report_type: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- source_industry: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- title: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- url: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
|-- host: string (nullable = true)
|-- info: struct (nullable = true)
| |-- description: string (nullable = true)
| |-- title: string (nullable = true)
| |-- version: string (nullable = true)
|-- paths: struct (nullable = true)
| |-- /market_research_library/search: struct (nullable = true)
| | |-- get: struct (nullable = true)
| | | |-- description: string (nullable = true)
| | | |-- parameters: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- description: string (nullable = true)
| | | | | |-- format: string (nullable = true)
| | | | | |-- in: string (nullable = true)
| | | | | |-- name: string (nullable = true)
| | | | | |-- required: boolean (nullable = true)
| | | | | |-- type: string (nullable = true)
| | | |-- responses: struct (nullable = true)
| | | | |-- 200: struct (nullable = true)
| | | | | |-- description: string (nullable = true)
| | | | | |-- schema: struct (nullable = true)
| | | | | | |-- items: struct (nullable = true)
| | | | | | | |-- $ref: string (nullable = true)
| | | | | | |-- type: string (nullable = true)
| | | |-- summary: string (nullable = true)
| | | |-- tags: array (nullable = true)
| | | | |-- element: string (containsNull = true)
|-- produces: array (nullable = true)
| |-- element: string (containsNull = true)
|-- schemes: array (nullable = true)
| |-- element: string (containsNull = true)
|-- swagger: string (nullable = true)
Upvotes: 1
Reputation: 4719
Converting your json to single line.
Or set option("multiLine", true)
to allow multiply line json.
Upvotes: 3