Reputation: 69
I am facing an issue for which I am seeking your help. I have a task to convert a JSON
file to dataSet
so that it can be loaded into HIVE
.
SparkSession spark1 = SparkSession
.builder()
.appName("File_Validation")
.config("spark.some.config.option", "some-value")
.getOrCreate();
Dataset<Row> df = spark1.read().json("input/sample.json");
df.show();
Above code is throwing me a NullPointerException. I tried another way
JavaRDD<String> jsonFile = context.textFile("input/sample.json");
Dataset<Row> df2 = spark1.read().json(jsonFile);
df2.show();
created an RDD and passed it to the spark1 (sparkSession)
this code 2 is making the json to a different format with header as
+--------------------+
| _corrupt_record|
+--------------------+
with schema as - |-- _corrupt_record: string (nullable = true)
Please help in fixing it.
Sample JSON
{
"user": "gT35Hhhre9m",
"dates": ["2016-01-29", "2016-01-28"],
"status": "OK",
"reason": "some reason",
"content": [{
"foo": 123,
"bar": "val1"
}, {
"foo": 456,
"bar": "val2"
}, {
"foo": 789,
"bar": "val3"
}, {
"foo": 124,
"bar": "val4"
}, {
"foo": 126,
"bar": "val5"
}]
}
Upvotes: 2
Views: 3829
Reputation: 4623
You cant read formated JSON
in spark, Your JSON
should be in a single line
like this :
{"user": "gT35Hhhre9m","dates": ["2016-01-29", "2016-01-28"],"status": "OK","reason": "some reason","content": [{"foo": 123,"bar": "val1"}, {"foo": 456,"bar": "val2"}, {"foo": 789,"bar": "val3"}, {"foo": 124,"bar": "val4"}, {"foo": 126,"bar": "val5"}]}
Or it could be multi-lined JSON like this :
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
Upvotes: 0
Reputation: 16086
Your JSON should be in one line - one json in one line per one object. In example:
{ "property1: 1 }
{ "property1: 2 }
It will be read as Dataset with 2 objects inside and one column
From documentation:
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.
Of course read data with SparkSession, as it will inference schema
Upvotes: 4