manshul goel
manshul goel

Reputation: 69

JSON to dataset in Spark

I am facing an issue for which I am seeking your help. I have a task to convert a JSON file to dataSet so that it can be loaded into HIVE.

Code 1

    SparkSession spark1 = SparkSession
                  .builder()
                  .appName("File_Validation")
                  .config("spark.some.config.option", "some-value")
                  .getOrCreate();
    Dataset<Row> df = spark1.read().json("input/sample.json");
    df.show();

Above code is throwing me a NullPointerException. I tried another way

Code 2

    JavaRDD<String> jsonFile = context.textFile("input/sample.json");
        Dataset<Row> df2 = spark1.read().json(jsonFile);
    df2.show();

created an RDD and passed it to the spark1 (sparkSession)

this code 2 is making the json to a different format with header as

 +--------------------+
 |     _corrupt_record|
 +--------------------+

with schema as - |-- _corrupt_record: string (nullable = true)

Please help in fixing it.

Sample JSON

  {
    "user": "gT35Hhhre9m",
    "dates": ["2016-01-29", "2016-01-28"],
    "status": "OK",
    "reason": "some reason",
    "content": [{
        "foo": 123,
        "bar": "val1"
    }, {
        "foo": 456,
        "bar": "val2"
    }, {
        "foo": 789,
        "bar": "val3"
    }, {
        "foo": 124,
        "bar": "val4"
    }, {
        "foo": 126,
        "bar": "val5"
    }]
  }

Upvotes: 2

Views: 3829

Answers (2)

Pawan B
Pawan B

Reputation: 4623

You cant read formated JSON in spark, Your JSON should be in a single line like this :

{"user": "gT35Hhhre9m","dates": ["2016-01-29", "2016-01-28"],"status": "OK","reason": "some reason","content": [{"foo": 123,"bar": "val1"}, {"foo": 456,"bar": "val2"}, {"foo": 789,"bar": "val3"}, {"foo": 124,"bar": "val4"}, {"foo": 126,"bar": "val5"}]}

Or it could be multi-lined JSON like this :

{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}

Upvotes: 0

T. Gawęda
T. Gawęda

Reputation: 16086

Your JSON should be in one line - one json in one line per one object. In example:

{ "property1: 1 }
{ "property1: 2 }

It will be read as Dataset with 2 objects inside and one column

From documentation:

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

Of course read data with SparkSession, as it will inference schema

Upvotes: 4

Related Questions