Apache spark parsing json with splitted records

Question

As far as I know, Apache spark requires json file to have one record in exactly one string. I have a splitted by fields json file like this:

{"id": 123,
"name": "Aaron",
"city": {
    "id" : 1,
    "title": "Berlin"
}}
{"id": 125,
"name": "Bernard",
"city": {
    "id" : 2,
    "title": "Paris"
}}
{...many more lines
...}

How can I parse it using Spark? Do I need a preprocessor or can I provide custom splitter?

Assaf Mendelson · Accepted Answer

Spark uses splitting by newline to distinguish records. This mean that when using the standard json reader you would need to have one record per line.

You can convert by doing something like in this answer: https://stackoverflow.com/a/30452120/1547734

The basic idea would be to read as a wholeTextFiles and then load it to a json reader which would parse it and flatmap the results.

Of course this assumes the files are big enough to be in memory and parsed one at a time. Otherwise you would need more complicated solutions.

Apache spark parsing json with splitted records

Answers (1)

Related Questions