Reputation: 91
As far as I know, Apache spark requires json file to have one record in exactly one string. I have a splitted by fields json file like this:
{"id": 123,
"name": "Aaron",
"city": {
"id" : 1,
"title": "Berlin"
}}
{"id": 125,
"name": "Bernard",
"city": {
"id" : 2,
"title": "Paris"
}}
{...many more lines
...}
How can I parse it using Spark? Do I need a preprocessor or can I provide custom splitter?
Upvotes: 3
Views: 381
Reputation: 13001
Spark uses splitting by newline to distinguish records. This mean that when using the standard json reader you would need to have one record per line.
You can convert by doing something like in this answer: https://stackoverflow.com/a/30452120/1547734
The basic idea would be to read as a wholeTextFiles and then load it to a json reader which would parse it and flatmap the results.
Of course this assumes the files are big enough to be in memory and parsed one at a time. Otherwise you would need more complicated solutions.
Upvotes: 2