aa8y
aa8y

Reputation: 3942

Distributed processing of JSON in Hadoop

I want to process a ~300 GB JSON file in Hadoop. As far as my understanding goes a JSON consists of a single string with data nested in it. Now if I want to parse the JSON string using Google's GSON, then won't the Hadoop have to put the entire load upon a single node as the JSON is not logically divisible for it.

How do I partition the file (I can make out the partitions logically looking at the data) if I want that it should be processed parallely on different nodes. Do I have to break the file before I load it onto HDFS itself. Is it absolutely necessary that the JSON is parsed by one machine (or node) at least once?

Upvotes: 2

Views: 3380

Answers (3)

Tariq
Tariq

Reputation: 34184

You might find this JSON SerDe useful. It allows hive to read and write in JSON format. If it works for you, it'll be a lot more convenient to process you JSON data with Hive as you don't have to worry about the custom InputFormat that is going to read your JSON data and create splits for you.

Upvotes: 0

greedybuddha
greedybuddha

Reputation: 7507

Assuming you know can logically parse the JSON into logical separate components then you can accomplish this just by writing your own InputFormat.

Conceptually you can think of each of the logically divisible JSON components as one "line" of data. Where each component contains the minimal amount of information that can be acted on independently.

Then you will need to make a class, a FileInputFormat, where you will have to return each of these JSON components.

public class JSONInputFormat extends FileInputFormat<Text,JSONComponent {...}

Upvotes: 1

Dmitry
Dmitry

Reputation: 2993

If you can logically divide your giant JSON into parts, do it, and save these parts as separate lines in file (or records in sequence file). Then, if you feed this new file to Hadoop MapReduce, mappers will be able to process records in parallel.

So, yes, JSON should be parsed by one machine at least once. This preprocessing phase doesn't need to be performed in Hadoop, simple script can do the work. Use streaming API to avoid loading a lot of data into memory.

Upvotes: 0

Related Questions