Istvan
Istvan

Reputation: 8582

How to convert a 10G JSON file to Avro?

I have a roughly 10G JSON file. Each line contains exactly one JSON document. I was wondering what is the best way to convert this to Avro. Ideally I would like to keep several documents (like 10M) per file. I think Avro supports having multiple documents in the same file.

Upvotes: 0

Views: 1890

Answers (2)

Istvan
Istvan

Reputation: 8582

The easiest way to convert a large JSON file to Avro is using avro-tools from the Avro website.

After creating a simple schema the file can be directly converted.

java -jar avro-tools-1.7.7.jar fromjson --schema-file cpc.avsc --codec deflate test.1g.json > test.1g.deflate.avro

The example schema:

{
        "type": "record",
        "name": "cpc_schema",
        "namespace": "com.streambright.avro",
        "fields": [{
                "name": "section",
                "type": "string",
                "doc": "Section of the CPC"
        }, {
                "name": "class",
                "type": "string",
                "doc": "Class of the CPC"
        }, {
                "name": "subclass",
                "type": "string",
                "doc": "Subclass of the CPC"
        }, {
                "name": "main_group",
                "type": "string",
                "doc": "Main-group of the CPC"
        }, {
                "name": "subgroup",
                "type": "string",
                "doc": "Subgroup of the CPC"
        }, {
                "name": "classification_value",
                "type": "string",
                "doc": "Classification value of the CPC"
        }, {
                "name": "doc_number",
                "type": "string",
                "doc": "Patent doc_number"
        }, {
                "name": "updated_at",
                "type": "string",
                "doc": "Document update time"
        }],
        "doc:": "A basic schema for CPC codes"
}

Upvotes: 0

Matthieu
Matthieu

Reputation: 4731

You should be able to use Avro tools' fromjson command (see here for more information and examples). You'll probably want to split your file into 10M chunks beforehand (for example using split(1)).

Upvotes: 3

Related Questions