Sam
Sam

Reputation: 12049

Pig: load data with .pig_schema schema file

How can I load a data file with a .pig_schema schema file in the same directory?

The official Apache Pig documentation and this answer lack any proper explanation of what the different schema fields mean or the different data type values are.

Could someone give a better, more detailed example?

Upvotes: 1

Views: 1504

Answers (1)

Sam
Sam

Reputation: 12049

When you load data in Pig, you can optionally define its schema in a .pig_schema JSON file that sits in your data directory:

data/
├── data_file.csv
└── .pig_schema

If your data_file.csv looks like:

3,0,(mybytearray),{(1.7)},[wesam#2.9]
9,8,(mybytearray),{(0.6)},[elshamy#6.5]

and you use this .pig_schema file:

{
  "fields": [
    {
      "name": "myint",
      "type": 10
    },
    {
      "name": "mylong",
      "type": 15
    },
    {
      "name": "mytupe",
      "type": 110,
      "schema": {
        "fields": [
          {
            "name": "mybytearray",
            "type": 50
          }
        ]
      }
    },
    {
      "name": "mybag",
      "type": 120,
      "schema": {
        "fields": [
          {
            "name": "mytupe",
            "type": 110,
            "schema": {
              "fields": [
                {
                  "name": "myfloat",
                  "type": 20
                }
              ]
            }
          }
        ]
      }
    },
    {
      "name": "mymap",
      "type": 100,
      "schema": {
        "fields": [
          {
            "name": null,
            "type": 25
          }
        ]
      }
    }
  ]
}

and load your data with this Pig script.

b = LOAD '/path/to/data' USING PigStorage(',');

your data will have the following schema:

b: {myint: int,mylong: long,mytupe: (mybytearray: bytearray),mybag: {mytupe: (myfloat: float)},mymap: map[double]}

In the .pig_schema JSON file, the value of the "fields" key is an array of all the fields you have in your data. Every field is defined by a JSON object with:

  • "name" field name (e.g: "my_field").
  • "type" integer representing the type of the field (e.g: 55) (see type values below).
  • "schema" [optional] defines schema for complex types (tuple, bag, map).

The "type" values for the different Pig data types are:

int       : 10
long      : 15
float     : 20
double    : 25
bytearray : 50
chararray : 55
map       : 100
tuple     : 110
bag       : 120

Upvotes: 2

Related Questions