Reputation: 12049
How can I load a data file with a .pig_schema
schema file in the same directory?
The official Apache Pig documentation and this answer lack any proper explanation of what the different schema fields mean or the different data type values are.
Could someone give a better, more detailed example?
Upvotes: 1
Views: 1504
Reputation: 12049
When you load data in Pig, you can optionally define its schema in a .pig_schema
JSON file that sits in your data directory:
data/
├── data_file.csv
└── .pig_schema
If your data_file.csv
looks like:
3,0,(mybytearray),{(1.7)},[wesam#2.9]
9,8,(mybytearray),{(0.6)},[elshamy#6.5]
and you use this .pig_schema
file:
{
"fields": [
{
"name": "myint",
"type": 10
},
{
"name": "mylong",
"type": 15
},
{
"name": "mytupe",
"type": 110,
"schema": {
"fields": [
{
"name": "mybytearray",
"type": 50
}
]
}
},
{
"name": "mybag",
"type": 120,
"schema": {
"fields": [
{
"name": "mytupe",
"type": 110,
"schema": {
"fields": [
{
"name": "myfloat",
"type": 20
}
]
}
}
]
}
},
{
"name": "mymap",
"type": 100,
"schema": {
"fields": [
{
"name": null,
"type": 25
}
]
}
}
]
}
and load your data with this Pig script.
b = LOAD '/path/to/data' USING PigStorage(',');
your data will have the following schema:
b: {myint: int,mylong: long,mytupe: (mybytearray: bytearray),mybag: {mytupe: (myfloat: float)},mymap: map[double]}
In the .pig_schema
JSON file, the value of the "fields"
key is an array of all the fields you have in your data. Every field is defined by a JSON object with:
"name"
field name (e.g: "my_field"
)."type"
integer representing the type of the field (e.g: 55
) (see type values below)."schema"
[optional] defines schema for complex types (tuple
, bag
, map
).The "type"
values for the different Pig data types are:
int : 10
long : 15
float : 20
double : 25
bytearray : 50
chararray : 55
map : 100
tuple : 110
bag : 120
Upvotes: 2