Polymerase
Polymerase

Reputation: 6811

Pig Load How to mix scalar and map datatypes?

Using Apache Pig version 0.10.1.21 (rexported)

Content of data sample file:

AtomicNumber,ElementName,Symbol,AtomicMass,PropertyMap
46,Palladium,Pd,106.42,[P#46,N#60,Struc#Cubic]
49,Indium,In,114.818,[P#49,N#66,Struc#Tetragonal]
52,Tellurium,Te,127.6,[P#52,N#76,Struc#Hexagonal]
86,Radon,222.0,Rn,[P#86,N#136,Struc#Cubic]
38,Strontium,Sr,87.62,[P#38,N#50,Struc#Cubic]
Plutonium,94,Pu,244.0,[P#94,N#150,Struc#Monoclinic]

NOTE: Some columns are swapped intentionally (for Radon and Plutonium) to see how Pig handle datatype mismatch

Pig script:

AtomElem = LOAD 'data/Atoms.txt' USING PigStorage(',') AS (AtomicNumber:int, ElementName:chararray, Symbol:chararray, AtomicMass:float, PropertyMap:map[]);
DUMP AtomElem;

Results:

(,ElementName,Symbol,,)
(46,Palladium,Pd,106.42,)
(49,Indium,In,114.818,)
(52,Tellurium,Te,127.6,)
(86,Radon,222.0,,)
(38,Strontium,Sr,87.62,)
(,94,Pu,244.0,)

Question1: I was hoping that the PropertyMap would be displayed. Can you please show me how to modify either the pig script or the data file in order to display the PropertyMap colum as map datatype.

Question2: In the declaration of the map schema, I would like to strong type the datatype. I declared the schema as PropertyMap:map[int, int, chararray] but pig had rejected the syntax (error on , right bracket expected). Is it possible to declare a map having several keys? If yes, what should the schema declaration look like?

Thanks in advance for any help.

Upvotes: 1

Views: 514

Answers (2)

Eli
Eli

Reputation: 39009

Personally, I'd store all the data as JSON and load from that. When you have complicated data structures in your loaded data set, it'll make things much easier to manage for you and anyone working on this afterwards because JSON is a much more straight forward standard for nested structures than loading maps, etc... in pig.

I think @WinnieNicklaus' answer should work as well, but the next guy working on this, or the next time you need to add something complicated to your data, you'll run into the same problem. Just store everything in JSON and load with Pig's Built-in JSON loader:

a = load 'a.json' using 
JsonLoader('a0:int,a1:{(a10:int,a11:chararray)},a2:(a20:double,a21:bytearray),a3:[chararray]');

If you don't want to provide a schema, you can also load using ElephantBird's JSON loader:

loaded = LOAD '/path/to/some_file.json'
    using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');

Both of those should work fine with .gz files if I remember correctly, and the elephantbird version works with lzos as well.

Upvotes: 0

reo katoa
reo katoa

Reputation: 5811

The reason your script does not successfully produce the map is that you have used comma as your field delimeter, but it is also the delimeter for map key-value pairs. So when Pig splits your line into fields, the fifth field is not [P#46,N#60,Struc#Cubic], as you expect, but rather it is [P#46. Pig cannot successfully parse this as a map, so it is converted to NULL.

As to your second question, you cannot specify the datatypes of individual map values. In the first place, order means nothing in a map. And a map can have any number of elements. If you want to specify a single datatype for all your values, you can do so, but beyond that either Pig will figure out what type it is or you will need to explicitly cast the value when you use it.

To illustrate both of these points, I have modified your input data to be tab-delimited (and accordingly updated the script to be USING PigStorage('\t')), and swapped the position of two map elements in the second line to show that Pig does not reproduce the order they were provided in.

$ cat data.txt
AtomicNumber    ElementName     Symbol  AtomicMass      PropertyMap
46      Palladium       Pd      106.42  [P#46,N#60,Struc#Cubic]
49      Indium  In      114.818 [N#66,Struc#Tetragonal,P#49]
52      Tellurium       Te      127.6   [P#52,N#76,Struc#Hexagonal]
86      Radon   222.0   Rn      [P#86,N#136,Struc#Cubic]
38      Strontium       Sr      87.62   [P#38,N#50,Struc#Cubic]
Plutonium       94      Pu      244.0   [P#94,N#150,Struc#Monoclinic]

$ cat test.pig
AtomElem = LOAD 'data.txt' USING PigStorage('\t') AS (AtomicNumber:int, ElementName:chararray, Symbol:chararray, AtomicMass:float, PropertyMap:map[]);
DUMP AtomElem;

$ pig -x local test.pig
(,ElementName,Symbol,,)
(46,Palladium,Pd,106.42,[P#46,N#60,Struc#Cubic])
(49,Indium,In,114.818,[P#49,N#66,Struc#Tetragonal])
(52,Tellurium,Te,127.6,[P#52,N#76,Struc#Hexagonal])
(86,Radon,222.0,,[P#86,N#136,Struc#Cubic])
(38,Strontium,Sr,87.62,[P#38,N#50,Struc#Cubic])
(,94,Pu,244.0,[P#94,N#150,Struc#Monoclinic])

Upvotes: 1

Related Questions