Reputation: 95
I have a requirement to convert an existing ETL flow to Pig script, am trying to keep my dimensions in hdfs so that I no need to look at database when i load my fact table. Here am facing problem , am not able to create surrogate key for new dimension records. Is it possible to create surrogate key in pig ? If yes, please let me know how to create surrogate key.
Thanks Selvam
Upvotes: 2
Views: 2272
Reputation: 2333
The primary mechanism I've used to create surrogate keys with pig is by using the DataFu hash method. It returns an MD5 or SHA value. As long as the source data that was used to derive the hash is unique, your hash should also be unique.
In the case where I've generated a surrogate key from a non-unique source, I've usually created a ranking of the data, using RANK <alias>
, and then use that field as part of the hash.
Example 1:
Simple hash created from 2 concatenated fields. Notice in the results that there's a duplicate hash. That's because the record "30000,sydney,joseph" appears twice in the source dataset.
REGISTER datafu-1.2.0.jar
DEFINE SHA datafu.pig.hash.SHA();
S1 = LOAD 'surrogate_hash' USING PigStorage('|') AS (c1:chararray,c2:chararray,c3:chararray);
S2 = FOREACH S1 GENERATE SHA((chararray)CONCAT(c2,c3)),c1,c2,c3;
dump S2;
Results 1:
(291fe24ea7fcc35113d5bf1ea582334f8fe60b7a8028d0a5ae27c207ae2faa1,20000,newyork,john)
(518afaee8ca15c30f21ed0da1b1db89cf01d92ac5416e376c9e529ccbd71550d,30000,sydney,joseph)
(db7bbb6227e6643058bc1a343f60fef4eaa6e5490ff2701a6ec75a1af06a6419,60000,delhi,mike)
(518afaee8ca15c30f21ed0da1b1db89cf01d92ac5416e376c9e529ccbd71550d,30000,sydney,joseph)
Example 2:
Same basic logic as Example 1, but this time placing a simple unordered rank on the data. This provides a unique incremental field, which can be used in the creation of the hash. Notice in the results that all of the hash keys are unique.
REGISTER datafu-1.2.0.jar
DEFINE SHA datafu.pig.hash.SHA();
S1 = LOAD 'surrogate_hash' USING PigStorage('|') AS (c1:chararray,c2:chararray,c3:chararray);
S2 = RANK S1;
S3 = FOREACH S2 GENERATE SHA((chararray)CONCAT((chararray)rank_S1,c3)),c1,c2,c3;
dump S3;
Results 2:
(3729ce50c28bdfecd94b528de9482d5bd72b07a24a82a47946de7afceea35bee,20000,newyork,john)
(7c0f39f39e5ffff2153634025d87c44550968fde2386f42bfcdb9febebae15de,30000,sydney,joseph)
(777372af10efb2d55fd0e122cfc69be6884395520dbb20b39ec4d53923611728,60000,delhi,mike)
(eb0cffcedab1e5703d85d80684f8f6314c92011b5435b3fa18c03b385254c6bf,30000,sydney,joseph)
Upvotes: 2