Reputation: 6588
I have to format data in a flat file before its getting loaded into Hive table.
CF32|4711|00010101Z| +34.883| 98562AS1D |N8594ãä| 00 | 2
The file is pipe separated and I need to apply different cleaning and formatting functions on the different columns in the flat file . I have multiple functions to Clean_Text, Format_Date, Format_TimeStamp, Format_Integer etc.
My idea is to pass the schema as constructor to my UDF and call the different functions on the flat file in pig.
A = LOAD 'call_detail_records' USING org.apache.hcatalog.pig.HCatLoader();
DESCRIBE A;
REGISTER ZPigUdfs.jar;
DEFINE DFormat com.zna.pig.udf.DataColumnFormatter(A);
B = FOREACH A GENERATE DFormat($0);
DUMP B;
But how can I pass the schema ? DUMP A actually dumps the entire table but I need the metadata only. My current UDF pseudo code looks like
public class DataColumnFormatter extends EvalFunc {
private Tuple schema;
public DataColumnFormatter(Tuple schema) {
this.schema = schema;
}
@Override
public String exec(Tuple inputTuple) throws IOException {
if (inputTuple != null && inputTuple.size() > 0) {
String inpString = inputTuple.get(0).toString();
System.out.println(inpString);
System.out.println(schema);
/**
* Logic for splitting the string as pipe and apply functions based
* on positions of schema if(schema[1] -> date ){
*
* formatDate(input) }else if(schema[1] -> INT ){
*
* formatInt(input); }
*
*/
}
return null;
}
}
How can I get the schema in PIG UDF or is there any alternative way to achieve this.
Thanks in advance.
Upvotes: 1
Views: 422
Reputation: 7111
From within your EvalFunc you can call this.getInputSchema()
(at least since Pig v0.12, maybe earlier). You shouldn't need to do anything special to pass in the schema, and since you loaded from HCatalog, A
will already be decorated.
Alternately, you could consider breaking out separate UDF functions for each data type. Something like B = FOREACH A GENERATE dateFormat($0), cleanText($1), dateFormat($2);
Upvotes: 1