Abhi
Abhi

Reputation: 6588

Table Schema inside PIG UDF

I have to format data in a flat file before its getting loaded into Hive table.

CF32|4711|00010101Z| +34.883|  98562AS1D |N8594ãä| 00   | 2

The file is pipe separated and I need to apply different cleaning and formatting functions on the different columns in the flat file . I have multiple functions to Clean_Text, Format_Date, Format_TimeStamp, Format_Integer etc.

My idea is to pass the schema as constructor to my UDF and call the different functions on the flat file in pig.

A = LOAD 'call_detail_records'  USING org.apache.hcatalog.pig.HCatLoader();
DESCRIBE A;

REGISTER ZPigUdfs.jar;
DEFINE DFormat com.zna.pig.udf.DataColumnFormatter(A);

B = FOREACH A GENERATE DFormat($0);
DUMP B;

But how can I pass the schema ? DUMP A actually dumps the entire table but I need the metadata only. My current UDF pseudo code looks like

public class DataColumnFormatter extends EvalFunc {

private Tuple schema;

public DataColumnFormatter(Tuple schema) {
    this.schema = schema;
}

@Override
public String exec(Tuple inputTuple) throws IOException {

    if (inputTuple != null && inputTuple.size() > 0) {
        String inpString = inputTuple.get(0).toString();
        System.out.println(inpString);
        System.out.println(schema);

        /**
         * Logic for splitting the string as pipe and apply functions based
         * on positions of schema if(schema[1] -> date ){
         * 
         * formatDate(input) }else if(schema[1] -> INT ){
         * 
         * formatInt(input); }
         * 
         */

    }

    return null;
}

}

How can I get the schema in PIG UDF or is there any alternative way to achieve this.

Thanks in advance.

Upvotes: 1

Views: 422

Answers (1)

G__
G__

Reputation: 7111

From within your EvalFunc you can call this.getInputSchema() (at least since Pig v0.12, maybe earlier). You shouldn't need to do anything special to pass in the schema, and since you loaded from HCatalog, A will already be decorated.

Alternately, you could consider breaking out separate UDF functions for each data type. Something like B = FOREACH A GENERATE dateFormat($0), cleanText($1), dateFormat($2);

Upvotes: 1

Related Questions