Petro
Petro

Reputation: 3652

PIG: How to pass relationship to Java UDF as argument?

My pig script needs to pass data to the java constructor:

UPCFIND = LOAD 'testdatabase.item' USING org.apache.hive.hcatalog.pig.HCatLoader() AS (upc:chararray,description:chararray); 
UPCDATA = FOREACH UPCFIND GENERATE upc,description;
DUMP UPCDATA;
//output:
(00001123456789," Table       ")
(00000123456789," PICTURE       ")

My UDF is:

loading = LOAD '/incoming/files/*' USING com.readingitems.loading.TheLoader(UPCDATA) as
 (upc:chararray, description:chararray,

Can I pass this UPCDATA to my UDF and if so, how would I get this into a hashmap where upc is the key and description is the value. Is this considered an arraylist or tuple? Thanks in advance!

Problem right now is passing this data into the java contructor:

UPCFIND = LOAD 'testdatabase.item' USING org.apache.hive.hcatalog.pig.HCatLoader() AS (upc:chararray,description:chararray);
UPCDATA = FOREACH UPCFIND GENERATE upc,description;
UPCDATA_SCALAR = GROUP UPCDATA ALL;

loading = LOAD 'files/incoming/*' USING com.readingitems.loading.TheLoader(UPCDATA_SCALAR)

Getting the error:

ERROR org.apache.pig.Main - ERROR 2997: Encountered IOException. org.apache.pig.tools.parameters.ParameterSubstitutionException: Undefined parameter : UPCDATA_SCALAR

Dumping UPCDATA_SCALAR produces the correct results

The reason why I'm doing this is to load a hive table's data into a Loader function that's parsing files. I need to compare data in the files to the Hive table data in order to make changes and insert into a new table.

My loader function starts with:

public class TheLoader extends LoadFunc {

    public TheLoader (DataBag item_master_stream) throws SQLException {

Upvotes: 0

Views: 411

Answers (1)

mr2ert
mr2ert

Reputation: 5186

In your example UPCDATA is a relationship. In order to pass it into a function as an argument, you are going to have to convert it into a scalar. You can accomplish this with:

UPCDATA_SCALAR = GROUP UPCDATA ALL;

In Java, this will be repersented as a DataBag of Tuples. You can read more about that here.

It is worth keeping in mind that doing a GROUP ALL is really expensive so you will want to project out all the columns that aren't critical to your UDF functioning.

Upvotes: 1

Related Questions