RVandakar
RVandakar

Reputation: 81

Pig Latin using two data sources in one FILTER statement

In my pig script, am reading data from more than 5 data sources (Hive tables), where one is the main source data and rest were kind of dimension data tables. I am trying to filter the main data source relation (or alias) w.r.t some value in one of the dimension relation.

E.g.

 -- main_data is main data source and dept_data is department data
 filtered_data1 = FILTER main_data BY deptID == dept_data.departmentID;

 filtered_data2 = FOREACH filtered_data1 GENERATE $0, $1, $3, $7; 

In my pig script there are minimum 20 instances where I need to match for some value between multiple data sources and produce a new relation. But am getting some error as

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias filtered_data1. Backend error : Scalar has more than one row in the output. 1st : ( ..... ) 2nd : ( .... ) Details at logfile: /root/pig_1403263965493.log

I tried to use "relation::field" approach also, no use. Alternatively, am joining these two relations (data sources) to get filtered data, but I feel, this will slow down the execution process and unnecessirity huge data will be dumped.

Please guide me how two use two or more data sources in one FILTER statement, something like in SQL, so that I can avoid using JOIN statements and get it done from FILTER statement itself.

Where A.deptID = B.departmentID And A.sectionID = C.sectionID And A.cityID = D.cityID

Upvotes: 0

Views: 374

Answers (1)

Dennis Jaheruddin
Dennis Jaheruddin

Reputation: 21563

If you want to match records from different tables by a single ID, you would pretty much have to use a join, as such:

Where A::deptID = B::departmentID And A::sectionID = C::sectionID And A::cityID = D::cityID

If you just want to keep the records that occur in all other tables, you could probably go for an INTERSECT and then a

FILTER BY someID IN someIDList

Upvotes: 0

Related Questions