Reputation: 425
I have a list of filter keywords (about 1000 in numbers) and I need to filter a field of a relation in pig using this list.
Initially, I have declared these keywords like: %declare p1 '.keyword1.'; .... ...
%declare p1000 '.keyword1000.';
I am then doing filtering like:
Filtered= FITLER SRC BY (not $0 matches '$p1') and (not $0 matches '$p2') and ...... (not $0 matches '$p1000');
DUMP Filtered;
Assume that my source relation is in SRC and I need to apply filtering on first field i.e. $0.
If I am reducing the number of filters to 100-200, it's working fine. But as number of filters increases to 1000. It doesn't work.
Can somebody suggest a work around to get the results right?
Thanks in advance
Upvotes: 1
Views: 1462
Reputation: 25909
You can write a simple filter UDF where you'd perform all the checks something like:
package myudfs;
import java.io.IOException;
import org.apache.pig.FilterFunc;
import org.apache.pig.data.Tuple;
public class MYFILTER extends FilterFunc
{
static List<String> filterList;
static MYFILTER(){
//load all filters
}
public Boolean exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
return !filterList.contains(str);
}catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
}
}
Upvotes: 1
Reputation: 3540
One shallow approach is to divide the filtration into stages. Filter keywords 1 to 100 in stage one and then filter another 100 and so on for a total of (count(keywords)/100) stages. However, given more details of your data, there is probably a better solution to this.
As for the above shallow solution, you can wrap the pig script in a shell script that does the parcelling out of input and starts the run on the current keyword subset being filtered.
Upvotes: 0