Reputation: 7800
I have 2 files (messages, keys). I want to pull out all the lines out 'messages' that include a word from 'keys'.
messages = LOAD 'my-messages.txt' as (message:chararray);
keys = LOAD 'keys.txt' as (key: chararray);
Now I know I can do an inner join between messages & keys, but that won't work in situations such as:
message = "hi there"
key = "hi"
I'm thinking of a UDF as a way to get around it:
DEFINE containsKey my.udf.Matches("path/keys.txt");
matches = FILTER messages BY containsKey(messages);
Then inside the UDF loop through all keys (yikes!) Doesn't feel right...Not sure if my approach is right, so feel free to offer suggestions.
Upvotes: 0
Views: 66
Reputation: 2287
This looks like a use case where CROSS can be used. Ref : http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#CROSS
This might not be an optimal solution, sharing a feasible approach.
Input :
Messages :
hi there
He said "Hi, how are you doing ?"
HI there
Hello there
Keys :
hi
Pig Script :
messages = LOAD 'messages.csv' USING PigStorage('\t') AS (message:chararray);
keys = LOAD 'keys.csv' USING PigStorage('\t') AS (key:chararray);
crossed_data = CROSS messages, keys ;
filt_required_data = FILTER crossed_data BY LOWER(messages::message) MATCHES CONCAT('.*', LOWER(keys::key), '.*');
required_data = FOREACH filt_required_data GENERATE messages::message AS message;
DUMP required_data;
Output :
(hi there)
(He said "Hi, how are you doing ?")
(HI there)
Upvotes: 2