hba
hba

Reputation: 7800

PIG-Hadoop - In PIG is there a way to Inner-Join with Reg-Ex

I have 2 files (messages, keys). I want to pull out all the lines out 'messages' that include a word from 'keys'.

messages = LOAD 'my-messages.txt' as (message:chararray);
keys = LOAD 'keys.txt' as (key: chararray);

Now I know I can do an inner join between messages & keys, but that won't work in situations such as:

message = "hi there"
key = "hi"

I'm thinking of a UDF as a way to get around it:

DEFINE containsKey my.udf.Matches("path/keys.txt");
matches = FILTER messages BY containsKey(messages);

Then inside the UDF loop through all keys (yikes!) Doesn't feel right...Not sure if my approach is right, so feel free to offer suggestions.

Upvotes: 0

Views: 66

Answers (1)

Murali Rao
Murali Rao

Reputation: 2287

This looks like a use case where CROSS can be used. Ref : http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#CROSS

This might not be an optimal solution, sharing a feasible approach.

Input :

Messages :

hi there
He said "Hi, how are you doing ?"
HI there
Hello there

Keys :

hi

Pig Script :

messages = LOAD 'messages.csv' USING PigStorage('\t') AS (message:chararray);
keys = LOAD 'keys.csv' USING PigStorage('\t') AS (key:chararray);

crossed_data = CROSS messages, keys ;

filt_required_data = FILTER crossed_data BY LOWER(messages::message) MATCHES CONCAT('.*', LOWER(keys::key), '.*');

required_data =  FOREACH filt_required_data GENERATE messages::message AS message;

DUMP required_data;

Output :

(hi there)
(He said "Hi, how are you doing ?")
(HI there)

Upvotes: 2

Related Questions