Krishna Kalyan
Krishna Kalyan

Reputation: 1702

Pig matching with an external file

I have a file (In Relation A) with all tweets

today i am not feeling well
i have viral fever!!!
i have a fever
i wish i had viral fever
...

I have another file (In Relation B) with words to be filtered

    sick
    viral fever
    feeling
    ...

My Code

//loads all the tweets
A = load 'tweets' as tweets;
//loads all the words to be filtered
B = load 'filter_list' as filter_list;

Expected Output

(sick,1)
(viral fever,2)
(feeling,1)
...

How do i achieve this in pig using a join?

Upvotes: 1

Views: 393

Answers (2)

JamCon
JamCon

Reputation: 2333

EDITED SOLUTION

The basic concept that I supplied earlier will work, but it requires the addition of a UDF to generate NGrams pairs of the tweets. You then union the NGram pairs to the Tokenized tweets, and then perform the wordcount function on that dataset.

I've tested the code below, and it works fine against the data provided. If records in your filter_list have more than 2 words in a string (ie: "I feel bad"), you'll need to recompile the ngram-udf with the appropriate count (or ideally, just turn it into a variable and set the ngram count on the fly).

You can get the source code for the NGramGenerator UDF here: Github



ngrams.pig

REGISTER ngram-udf.jar
DEFINE NGGen org.apache.pig.tutorial.NGramGenerator;

--Load the initial data
A = LOAD 'tweets.txt' as (tweet:chararray);

--Create NGram tuple with a size limit of 2 from the tweets
B = FOREACH A GENERATE FLATTEN(NGGen(tweet)) as ngram; 
--Tokenize the tweets into single word tuples
C = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)tweet)) as ngram;

--Union the Ngram and word tuples
D = UNION B,C;
--Group similar tuples together
E = GROUP D BY ngram;
--For each unique ngram, generate the ngrame name and a count
F = FOREACH E GENERATE group, COUNT(D);


--Load the wordlist for joining
Z = LOAD 'wordlist.txt' as (word:chararray);

--Perform the innerjoin of the ngrams and the wordlist
Y = JOIN F BY group, Z BY word;

--For each intersecting record, store the ngram and count
X = FOREACH Y GENERATE $0,$1;


DUMP X;



RESULTS/OUTPUT

(feeling,1)
(viral fever,2)



tweets.txt

today i am not feeling well
i have viral fever!!!
i have a fever
i wish i had viral fever



wordlist.txt

sick
viral fever
feeling





Original Solution

I don't have access to my Hadoop system at the moment to test this answer, so the code may be off slightly. The logic should be sound, however. An easy solution should be:

  1. Perform the classic wordcount program against the tweets dataset
  2. Perform an inner join of the wordlist and tweets
  3. Generate the data again to get rid of the duplicate word in the tuple
  4. Dump/Store the join results

Example code:

A = LOAD 'tweets.txt';
B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0)) as word;
C = GROUP B BY word;
D = FOREACH C GENERATE group, COUNT(B);

Z = LOAD 'wordlist.txt' as (word:chararray);
Y = JOIN D BY group, Z BY word;
X = FOREACH Y GENERATE ($1,$2);
DUMP X;

Upvotes: 1

Frederic
Frederic

Reputation: 3284

As far as I know, this is not possible using a join.

You could do a CROSS followed by a FILTER with a regex match.

Upvotes: 0

Related Questions