Reputation: 1702
I have a file (In Relation A) with all tweets
today i am not feeling well
i have viral fever!!!
i have a fever
i wish i had viral fever
...
I have another file (In Relation B) with words to be filtered
sick
viral fever
feeling
...
My Code
//loads all the tweets
A = load 'tweets' as tweets;
//loads all the words to be filtered
B = load 'filter_list' as filter_list;
Expected Output
(sick,1)
(viral fever,2)
(feeling,1)
...
How do i achieve this in pig using a join?
Upvotes: 1
Views: 393
Reputation: 2333
The basic concept that I supplied earlier will work, but it requires the addition of a UDF to generate NGrams pairs of the tweets. You then union the NGram pairs to the Tokenized tweets, and then perform the wordcount function on that dataset.
I've tested the code below, and it works fine against the data provided. If records in your filter_list have more than 2 words in a string (ie: "I feel bad"), you'll need to recompile the ngram-udf with the appropriate count (or ideally, just turn it into a variable and set the ngram count on the fly).
You can get the source code for the NGramGenerator UDF here: Github
ngrams.pig
REGISTER ngram-udf.jar
DEFINE NGGen org.apache.pig.tutorial.NGramGenerator;
--Load the initial data
A = LOAD 'tweets.txt' as (tweet:chararray);
--Create NGram tuple with a size limit of 2 from the tweets
B = FOREACH A GENERATE FLATTEN(NGGen(tweet)) as ngram;
--Tokenize the tweets into single word tuples
C = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)tweet)) as ngram;
--Union the Ngram and word tuples
D = UNION B,C;
--Group similar tuples together
E = GROUP D BY ngram;
--For each unique ngram, generate the ngrame name and a count
F = FOREACH E GENERATE group, COUNT(D);
--Load the wordlist for joining
Z = LOAD 'wordlist.txt' as (word:chararray);
--Perform the innerjoin of the ngrams and the wordlist
Y = JOIN F BY group, Z BY word;
--For each intersecting record, store the ngram and count
X = FOREACH Y GENERATE $0,$1;
DUMP X;
RESULTS/OUTPUT
(feeling,1)
(viral fever,2)
tweets.txt
today i am not feeling well
i have viral fever!!!
i have a fever
i wish i had viral fever
wordlist.txt
sick
viral fever
feeling
I don't have access to my Hadoop system at the moment to test this answer, so the code may be off slightly. The logic should be sound, however. An easy solution should be:
Example code:
A = LOAD 'tweets.txt';
B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0)) as word;
C = GROUP B BY word;
D = FOREACH C GENERATE group, COUNT(B);
Z = LOAD 'wordlist.txt' as (word:chararray);
Y = JOIN D BY group, Z BY word;
X = FOREACH Y GENERATE ($1,$2);
DUMP X;
Upvotes: 1
Reputation: 3284
As far as I know, this is not possible using a join.
You could do a CROSS
followed by a FILTER
with a regex match.
Upvotes: 0