Reputation: 240
I'm trying to run simple word counter in pig latin as follows:
lines = LOAD 'SOME_FILES' using PigStorage('#') as (line:chararray);
word = FILTER lines BY (line matches '.*SOME_VALUE.*');
I want to count how many SOME_VALUE
s found searching SOME_FILES
, so the expected output should be something like:
(SOME_VALUE,xxxx)
Where xxxx
, is the total number of SOME_VALUE
found.
How can I search for multiple values and print each one as above ?
Upvotes: 0
Views: 1056
Reputation: 5186
What you should do is split each line into a bag of tokens, then FLATTEN
it. Then you can do a GROUP
on the words to pull all occurrences of each word into it's own line. Once you do a COUNT
of the resulting bag you'll have the total count for all words in the document.
This will look something like:
B = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) ;
C = GROUP B BY $0 ;
D = FOREACH C GENERATE group AS word, COUNT(B) AS count ;
If you aren't sure what each step is doing, then you can use DESCRIBE
and DUMP
to help visualize what is happening.
Update: If you want to filter the results to contain only the couple of strings you want you can do:
E = FILTER D BY (word == 'foo') OR
(word == 'bar') OR
(word == 'etc') ;
-- Another way...
E = FILTER D BY (word matches 'foo|bar|etc') ;
However, you can also do this between B
and C
so you don't do any COUNT
s you don't need to.
Upvotes: 1