A.Elnaggar
A.Elnaggar

Reputation: 240

Counting result lines in pig latin

I'm trying to run simple word counter in pig latin as follows:

lines = LOAD 'SOME_FILES' using PigStorage('#') as (line:chararray);
word = FILTER lines BY (line matches '.*SOME_VALUE.*');

I want to count how many SOME_VALUEs found searching SOME_FILES, so the expected output should be something like:

(SOME_VALUE,xxxx)

Where xxxx, is the total number of SOME_VALUE found.

How can I search for multiple values and print each one as above ?

Upvotes: 0

Views: 1056

Answers (1)

mr2ert
mr2ert

Reputation: 5186

What you should do is split each line into a bag of tokens, then FLATTEN it. Then you can do a GROUP on the words to pull all occurrences of each word into it's own line. Once you do a COUNT of the resulting bag you'll have the total count for all words in the document.

This will look something like:

B = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) ;
C = GROUP B BY $0 ;
D = FOREACH C GENERATE group AS word, COUNT(B) AS count ;

If you aren't sure what each step is doing, then you can use DESCRIBE and DUMP to help visualize what is happening.


Update: If you want to filter the results to contain only the couple of strings you want you can do:

E = FILTER D BY (word == 'foo') OR 
                (word == 'bar') OR 
                (word == 'etc') ;

-- Another way...
E = FILTER D BY (word matches 'foo|bar|etc') ;

However, you can also do this between B and C so you don't do any COUNTs you don't need to.

Upvotes: 1

Related Questions