frazman
frazman

Reputation: 33303

Doing word count in pig

I have data already processed in following form:

( id ,{ bag of words})

So for example:

(foobar, {(foo), (foo),(foobar),(bar)})
(foo,{(bar),(bar)})

and so on.. describe processed gives me:

processed: {id: chararray,tokens: {tuple_of_tokens: (token: chararray)}}

Now what I want is.. also count the number of times a word appears in this data and output it as

foobar, foo, 2
foobar,foobar,1
foobar,bar,1
foo,bar,2

and so on...

How do I do this in pig?

Upvotes: 1

Views: 348

Answers (2)

Ruslan
Ruslan

Reputation: 3283

Try this:

$ cat input 
foobar  foo
foobar  foo
foobar  foobar
foobar  bar
foo bar
foo bar

--preparing
inputs = LOAD 'input' AS (first: chararray, second: chararray);
grouped = GROUP inputs BY first;
formatted = FOREACH grouped GENERATE group, inputs.second AS second;
--what you need
flattened = FOREACH formatted GENERATE group, FLATTEN(second);
result = FOREACH (GROUP flattened BY (group, second)) GENERATE FLATTEN(group), COUNT(flattened);
DUMP result;

Output:

(foo,bar,2)
(foobar,bar,1)
(foobar,foo,2)
(foobar,foobar,1)

Upvotes: 1

mr2ert
mr2ert

Reputation: 5184

Though you can do this in pure pig, it should be much more efficient to do this with a UDF. Something along the lines of:

@outputschema('wordcounts: {T:(word:chararray, count:int)}')
def generate_wordcount(BAG):
    d = {}
    for word in BAG:
        if word in d:
            d[word] += 1
        else:
            d[word] = 1
    return d.items()

You can then use this UDF like this:

REGISTER 'myudfs.py' USING jython AS myudfs ;

-- A: (id, words: {T:(word:chararray)})

B = FOREACH A GENERATE id, FLATTEN(myudfs.generate_wordcount(words)) ;

Upvotes: 1

Related Questions