grautur
grautur

Reputation: 30505

WordCount with custom word delimiters in Pig?

I'm new to Pig, and I'm trying to write a word count program.

One way of getting words from text is to use the TOKENIZE function:

WORDS = foreach INPUT generate flatten(TOKENIZE(text)) AS word;

But I only want to split on whitespace, whereas TOKENIZE splits on things like commas, too. How would I do this? I tried using STRSPLIT(text, ' '), but STRSPLIT seems to return a tuple whereas TOKENIZE returns a bag, so I'm not sure how to use STRSPLIT for this.

Upvotes: 1

Views: 1338

Answers (2)

msponer
msponer

Reputation: 111

It depends on what your input data looks like, but the following could work for you:

  1. Use MyRegExLoader (in PiggyBank) with a regex to load your data.
  2. Use STREAM with Perl, sed, or your favorite scripting language to munge your input data into a format that TOKENIZE will then handle the way you want.

Also, it's possible to convert tuples to a bag with ToBag (also in PiggyBank).

Upvotes: 2

Kevin
Kevin

Reputation: 1000

We actually can't directly transform a tuple into a bag (and vice-versa). I suggest you to do this :

  1. Load your data
  2. Use STRSPLIT to split your value into a tuple
  3. Convert your tuples into a bag with an UDF
  4. Flatten you bag

Upvotes: 1

Related Questions