Jay
Jay

Reputation: 71

How to count the number of letters, not words using Pig

everyone, I find many examples about count words, but cannot find counting letters. I just want to split the words into letters, and count them, but my code is wrong. Can someone help me with this? Thanks very much. And this is my code:

A = load './in/*.txt';
B = FOREACH A GENERATE  FLATTEN(TOKENIZE(LOWER((chararray)$0))) as words;
C = FOREACH B GENERATE  FLATTEN(REGEX_EXTRACT_ALL(words, '([a-zA-Z])')) as letter;
D = group C by letter;
E = FOREACH D GENERATE COUNT(C), group;
DUMP E;

Upvotes: 4

Views: 1681

Answers (1)

sujit
sujit

Reputation: 2328

Change your corresponding line as below:

C = foreach B generate flatten(TOKENIZE(REPLACE(words,'','|'), '|')) as letter;

The trick i have used is to replace each letter boundary with a special character(|) and then tokenize with that as delimiter. You can also use an uncommon string sequence instead of the special character.

Upvotes: 0

Related Questions