midnightsoul
midnightsoul

Reputation: 29

Pig script to count the number of letters in a file

I want to extend the hello world program of hadoop word count to be able to count the number of letters in the input file.

I have written this so far and I'm unable to figure out what is wrong with this code. Any help identifying the issue will be appreciated.

A = load '/tmp/alice.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches '\\w+';
D = foreach C generate flatten(REGEX_EXTRACT_ALL(word, '([a-zA-Z])')) as letter;
E = group D by letter;
F = foreach E generate COUNT(D), group;
store F into '/tmp/alice_wordcount';

Upvotes: 1

Views: 2644

Answers (2)

Ashok
Ashok

Reputation: 75

try the following code

Load the data A = load '/tmp/alice.txt';

Split the line into words B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;

Split words into chars C = foreach B generate flatten(TOKENIZE(REPLACE($0,'','|'),'|')) as letter;

Group the letters D = GROUP C BY letter;

Generate the results with count of each letter E = foreach D generate COUNT(C), group;

Store F into '/tmp/alice_wordcount';

Upvotes: 0

sujit
sujit

Reputation: 2328

Let me say that I am a PIG newbie, but somehow this query got me interested. I diverged into all kinds of complex stuff like nested foreach, UDFs etc. But in the end, the answer is pretty simple. It's just a correction in one of your pig latin lines as below:

D = foreach C generate flatten(TOKENIZE(REPLACE(word,'','|'), '|')) as letter;

Instead of using regexp_extract_all, I instead opt to REPLACE each letter boundary with a special character ('|' here, though you can use an uncommon sequence also if you like) and then TOKENIZE around that delimiter.

Upvotes: 3

Related Questions