duke_sastry
duke_sastry

Reputation: 43

Error while trying aggregate data using Apache Pig

This is the code I'm running:

bigrams = LOAD 's3://******' AS (bigram:chararray, year:int, occurrences:int, books:int);
bg_tmp = filter bigrams BY (occurrences >= 300) AND (books >= 12);
bg_tmp_2 = GROUP bg_tmp ALL;
occ_cnt = FOREACH bg_tmp_2 GENERATE bigram, SUM(bg_tmp_2.occurrences);
x = LIMIT occ_cnt 100;
DUMP x;

This is the error I'm getting when I'm computing occ_cnt

81201 [main] ERROR org.apache.pig.tools.grunt.Grunt  - ERROR 1200: Pig script failed to parse: <line 5, column 48> Invalid scalar projection: bg_tmp_218/10/26 16:05:07 ERROR grunt.Grunt: ERROR 1200: Pig script failed to parse: <line 5, column 48> Invalid scalar projection: bg_tmp_2
Details at logfile: /mnt/var/log/pig/pig_1540569826316.log

I have no idea why this is happening. I'm using Apache Pig 0.17.0 and Hadoop 2.8.4 on AWS EMR

Upvotes: 0

Views: 159

Answers (1)

Koji
Koji

Reputation: 409

I would rewrite your query as

bg_tmp_2 = GROUP bg_tmp by (bigram);
occ_cnt = FOREACH bg_tmp_2 GENERATE group, SUM(bg_tmp.occurrences);

Replacing GROUP ALL since I think you want the SUM per bigram entry. Replacing bg_tmp2 with bg_tmp since you want to reference the bg_tmp BAG inside bg_tmp_2 relation.

(If you run "describe bg_tmp_2", you'll see the following schema)

bg_tmp_2: {group: chararray,bg_tmp: {(bigram: chararray,year: int,occurrences: int,books: int)}}

Upvotes: 1

Related Questions