Lin Ma
Lin Ma

Reputation: 10139

advice to make my below Pig code simple

Here is my code and I do two group all operations and my code works. My purpose is to generate all student unique user count with their total scores, student located in CA unique user count. Wondering if good advice to make my code simple to use only one group operation, or any constructive ideas to make code simple, for example using only one FOREACH operation? Thanks.

student_all = group student all;
student_all_summary = FOREACH student_all GENERATE COUNT_STAR(student) as uu_count, SUM(student.mathScore) as count1,SUM(student.verbScore) as count2;

student_CA = filter student by LID==1;
student_CA_all = group student_CA all;
student_CA_all_summary = FOREACH student_CA_all GENERATE COUNT_STAR(student_CA);

Sample input (student ID, location ID, mathScore, verbScore),

1 1 10  20
2 1 20  30
3 1 30  40
4 2 30  50
5 2 30  50
6 3 30  50

Sample output (unique user, unique user in CA, sum of mathScore of all students, sum of verb Score of all students),

7 3 150 240

thanks in advance, Lin

Upvotes: 1

Views: 72

Answers (3)

hello_abhishek
hello_abhishek

Reputation: 545

The answer accepted has an logical error.

Try to have the below input file

1 1 10  20
2 1 20  30
3 1 30  40
4 2 30  50
5 2 30  50
6 3 30  50
7 1 10  10

The output will be

(13,4,160,250)

The output should be

(7,4.170,260)

I have modified the script to work correct.

data = load '/tmp/temp.csv' USING PigStorage(' ') as (sid:int,lid:int, ms:int, vs:int);

gdata = group data all;

result = foreach gdata {
    student_CA_sum = COUNT( data.sid ) ;
    student_CA = filter data by lid == 1;
    student_CA_count = COUNT( student_CA.sid ) ;
    mathScore = SUM(data.ms);
    verbScore = SUM(data.vs);
    GENERATE student_CA_sum as student_CA_sum, student_CA_count as student_CA_count, mathScore  as mathScore, verbScore as verbScore;

};

Output

(7,4,160,250)

Upvotes: 1

hello_abhishek
hello_abhishek

Reputation: 545

first load the file(student)in hadoop file system. The perform the below action.

split student into student_CA if locationId == 1, student_Other if locationId != 1;

student_CA_all = group student_CA all;

student_CA_all_summary = FOREACH student_CA_all GENERATE COUNT_STAR(student_CA) as uu_count,COUNT_STAR(student_CA)as locationCACount, SUM(student_CA.mathScore) as mScoreCount,SUM(student_CA.verbScore) as vScoreCount;

student_Other_all = group student_Other all;

student_Other_all_summary = FOREACH student_Other_all GENERATE COUNT_STAR(student_Other) as uu_count,0 as locationOtherCount:long, SUM(student_Other.mathScore) as mScoreCount,SUM(student_Other.verbScore) as vScoreCount;

student_CAandOther_all_summary = UNION student_CA_all_summary, student_Other_all_summary;

student_summary_all = group student_CAandOther_all_summary all;

student_summary = foreach student_summary_all generate SUM(student_CAandOther_all_summary.uu_count) as studentIdCount, SUM(student_CAandOther_all_summary.locationCACount) as locationCount, SUM(student_CAandOther_all_summary.mScoreCount) as mathScoreCount , SUM(student_CAandOther_all_summary.vScoreCount) as verbScoreCount;

output:

dump student_summary;
(6,3,150,240)

Hope this helps :)

While solving your problem, I also encountered an issue with PIG. I assume it is because of improper exception handling done in UNION command. Actually, it can hang you command line prompt, if you execute that command, without proper error message. If you want I can share you the snippet for that.

Upvotes: 1

Mahendra
Mahendra

Reputation: 1426

You might be looking for this.

data = load '/tmp/temp.csv' USING PigStorage(' ') as (sid:int,lid:int, ms:int, vs:int);

gdata = group data all;

result = foreach gdata {
        student_CA = filter data by lid == 1; 
        student_CA_sum = SUM( student_CA.sid ) ;
        student_CA_count = COUNT( student_CA.sid ) ;
        mathScore = SUM(data.ms);
        verbScore = SUM(data.vs);
        GENERATE student_CA_sum as student_CA_sum, student_CA_count as student_CA_count, mathScore  as mathScore, verbScore as verbScore;
 };

Output is:

grunt> dump result
    (6,3,150,240)
grunt> describe result
    result: {student_CA_sum: long,student_CA_count: long,mathScore: long,verbScore: long}

Upvotes: 1

Related Questions