MarcL
MarcL

Reputation: 3593

PIG: Calculate highest monthly growth in wiki pagecount data requests per article

I have several wiki dump data from https://dumps.wikimedia.org/other/pagecounts-raw/2015/ now I want to calculate the monthly growth of requests for each wiki article for the year 2015, and then find out whats the month with the highest growth of requests for an article, and how high that growth is... for explanation: the wikidata is of format: "wikiproject" "article-url" "number of requests" "size of page in bytes", e.g: fr.b Special:Recherche/Achille_Baraguey_d%5C%27Hilliers 1 624 en Main_Page 242332 4737756101

our cluster setting up is still "work in progress", so I have to try it out on the cloudera quickstart VM with smaller dataset.. I used only pagedumps from 1 hour of the 3 months... however when I try to illustrate it, it runs out of JAVA heap space, or I get message GC overload....

This is my code:

m1  = LOAD '/user/cloudera/2015/2015-01' USING PigStorage(' ') as(proj:chararray, url:chararray, req:long, size:long);
m2  = LOAD '/user/cloudera/2015/2015-02' USING PigStorage(' ') as(proj:chararray, url:chararray, req:long, size:long);
m3  = LOAD '/user/cloudera/2015/2015-03' USING PigStorage(' ') as(proj:chararray, url:chararray, req:long, size:long);

m11 = SAMPLE m1 0.1;
m22 = SAMPLE m2 0.1;
m33 = SAMPLE m3 0.1;

a = COGROUP m11 by url, m22 by  url, m33 by  url;
b = FOREACH a generate group, SUM(m11.req) as s1, SUM(m22.req) as s2, SUM(m33.req) as s3;
c = FOREACH b generate group, ((s2-s1) > 0 ? (s2-s1): 0) as dm2, ((s3-s2)> 0 ? (s3-2): 0) as dm3 parallel 10;
d = FOREACH c generate group as Artikel, MAX(TOBAG(dm2,dm3)) as maxZugriffe;
e = order d by maxZugriffe desc;
f = limit e 10;

so what I'm trying to do is, first I sample 10% of original data, then I group my monthly datasets by article (=url), then I cogroup them together. Then I calculate the sum of requests for each article and each month, to calculate the growth in requests, I use the sum of requests from next months and subtract the sum of requests of the same article from previous month and check if it's >0 (if there is any growth), then I calculate the maximum of all growth values and orther my relation by maxRequests (=maxZugriffe) in descending order and limit output to 10...

can somebody infer from the code if that's right like that, or am I missing something? As I said, it seems to be too much for the quickstart VM to calculate the result, but it doesn't seem so complicated to me...

my 2nd question is: Is it possible to use an alias for the bincondition expression in pig? e.g : c = FOREACH b generate group, ((s2-s1) as 'diff' > 0 ? diff: 0) as dm2; so I want to replace the first case with the alias 'diff' that I already calculated, instead of calculating (s2-s1) again...

edit: some weeks have past.... and still no answers, can anybody help?

Upvotes: 4

Views: 129

Answers (1)

Aamir
Aamir

Reputation: 143

Answer to your second question "Is it possible to use an alias for the bincondition expression in pig?" We can't use the alias in bicondition expression.This is case not only with the pig, in SQL we can't use it as well. We can not give alias name to a expression without ( = ) assignment operator.
If you do really want to avoid the use of expression repetitively, do as below

b = FOREACH a generate group, SUM(m11.req) as s1, SUM(m22.req) as s2, SUM(m33.req) as s3;  
x = FOREACH b generate group,s1,s2,s3,(s2-s1) as diff;  
c = FOREACH x generate group, (diff > 0 ? diff: 0) as dm2;

In this what we have done, we have created another column for (s2-s1) with alias diff and have used it in a expression. Hope you find this answer useful. Thank you.

Upvotes: 2

Related Questions