Reputation: 2225
I have a dataset in Pig that looks like this:
6009544 "NY" 6009545 "NY"
6009544 "NY" 6009545 "NY"
6009548 "NY" 6009546 "OR"
6009546 "OR" 6009546 "OR"
6009545 "NY" 6009546 "OR"
6009548 "NY" 6009547 "AZ"
6009547 "AZ" 6009547 "AZ"
6009547 "AZ" 6009548 "NY"
6009544 "NY" 6009548 "NY"
The first line is read like so: "patent 6009544 originated in New York, and cites patent 6009545 which originated in New York." I'm trying to find, for each state, the % of patents cited that originate from the same state. So my expected output should be
NY: .5
OR: 1
AZ: .5
because of the 6 patents that originated in New York, 3 cite patents that also originated in New York. The 1 patent that originated in Oregon cites a patent that also originated in New York. And of the 2 patents that originated in Arizona, 1 cites a patent that also originated in Arizona.
Can anyone suggest a good way for performing this in Pig?
Upvotes: 1
Views: 330
Reputation: 1
I change the sample data and separate data using space:
A = load '/padata' using PigStorage(' ' ) as (pno:int,pcity:chararray,pci:int,pccity:chararray);
b = group A by pcity ;
r = foreach b {
copcity= COUNT(A.pcity) ;
samdata = FILTER A by pcity==pccity;
csamdata = COUNT(samdata);
percent = (float)csamdata/(float)copcity;
generate group,percent ;
}
dump r ;
Output :-
("AZ",0.5)
("NY",0.5)
("OR",1.0)
Upvotes: 0
Reputation: 4724
Can you try this?
input.txt
6009544 "NY" 6009545 "NY"
6009544 "NY" 6009545 "NY"
6009548 "NY" 6009546 "OR"
6009546 "OR" 6009546 "OR"
6009545 "NY" 6009546 "OR"
6009548 "NY" 6009547 "AZ"
6009547 "AZ" 6009547 "AZ"
6009547 "AZ" 6009548 "NY"
6009544 "NY" 6009548 "NY"
PigScript:
A = LOAD 'input.txt' AS line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(\\d+)\\s+"(\\w+)"\\s+(\\d+)\\s+"(\\w+)"')) AS (f1:int,f2:chararray,f3:int,f4:chararray);
C = GROUP B BY f2;
D = FOREACH C {
FilterByPatent = FILTER B BY f2==f4;
CityPatentCount = COUNT(B.f2);
GENERATE group,((float)COUNT(FilterByPatent)/(float)CityPatentCount);
}
DUMP D;
Output:
(AZ,0.5)
(NY,0.5)
(OR,1.0)
Upvotes: 1