Luke
Luke

Reputation: 2225

Pig - Calculation

I have a dataset in Pig that looks like this:

6009544 "NY"    6009545 "NY"
6009544 "NY"    6009545 "NY"
6009548 "NY"    6009546 "OR"
6009546 "OR"    6009546 "OR"
6009545 "NY"    6009546 "OR"
6009548 "NY"    6009547 "AZ"
6009547 "AZ"    6009547 "AZ"
6009547 "AZ"    6009548 "NY"
6009544 "NY"    6009548 "NY"

The first line is read like so: "patent 6009544 originated in New York, and cites patent 6009545 which originated in New York." I'm trying to find, for each state, the % of patents cited that originate from the same state. So my expected output should be

NY: .5
OR: 1
AZ: .5

because of the 6 patents that originated in New York, 3 cite patents that also originated in New York. The 1 patent that originated in Oregon cites a patent that also originated in New York. And of the 2 patents that originated in Arizona, 1 cites a patent that also originated in Arizona.

Can anyone suggest a good way for performing this in Pig?

Upvotes: 1

Views: 330

Answers (2)

harit singh
harit singh

Reputation: 1

I change the sample data and separate data using space:

A = load '/padata' using PigStorage(' ' ) as (pno:int,pcity:chararray,pci:int,pccity:chararray);

b = group A by pcity ;

r = foreach b {

               copcity= COUNT(A.pcity) ;

               samdata = FILTER A by pcity==pccity;

               csamdata = COUNT(samdata);

               percent = (float)csamdata/(float)copcity;

               generate group,percent ;

               }

dump r ; 

Output :-

("AZ",0.5)

("NY",0.5)

("OR",1.0)

Upvotes: 0

Sivasakthi Jayaraman
Sivasakthi Jayaraman

Reputation: 4724

Can you try this?

input.txt
6009544 "NY"    6009545 "NY"
6009544 "NY"    6009545 "NY"
6009548 "NY"    6009546 "OR"
6009546 "OR"    6009546 "OR"
6009545 "NY"    6009546 "OR"
6009548 "NY"    6009547 "AZ"
6009547 "AZ"    6009547 "AZ"
6009547 "AZ"    6009548 "NY"
6009544 "NY"    6009548 "NY"

PigScript:
A = LOAD 'input.txt' AS line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(\\d+)\\s+"(\\w+)"\\s+(\\d+)\\s+"(\\w+)"')) AS (f1:int,f2:chararray,f3:int,f4:chararray);
C = GROUP B BY f2;
D = FOREACH C {
                FilterByPatent = FILTER B BY f2==f4;
                CityPatentCount = COUNT(B.f2);
                GENERATE group,((float)COUNT(FilterByPatent)/(float)CityPatentCount);
              }
DUMP D;

Output:
(AZ,0.5)
(NY,0.5)
(OR,1.0)

Upvotes: 1

Related Questions