Pig - Calculation

Question

I have a dataset in Pig that looks like this:

6009544 "NY"    6009545 "NY"
6009544 "NY"    6009545 "NY"
6009548 "NY"    6009546 "OR"
6009546 "OR"    6009546 "OR"
6009545 "NY"    6009546 "OR"
6009548 "NY"    6009547 "AZ"
6009547 "AZ"    6009547 "AZ"
6009547 "AZ"    6009548 "NY"
6009544 "NY"    6009548 "NY"

The first line is read like so: "patent 6009544 originated in New York, and cites patent 6009545 which originated in New York." I'm trying to find, for each state, the % of patents cited that originate from the same state. So my expected output should be

NY: .5
OR: 1
AZ: .5

because of the 6 patents that originated in New York, 3 cite patents that also originated in New York. The 1 patent that originated in Oregon cites a patent that also originated in New York. And of the 2 patents that originated in Arizona, 1 cites a patent that also originated in Arizona.

Can anyone suggest a good way for performing this in Pig?

Sivasakthi Jayaraman · Accepted Answer

Can you try this?

input.txt
6009544 "NY"    6009545 "NY"
6009544 "NY"    6009545 "NY"
6009548 "NY"    6009546 "OR"
6009546 "OR"    6009546 "OR"
6009545 "NY"    6009546 "OR"
6009548 "NY"    6009547 "AZ"
6009547 "AZ"    6009547 "AZ"
6009547 "AZ"    6009548 "NY"
6009544 "NY"    6009548 "NY"

PigScript:
A = LOAD 'input.txt' AS line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(\d+)\s+"(\w+)"\s+(\d+)\s+"(\w+)"')) AS (f1:int,f2:chararray,f3:int,f4:chararray);
C = GROUP B BY f2;
D = FOREACH C {
                FilterByPatent = FILTER B BY f2==f4;
                CityPatentCount = COUNT(B.f2);
                GENERATE group,((float)COUNT(FilterByPatent)/(float)CityPatentCount);
              }
DUMP D;

Output:
(AZ,0.5)
(NY,0.5)
(OR,1.0)

Pig - Calculation

Answers (2)

Related Questions