kskp
kskp

Reputation: 712

Pig: Count only specific rows

I have a data that has location, sentiment and brand fields. I want to calculate the number of positives, negatives and neutrals in a location for a brand.

Assuming x has the data, I did:

a1 = GROUP x BY (location, brand);
a2 = FOREACH a1 GENERATE FLATTEN(group) AS (location, brand), COUNT(x.sentiment=="positive"?1:0) AS positive_count, COUNT(x.sentiment=="negative"?1:0) AS negative_count, COUNT(x.sentiment=="neutral:?1:0) as neutral_count;

But I am getting a syntax error saying Unexpected character '"'

I tried grouping by all three: location, sentiment and brand but I am getting only the overall count like:

{location: "newyork", brand: "pampers", sentiment = "positive", count = 10}
{location: "newyork", brand: "pampers", sentiment = "negative", count = 2}
{location: "newyork", brand: "pampers", sentiment = "neutral", count = 20}

I want seperate fields for positives_count, negatives_count and neutrals_count. Something like this:

{location: "newyork", brand: "pampers", positive_count = 10, negative_count = 2, neutral_count = 20}
{location: "london", brand: "pampers", positive_count = 12, negative_count = 0, neutral_count = 35}
{location: "newyork", brand: "huggies", positive_count = 40, negative_count = 6, neutral_count = 10}

Can some one help me out please?

Upvotes: 0

Views: 161

Answers (2)

nobody
nobody

Reputation: 11090

Use single quotes

a1 = GROUP x BY (location, brand);
a2 = FOREACH a1 GENERATE FLATTEN(group) AS (location, brand), 
                    COUNT(x.sentiment=='positive'?1:0) AS positive_count, 
                    COUNT(x.sentiment=='negative'?1:0) AS negative_count, 
                    COUNT(x.sentiment=='neutral'?1:0) as neutral_count;

EDIT

newyork pampers positive
newyork pampers positive
newyork pampers negative
newyork pampers positive
newyork pampers positive
newyork pampers neutral
newyork pampers positive
newyork pampers negative
newyork pampers neutral
newyork pampers positive
newyork pampers positive
newyork pampers neutral

Script

B = GROUP A BY (location,brand);
C = FOREACH B  { 
                  A1 = FILTER A BY sentiment matches 'positive';
                  A2 = FILTER A BY sentiment matches 'negative';
                  A3 = FILTER A BY sentiment matches 'neutral';
                  GENERATE FLATTEN(group) as (location,brand),COUNT(A1),COUNT(A2),COUNT(A3);
               };

Output

enter image description here

Upvotes: 0

kskp
kskp

Reputation: 712

I filtered the alias that contained the original data and counted each number of entries and joined them all.

p = FILTER y BY (sentiment == 'positive');
p1 = GROUP p BY (location, brand, avl_author_type);
p2 = FOREACH p1 GENERATE FLATTEN(group) AS (location, brand, avl_author_type), COUNT(p) AS positive_counts;

n = FILTER y BY (sentiment == 'negative');
n1 = GROUP n BY (location, brand, avl_author_type);
n2 = FOREACH n1 GENERATE FLATTEN(group) AS (location, brand, avl_author_type), COUNT(n) AS negative_counts;

ne = FILTER y BY (sentiment == 'neutral');
ne1 = GROUP ne BY (location, brand, avl_author_type);
ne2 = FOREACH ne1 GENERATE FLATTEN(group) AS (location, brand, avl_author_type), COUNT(ne) AS neutral_counts;

j1 = JOIN p2 BY (location, brand, avl_author_type) LEFT OUTER, n2 BY (location, brand, avl_author_type);
j2 = FOREACH j1 GENERATE p2::location as location, p2::brand as brand, p2::avl_author_type as avl_author_type, p2::positive_counts as positive_counts, n2::negative_counts as negative_counts;

j3 = JOIN j2 BY (location, brand, avl_author_type) LEFT OUTER, ne2 BY (location, brand, avl_author_type);
j4 = FOREACH j3 GENERATE j2::location as location, j2::brand as brand, j2::avl_author_type as avl_author_type, j2::positive_counts as positive, j2::negative_counts as negative, ne2::neutral_counts as neutral;

Kind of lengthy but worked.

Upvotes: 0

Related Questions