FirstName LastName
FirstName LastName

Reputation: 1911

Pig: Cast error while grouping data

This is the code that I am trying to run. Steps:

  1. Take an input (there is a .pig_schema file in the input folder)
  2. Take only two fields (chararray) from it and remove duplicates
  3. Group on one of those fields

The code is as follows:

x = LOAD '$input' USING PigStorage('\t'); --The input is tab separated
x = LIMIT x 25;
DESCRIBE x;
-- Output of DESCRIBE x:
-- x: {id: chararray,keywords: chararray,score: chararray,time: long}

distinctCounts = FOREACH x GENERATE keywords, id; -- generate two fields
distinctCounts = DISTINCT distinctCounts; -- remove duplicates
DESCRIBE distinctCounts;
-- Output of DESCRIBE distinctCounts;
-- distinctCounts: {keywords: chararray,id: chararray}

grouped = GROUP distinctCounts BY keywords; --group by keywords
DESCRIBE grouped; --THIS IS WHERE IT GIVES AN ERROR
DUMP grouped;

When I do the grouped, it gives the following error:

ERROR org.apache.pig.tools.pigstats.SimplePigStats  - 
ERROR: org.apache.pig.data.DataByteArray cannot be cast to java.lang.String

keywords is a chararray and Pig should be able to group on a chararray. Any ideas?

EDIT: Input file:

0000010000014743       call for midwife    23      1425761139
0000010000062069       naruto 1    56      1425780386
0000010000079919       the following    98     1425788874
0000010000081650       planes 2    76      1425721945
0000010000118785       law and order    21     1425763899
0000010000136965       family guy    12    1425766338
0000010000136100       american dad    19      1425766702

.pig_schema file

{"fields":[{"name":"id","type":55},{"name":"keywords","type":55},{"name":"score","type":55},{"name":"time","type":15}]}

Upvotes: 0

Views: 122

Answers (1)

Murali Rao
Murali Rao

Reputation: 2287

Pig is not able to identify the value of keywords as chararray.Its better to go for field naming during initial load, in this way we are explicitly stating the field types.

x = LOAD '$input' USING PigStorage('\t') AS (id:chararray,keywords:chararray,score: chararray,time: long);

UPDATE :

Tried the below snippet with updated .pig_schema to introduce score, used '\t' as separator and tried the below steps for the input shared.

  x = LOAD 'a.csv' USING PigStorage('\t'); 
 distinctCounts = FOREACH x GENERATE keywords, id; 
 distinctCounts = DISTINCT distinctCounts;
 grouped = GROUP distinctCounts BY keywords; 
 DUMP grouped;

Would suggest to use unique alias names for better readability and maintainability.

Output :

    (naruto 1,{(naruto 1,0000010000062069)})
    (planes 2,{(planes 2,0000010000081650)})
    (family guy,{(family guy,0000010000136965)})
    (american dad,{(american dad,0000010000136100)})
    (law and order,{(law and order,0000010000118785)})
    (the following,{(the following,0000010000079919)})
    (call for midwife,{(call for midwife,0000010000014743)})

Upvotes: 1

Related Questions