NYCeyes
NYCeyes

Reputation: 5669

PIG (v0.10.0) excepttion during FILTER operation: java.lang.Integer cannot be cast to java.lang.String

Here is my (seemingly trivial) PIG script, followed by the exception that it generates:

raw_logs = LOAD './Apache-WebLog-Samples.d/access_log.txt' USING TextLoader() AS (line:chararray);

logs = FOREACH raw_logs GENERATE FLATTEN (
    REGEX_EXTRACT_ALL(line, '^(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+\\[([\\w:/]+\\s[+\\-]\\d{4})\\]\\s+"(..*)"\\s+(\\S+)\\s+(\\S+)'))
       AS (remoteAddr:    chararray,
           remoteLogname: chararray,
           user:          chararray,
           date_time:     chararray, 
           request:       chararray,
           httpStatus:          int, <- Here's the problem. But goes away when I set to chararray.
           numBytes:            int);

httpGET200 = FILTER logs BY (request MATCHES '^GET\\s.*') AND (httpStatus == 200);

mylimit = LIMIT httpGET200 40;

DUMP mylimit;

PIG SCRIPT

java.lang.Exception: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.String
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:404)
Caused by: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.String

[ ... non meaningful error output removed ... ]

2013-03-13 14:04:10,882 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: 

HadoopVersion   PigVersion  UserId  StartedAt   FinishedAt  Features
2.0.0-cdh4.2.0  0.10.0-cdh4.2.0 nmvega  2013-03-13 14:04:05 2013-03-13 14:04:10 FILTER,LIMIT

Failed!    
Failed Jobs:
JobId   Alias   Feature Message Outputs
job_local1982169921_0001    httpGET200,logs,mylimit,raw_logs        Message: Job failed!    

Input(s):
Failed to read data from "file:///home/user/Dropbox/CodeDEV.d/BIG-DATA-SNIPPETS.d/PIG.d/Apache-WebLog-Samples.d/access_log.txt"

Output(s):

EXCEPTION MESSAGES

Everything works except for the 'httpGET200' relation. For reasons not obvious to me, the clause, "httpStatus == 200", causes the exception above. When I remove that clause the problem goes away. Or, alternatively, when I alter the schema and declare 'httpStatus' to be of type "chararray" instead of "int" (as is done above and which is appropriate for HTTP status codes), the problem also goes away... (of course when I do this I have to edit the relation to insert quotes like so: httpStatus == '200').

I checked the input data file and verified that, for every line, the field corresponding to 'httpStatus' is indeed always an integer (... well, a substring representing an integer).

Here, by the way, is the schema as grunt reports it to be (i.e. what is expected):

grunt> describe httpGET200;
httpGET200: {remoteAddr: chararray,remoteLogname: chararray,user: chararray,date_time: chararray,request: chararray,httpStatus: int,numBytes: int}

I'd like to understand what is happening here (either my misunderstanding or PIG limitation). Can anyone shed light?

Thank you!

Upvotes: 2

Views: 3101

Answers (2)

sashaostr
sashaostr

Reputation: 655

I got the same problem trying to compare two integers in FILTER statement in Pig script. The most elegant solution that I found was using GenericInvoker. So for your problem I would use:

--StringToInt would be func that will invoke valueOf method of Integer class for String arg.
DEFINE StringToInt InvokeForInt('java.lang.Integer.valueOf', 'String');


--Now we can use it in our FILTER statement (without need to make projections in order to get right types schema for your tuples)
httpGET200=FILTER logs BY (request MATCHES '^GET\\s.*') AND StringToInt(httpStatus)== 200;

Voila!

Upvotes: 1

Lorand Bendig
Lorand Bendig

Reputation: 10650

It seems to me that in case of REGEX_EXTRACT_ALL setting a field in the output schema to int will cause later a ClassCastException when an arithmetic operation is performed on that field. Probably because all fields remain and are treated as chararray inside the returned tuple despite the given schema.

As a workaround you may set all fields to chararray and then perform an explicit cast (conv):

logs = FOREACH raw_logs ....
conv = FOREACH logs generate remoteAddr, remoteLogname, user, date_time, 
         request, (int)httpStatus, (int)numBytes;

Then you can apply the filter you initially used:

httpGET200 = FILTER conv BY (request MATCHES '^GET\\s.*') AND (httpStatus == 200);

You can find some more information about a similar issue in this ticket:

Upvotes: 6

Related Questions