Pig filter fails due to unexpected data

Question

I am running Cassandra and have about 20k records in it to play with. I am trying to run a filter in pig on this data but am getting the following message back:

2015-07-23 13:02:23,559 [Thread-4] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0001 java.lang.RuntimeException: com.datastax.driver.core.exceptions.InvalidQueryException: Expected 8 or 0 byte long (1) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.initNextRecordReader(PigRecordReader.java:260) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:205) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) Caused by: com.datastax.driver.core.exceptions.InvalidQueryException: Expected 8 or 0 byte long (1) at com.datastax.driver.core.exceptions.InvalidQueryException.copy(InvalidQueryException.java:35) at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:263) at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:179) at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52) at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:44) at org.apache.cassandra.hadoop.cql3.CqlRecordReader$RowIterator.(CqlRecordReader.java:259) at org.apache.cassandra.hadoop.cql3.CqlRecordReader.initialize(CqlRecordReader.java:151) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.initNextRecordReader(PigRecordReader.java:256) ... 7 more

You would think this is an obvious error, and believe me there are a ton of results on google for this. It's clear that some piece of my data isn't conforming to the expected type of a given column. What I don't understand is 1.) why this is happening, and 2.) how to debug it. If I try to insert invalid data into Cassandra from my nodejs app, it will throw this kind of error if my data type doesn't match the columns data type, which means that this shouldn't be possible? I've read that data validation using UTF8 is wonky and that setting a different kind of validation is the answer, but I don't know how to do that. Here are my steps to reproduce:

grunt> define CqlNativeStorage org.apache.cassandra.hadoop.pig.CqlNativeStorage(); grunt> test = load 'cql://blah/blahblah' USING CqlNativeStorage(); grunt> describe test; 13:09:54.544 [main] DEBUG o.a.c.hadoop.pig.CqlNativeStorage - Found ksDef name: blah 13:09:54.544 [main] DEBUG o.a.c.hadoop.pig.CqlNativeStorage - partition keys: ["ad_id"] 13:09:54.544 [main] DEBUG o.a.c.hadoop.pig.CqlNativeStorage - cluster keys: [] 13:09:54.544 [main] DEBUG o.a.c.hadoop.pig.CqlNativeStorage - row key validator: org.apache.cassandra.db.marshal.UTF8Type 13:09:54.544 [main] DEBUG o.a.c.hadoop.pig.CqlNativeStorage - cluster key validator: org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type) blahblah: {ad_id: chararray,address: chararray,city: chararray,date_created: long,date_listed: long,fireplace: bytearray,furnished: bytearray,garage: bytearray,neighbourhood: chararray,num_bathrooms: int,num_bedrooms: int,pet_friendly: bytearray,postal_code: chararray,price: double,province: chararray,square_feet: int,url: chararray,utilities_included: bytearray} grunt> query1 = FILTER blahblah BY city == 'New York'; grunt> dump query1;

Then it runs for awhile and dumps out tons of logs and the error appears.

Pig filter fails due to unexpected data

Answers (1)

Related Questions