Reputation: 2010
I am running Cassandra and have about 20k records in it to play with. I am trying to run a filter in pig on this data but am getting the following message back:
2015-07-23 13:02:23,559 [Thread-4] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0001 java.lang.RuntimeException: com.datastax.driver.core.exceptions.InvalidQueryException: Expected 8 or 0 byte long (1) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.initNextRecordReader(PigRecordReader.java:260) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:205) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) Caused by: com.datastax.driver.core.exceptions.InvalidQueryException: Expected 8 or 0 byte long (1) at com.datastax.driver.core.exceptions.InvalidQueryException.copy(InvalidQueryException.java:35) at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:263) at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:179) at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52) at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:44) at org.apache.cassandra.hadoop.cql3.CqlRecordReader$RowIterator.(CqlRecordReader.java:259) at org.apache.cassandra.hadoop.cql3.CqlRecordReader.initialize(CqlRecordReader.java:151) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.initNextRecordReader(PigRecordReader.java:256) ... 7 more
You would think this is an obvious error, and believe me there are a ton of results on google for this. It's clear that some piece of my data isn't conforming to the expected type of a given column. What I don't understand is 1.) why this is happening, and 2.) how to debug it. If I try to insert invalid data into Cassandra from my nodejs app, it will throw this kind of error if my data type doesn't match the columns data type, which means that this shouldn't be possible? I've read that data validation using UTF8 is wonky and that setting a different kind of validation is the answer, but I don't know how to do that. Here are my steps to reproduce:
grunt> define CqlNativeStorage org.apache.cassandra.hadoop.pig.CqlNativeStorage();
grunt> test = load 'cql://blah/blahblah' USING CqlNativeStorage();
grunt> describe test;
13:09:54.544 [main] DEBUG o.a.c.hadoop.pig.CqlNativeStorage - Found ksDef name: blah
13:09:54.544 [main] DEBUG o.a.c.hadoop.pig.CqlNativeStorage - partition keys: ["ad_id"]
13:09:54.544 [main] DEBUG o.a.c.hadoop.pig.CqlNativeStorage - cluster keys: []
13:09:54.544 [main] DEBUG o.a.c.hadoop.pig.CqlNativeStorage - row key validator: org.apache.cassandra.db.marshal.UTF8Type
13:09:54.544 [main] DEBUG o.a.c.hadoop.pig.CqlNativeStorage - cluster key validator: org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type)
blahblah: {ad_id: chararray,address: chararray,city: chararray,date_created: long,date_listed: long,fireplace: bytearray,furnished: bytearray,garage: bytearray,neighbourhood: chararray,num_bathrooms: int,num_bedrooms: int,pet_friendly: bytearray,postal_code: chararray,price: double,province: chararray,square_feet: int,url: chararray,utilities_included: bytearray}
grunt> query1 = FILTER blahblah BY city == 'New York';
grunt> dump query1;
Then it runs for awhile and dumps out tons of logs and the error appears.
Upvotes: 1
Views: 154
Reputation: 2010
Discovered my problem: the pig partioner did not match CQL3, and therefore the data was being parsed incorrectly. Previously the environment variable was PIG_PARTITIONER=org.apache.cassandra.dht.RandomPartitioner. After I changed it to PIG_PARTITIONER=org.apache.cassandra.dht.Murmur3Partitioner it started working.
Upvotes: 1