Reputation: 499
So I'm trying to run the following simple query on redshift spectrum:
select * from company.vehicles where vehicle_id is not null
and it return 0 rows(all of the rows in the table are null). However when I run the same query on athena it works fine and return results. Tried msck repair but both athena and redshift are using the same metastore so it shouldn't matter. I also don't see any errors.
The format of the files is orc.
The create table query is:
CREATE EXTERNAL TABLE 'vehicles'(
'vehicle_id' bigint,
'parent_id' bigint,
'client_id' bigint,
'assets_group' int,
'drivers_group' int)
PARTITIONED BY (
'dt' string,
'datacenter' string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3://company-rt-data/metadata/out/vehicles/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'classification'='orc',
'compressionType'='none')
Any idea?
Upvotes: 3
Views: 4743
Reputation: 21
I had a similar problem and found this solution.
In my case I had external tables that were created with Athena pointing to an S3 bucket that contained heavily nested JSON data. To access them with Redshift I used json_serialization_enable to true;
before my queries to make the nested JSON columns queryable. This lead to some columns being NULL
when the JSON exceeded a size limit, see here:
If the serialization overflows the maximum VARCHAR size of 65535, the cell is set to NULL.
To solve this issue I used Amazon Redshift Spectrum instead of serialization: https://docs.aws.amazon.com/redshift/latest/dg/tutorial-query-nested-data.html.
Upvotes: 0
Reputation: 499
Eventually it turned out to be a bug in redshift. In order to fix it, we needed to run the following command:
ALTER TABLE table_name SET TABLE properties(‘orc.schema.resolution’=‘position’);
Upvotes: 1
Reputation: 931
How did you create your external table ?? For Spectrum,you have to explicitly set the parameters to treat what should be treated as null
add the parameter 'serialization.null.format'='' in TABLE PROPERTIES so that all columns with '' will be treated as NULL to your external table in spectrum
**
CREATE EXTERNAL TABLE external_schema.your_table_name(
)
row format delimited
fields terminated by ','
stored as textfile
LOCATION [filelocation]
TABLE PROPERTIES('numRows'='100', 'skip.header.line.count'='1','serialization.null.format'='');
**
Alternatively,you can setup the SERDE-PROPERTIES while creating the external table which will automatically recognize NULL values
Upvotes: 1