data appears as null on redshift external table while working right on athena

So I'm trying to run the following simple query on redshift spectrum:

select * from company.vehicles where vehicle_id is not null

and it return 0 rows(all of the rows in the table are null). However when I run the same query on athena it works fine and return results. Tried msck repair but both athena and redshift are using the same metastore so it shouldn't matter. I also don't see any errors.

The format of the files is orc.

The create table query is:

CREATE EXTERNAL TABLE 'vehicles'(
  'vehicle_id' bigint, 
  'parent_id' bigint, 
  'client_id' bigint, 
  'assets_group' int, 
  'drivers_group' int)
PARTITIONED BY ( 
  'dt' string, 
  'datacenter' string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  's3://company-rt-data/metadata/out/vehicles/'
TBLPROPERTIES (
  'CrawlerSchemaDeserializerVersion'='1.0', 
  'CrawlerSchemaSerializerVersion'='1.0',  
  'classification'='orc', 
  'compressionType'='none')

Any idea?

Upvotes: 3

Answers (3)

Stefan Käser

Reputation: 21

I had a similar problem and found this solution.

In my case I had external tables that were created with Athena pointing to an S3 bucket that contained heavily nested JSON data. To access them with Redshift I used json_serialization_enable to true; before my queries to make the nested JSON columns queryable. This lead to some columns being NULL when the JSON exceeded a size limit, see here:

If the serialization overflows the maximum VARCHAR size of 65535, the cell is set to NULL.

To solve this issue I used Amazon Redshift Spectrum instead of serialization: https://docs.aws.amazon.com/redshift/latest/dg/tutorial-query-nested-data.html.

Upvotes: 0

Gal Itzhak

Reputation: 499

Eventually it turned out to be a bug in redshift. In order to fix it, we needed to run the following command:

ALTER TABLE table_name SET TABLE properties(‘orc.schema.resolution’=‘position’);

Upvotes: 1

theDbGuy

Reputation: 931

How did you create your external table ?? For Spectrum,you have to explicitly set the parameters to treat what should be treated as null

add the parameter 'serialization.null.format'='' in TABLE PROPERTIES so that all columns with '' will be treated as NULL to your external table in spectrum

CREATE EXTERNAL TABLE external_schema.your_table_name(
)
row format delimited
    fields terminated by ','
    stored as textfile 
LOCATION [filelocation]
TABLE PROPERTIES('numRows'='100', 'skip.header.line.count'='1','serialization.null.format'='');

Alternatively,you can setup the SERDE-PROPERTIES while creating the external table which will automatically recognize NULL values

Upvotes: 1

data appears as null on redshift external table while working right on athena

Answers (3)

Related Questions