shantanuo
shantanuo

Reputation: 32304

Redshift spectrum shows NULL values for all rows

When I run this query in Athena query editor, it works as expected.

SELECT * FROM "sampledb"."elb_logs" limit 10;

elb_logs table has been generated based on the official tutorial. When I try to use spectrum in redshift, I can see all "NULL" values for all columns. I am creating athena_schema using these commands:

drop schema "athena_schema";

create external schema athena_schema from data catalog 
database 'sampledb' 
iam_role 'arn:aws:iam::94331XXXXXXX:role/RedshiftCopyUnload'
region 'ap-south-1';

And the output of system table:

select * from svv_external_tables;

schemaname  tablename   location    input_format    output_format   serialization_lib   serde_parameters    compressed  parameters

athena_schema   elb_logs    s3://athena-examples-ap-south-1/elb/plaintext   org.apache.hadoop.mapred.TextInputFormat    org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  org.apache.hadoop.hive.serde2.RegexSerDe    {"input.regex":"([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*):([0-9]*) ([.0-9]*) ([.0-9]*) ([.0-9]*) (-|[0-9]*) (-|[0-9]*) ([-0-9]*)     0   {"EXTERNAL":"TRUE","transient_lastDdlTime":"1480278335"}

I am not sure why the athena console shows the correct values for all columns while redshift shows all NULLs?

Upvotes: 3

Views: 2018

Answers (2)

Sanket Prabhu
Sanket Prabhu

Reputation: 1

There could be multiple causes of this issue:

  1. Role you have created external table does not have access to S3 bucket.
  2. If the role has access to S3 bucket that also needs to be associated with the redshift cluster.
  3. There could be a S3-based policy restricting the getObject access with bucket name or bucket_name_* in the resources.
  4. Check if there is endpoint created in VPC for S3 bucket and check any policy restricting the S3 bucket access..
  5. For testing, create separate bucket load data into it, and try creating an external table using data. catalog and check if it is accessible.

Upvotes: 0

shantanuo
shantanuo

Reputation: 32304

This was because elb_logs table was using regular expression serialization that spectrum can not handle. I converted the table into parquet file format using this command.

CREATE TABLE elb_logs3
WITH (
      format = 'PARQUET',
      parquet_compression = 'SNAPPY',
      external_location = 's3://elb163/parqfiles'
) AS SELECT * from elb_logs

Now athena will have 2 tables "elb_logs" and "elb_logs3". Once I create the external schema using standard commands like this...

drop schema "athena_schema";

create external schema athena_schema from data catalog 
database 'sampledb' 
iam_role 'arn:aws:iam::XXX:role/RedshiftCopyUnload'
region 'us-east-1';

I can now select records from elb_logs table like this...

select * from athena_schema.elb_logs3 limit 10;

Note that selecting from elb_logs table still shows NULL values for all columns.

Upvotes: 2

Related Questions