Reputation: 711
My S3 file (s3://MY_s3/stage/my_s3_file
) has a partition on load_dt
:
S3 Structure:
-MY_S3
--stage
---my_s3_file
----load_dt=2016-02-01
----000024
----load_dt=2016-02-02
----000036
The actual files are under load_dt partitions like 000024 & 000036.
The COPY command works fine if I am not keeping load_dt in Redshift table, but when I am adding load_dt in Redshift table the COPY command fails due to data error as the input layout & target layout are not matching (with extra load_dt on target).
The hive table which creates the S3 file shows partitioned load_dt at the end.
How to make this COPY command work with load_dt being on target Redshift?
I even tried changing S3 locations to s3://MY_s3/stage/my_s3_file/load_dt
but no luck.
Upvotes: 8
Views: 9992
Reputation: 1
You can use Redshift Federated query feature to achieve this.
Create an external schema in Redshift pointing to Hive.
Example:
create external schema "redshift_external_schema"
from hive metastore
database 'hive_database'
uri 'hive.host.name.com' port 9083
iam_role 'arn:aws:iam::2188*********:role/redshift-role-name';
Use CTAS statement
create table redshift_schema.table_name as select * from
redshift_external_schema.table_name
Note: This will not work if your hive tables have complex data types like array, map, struct etc. In that case, you need to run the create table DDL in Redshift and use COPY command.
Upvotes: 0
Reputation: 11
Redshift 'Copy' command will show errors under mismatched columns between table schema and parquet columns. So when you use range(daily) partition, you may be able to use the below script.
export PGUSER=sample
export PGPASSWORD=sample
export PGHOST=xxxx.xxxx.redshift.amazonaws.com
export PGPORT=5439
export PGDATABASE=xxxx
start_date=`date -d '2018-01-01' +%s`
base_s3='s3://bucket_and_table_root_path/range_column_name='
for day in {0..364}
do
base_ymd=`date -d @$((${start_date} + 3600 * 24 * ${day})) +'%Y-%m-%d'`
base_ymd_lower=`echo ${base_ymd} | sed '1,$s/-/_/g'`
echo ${base_s3}${base_ymd}
echo 'start-time' `date`
psql <<EOF
DROP TABLE loading_table_${base_ymd_lower}_temp;
CREATE TABLE loading_table_${base_ymd_lower}_temp (
col_a VARCHAR(xxxx) ENCODE LZO,
...
)
DISTSTYLE even
;
COPY loading_table_${base_ymd_lower}_temp
FROM '${base_s3}${base_ymd}'
iam_role 'arn:aws:iam::xxxxxxxxxxx:role/xxxxxxxxxx'
format as parquet
;
EOF
done
And next, you can use CTAS with UNION keyword in Redshift.
Upvotes: 1
Reputation: 269390
When Hive (running under Apache Hadoop) creates a partitioned EXTERNAL TABLE
, it separates files by directory. For example, all rows where load_dt=2016-02-01
are stored in the directory called load_dt=2016-02-01
.
The files within the directory do not store a value for the partition column (load_dt
). Instead, the value of the partition column is stored as part of the directory name. As such, renaming the directory will actually change the value in that column for all rows within the directory.
Yes, it's a bit strange but that's how Hive stores its data!
Amazon Redshift can import CSV files (including compressed CSV files) from Amazon S3. It can even import files from multiple sub-directories because it only looks at the path prefix of files to load. However, it does not understand the way that Hive stores partitioned data so it will not load the partition column from the directory name.
Some options:
COPY
that sets the column value based on some calculation (eg from a date field)Upvotes: 7
Reputation: 711
I think I found the answer for my case.
I was not able to load the hive partition because Hive store that partition value in Hive metadata , that's why the partition column was not there in S3 file.
Now I am adding a new column to S3 through Hive as Load_Dt_New so the S3 file would have the required column for my Redshift COPY command to work.
Upvotes: 2
Reputation: 71
"copy table from 's3://MY_s3/stage/my_s3_file/load_dt' " should work if you can change the file name to "load_dt_20160201" rather than "load_dt=2016-02-01". COPY command will take the last part of S3 path as prefix.
Upvotes: 1