Reputation: 681
I am trying to design a sort of data pipeline to migrate my Hive tables into BigQuery. Hive is running on an Hadoop on premise cluster. This is my current design, actually, it is very easy, it is just a shell script:
for each table source_hive_table {
target_avro_hive_table
SELECT * FROM source_hive_table;
distcp
bq load --source_format=AVRO your_dataset.something something.avro
}
Do you think it makes sense? Is there any better way, perhaps using Spark? I am not happy about the way I am handling the casting, I would like to avoid creating the BigQuery table twice.
Upvotes: 4
Views: 6667
Reputation: 719
Yes, your migration logic makes sense.
I personally prefer to do the CAST for specific types directly into the initial "Hive query" that generates your Avro (Hive) data. For instance, "decimal" type in Hive maps to the Avro 'type': "type":"bytes","logicalType":"decimal","precision":10,"scale":2
And BQ will just take the primary type (here "bytes") instead of the logicalType. So that is why I find it easier to cast directly in Hive (here to "double"). Same problem happens to the date-hive type.
Upvotes: 3