Reputation: 154
I'm trying to use an instance of a Dataproc cluster to import large CSV files to HDFS, then export them to SequenceFile format, then finally to import the latest to Bigtable as described here: https://cloud.google.com/bigtable/docs/exporting-importing
I initially imported the CSV files as an external table in Hive, then exported them by inserting them in a SequenceFile backed table.
However (probably since it seems dataproc ships with Hive 1.0?), I faced the cast exception error mentioned here: Bigtable import error
I can't seem to get HBase shell or ZooKeeper up and running on the dataproc master VM, so I can't run a simple export job from CLI.
Is there an alternative way I could export bigtable-compatible sequence files from dataproc ?
What's the proper configuration to setup to get HBase and ZooKeeper running from Dataproc VM master node ?
Upvotes: 3
Views: 2118
Reputation: 1528
The import instructions you linked to are instructions for importing data from an existing HBase deployment.
If the input format you're working with is CSV, creating SequenceFiles is probably an unnecessary step. How about writing a Hadoop MapReduce to process the CSV files and write directly to Cloud Bigtable? A Dataflow would also be a good fit here.
Take a look at samples here: https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/tree/master/java
Upvotes: 2