Cheater
Cheater

Reputation: 435

Creating hive table on spark output on HDFS

I have my Spark job which is running every 30 minutes and writing output to hdfs-(/tmp/data/1497567600000). I have this job continuously running in the cluster.

How can I create a Hive table on top of this data? I have seen one solution in StackOverFlow which creates a hive table on top of data partitioned by date field. which is like,

 CREATE EXTERNAL TABLE `mydb.mytable`
 (`col1` string, 
  `col2` decimal(38,0), 
  `create_date` timestamp, 
  `update_date` timestamp)
  PARTITIONED BY (`my_date` string)
  STORED AS ORC
  LOCATION '/tmp/out/'

and the solution suggests to Alter the table as,

ALTER TABLE mydb.mytable ADD PARTITION (my_date=20160101) LOCATION '/tmp/out/20160101'

But, in my case, I have no idea on how the output directories are being written, and so I clearly can't create the partitions as suggested above.

How can I handle this case, where the output directories are being randomly written in timestamp basis and is not in format (/tmp/data/timestamp= 1497567600000)?

How can I make Hive pick the data under the directory /tmp/data?

Upvotes: 2

Views: 672

Answers (1)

Piotr Kalański
Piotr Kalański

Reputation: 689

I can suggest two solutions:

  • If you can change your Spark job, then you can partition your data by hour (e.g. /tmp/data/1, /tmp/data/2), add Hive partitions for each hour and just write to relevant partition
  • you can write bash script responsible for adding Hive partitions which can be achieved by:
    • listing HDFS subdirectories using command hadoop fs -ls /tmp/data
    • listing hive partitions for table using command: hive -e 'show partitions table;'
    • comparing above lists to find missing partitions
    • adding new Hive partitions with command provided above: ALTER TABLE mydb.mytable ADD PARTITION (my_date=20160101) LOCATION '/tmp/out/20160101'

Upvotes: 1

Related Questions