Creating hive table on spark output on HDFS

Question

I have my Spark job which is running every 30 minutes and writing output to hdfs-(/tmp/data/1497567600000). I have this job continuously running in the cluster.

How can I create a Hive table on top of this data? I have seen one solution in StackOverFlow which creates a hive table on top of data partitioned by date field. which is like,

 CREATE EXTERNAL TABLE `mydb.mytable`
 (`col1` string, 
  `col2` decimal(38,0), 
  `create_date` timestamp, 
  `update_date` timestamp)
  PARTITIONED BY (`my_date` string)
  STORED AS ORC
  LOCATION '/tmp/out/'

and the solution suggests to Alter the table as,

ALTER TABLE mydb.mytable ADD PARTITION (my_date=20160101) LOCATION '/tmp/out/20160101'

But, in my case, I have no idea on how the output directories are being written, and so I clearly can't create the partitions as suggested above.

How can I handle this case, where the output directories are being randomly written in timestamp basis and is not in format (/tmp/data/timestamp= 1497567600000)?

How can I make Hive pick the data under the directory /tmp/data?

Creating hive table on spark output on HDFS

Answers (1)

Related Questions