Reputation: 435
I have my Spark job which is running every 30 minutes and writing output to hdfs-(/tmp/data/1497567600000)
. I have this job continuously running in the cluster.
How can I create a Hive table on top of this data? I have seen one solution in StackOverFlow which creates a hive table on top of data partitioned by date field. which is like,
CREATE EXTERNAL TABLE `mydb.mytable`
(`col1` string,
`col2` decimal(38,0),
`create_date` timestamp,
`update_date` timestamp)
PARTITIONED BY (`my_date` string)
STORED AS ORC
LOCATION '/tmp/out/'
and the solution suggests to Alter the table as,
ALTER TABLE mydb.mytable ADD PARTITION (my_date=20160101) LOCATION '/tmp/out/20160101'
But, in my case, I have no idea on how the output directories are being written, and so I clearly can't create the partitions as suggested above.
How can I handle this case, where the output directories are being randomly written in timestamp basis and is not in format (/tmp/data/timestamp= 1497567600000
)?
How can I make Hive pick the data under the directory /tmp/data
?
Upvotes: 2
Views: 672
Reputation: 689
I can suggest two solutions:
/tmp/data/1
, /tmp/data/2
), add Hive partitions for each hour and just write to relevant partitionhadoop fs -ls /tmp/data
hive -e 'show partitions table;'
ALTER TABLE mydb.mytable ADD PARTITION (my_date=20160101) LOCATION '/tmp/out/20160101'
Upvotes: 1