Dynamic partitioning in Hive through the exact inserted timestamp

Question

I need to insert data to a given external table which should be partitioned by the inserted date. My question is how is Hive handling the timestamp generation? When I select a timestamp for all inserted records like this:

WITH delta_insert AS (
        SELECT trg.*, from_unixtime(unix_timestamp()) AS generic_timestamp
        FROM target_table trg
        )
SELECT *
FROM delta_insert;

Will the timestamp always be identical for all records, even if the query takes a lot of time to un?

Or should I alternatively only select an actual timestamp and join it with everything that is selected afterwards?

WITH insert_timestamp AS (
    SELECT from_unixtime(unix_timestamp()) AS insert_timestamp
),
delta_insert AS (
        SELECT trg.*, insert_timestamp.insert_timestamp
        FROM target_table trg, insert_timestamp
        )
SELECT *
FROM delta_insert;

Since it's not recommended to do cross joins in Hive, I wonder what would be the best approach, since I don't have enough test data to simulate long running queries to be sure they get inserted within the same partition of a delta load.

leftjoin · Accepted Answer

No, the timestamp generated using unix_timestamp() WILL NOT always be identical for all records. This function is non-deterministic and prevents proper optimization of queries - this has been deprecated since 2.0 in favour of CURRENT_TIMESTAMP constant (calculated only one time for the query). See this docs: Date functions This not applies to unix_timestamp(string date) with argument. unix_timestamp(string date) function is deterministic.

current_timestamp returns the current timestamp at the start of query evaluation (as of Hive 1.2.0). All calls of current_timestamp within the same query return the same value. If you need date, use current_date function.

Dynamic partitioning in Hive through the exact inserted timestamp

Answers (1)

Related Questions