hive unix_timestamp() UDF giving multiple values

Question

I am using HQL to extract some data from a hive table, while adding an extra row containing the current time.

Something like: select col1, col2, col3, unix_timestamp() from myTable;

I was expecting that all the records will have the same value in the fourth column.

I was expecting something like:

col1Value, col2Value, col3Value, col4Value, timeT
col1Value, col2Value, col3Value, col4Value, timeT
col1Value, col2Value, col3Value, col4Value, timeT
col1Value, col2Value, col3Value, col4Value, timeT
col1Value, col2Value, col3Value, col4Value, timeT
col1Value, col2Value, col3Value, col4Value, timeT

However I am getting something like this:

col1Value, col2Value, col3Value, col4Value, timeT1
col1Value, col2Value, col3Value, col4Value, timeT1
col1Value, col2Value, col3Value, col4Value, timeT1
col1Value, col2Value, col3Value, col4Value, timeT2
col1Value, col2Value, col3Value, col4Value, timeT2
col1Value, col2Value, col3Value, col4Value, timeT2
col1Value, col2Value, col3Value, col4Value, timeT2
col1Value, col2Value, col3Value, col4Value, timeT3
col1Value, col2Value, col3Value, col4Value, timeT3

The dataset is not that large and only a single mapper is used. So my question is:

In a single machine, is unix_timestamp() evaluated for every row that is selected (each line in hive's mapper) or one value is evaluated and used for all the rows?

I am using MapR M5/hive 0.9.0

Lukas Vermeer · Accepted Answer

According to the LanguageManual: "the context of a UDF's evaluate method is one row at a time". I believe this means your unix_timestamp() call would be evaluated during the mapping phase once for each record emitted.

Perhaps you could use a subquery to evaluate unix_timestamp() once and then join the result to your original query?

hive unix_timestamp() UDF giving multiple values

Answers (1)

Related Questions