jangorecki
jangorecki

Reputation: 16697

Ensure that Impala query gets materialized

It there any reliable and efficient way to ensure that impala query results gets fully materialized without printing results to console? As example I will use INNER JOIN query.

The obvious way to materialize query results is to create table as select.

CREATE TABLE t3 STORED AS PARQUET AS SELECT t1.* FROM t1 INNER JOIN t2 ON t1.id=t2.id;

The problem with it is that it writes to disc therefore is inefficient. I'm looking for most efficient way to execute query and ensure results are materialized.

As an example, in Spark I can use .cache method followed by .count to ensure query is materialized.

val t3 = t1.join(t2, "id")
t3.cache
t3.count

I could try workaround with sub-query.

SELECT COUNT(*) FROM (SELECT t1.* FROM t1 INNER JOIN t2 ON t1.id=t2.id) t3;

But still I need to ensure the sub-query is materialized, which is not obvious if query optimizer discovers that I'm only interested in total count. Maybe there are some hints to enforce that or other tricks?

Upvotes: 3

Views: 787

Answers (1)

Samson Scharfrichter
Samson Scharfrichter

Reputation: 9067

AFAIK you can't do that with Impala, and will never be able to.
Cloudera designed that tool specifically to support BI tools such as Tableau, Qlik, MicroStrategy etc. -- but not to support ad hoc ETL scripts.

On the other hand Hive now ships with a "HPL-SQL" procedural language wrapper that might fit your needs. Caveats:

  • requires Hive 2.0+
  • requires running your whole script inside the HPL-SQL interpreter, not the base Hive client (nor a standard JDBC connection)

And that HPL-SQL tool claims that it also supports Impala queries but I never investigated that claim. Could solve your problem, as a kind of clumsy workaround.

References:
  HIVE-11055 (PL/HQL tool contributed to the Hive code base)
  HPL/SQL website


Speaking of workarounds, why not use Spark, as you suggested yourself? You might read the Impala/Hive tables, either with Spark native Parquet libraries, or with a custom JDBC connection to an Impala daemon. In essence it would be similar to a HPL/SQL solution.

Upvotes: 1

Related Questions