Reputation: 16697
It there any reliable and efficient way to ensure that impala query results gets fully materialized without printing results to console? As example I will use INNER JOIN query.
The obvious way to materialize query results is to create table as select.
CREATE TABLE t3 STORED AS PARQUET AS SELECT t1.* FROM t1 INNER JOIN t2 ON t1.id=t2.id;
The problem with it is that it writes to disc therefore is inefficient. I'm looking for most efficient way to execute query and ensure results are materialized.
As an example, in Spark I can use .cache
method followed by .count
to ensure query is materialized.
val t3 = t1.join(t2, "id")
t3.cache
t3.count
I could try workaround with sub-query.
SELECT COUNT(*) FROM (SELECT t1.* FROM t1 INNER JOIN t2 ON t1.id=t2.id) t3;
But still I need to ensure the sub-query is materialized, which is not obvious if query optimizer discovers that I'm only interested in total count. Maybe there are some hints to enforce that or other tricks?
Upvotes: 3
Views: 787
Reputation: 9067
AFAIK you can't do that with Impala, and will never be able to.
Cloudera designed that tool specifically to support BI tools such as Tableau, Qlik, MicroStrategy etc. -- but not to support ad hoc ETL scripts.
On the other hand Hive now ships with a "HPL-SQL" procedural language wrapper that might fit your needs. Caveats:
And that HPL-SQL tool claims that it also supports Impala queries but I never investigated that claim. Could solve your problem, as a kind of clumsy workaround.
References:
HIVE-11055 (PL/HQL tool contributed to the Hive code base)
HPL/SQL website
Speaking of workarounds, why not use Spark, as you suggested yourself? You might read the Impala/Hive tables, either with Spark native Parquet libraries, or with a custom JDBC connection to an Impala daemon. In essence it would be similar to a HPL/SQL solution.
Upvotes: 1