seiya
seiya

Reputation: 1534

How to cache subquery result in WITH clause in Spark SQL

I wonder if Spark SQL support caching result for the query defined in WITH clause. The Spark SQL query is something like this:

with base_view as
(
 select some_columns from some_table
WHERE 
 expensive_udf(some_column) = true
)
... multiple query join based on this view

While this query works with Spark SQL, I noticed that the UDF were applied to the same data set multiple times. In this use case, the UDF is very expensive. So I'd like to cache the query result of base_view so the subsequent queries would benefit from the cached result.

P.S. I know you can create and cache a table with the given query and then reference it in the subqueries. In this specific case, though, I can't create any tables or views.

Upvotes: 3

Views: 3900

Answers (4)

marie20
marie20

Reputation: 783

Not sure if you are still interested in the solution, but the following is a workaround to accomplish the same:-

spark.sql("""
         | create temp view my_view
         | as
         | WITH base_view as
         | (
         | select some_columns 
         | from some_table
         | WHERE 
         | expensive_udf(some_column) = true
         | )
         | SELECT *
         | from base_view 
      """);

spark.sql("""CACHE TABLE my_view""");

Now you can use the my_view temp view to join to other tables as shown below-

spark.sql("""
         | select mv.col1, t2.col2, t3.col3
         | from my_view mv
         | join tab2 t2
         | on mv.col2 = t2.col2 
         | join tab3 t3
         | on mv.col3 = t3.col3 
      """);

Remember to uncache the view after using-

spark.sql("""UNCACHE TABLE my_view""");

Hope this helps.

Upvotes: 0

Mason Ma
Mason Ma

Reputation: 21

This can be done by excuting several sql query.

-- first cache sql
spark.sql("
CACHE TABLE base_view as
  select some_columns
  from some_table
  WHERE 
  expensive_udf(some_column) = true")

-- then use
spark.sql("
... multiple query join based on this view
")

Upvotes: 0

Jim Castro
Jim Castro

Reputation: 894

The WITH clause allows you to give a name to a temporary result set so it ca be reused several times within a single query. I believe what he's asking for is a materialized view.

Upvotes: 1

Ged
Ged

Reputation: 18003

That is not possible. The WITH result cannot be persisted after execution or substituted into new Spark SQL invocation.

Upvotes: 2

Related Questions