R3Tech
R3Tech

Reputation: 801

Run directly sql query with Apache Spark SQL with Java

Im trying to figure out how to execute a query directly with Spark SQL. Im mean with:

SQLContext sql = new SQLContext(ctx);
sql.sql("QUERY HERE");

but how to set connection information for the Database? I'm using an Oracle DB. Before I used the sql.read().jdbc.. way. And there I pass the connection URL as parameter. But this way is really slow (4 seconds) compared to the direct query on the SQL Console (0,05 seconds).

Greetz

Upvotes: 3

Views: 3001

Answers (1)

T. Gawęda
T. Gawęda

Reputation: 16076

Probably you are missing concept of Spark SQL.

It is NOT engine to real-time proxying database. For fast caches you may want to use data grids, such as Oracle Coherence, Hazelcast or Apache Ignite (random order)

Spark is for fast computing over massive datasets. In 03.10 on Databricks blog there were article with CERN use case of Spark - big query that run 12h on database is taking only 2 mins in Spark!

So, why your query is slow? Spark SQL is more similar to OLAP systems, not OLTP. It can process masive datasets very fast. However this data must be read from database and then calculated in Spark. That's why time is much bigger in your case, it's load time + calculation time. Database engine can do read and calculate in one step (in approximation of course, implementation may be different).

When you'll have more data, then load time will be smaller percent of execution time and processing time will be much bigger. Then Spark will do his best. That's because processing in database engine is much slower than in Spark - Spark can parallelize query better.

How you can tune your query? Read it once, then cache into memory and then use in query. On small datasets it still can be slower, but on big datasets and with heavy-use of this DataFrame it can help

Upvotes: 4

Related Questions