How to use Spark SQL as in-memory database?

Question

I am trying to understand Spark SQL concepts and am wondering if I could use Spark SQL as an in-memory database, similar to H2/SQLite?

Once I process all the records from 100 files, I could save the data in a tabular format and I could query the tables for result instead of searching the files. Does this make any sense?

Dataset results = spark.sql("SELECT distinct(name) FROM mylogs");

At runtime if a user opts to get distinct names from a table 'mylogs', it should fetch from tables (not from the underlying files that the tables are derived from).

What I noticed is that Spark SQL does scans over files to get data again and till it scans all 100 files and fetches the data, user has to wait for the response.

Is this a use case for Spark? Is there any better way to achieve this?

How to use Spark SQL as in-memory database?

Answers (1)

Related Questions