Reputation: 1304
I am trying to understand Spark SQL concepts and am wondering if I could use Spark SQL as an in-memory database, similar to H2/SQLite?
Once I process all the records from 100 files, I could save the data in a tabular format and I could query the tables for result instead of searching the files. Does this make any sense?
Dataset<Row> results = spark.sql("SELECT distinct(name) FROM mylogs");
At runtime if a user opts to get distinct names from a table 'mylogs', it should fetch from tables (not from the underlying files that the tables are derived from).
What I noticed is that Spark SQL does scans over files to get data again and till it scans all 100 files and fetches the data, user has to wait for the response.
Is this a use case for Spark? Is there any better way to achieve this?
Upvotes: 2
Views: 3229
Reputation: 74619
In theory it's doable and you could use Spark SQL as an in-memory database. I'd not be surprised if the data were gone at some point and you'd have to re-query the 100 files again.
You could have a configuration where you execute a query over the 100 files and then cache
/ persist
the results to avoid scans.
That's how Spark Thrift Server works pretty much and so you should read the documentation at Running the Thrift JDBC/ODBC server.
Upvotes: 2