Reputation: 602
I'm working with Apache-Spark and in my project, I want to use Spark-SQL. But, I have to be sure Spark-SQL's query performance. I know that Spark-SQL is not effective like RDBMS. But I want to learn that are there too much time gap between Spark-SQL and RDBMS queries?
For example, I'm working on Virtual Machine which has 4 gb ram and 1 core CPU. It is a slow system. I have a small data set with 2 tables. First one has 5M records, second one has 1K records. When I join two tables, query takes about 60 seconds. Is it normal for Spark-SQL with this hardware? If I do same join operation with RDBMS, it takes too less time but I can't test it with physical limits at office.
And a last question: How can I reduce query time in Spark-SQL?
Upvotes: 1
Views: 1379
Reputation: 11
I believe the problem is the virtual machine. I was on the same boat, and what ended up doing it was installing Spark on Windows (you can do that, just google it). The performance was much better (I have a 4 core laptop, 4gb ram and ssd drive).
Spark-SQL is really powerful, depending on your needs. What you are comparing with the performance will be amazing, but you need to do/implement things differently than what you were used to doing in a regular RDBMS.
Upvotes: 0