user11856426
user11856426

Reputation:

Can Apache Spark be used as a database replacement? (For example to replace Mysql)

I need a scalable database solution that can scale to multiple worker nodes and I came across Apache Spark SQL which seems to be very powerful resilient. Can I use this as a Mysql replacement? I've tried creating, reading, updating, deleting values from a DataFrame but it seems like it wasn't built for this purpose? I (currently) can' find a way to update and rows... It's almost like it's really good for querying data once you have it but not for inserting data

Am I mistaken? I'm extremely new to Spark and I don't want to spend time trying to make it into something it's not

In the case that it can't be used as a database... does that mean that Spark is just used for analytics? Should I store my data using a database and then later load the dataset into spark if I wanted real time information?

Upvotes: 2

Views: 3671

Answers (3)

stevel
stevel

Reputation: 13430

As an OLTP database for transactions where you update multiple tables and commit the work: no, not a chance.

As a basis for data warehousing analysis of data, e.g. OLAP (Online Analytical Processing), yes.

put differently, if your SQL code has this line at the top

BEGIN TRANSACTION

then you need a database like MySQL, Postgres, etc

Upvotes: 1

id3a
id3a

Reputation: 118

Explore Delta Lake. Delta lake provides acid transaction and you can build a reliable "data warehouse" inside a data lake (like s3 or adls).

This means you can do update/delete/insert/merge on Delta tables.

Keep in mind that spark is a strong candidate for the processing and preparation layer which means you can ingest data from various sources in batch or streaming, mixed them together if needed and you can make sense of your data with Delta lake.

However, there are better tools for the serving layer that can handle lots of concurrent users/queries like sql databases or dremio.

Upvotes: 3

ernest_k
ernest_k

Reputation: 45309

Short answer: No.

The description line on Spark's website reads:

Apache Spark™ is a unified analytics engine for large-scale data processing.

And the Spark SQL documentation describes it:

One use of Spark SQL is to execute SQL queries. Spark SQL can also be used to read data from an existing Hive installation. [...] When running SQL from within another programming language the results will be returned as a Dataset/DataFrame. You can also interact with the SQL interface using the command-line or over JDBC/ODBC.

So yes, Spark allows one to run SQL queries on data frames (resulting in other data frames), but Spark data frames are immutable and changing data is usually done by exporting queried and transformed data sets back to an underlying database (SQL/Relational or not) or other storage (file system/DFS).

Spark even allows applications to connect to it via JDBC and submit queries as though it were an RDBMS, but it is not meant to replace databases. Stick to using spark for batch and ad hoc processing or analysis. In fact, even for normal applications' SQL queries, you should prefer a database because Spark can be an inefficient alternative for typical, random access queries (it processes data in memory, so it might be forced to make unnecessary reads just to find and return a small fraction of the data).

Upvotes: 2

Related Questions