user2535982
user2535982

Reputation:

What exactly is SparkSQL?

I am very new to this whole world of "big data" tech, and recently started reading about Spark. One thing that keeps coming up is SparkSQL, yet I consistently fail to comprehend was exactly it is.

Is it supposed to convert SQL queries to MapReduce that do operations on the data you give it? But aren't dataframes already essentially SQL tables in terms of functionality?

Or is it some tech that allows you to connect to an SQL database and use Spark to query it? In this case, what's the point of Spark in here at all - why not use SQL directly? Or is the point that you can use your structured SQL data in combination with the flat data?

Again, I am emphasizing that I am very new to all of this and may or may not talking out of my butt :). So please do correct me and be forgiving if you see that I'm clearly misunderstanding something.

Upvotes: 4

Views: 700

Answers (2)

Orléando Dassi
Orléando Dassi

Reputation: 466

Spark

Spark is a Framework or very big set of components using for Scalable, efficient analysis of Big Data.

For example: People are uploading a petabyte of video to YouTube every day. Now the time it takes to read one terabyte from a disk is about three hours at 100 megabytes per second. That's actually quite a long time(inexpensive of disk cannot helps us here). So the challenge we face is that one machine cannot process, or even store, all of the data. So our solution is distributed data over cluster of machines.

DataFrames are the primary abstraction in Spark.

We can construct a data frame from text files, Json files, Hadoop Distributed File System, Apache Parquet or Hypertable or Amazon S3 file, Apache HBase and then perform some operations, transformation on it regardless where the data come from.

Spark Sql

Spark SQL is a Spark module for structured data processing. as describing on the documentation page here.

So one of the interests of Spark SQL is that it allows us to query structured data from many data sources with an SQL syntax and offering many others possibilities. I think it is for this reason we don't use SQL directly.

Upvotes: 0

maxymoo
maxymoo

Reputation: 36555

Your first answer is essentially correct, it's a API in Spark where you can write queries in SQL and they will be converted to a parallelised Spark job (Spark can do more complex types of operations than just map and reduce). Spark Data frames actually are just a wrapper around this API, it's just an alternative way of accessing the API, depending on whether you're more comfortable coding in SQL or in Python/Scala.

Upvotes: 3

Related Questions