ZK Zhao
ZK Zhao

Reputation: 21613

PySpark & JDBC: When should I use spark with JDBC?

I'm not very familiar with Spark, so please forgive me if this is navie.


I have an HDFS data lake to work with, and the data can be queried through Hive and Presto, Impala and Spark (in the cluster).

However, the Spark does not have built-in access control, and for security reason, I can only use Hive/Presto for query.

My questions

Thanks!

Upvotes: 1

Views: 5136

Answers (1)

pissall
pissall

Reputation: 7419

  1. Yes, you can install spark locally and use JDBC to connect to your databases. Here is an function to help you connect to my-sql, which you can generalize to any JDBC source by changing the JDBC connection string:
def connect_to_sql(
        spark, jdbc_hostname, jdbc_port, database, data_table, username, password
):
    jdbc_url = "jdbc:mysql://{0}:{1}/{2}".format(jdbc_hostname, jdbc_port, database)

    connection_details = {
        "user": username,
        "password": password,
        "driver": "com.mysql.cj.jdbc.Driver",
    }

    df = spark.read.jdbc(url=jdbc_url, table=data_table, properties=connection_details)
    return df
  1. Spark is better at handling big data than Pandas even on local machines but it comes with a performance overhead due to parallelism and distributed computing. It will definitely serve your purpose on the cluster, but local mode should be used for development only.

  2. Rest assured, Spark (installed locally) will push query, limit and transformation limitations and even handle it better if done correctly. Search, Sort, Filter operations are going to be expensive since the DF is a non-indexed distributed data structure.

  3. Unaware of the speed difference between Presto and Spark, have not tried a comparison.

Hope this helps.

Note: Performance improvement is not guaranteed on local machine even with optimal parallel workload. It does not provide opportunity for distribution.

Upvotes: 3

Related Questions