Making DataFrame from .csv vs from PostgreSQL table in Pyspark

Question

I am using the MovieLens dataset in Apache Pyspark. To put the data in Spark DataFrames, I used two methods-

Read directly from .csv files using spark.read.load() -> Works fine
Put the .csv files in Postgres server, read the SQL relations to Spark DataFrames using a jdbc connection like

dataframeList[table] = spark.read.format("jdbc"). \
                            options(
                             url = 'jdbc:postgresql://localhost:5432/movielens_dataset',
                             dbtable = table,
                             user = 'postgres',
                             password = 'postgres',
                             driver = 'org.postgresql.Driver').\
                            load()

This code is in a for loop to read 6 tables.

This method does not seem to work when I call show(5) on a big df. It runs into

ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.OutOfMemoryError: Java heap space

I tried increasing the driver memory and set it to 4/5/6g but it led to Chrome crashing and shutting down.

sparkConf.setAppName("My app")
         .set("spark.jars", "postgresql-42.5.0.jar")
         .set("spark.driver.memory", "6g")
spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()

I am running this code in a Jupyter Notebook on Google Chrome and my laptop has 8 GB RAM. I am unable to understand why does it not work in this case as in both the cases, they are essentially DataFrames. Its just that they are created by different methods.

Can someone explain why this is happening and also, how can I fix this problem? Can someone also recommend a good resource for understanding the Spark configuration variables (driver/executor core, memory, memoryOverhead, etc.) and how tweaking them can help in different situations? Thank you so much!

Jonathan · Accepted Answer

When you use JDBC to connect external database, in your case which is Postgresql, increasing driver memory will not help solving your error since the data fetching is done by the executor. You error is also showing that the Java heap space OOM error is during in the task 0.0 in stage 1.0 in your executor.

Based on the code and configuration that you provided, you didn't configure any connection property and all parameters are using the default value. Therefore, when you read a large dataset from the Postgresql, your spark application might only use a single executor to handle this action. In this case, not sure about how do you configure your executor, it might trigger OOM error by fetching large dataset.

When we use JDBC in spark, parallelism is one of the most critical parts for your spark application performance. Increasing the number of executors is just one of the way to increase parallelism, also the number of partitions. Beside, the value of batchsize, lowerBound and upperBound to improve your performance and increase parallelism.

The best way to understand the Spark configuration is through the Spark documents: https://spark.apache.org/docs/latest/configuration.html, for JDBC parts you can check this one: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html. Also, I recommend you to watch this post as it's a good introduction on the spark optimization in JDBC reading.

There is no standard solution or parameters that is suitable and optimize of different case, since the performance will be affected by the database, the JDBC design, the network, the disk that the database use etc. They best way to find the fit parameters is trial and error.

Making DataFrame from .csv vs from PostgreSQL table in Pyspark

Answers (1)

Related Questions