Conversion from Pandas data frame to Spark data frame takes huge amount of time

Question

I use 48 core remote machine, but the operation shown in below takes huge amount of time for a pandas dataframe with size (1009224, 232). Actually, I cannot see running stage on web GUI of spark. Any idea or suggestion ? [Update] My main problem is that I cannot achieve to use all of available cores of 48 core machine. I guess my configuration is wrong. This code executed but not in 48 core.

from pyspark.sql import SparkSession

spark_session = SparkSession.builder.appName("rocket3") \
    .config('spark.driver.memory', '30g')\
    .config('spark.executor.memory', '30g')\
    .config('spark.executor.cores', '40') \
    .config('spark.cores.max', '40') \
    .getOrCreate()

import time

start = time.time()
df_sp = spark_session.createDataFrame(x_df)
end = time.time()
print(end - start)

Conversion from Pandas data frame to Spark data frame takes huge amount of time

Answers (1)

Related Questions