ugurtosun
ugurtosun

Reputation: 327

Conversion from Pandas data frame to Spark data frame takes huge amount of time

I use 48 core remote machine, but the operation shown in below takes huge amount of time for a pandas dataframe with size (1009224, 232). Actually, I cannot see running stage on web GUI of spark. Any idea or suggestion ? [Update] My main problem is that I cannot achieve to use all of available cores of 48 core machine. I guess my configuration is wrong. This code executed but not in 48 core.

from pyspark.sql import SparkSession

spark_session = SparkSession.builder.appName("rocket3") \
    .config('spark.driver.memory', '30g')\
    .config('spark.executor.memory', '30g')\
    .config('spark.executor.cores', '40') \
    .config('spark.cores.max', '40') \
    .getOrCreate()

import time

start = time.time()
df_sp = spark_session.createDataFrame(x_df)
end = time.time()
print(end - start)

Upvotes: 2

Views: 765

Answers (1)

Rushikesh Sabde
Rushikesh Sabde

Reputation: 1626

use this code snippet for conversion.

dataset = pd.read_csv("data/file.csv")
sc = SparkContext(conf=conf)
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(dataset)

If you get this error

TypeError: Can not merge type <class 'pyspark.sql.types.StringType'> and <class 'pyspark.sql.types.DoubleType'>

then change the data type of columns to str

Ex.

df[['SomeCol', 'Col2']] = df[['SomeCol', 'Col2']].astype(str)

Upvotes: 2

Related Questions