Reputation: 327
I use 48 core remote machine, but the operation shown in below takes huge amount of time for a pandas dataframe with size (1009224, 232). Actually, I cannot see running stage on web GUI of spark. Any idea or suggestion ? [Update] My main problem is that I cannot achieve to use all of available cores of 48 core machine. I guess my configuration is wrong. This code executed but not in 48 core.
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName("rocket3") \
.config('spark.driver.memory', '30g')\
.config('spark.executor.memory', '30g')\
.config('spark.executor.cores', '40') \
.config('spark.cores.max', '40') \
.getOrCreate()
import time
start = time.time()
df_sp = spark_session.createDataFrame(x_df)
end = time.time()
print(end - start)
Upvotes: 2
Views: 765
Reputation: 1626
use this code snippet for conversion.
dataset = pd.read_csv("data/file.csv")
sc = SparkContext(conf=conf)
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(dataset)
If you get this error
TypeError: Can not merge type <class 'pyspark.sql.types.StringType'> and <class 'pyspark.sql.types.DoubleType'>
then change the data type of columns to str
Ex.
df[['SomeCol', 'Col2']] = df[['SomeCol', 'Col2']].astype(str)
Upvotes: 2