Reputation: 11935
I am trying to convert an RDD to dataframe but it fails with an error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 11, 10.139.64.5, executor 0)
This is my code:
items = [(1,12),(1,float('Nan')),(1,14),(1,10),(2,22),(2,20),(2,float('Nan')),(3,300),
(3,float('Nan'))]
sc = spark.sparkContext
rdd = sc.parallelize(items)
itemsRdd = rdd.map(lambda x: Row(id=x[0], col1=int(x[1])))
df = itemsRdd.toDF() # The error is thrown in this line.
Upvotes: 0
Views: 1317
Reputation: 1874
You have multiple problems with this code.
The first problem, which you have probably encountered here, is missing import of Row
class, hence the method toDF()
fails to execute and create a logical plan for you dataframe.
The second problem occurs in the definition of col1
column. If you try to execute int(float('nan'))
it will result in a ValueError
and therefore crashes the execution later on when you call an action on the dataframe.
You can solve both problems for example this way:
items = [(1,12),(1,float('Nan')),(1,14),(1,10),(2,22),(2,20),(2,float('Nan')),
(3,300),(3,float('Nan'))]
sc = spark.sparkContext
rdd = sc.parallelize(items)
df = rdd.toDF(["id", "col1"])
If you wish to retype the columns, I'd suggest to use the cast
method on the specific column you want to retype. It's a bit safer, faster and more stable way to change column types in Spark dataframe rather than forcing a Python type on each row.
Upvotes: 1