Fail to convert an RDD to dataframe

Question

I am trying to convert an RDD to dataframe but it fails with an error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 11, 10.139.64.5, executor 0)

This is my code:

items = [(1,12),(1,float('Nan')),(1,14),(1,10),(2,22),(2,20),(2,float('Nan')),(3,300),
         (3,float('Nan'))]

sc = spark.sparkContext
rdd = sc.parallelize(items)

itemsRdd = rdd.map(lambda x: Row(id=x[0], col1=int(x[1])))

df = itemsRdd.toDF() # The error is thrown in this line.

Richard Nemeth · Accepted Answer

You have multiple problems with this code.

The first problem, which you have probably encountered here, is missing import of Row class, hence the method toDF() fails to execute and create a logical plan for you dataframe.

The second problem occurs in the definition of col1 column. If you try to execute int(float('nan')) it will result in a ValueError and therefore crashes the execution later on when you call an action on the dataframe.

You can solve both problems for example this way:

items = [(1,12),(1,float('Nan')),(1,14),(1,10),(2,22),(2,20),(2,float('Nan')),
         (3,300),(3,float('Nan'))]

sc = spark.sparkContext
rdd = sc.parallelize(items)

df = rdd.toDF(["id", "col1"])

If you wish to retype the columns, I'd suggest to use the cast method on the specific column you want to retype. It's a bit safer, faster and more stable way to change column types in Spark dataframe rather than forcing a Python type on each row.

Fail to convert an RDD to dataframe

Answers (1)

Related Questions