Getting Error when convert RDD to DataFrame PySpark

Question

I'm doing some study in Apache Spark and I'm facing something really strange. See this code below:

ClimateRdd = ClimateRdd.map(lambda x: tuple(x))
print ClimateRdd.first()

these commands return to me this line: ('1743-11-01', '4.3839999999999995', '2.294', '\xc3\x85land')

Then I move this to a dataFrame like this:

schemaDf = sqlContext.createDataFrame(ClimateRdd, schema)
schemaDf.registerTempTable('globalTemp')
result = sqlContext.sql("SELECT dt FROM globalTemp")
result.show(5)

This works perfect and I got this result:

+----------+
|        dt|
+----------+
|1743-11-01|
|1743-12-01|
|1744-01-01|
|1744-02-01|
|1744-03-01|
+----------+
only showing top 5 rows

After I take the query result, and try to run the lines:

dates = result.map(lambda x: "Datas: " + x.dt)
print dates.collect()

I got an exception of java with this cause: Caused by: java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 4 fields are required while 5 values are provided.

Well I did a lot of research and I found what was the problem, I changed my first part of code to this:

ClimateRdd = ClimateRdd.map(lambda x: (x[0], x[1], x[2], x[3]))

And it worked!

Well the point is, why the first part didn't work? Why I have to manually generate a tuple? Is there a way to create this tuple dynamically?

Thiago Baldim · Accepted Answer

The issue was the dirty data. The data was not in the default split parameter. The issue was there.

When I made the tuple convertion, that assumes that the structure has 4 fields according with the most part of the data. But at one specific line it wasnt true.

So that is the reason why my dataframe crash in tuple convertion.

Getting Error when convert RDD to DataFrame PySpark

Answers (2)

Related Questions