Thiago Baldim
Thiago Baldim

Reputation: 7732

Getting Error when convert RDD to DataFrame PySpark

I'm doing some study in Apache Spark and I'm facing something really strange. See this code below:

ClimateRdd = ClimateRdd.map(lambda x: tuple(x))
print ClimateRdd.first()

these commands return to me this line: ('1743-11-01', '4.3839999999999995', '2.294', '\xc3\x85land')

Then I move this to a dataFrame like this:

schemaDf = sqlContext.createDataFrame(ClimateRdd, schema)
schemaDf.registerTempTable('globalTemp')
result = sqlContext.sql("SELECT dt FROM globalTemp")
result.show(5)

This works perfect and I got this result:

+----------+
|        dt|
+----------+
|1743-11-01|
|1743-12-01|
|1744-01-01|
|1744-02-01|
|1744-03-01|
+----------+
only showing top 5 rows

After I take the query result, and try to run the lines:

dates = result.map(lambda x: "Datas: " + x.dt)
print dates.collect()

I got an exception of java with this cause: Caused by: java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 4 fields are required while 5 values are provided.

Well I did a lot of research and I found what was the problem, I changed my first part of code to this:

ClimateRdd = ClimateRdd.map(lambda x: (x[0], x[1], x[2], x[3]))       

And it worked!

Well the point is, why the first part didn't work? Why I have to manually generate a tuple? Is there a way to create this tuple dynamically?

Upvotes: 1

Views: 5518

Answers (2)

Thiago Baldim
Thiago Baldim

Reputation: 7732

The issue was the dirty data. The data was not in the default split parameter. The issue was there.

When I made the tuple convertion, that assumes that the structure has 4 fields according with the most part of the data. But at one specific line it wasnt true.

So that is the reason why my dataframe crash in tuple convertion.

Upvotes: 3

Alexis Benichoux
Alexis Benichoux

Reputation: 800

That is a little bit weird. Why do you need tuples ? List work fine with map.

ClimateRdd.map(lambda x: [x[0], x[1], x[2], x[3]])       

Upvotes: 0

Related Questions