Reputation: 7732
I'm doing some study in Apache Spark and I'm facing something really strange. See this code below:
ClimateRdd = ClimateRdd.map(lambda x: tuple(x))
print ClimateRdd.first()
these commands return to me this line:
('1743-11-01', '4.3839999999999995', '2.294', '\xc3\x85land')
Then I move this to a dataFrame like this:
schemaDf = sqlContext.createDataFrame(ClimateRdd, schema)
schemaDf.registerTempTable('globalTemp')
result = sqlContext.sql("SELECT dt FROM globalTemp")
result.show(5)
This works perfect and I got this result:
+----------+
| dt|
+----------+
|1743-11-01|
|1743-12-01|
|1744-01-01|
|1744-02-01|
|1744-03-01|
+----------+
only showing top 5 rows
After I take the query result, and try to run the lines:
dates = result.map(lambda x: "Datas: " + x.dt)
print dates.collect()
I got an exception of java with this cause: Caused by: java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 4 fields are required while 5 values are provided.
Well I did a lot of research and I found what was the problem, I changed my first part of code to this:
ClimateRdd = ClimateRdd.map(lambda x: (x[0], x[1], x[2], x[3]))
And it worked!
Well the point is, why the first part didn't work? Why I have to manually generate a tuple? Is there a way to create this tuple dynamically?
Upvotes: 1
Views: 5518
Reputation: 7732
The issue was the dirty data. The data was not in the default split parameter. The issue was there.
When I made the tuple convertion, that assumes that the structure has 4 fields according with the most part of the data. But at one specific line it wasnt true.
So that is the reason why my dataframe crash in tuple convertion.
Upvotes: 3
Reputation: 800
That is a little bit weird. Why do you need tuples ? List work fine with map.
ClimateRdd.map(lambda x: [x[0], x[1], x[2], x[3]])
Upvotes: 0