Reputation: 63062
My issue is when I'm using trying to read data from a sql.Row
as a String
. I'm using pyspark, but I've heard people have this issue with Scala API too.
The pyspark.sql.Row object is a pretty intransigent creature. The following exception is thrown:
java.lang.ClassCastException: [B cannot be cast to java.lang.String
at org.apache.spark.sql.catalyst.expressions.GenericRow.getString(Row.scala 183)
So what we have is one of the fields is being represented as a byte array. The following python printing constructs do NOT work
repr(sqlRdd.take(2))
Also
import pprint
pprint.pprint(sqlRdd.take(2))
Both result in the ClassCastException.
So.. how do other folks do this? I started to roll my own (can not copy/paste here unfortunately..) But this is a bit re-inventing the wheel .. or so I suspect.
Upvotes: 3
Views: 4263
Reputation: 31513
Try
sqlContext.setConf("spark.sql.parquet.binaryAsString", "true")
I think since Spark 1.1.0 they broke it - reading binary as strings used to work, then they made it not work, but added this flag, but set it's default to false.
Upvotes: 4