Reputation: 735
I have an uncompressed Parquet file which has "crawler log" sort of data.
I import it into Spark via PySpark as
sq = SQLContext(sc)
p = sq.read.parquet('/path/to/stored_as_parquet/table/in/hive')
p.take(1).show()
This shows strings in the source data converted to
Row(host=bytearray(b'somehostname'), (checksum=bytearray(b'stuff'))...)
When I do p.dtypes I see
((host binary), (checksum binary) ....).
What can I do to avoid this conversion or alternately how do I convert back to what I need
i.e. when I do p.dtypes I want to see
((host string), (checksum string) ....)
Thanks.
Upvotes: 10
Views: 10952
Reputation: 432
For people using SparkSession
it is:
spark = SparkSession.builder.config('spark.sql.parquet.binaryAsString', 'true').getOrCreate().newSession()
Upvotes: 1
Reputation: 1067
For spark 2.0 or later
set runtime options
spark.conf.set("spark.sql.parquet.binaryAsString","true")
Upvotes: 5
Reputation: 879
I ran into the same problem. Adding
sqlContext.setConf("spark.sql.parquet.binaryAsString","true")
right after creating my SqlContext, solved it for me.
Upvotes: 18