Nitin
Nitin

Reputation: 735

Spark import of Parquet files converts strings to bytearray

I have an uncompressed Parquet file which has "crawler log" sort of data.

I import it into Spark via PySpark as

sq = SQLContext(sc) p = sq.read.parquet('/path/to/stored_as_parquet/table/in/hive') p.take(1).show()

This shows strings in the source data converted to

Row(host=bytearray(b'somehostname'), (checksum=bytearray(b'stuff'))...)

When I do p.dtypes I see

((host binary), (checksum binary) ....).

What can I do to avoid this conversion or alternately how do I convert back to what I need

i.e. when I do p.dtypes I want to see

((host string), (checksum string) ....)

Thanks.

Upvotes: 10

Views: 10952

Answers (3)

emjeexyz
emjeexyz

Reputation: 432

For people using SparkSession it is:

spark = SparkSession.builder.config('spark.sql.parquet.binaryAsString', 'true').getOrCreate().newSession()

Upvotes: 1

Vijay Krishna
Vijay Krishna

Reputation: 1067

For spark 2.0 or later

set runtime options

spark.conf.set("spark.sql.parquet.binaryAsString","true")

Upvotes: 5

uuazed
uuazed

Reputation: 879

I ran into the same problem. Adding

sqlContext.setConf("spark.sql.parquet.binaryAsString","true")

right after creating my SqlContext, solved it for me.

Upvotes: 18

Related Questions