Avik Aggarwal
Avik Aggarwal

Reputation: 619

Drill failing to read most of the columns in Parquet generated by Spark

I am running Drill 1.15 in distributed mode on top of datanodes only (3 nodes with 32GB memory each). I am trying to read parquet file generated from Spark job in HDFs.

Generated file is being read in spark, just fine but when reading in Drill it doesn't seem to work for columns except a few.

org.apache.drill.common.exceptions.UserRemoteException: DATA_READ ERROR: Exception occurred while reading from disk. File: [file_name].parquet Column: Line Row Group Start: 111831 File: [file_name].parquet Column: Line Row Group Start: 111831 Fragment 0:0 [Error Id: [Error_id] on [host]:31010]

In drill config for dfs, i have default config for parquet format.

I am trying to run a simple query :

select * from dfs.`/hdfs/path/to/parquet/file.parquet`

File size if also in 10s of MBs not alot.

I am using Spark 2.3 version to generate the parquet file with 1.15 version of Drill.

Is there any config i am missing or some other point?

Upvotes: 0

Views: 277

Answers (1)

Vitalii Diravka
Vitalii Diravka

Reputation: 855

Looks like a bug.
Please create Jira ticket and provide file.parquet and log files.
Thanks

Upvotes: 1

Related Questions