Josh D
Josh D

Reputation: 894

Polybase from Parquet error: Cannot cast Java.lang.Double to

Loading Azure Data Warehouse via Polybase, I'm reading Parquet files that are on Azure Blob.

First I created an external table in SQL to point to the Parquet file, and then loading with CTAS. No matter what data type I use in SQL, it gives me this type casting error. I've tried DECIMAL, NUMERIC, FLOAT. But loading VARCHAR works fine.

I suspect it has something to do with how the Parquet file was created, which is from a Python Pandas dataframe, using df.to_parquet and using pyarrow. Digging into the source code and experimenting, I see that the data type data when it is in Arrow (step before Parquet) is Double. Maybe that's why?

Also, I tried both Gzip and Snappy as compression types when creating the file, and when creating the SQL external table, no dice.

Going crazy from this. Any ideas?

Steps to reproduce

Environment:

conda create -n testenv python=3.6

conda install -n testenv -c conda-forge pyarrow

conda list -n testenv

# Name                    Version                   Build  Channel
arrow-cpp                 0.13.0           py36hee3af98_1    conda-forge
boost-cpp                 1.68.0            h6a4c333_1000    conda-forge
brotli                    1.0.7             he025d50_1000    conda-forge
ca-certificates           2019.3.9             hecc5488_0    conda-forge
certifi                   2019.3.9                 py36_0    conda-forge
gflags                    2.2.2             he025d50_1001    conda-forge
glog                      0.3.5                h6538335_1
intel-openmp              2019.3                      203
libblas                   3.8.0                     5_mkl    conda-forge
libcblas                  3.8.0                     5_mkl    conda-forge
liblapack                 3.8.0                     5_mkl    conda-forge
libprotobuf               3.7.1                h1a1b453_0    conda-forge
lz4-c                     1.8.1.2              h2fa13f4_0
mkl                       2019.3                      203
numpy                     1.16.2           py36h8078771_1    conda-forge
openssl                   1.1.1b               hfa6e2cd_2    conda-forge
pandas                    0.24.2           py36h6538335_0    conda-forge
parquet-cpp               1.5.1                         2    conda-forge
pip                       19.0.3                   py36_0
pyarrow                   0.13.0           py36h8c67754_0    conda-forge
python                    3.6.8                h9f7ef89_7
python-dateutil           2.8.0                      py_0    conda-forge
pytz                      2019.1                     py_0    conda-forge
re2                       2019.04.01       vc14h6538335_0  [vc14]  conda-forge
setuptools                41.0.0                   py36_0
six                       1.12.0                py36_1000    conda-forge
snappy                    1.1.7             h6538335_1002    conda-forge
sqlite                    3.27.2               he774522_0
thrift-cpp                0.12.0            h59828bf_1002    conda-forge
vc                        14.1                 h0510ff6_4
vs2015_runtime            14.15.26706          h3a45250_0
wheel                     0.33.1                   py36_0
wincertstore              0.2              py36h7fe50ca_0
zlib                      1.2.11            h2fa13f4_1004    conda-forge
zstd                      1.3.3                    vc14_1    conda-forge

Python:

>>> import pandas as pd
>>> df = pd.DataFrame({'ticker':['AAPL','AAPL','AAPL'],'price':[101,102,103]})
>>> df
  ticker  price
0   AAPL    101
1   AAPL    102
2   AAPL    103
>>> df.to_parquet('C:/aapl_test.parquet',engine='pyarrow',compression='snappy',index=False)

Azure steps:

  1. Uploaded the Parquet file to Azure Blob
  2. Using Azure Data Warehouse Gen2, size: DW400c
  3. Per the docs and a tutorial, created the DATABASE SCOPED CREDENTIAL, EXTERNAL DATA SOURCE, and EXTERNAL FILE FORMAT

SQL Code:

CREATE EXTERNAL FILE FORMAT [ParquetFileSnappy] WITH (
    FORMAT_TYPE = PARQUET, 
    DATA_COMPRESSION = N'org.apache.hadoop.io.compress.SnappyCodec'
)
GO

CREATE EXTERNAL DATA SOURCE [AzureBlobStorage] WITH (
    TYPE = HADOOP, 
    LOCATION = N'wasbs://[redacted: containerName]@[redacted: storageAccountName].blob.core.windows.net', 
    CREDENTIAL = [AzureQuantBlobStorageCredential] -- created earlier
)
GO

CREATE EXTERNAL TABLE ext.technicals(
    [ticker] VARCHAR(5) NOT NULL ,
    [close_px] DECIMAL(8,2) NULL
) WITH (
    LOCATION='/aapl_test.parquet', 
    DATA_SOURCE=AzureBlobStorage, 
    FILE_FORMAT=ParquetFileSnappy 
);

CREATE TABLE [dbo].TechnicalFeatures
WITH
( 
    DISTRIBUTION = ROUND_ROBIN,
    CLUSTERED COLUMNSTORE INDEX
)
AS SELECT * FROM [ext].technicals
OPTION (LABEL = 'CTAS : Load [dbo].[TechnicalFeatures]')
;

And here is the error:

Msg 106000, Level 16, State 1, Line 20
HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: ClassCastException: class java.lang.Long cannot be cast to class parquet.io.api.Binary (java.lang.Long is in module java.base of loader 'bootstrap'; parquet.io.api.Binary is in unnamed module of loader 'app')

Edit: Also tried using fastparquet instead of pyarrow, same error.

Upvotes: 1

Views: 6374

Answers (1)

Ron Dunn
Ron Dunn

Reputation: 3078

I repeated your Python file creation ... you owe me a beer for the pain and suffering inflicted by an Anaconda install ;)

On examining the file using parquet-tools, the problem is that your data values are being written as long integers (101,102,103), but you're trying to map them as decimals in your Create External Table statement.

If you change the DECIMAL(8,2) to BIGINT, then your data will load.

Alternatively, write your data values as doubles by adding a decimal point (101.0, 102.0, 103.0), then you can read them by changing DECIMAL(8,2) to DOUBLE PRECISION, or even FLOAT as they are small and precise numbers in this case.

(just kidding about the beer)

Upvotes: 2

Related Questions