Reputation: 894
Loading Azure Data Warehouse via Polybase, I'm reading Parquet files that are on Azure Blob.
First I created an external table in SQL to point to the Parquet file, and then loading with CTAS. No matter what data type I use in SQL, it gives me this type casting error. I've tried DECIMAL, NUMERIC, FLOAT. But loading VARCHAR works fine.
I suspect it has something to do with how the Parquet file was created, which is from a Python Pandas dataframe, using df.to_parquet
and using pyarrow
. Digging into the source code and experimenting, I see that the data type data when it is in Arrow (step before Parquet) is Double. Maybe that's why?
Also, I tried both Gzip and Snappy as compression types when creating the file, and when creating the SQL external table, no dice.
Going crazy from this. Any ideas?
conda create -n testenv python=3.6
conda install -n testenv -c conda-forge pyarrow
conda list -n testenv
# Name Version Build Channel
arrow-cpp 0.13.0 py36hee3af98_1 conda-forge
boost-cpp 1.68.0 h6a4c333_1000 conda-forge
brotli 1.0.7 he025d50_1000 conda-forge
ca-certificates 2019.3.9 hecc5488_0 conda-forge
certifi 2019.3.9 py36_0 conda-forge
gflags 2.2.2 he025d50_1001 conda-forge
glog 0.3.5 h6538335_1
intel-openmp 2019.3 203
libblas 3.8.0 5_mkl conda-forge
libcblas 3.8.0 5_mkl conda-forge
liblapack 3.8.0 5_mkl conda-forge
libprotobuf 3.7.1 h1a1b453_0 conda-forge
lz4-c 1.8.1.2 h2fa13f4_0
mkl 2019.3 203
numpy 1.16.2 py36h8078771_1 conda-forge
openssl 1.1.1b hfa6e2cd_2 conda-forge
pandas 0.24.2 py36h6538335_0 conda-forge
parquet-cpp 1.5.1 2 conda-forge
pip 19.0.3 py36_0
pyarrow 0.13.0 py36h8c67754_0 conda-forge
python 3.6.8 h9f7ef89_7
python-dateutil 2.8.0 py_0 conda-forge
pytz 2019.1 py_0 conda-forge
re2 2019.04.01 vc14h6538335_0 [vc14] conda-forge
setuptools 41.0.0 py36_0
six 1.12.0 py36_1000 conda-forge
snappy 1.1.7 h6538335_1002 conda-forge
sqlite 3.27.2 he774522_0
thrift-cpp 0.12.0 h59828bf_1002 conda-forge
vc 14.1 h0510ff6_4
vs2015_runtime 14.15.26706 h3a45250_0
wheel 0.33.1 py36_0
wincertstore 0.2 py36h7fe50ca_0
zlib 1.2.11 h2fa13f4_1004 conda-forge
zstd 1.3.3 vc14_1 conda-forge
>>> import pandas as pd
>>> df = pd.DataFrame({'ticker':['AAPL','AAPL','AAPL'],'price':[101,102,103]})
>>> df
ticker price
0 AAPL 101
1 AAPL 102
2 AAPL 103
>>> df.to_parquet('C:/aapl_test.parquet',engine='pyarrow',compression='snappy',index=False)
CREATE EXTERNAL FILE FORMAT [ParquetFileSnappy] WITH (
FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = N'org.apache.hadoop.io.compress.SnappyCodec'
)
GO
CREATE EXTERNAL DATA SOURCE [AzureBlobStorage] WITH (
TYPE = HADOOP,
LOCATION = N'wasbs://[redacted: containerName]@[redacted: storageAccountName].blob.core.windows.net',
CREDENTIAL = [AzureQuantBlobStorageCredential] -- created earlier
)
GO
CREATE EXTERNAL TABLE ext.technicals(
[ticker] VARCHAR(5) NOT NULL ,
[close_px] DECIMAL(8,2) NULL
) WITH (
LOCATION='/aapl_test.parquet',
DATA_SOURCE=AzureBlobStorage,
FILE_FORMAT=ParquetFileSnappy
);
CREATE TABLE [dbo].TechnicalFeatures
WITH
(
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED COLUMNSTORE INDEX
)
AS SELECT * FROM [ext].technicals
OPTION (LABEL = 'CTAS : Load [dbo].[TechnicalFeatures]')
;
Msg 106000, Level 16, State 1, Line 20
HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: ClassCastException: class java.lang.Long cannot be cast to class parquet.io.api.Binary (java.lang.Long is in module java.base of loader 'bootstrap'; parquet.io.api.Binary is in unnamed module of loader 'app')
Edit:
Also tried using fastparquet
instead of pyarrow
, same error.
Upvotes: 1
Views: 6374
Reputation: 3078
I repeated your Python file creation ... you owe me a beer for the pain and suffering inflicted by an Anaconda install ;)
On examining the file using parquet-tools, the problem is that your data values are being written as long integers (101,102,103), but you're trying to map them as decimals in your Create External Table statement.
If you change the DECIMAL(8,2) to BIGINT, then your data will load.
Alternatively, write your data values as doubles by adding a decimal point (101.0, 102.0, 103.0), then you can read them by changing DECIMAL(8,2) to DOUBLE PRECISION, or even FLOAT as they are small and precise numbers in this case.
(just kidding about the beer)
Upvotes: 2