Reputation: 2513
I have Azure Data Factory, which read CSV via HTTP Connection and store data to Azure Storage Gen2. File format is UTC-8. It seem like file get somehow corrupted because of polygon definitions.
File content is followings:
Shape123|"MULTIPOLYGON (((496000 6908000, 495000 6908000, 495000 6909000, 496000 6909000, 496000 6908000)))"|"Red"|"Long"|"208336"|"5"|"-1"
Problem 1: Polybase complain about encoding and cannot read file.
Problem 2: Databricks data frame cannot handle this and it can cuts row and reads only "Shape123|"MULTIPOLYGON (((496000 6908000,"
Quick solution: Open CSV file with Notepad++ and reconfirm encoding as UTC-8. Then Polybase is able to handle.
Question: What are automatic way to fix CSV file? How to make dataframe to handle entire row if CSV file cannot not be fixed?
Upvotes: 1
Views: 65
Reputation: 14389
Polybase can cope perfectly well with UTF8 files and various delimiters. Did you create an external file format with pipe delimiter, double-quote as string delimiter, something like this?
CREATE EXTERNAL FILE FORMAT ff_pipeFileFormatSHAPE
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = '|',
STRING_DELIMITER = '"',
ENCODING = 'UTF8'
)
);
GO
CREATE EXTERNAL TABLE shape_data (
col1 VARCHAR(20),
col2 VARCHAR(8000),
col3 VARCHAR(20),
col4 VARCHAR(20),
col5 VARCHAR(20),
col6 VARCHAR(20),
col7 VARCHAR(20)
)
WITH (
LOCATION = 'yourPath/shape/shape working.txt',
DATA_SOURCE = ds_azureDataLakeStore,
FILE_FORMAT = ff_pipeFileFormatSHAPE,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);
My results:
Upvotes: 2