Databricks and Polybase cannot parse CSV including polygon

Question

I have Azure Data Factory, which read CSV via HTTP Connection and store data to Azure Storage Gen2. File format is UTC-8. It seem like file get somehow corrupted because of polygon definitions.

File content is followings:

Shape123|"MULTIPOLYGON (((496000 6908000, 495000 6908000, 495000 6909000, 496000 6909000, 496000 6908000)))"|"Red"|"Long"|"208336"|"5"|"-1"

Problem 1: Polybase complain about encoding and cannot read file.

Problem 2: Databricks data frame cannot handle this and it can cuts row and reads only "Shape123|"MULTIPOLYGON (((496000 6908000,"

Quick solution: Open CSV file with Notepad++ and reconfirm encoding as UTC-8. Then Polybase is able to handle.

Question: What are automatic way to fix CSV file? How to make dataframe to handle entire row if CSV file cannot not be fixed?

wBob · Accepted Answer

Polybase can cope perfectly well with UTF8 files and various delimiters. Did you create an external file format with pipe delimiter, double-quote as string delimiter, something like this?

CREATE EXTERNAL FILE FORMAT ff_pipeFileFormatSHAPE
WITH (
    FORMAT_TYPE = DELIMITEDTEXT,
    FORMAT_OPTIONS (
        FIELD_TERMINATOR = '|',
        STRING_DELIMITER = '"',
        ENCODING = 'UTF8' 
        )
);
GO

CREATE EXTERNAL TABLE shape_data (
    col1                VARCHAR(20),
    col2                VARCHAR(8000),
    col3                VARCHAR(20),
    col4                VARCHAR(20),
    col5                VARCHAR(20),
    col6                VARCHAR(20),
    col7                VARCHAR(20)
)
WITH (
    LOCATION = 'yourPath/shape/shape working.txt',
    DATA_SOURCE = ds_azureDataLakeStore,
    FILE_FORMAT = ff_pipeFileFormatSHAPE,
    REJECT_TYPE = VALUE,
    REJECT_VALUE = 0
);

My results:

Databricks and Polybase cannot parse CSV including polygon

Answers (1)

Related Questions