Kaharon
Kaharon

Reputation: 395

How to write data to snowflake with python from parquet file

I am trying to write data from parquet files locally stored in the folder data/. For your information, these files are coming from a Delta Lake.

# files_list contains this ['part-00000-c8fc3190-8a49-49c5-a000-b3f885e3a053-c000.snappy.parquet', 'part-00001-cb8e2d2a-0449-406c-8d6f-3ec1249c3c36-c000.snappy.parquet']

createStmt = f"""
CREATE OR REPLACE TABLE {TABLE_NAME} (
    id int,
    firstName string,
    middleName string,
    lastName string,
    gender string,
    birthDate timestamp,
    ssn string,
    salary int
);
"""
cs.execute(createStmt)

createFileFormat = f"""
CREATE OR REPLACE FILE FORMAT sf_delta_parquet_format
  type=PARQUET COMPRESSION=SNAPPY;
"""
cs.execute(createFileFormat)

createStage = f"""
CREATE OR REPLACE STAGE sf_delta_stage
  FILE_FORMAT = sf_delta_parquet_format;
"""
cs.execute(createStage)

for file in files_list:
    uploadStmt = f'put file://{FOLDER_LOCAL}{file} @sf_delta_stage;'
    cs.execute(uploadStmt)

for file in files_list:
    copyStmt = f"""COPY INTO {SCHEMA_NAME}.{TABLE_NAME} FROM 
    (
        SELECT
            $1:id:VARIANT,
            $1:firstName:VARIANT,
            $1:middleName:VARIANT,
            $1:lastName:VARIANT,
            $1:gender:VARIANT,
            $1:birthDate:VARIANT,
            $1:ssn:VARIANT,
            $1:salary:VARIANT
        FROM @sf_delta_stage/{file}
    )
        FORCE = TRUE
    """
    cs.execute(copyStmt)

When I check on my Snowflake table, I have the correct amount of rows, but all my data are NULL.

Do you know what could have happend?

Upvotes: 0

Views: 1350

Answers (1)

Powers
Powers

Reputation: 19308

It's easiest to use the Delta Lake readers to read in the underlying Parquet files. Here's how to do this with Spark:

df = spark.read.format("delta").load("path/to/data")
df.write.format(snowflake_source_name)...

You can use delta-rs to load your Delta Lake into a pandas DataFrame and load it into Snowflake with pure Python as well. You can also use delta-rs to simply get the list of Parquet files in the latest version and load them in one-by-one.

Your current code may be a bit wrong because it's probably picking up the Parquet files that have been tombstoned by Delta Lake (e.g. Parquet files that have been marked for deletion, but haven't been physically removed from disk).

Upvotes: 1

Related Questions