CodingInCircles
CodingInCircles

Reputation: 2807

Snowflake pandas pd_writer writes out tables with NULLs

I have a Pandas dataframe that I'm writing out to Snowflake using SQLAlchemy engine and the to_sql function. It works fine, but I have to use the chunksize option because of some Snowflake limit. This is also fine for smaller dataframes. However, some dataframes are 500k+ rows, and at a 15k records per chunk, it takes forever to complete writing to Snowflake.

I did some research and came across the pd_writer method provided by Snowflake, which apparently loads the dataframe much faster. My Python script does complete faster and I see it creates a table with all the right columns and the right row count, but every single column's value in every single row is NULL.

I thought it was a NaN to NULL issue and tried everything possible to replace the NaNs with None, and while it does the replacement within the dataframe, by the time it gets to the table, everything becomes NULL.

How can I use pd_writer to get these huge dataframes written properly into Snowflake? Are there any viable alternatives?

EDIT: Following Chris' answer, I decided to try with the official example. Here's my code and the result set:

import os
import pandas as pd
from snowflake.sqlalchemy import URL
from sqlalchemy import create_engine
from snowflake.connector.pandas_tools import write_pandas, pd_writer


def create_db_engine(db_name, schema_name):
    return create_engine(
        URL(
            account=os.environ.get("DB_ACCOUNT"),
            user=os.environ.get("DB_USERNAME"),
            password=os.environ.get("DB_PASSWORD"),
            database=db_name,
            schema=schema_name,
            warehouse=os.environ.get("DB_WAREHOUSE"),
            role=os.environ.get("DB_ROLE"),
        )
    )


def create_table(out_df, table_name, idx=False):
    engine = create_db_engine("dummy_db", "dummy_schema")
    connection = engine.connect()

    try:
        out_df.to_sql(
            table_name, connection, if_exists="append", index=idx, method=pd_writer
        )

    except ConnectionError:
        print("Unable to connect to database!")

    finally:
        connection.close()
        engine.dispose()

    return True


df = pd.DataFrame([("Mark", 10), ("Luke", 20)], columns=["name", "balance"])

print(df.head)

create_table(df, "dummy_demo_table")

The code works fine with no hitches, but when I look at the table, which gets created, it's all NULLs. Again.

This is what dummy_demo_table shows me

Upvotes: 9

Views: 8397

Answers (2)

Mahen Ramdhaney
Mahen Ramdhaney

Reputation: 21

I have had this exact same issue, don't despair there is a solution in sight. When you create a table in snowflake, from the snowflake worksheet or snowflake environment, it names the object and all columns and constraints in uppercase. However when you create the table from Python using the data frame, the object gets created in the exact case that you specified in your data frame. In your case it is columns=['name', 'balance']). So when the insert happens, it looks for all uppercase column names in snowflake and cannot find it, it does the insert but sets your 2 columns to null as the columns are created as nullable.

Best way to get pass this issue is to create your columns in uppercase in the dataframe, columns=['NAME', 'BALANCE']).

I do think this is something that snowflake should address and fix as it is not an expected behavior.

Even if you tried to do a select from your table that has nulls you would get an error eg: select name, balance from dummy_demo_table

You would probably get an error like the following, SQL compilation error: error line 1 at position 7 invalid identifier 'name'

BUT the following will work SELECT * from dummy_demo_table

Upvotes: 2

CodingInCircles
CodingInCircles

Reputation: 2807

Turns out, the documentation (arguably, Snowflake's weakest point) is out of sync with reality. This is the real issue: https://github.com/snowflakedb/snowflake-connector-python/issues/329. All it needs is a single character in the column name to be upper case and it works perfectly.

My workaround is to simply do: df.columns = map(str.upper, df.columns) before invoking to_sql.

Upvotes: 21

Related Questions