Jeet Patel
Jeet Patel

Reputation: 1241

to_sql() method of pandas sends primary key column as NULL even if the column is not present in dataframe

I want to insert a data frame into the Snowflake database table. The database has columns like id which is a primary_key and event_id which is an integer field and it's also nullable.

I have created a declarative_base() class using SQLAlchemy as shown below -

class AccountUsageLoginHistory(Base):

    __tablename__ = constants.TABLE_ACCOUNT_USAGE_LOGIN_HISTORY
    __table_args__ = {
        'extend_existing':True,
        'schema' : os.environ.get('SCHEMA_NAME_AUDITS')
    }

    id = Column(Integer, Sequence('id_account_usage_login_history'), primary_key=True)
    event_id = Column(Integer, nullable=True)

The class stated above creates a table in the Snowflake database.

I have a data frame that has just one column event_id.

When I try to insert the data using pandas to_sql() method Snowflake returns me an error shown below -

snowflake.connector.errors.ProgrammingError: 100072 (22000): 01991f2c-0be5-c903-0000-d5e5000c6cee: NULL result in a non-nullable column

This error is generated by snowflake because to_sql() is appending a column id and the values are set to null for each row of that column.

dataframe.to_sql(table_name, self.engine, index=False, method=pd_writer, if_exists="append")

Consider this as case 1 -

I tried to run an insert query directly to snowflake -

insert into "SFOPT_TEST"."AUDITS"."ACCOUNT_USAGE_LOGIN_HISTORY" (ID, EVENT_ID) values(NULL, 33)

The query above returned me the same error -

NULL result in a non-nullable column

The query stated above is how probably the to_sql() method might be doing.

Consider this as case 2 -

I also tried to insert a row by executing the query stated below -

insert into "SFOPT_TEST"."AUDITS"."ACCOUNT_USAGE_LOGIN_HISTORY" (EVENT_ID) values(33)

Now, this particular query has been executed successfully by inserting the data into the table and it has also auto-generated value for column id.

How can I make to_sql() method of pandas to use case 2?

Upvotes: 0

Views: 3191

Answers (1)

olv1do
olv1do

Reputation: 96

Please note that pandas.DataFrame.to_sql() has by default parameter index=True which means that it will add an extra column (df.index) when inserting the data.

Some Databases like PostgreSQL have a data type serial which allows you to sequentially fill the column with incremental numbers.

Snowflake DB doesn't have that concept but instead, there are other ways to handle it:

First Option: You can use CREATE SEQUENCE statement and create a sequence directly in the db - here is the official documentation on this topic. The downside of this approach is that you would need to convert your DataFrame into a proper SQL statement:

db preparation part:

CREATE OR REPLACE SEQUENCE schema.my_sequence START = 1 INCREMENT = 1;
CREATE OR REPLACE TABLE schema.my_table (i bigint, b text);

You would need to convert the DataFrame into Snowflake's INSERT statement and use schema.my_sequence.nextval to get the next ID value

INSERT INTO schema.my_table VALUES
(schema.my_sequence.nextval, 'string_1'),
(schema.my_sequence.nextval, 'string_2');

The result will be:

i b
1 string_1
2 string_2

Please note that there are some limitations to this approach and you need to ensure that each insert statement you do this way will be successful as calling schema.my_sequence.nextval and not inserting it will mean that there will be gaps numbers. To avoid it you can have a separate script that checks if the current insert was successful and if not it will recreate the sequence by calling:

REPLACE SEQUENCE schema.my_sequence start = (SELECT max(i) FROM schema.my_table) increment = 1;

Alternative Option: You would need to create an extra function that runs the SQL to get the last i you inserted previously.

SELECT max(i) AS max_i FROM schema.my_table;

and then update the index in your DataFrame before running to_sql()

df.index = range(max_i+1, len(df)+max_i+1)

This will ensure that your DataFrame index continues i in your table. Once that is done you can use

df.to_sql(index_label='i', name='my_table', con=connection_object)

It will use your index as one of the columns you insert allowing you to maintain the unique index in the table.

Upvotes: 2

Related Questions