Reputation: 1241
I want to insert a data frame into the Snowflake database table. The database has columns like id
which is a primary_key
and event_id
which is an integer
field and it's also nullable
.
I have created a declarative_base()
class using SQLAlchemy as shown below -
class AccountUsageLoginHistory(Base):
__tablename__ = constants.TABLE_ACCOUNT_USAGE_LOGIN_HISTORY
__table_args__ = {
'extend_existing':True,
'schema' : os.environ.get('SCHEMA_NAME_AUDITS')
}
id = Column(Integer, Sequence('id_account_usage_login_history'), primary_key=True)
event_id = Column(Integer, nullable=True)
The class stated above creates a table in the Snowflake database.
I have a data frame that has just one column event_id
.
When I try to insert the data using pandas to_sql()
method Snowflake returns me an error shown below -
snowflake.connector.errors.ProgrammingError: 100072 (22000): 01991f2c-0be5-c903-0000-d5e5000c6cee: NULL result in a non-nullable column
This error is generated by snowflake because to_sql()
is appending a column id
and the values are set to null
for each row of that column.
dataframe.to_sql(table_name, self.engine, index=False, method=pd_writer, if_exists="append")
Consider this as case 1 -
I tried to run an insert query directly to snowflake -
insert into "SFOPT_TEST"."AUDITS"."ACCOUNT_USAGE_LOGIN_HISTORY" (ID, EVENT_ID) values(NULL, 33)
The query above returned me the same error -
NULL result in a non-nullable column
The query stated above is how probably the to_sql()
method might be doing.
Consider this as case 2 -
I also tried to insert a row by executing the query stated below -
insert into "SFOPT_TEST"."AUDITS"."ACCOUNT_USAGE_LOGIN_HISTORY" (EVENT_ID) values(33)
Now, this particular query has been executed successfully by inserting the data into the table and it has also auto-generated value for column id
.
How can I make to_sql()
method of pandas to use case 2?
Upvotes: 0
Views: 3191
Reputation: 96
Please note that pandas.DataFrame.to_sql()
has by default parameter index=True
which means that it will add an extra column (df.index) when inserting the data.
Some Databases like PostgreSQL have a data type serial
which allows you to sequentially fill the column with incremental numbers.
Snowflake DB doesn't have that concept but instead, there are other ways to handle it:
First Option:
You can use CREATE SEQUENCE
statement and create a sequence directly in the db - here is the official documentation on this topic. The downside of this approach is that you would need to convert your DataFrame into a proper SQL statement:
db preparation part:
CREATE OR REPLACE SEQUENCE schema.my_sequence START = 1 INCREMENT = 1;
CREATE OR REPLACE TABLE schema.my_table (i bigint, b text);
You would need to convert the DataFrame into Snowflake's INSERT
statement and use schema.my_sequence.nextval
to get the next ID value
INSERT INTO schema.my_table VALUES
(schema.my_sequence.nextval, 'string_1'),
(schema.my_sequence.nextval, 'string_2');
The result will be:
i b
1 string_1
2 string_2
Please note that there are some limitations to this approach and you need to ensure that each insert statement you do this way will be successful as calling schema.my_sequence.nextval
and not inserting it will mean that there will be gaps numbers.
To avoid it you can have a separate script that checks if the current insert was successful and if not it will recreate the sequence by calling:
REPLACE SEQUENCE schema.my_sequence start = (SELECT max(i) FROM schema.my_table) increment = 1;
Alternative Option: You would need to create an extra function that runs the SQL to get the last i you inserted previously.
SELECT max(i) AS max_i FROM schema.my_table;
and then update the index
in your DataFrame before running to_sql()
df.index = range(max_i+1, len(df)+max_i+1)
This will ensure that your DataFrame index continues i in your table. Once that is done you can use
df.to_sql(index_label='i', name='my_table', con=connection_object)
It will use your index as one of the columns you insert allowing you to maintain the unique index in the table.
Upvotes: 2