Catarina Ribeiro
Catarina Ribeiro

Reputation: 646

How to delete and insert rows based on a column with SQLAlchemy in Python?

My code reads some excel files and appends them to a list, also adding a column to extract the file name. Then it concatenates everything and sends it to a table.


all_df_list = []

for file in files_list:  
     #reads and appends excel files
     frame = pd.read_excel(pd.read_excel(file, header=0, engine='openpyxl'))
     frame['filename'] = os.path.basename(file)
     all_df_list.append(frame)

xls=pd.concat(all_df_list)
xls.to_sql(table, con=engine, if_exists='append', index=False, chunksize=10000)

This code works just fine and the output is something like this:

Column A Column B filname
First row file 01.xlsx
Second row file 02.xlsx

What I need now is to alter my code to check for that filename in the column ( (os.path.basename(file) ) and either overwrite for only those rows or delete rows where filename = (os.path.basename(file) and then execute the code above. Like, if that filename exists, delete the corresponding rows and execute that code above, something like that.

Any ideas how I can do that? I'm using create_engine method from sqlalchemy to access the database

Upvotes: 2

Views: 806

Answers (2)

newUser
newUser

Reputation: 102

For what I understand, you want to delete the rows based on the filename before inserting the rows once again. Maybe this will help you

engine.execute("DELETE FROM %s WHERE filename = %s", (table, file))

if file is the full path , you should useos.path.basename(file)instead

Upvotes: 1

constantstranger
constantstranger

Reputation: 9379

UPDATED ANSWER:

Assuming you want to minimize the work done on the database side and do as much as possible using dataframes, this will achieve the same result as my original answer (further down).

Initial state: For testing purposes, initialize the main database table, SomeTable, to simulate a situation in which it has been updated previously:

xls=pd.concat([
    pd.DataFrame({'Column A':['First'], 'Column B':['row'], 'filename':['file 01.xlsx']}),
    pd.DataFrame({'Column A':['Second'], 'Column B':['row'], 'filename':['file 02.xlsx']}),
    pd.DataFrame({'Column A':['Third'], 'Column B':['row'], 'filename':['file 03.xlsx']})
])
xls.to_sql("SomeTable", con=engine, if_exists='replace', index=False, chunksize=10000)

Contents of SomeTable:

  Column A Column B      filename
0    First      row  file 01.xlsx
1   Second      row  file 02.xlsx
2    Third      row  file 03.xlsx

Source of updates: Specify the files to use to update SomeTable in the database and load these into dfBasenames:

files_list = ['file 01.xlsx', 'file 04.xlsx']

import os

# accumulate new file names in files_list into "new_file"
basenames = set()
for file in files_list:
    basenames.add(os.path.basename(file))
dfBasenames = pd.DataFrame({'filename': list(basenames)})

Contents of dfBasenames:

       filename
0  file 04.xlsx
1  file 01.xlsx

Delete stale rows: Delete rows in df with filename column value found in dfBasenames:

df = df.join(dfBasenames.assign(is_stale=True).set_index('filename'), on='filename')
df = df[df['is_stale'].isna()].drop(columns=['is_stale'])

Contents of df:

  Column A Column B      filename
1   Second      row  file 02.xlsx
2    Third      row  file 03.xlsx

Note that the original row with filename == "file 01.xlsx" has been deleted.

Update df from latest files: Aggregate rows found in Excel files in files_list and concat them together with non-stale rows in df:

all_df_list = []
for file in files_list:  
    #reads and appends excel files
    frame = pd.read_excel(file)
    frame['filename'] = os.path.basename(file)
    print(f'\nfile {file}:\n{frame}')
    all_df_list.append(frame)
df = pd.concat([df] + all_df_list)

Contents of df:

  Column A Column B      filename
1   Second      row  file 02.xlsx
2    Third      row  file 03.xlsx
0   1.a.01   1.b.01  file 01.xlsx
1   1.a.02   1.b.02  file 01.xlsx
0   4.a.01   4.b.01  file 04.xlsx
1   4.a.02   4.b.02  file 04.xlsx

Commit changes to database: Overwrite SomeTable in database using df:

df.to_sql("SomeTable", con=engine, if_exists='replace', index=False, chunksize=10000)

Contents of SomeTable:

  Column A Column B      filename
0   Second      row  file 02.xlsx
1    Third      row  file 03.xlsx
2   1.a.01   1.b.01  file 01.xlsx
3   1.a.02   1.b.02  file 01.xlsx
4   4.a.01   4.b.01  file 04.xlsx
5   4.a.02   4.b.02  file 04.xlsx

Full test code:

from sqlalchemy import create_engine
import pandas as pd
connectable = 'sqlite:///foo.db'
engine = create_engine(connectable)

# For testing purposes only, initialize SomeTable so it's not empty
xls=pd.concat([
    pd.DataFrame({'Column A':['First'], 'Column B':['row'], 'filename':['file 01.xlsx']}),
    pd.DataFrame({'Column A':['Second'], 'Column B':['row'], 'filename':['file 02.xlsx']}),
    pd.DataFrame({'Column A':['Third'], 'Column B':['row'], 'filename':['file 03.xlsx']})
])
xls.to_sql("SomeTable", con=engine, if_exists='replace', index=False, chunksize=10000)

# read and print SomeTable for verification of initial state
df = pd.read_sql_table("SomeTable", connectable) 
print('======== VERIFY INITIAL STATE: read_sql_table("SomeTable"):')
print(df)

# specify files to use to update SomeTable in the database
files_list = ['file 01.xlsx', 'file 04.xlsx']

import os

# accumulate new file names in files_list into "new_file"
basenames = set()
for file in files_list:
    basenames.add(os.path.basename(file))
dfBasenames = pd.DataFrame({'filename': list(basenames)})
print('======== VERIFY FILENAMES TO UPDATE FROM:')
print(dfBasenames)

# delete rows in SomeTable copy with filename column found in dfBasenames
df = df.join(dfBasenames.assign(is_stale=True).set_index('filename'), on='filename')
df = df[df['is_stale'].isna()].drop(columns=['is_stale'])
print('======== VERIFY DELETION OF ROWS MATCHING NEW FILENAMES:')
print(df)

# aggregate rows found in Excel files in dfBasenames into all_df_list and concat to remaining non-stale rows of SomeTable copy
all_df_list = []
for file in files_list:  
    #reads and appends excel files
    frame = pd.read_excel(file)
    frame['filename'] = os.path.basename(file)
    print(f'\nfile {file}:\n{frame}')
    all_df_list.append(frame)
df = pd.concat([df] + all_df_list)
print('======== VERIFY UPDATED DF READY TO COMMIT TO DB:')
print(df)

# overwrite SomeTable in database
df.to_sql("SomeTable", con=engine, if_exists='replace', index=False, chunksize=10000)

# read and print table for verification of correct result
df = pd.read_sql_table("SomeTable", connectable) 
print('======== VERIFY UPDATED TABLE: read_sql_table("SomeTable"):')
print(df)

Test output:

======== VERIFY INITIAL STATE: read_sql_table("SomeTable"):
  Column A Column B      filename
0    First      row  file 01.xlsx
1   Second      row  file 02.xlsx
2    Third      row  file 03.xlsx
======== VERIFY FILENAMES TO UPDATE FROM:
       filename
0  file 04.xlsx
1  file 01.xlsx
======== VERIFY DELETION OF ROWS MATCHING NEW FILENAMES:
  Column A Column B      filename
1   Second      row  file 02.xlsx
2    Third      row  file 03.xlsx

file file 01.xlsx:
  Column A Column B      filename
0   1.a.01   1.b.01  file 01.xlsx
1   1.a.02   1.b.02  file 01.xlsx

file file 04.xlsx:
  Column A Column B      filename
0   4.a.01   4.b.01  file 04.xlsx
1   4.a.02   4.b.02  file 04.xlsx
======== VERIFY UPDATED DF READY TO COMMIT TO DB:
  Column A Column B      filename
1   Second      row  file 02.xlsx
2    Third      row  file 03.xlsx
0   1.a.01   1.b.01  file 01.xlsx
1   1.a.02   1.b.02  file 01.xlsx
0   4.a.01   4.b.01  file 04.xlsx
1   4.a.02   4.b.02  file 04.xlsx
======== VERIFY UPDATED TABLE: read_sql_table("SomeTable"):
  Column A Column B      filename
0   Second      row  file 02.xlsx
1    Third      row  file 03.xlsx
2   1.a.01   1.b.01  file 01.xlsx
3   1.a.02   1.b.02  file 01.xlsx
4   4.a.01   4.b.01  file 04.xlsx
5   4.a.02   4.b.02  file 04.xlsx


ORIGINAL ANSWER:

Here is a way to do what you've asked.

Initial state: (Same as in UPDATED ANSWER above.)

Source of updates: Specify the files to use to update SomeTable in the database and load these into a temporary database table new_file:

files_list = ['file 01.xlsx', 'file 04.xlsx']
basenames = set()
for file in files_list:
    basenames.add(os.path.basename(file))
dfBasenames = pd.DataFrame({'filename': list(basenames)})
dfBasenames.to_sql("new_file", con=engine, if_exists='replace', index=False, chunksize=10000)

Contents of new_file:

       filename
0  file 01.xlsx
1  file 04.xlsx

Delete stale rows: Delete rows in SomeTable with filename column value found in new_file table:

with engine.connect() as connection:
    result = connection.execute('delete from SomeTable where exists (select 1 from new_file where new_file.filename = SomeTable.filename)')

Contents of SomeTable:

  Column A Column B      filename
0   Second      row  file 02.xlsx
1    Third      row  file 03.xlsx

Note that the original row with filename == "file 01.xlsx" has been deleted.

Update the database from latest files: Aggregate rows found in Excel files in files_list and append them to SomeTable:

all_df_list = []
for file in files_list:  
    #reads and appends excel files
    frame = pd.read_excel(file)
    frame['filename'] = os.path.basename(file)
    all_df_list.append(frame)
xls = pd.concat(all_df_list)
xls.to_sql("SomeTable", con=engine, if_exists='append', index=False, chunksize=10000)

Contents of SomeTable:

  Column A Column B      filename
0   Second      row  file 02.xlsx
1    Third      row  file 03.xlsx
2   1.a.01   1.b.01  file 01.xlsx
3   1.a.02   1.b.02  file 01.xlsx
4   4.a.01   4.b.01  file 04.xlsx
5   4.a.02   4.b.02  file 04.xlsx

Full test code: Test code (with print statements) is:

from sqlalchemy import create_engine
import pandas as pd
connectable = 'sqlite:///foo.db'
engine = create_engine(connectable)

# For testing purposes only, initialize SomeTable so it's not empty
xls=pd.concat([
    pd.DataFrame({'Column A':['First'], 'Column B':['row'], 'filename':['file 01.xlsx']}),
    pd.DataFrame({'Column A':['Second'], 'Column B':['row'], 'filename':['file 02.xlsx']}),
    pd.DataFrame({'Column A':['Third'], 'Column B':['row'], 'filename':['file 03.xlsx']})
])
xls.to_sql("SomeTable", con=engine, if_exists='replace', index=False, chunksize=10000)

# read and print SomeTable for verification of initial state
df = pd.read_sql_table("SomeTable", connectable) 
print('======== VERIFY INITIAL STATE: read_sql_table("SomeTable"):')
print(df)

# specify files to use to update SomeTable in the database
files_list = ['file 01.xlsx', 'file 04.xlsx']

import os

# accumulate new file names in files_list into "new_file"
basenames = set()
for file in files_list:
    basenames.add(os.path.basename(file))
dfBasenames = pd.DataFrame({'filename': list(basenames)})
dfBasenames.to_sql("new_file", con=engine, if_exists='replace', index=False, chunksize=10000)

# read and print table for verification of correct result
df = pd.read_sql_table("new_file", connectable) 
print('======== VERIFY FILENAMES TO UPDATE FROM: read_sql_table("new_file"):')
print(df)

# delete rows in SomeTable with filename column found in new_file table
with engine.connect() as connection:
    result = connection.execute('delete from SomeTable where exists (select 1 from new_file where new_file.filename = SomeTable.filename)')

# read and print table for verification of correct result
df = pd.read_sql_table("SomeTable", connectable) 
print('======== VERIFY DELETION OF ROWS MATCHING NEW FILENAMES: read_sql_table("SomeTable"):')
print(df)

# aggregate rows found in Excel files in files_list into all_df_list
all_df_list = []
for file in files_list:  
    #reads and appends excel files
    frame = pd.read_excel(file)
    frame['filename'] = os.path.basename(file)
    print(f'\nfile {file}:\n{frame}')
    all_df_list.append(frame)

# append rows in all_df_list to SomeTable
xls = pd.concat(all_df_list)
xls.to_sql("SomeTable", con=engine, if_exists='append', index=False, chunksize=10000)

# read and print table for verification of correct result
df = pd.read_sql_table("SomeTable", connectable) 
print('======== VERIFY UPDATED TABLE: read_sql_table("SomeTable"):')
print(df)

Test output:

======== VERIFY INITIAL STATE: read_sql_table("SomeTable"):
  Column A Column B      filename
0    First      row  file 01.xlsx
1   Second      row  file 02.xlsx
2    Third      row  file 03.xlsx
======== VERIFY FILENAMES TO UPDATE FROM: read_sql_table("new_file"):
       filename
0  file 01.xlsx
1  file 04.xlsx
======== VERIFY DELETION OF ROWS MATCHING NEW FILENAMES: read_sql_table("SomeTable"):
  Column A Column B      filename
0   Second      row  file 02.xlsx
1    Third      row  file 03.xlsx

file file 01.xlsx:
  Column A Column B      filename
0   1.a.01   1.b.01  file 01.xlsx
1   1.a.02   1.b.02  file 01.xlsx

file file 04.xlsx:
  Column A Column B      filename
0   4.a.01   4.b.01  file 04.xlsx
1   4.a.02   4.b.02  file 04.xlsx
======== VERIFY UPDATED TABLE: read_sql_table("SomeTable"):
  Column A Column B      filename
0   Second      row  file 02.xlsx
1    Third      row  file 03.xlsx
2   1.a.01   1.b.01  file 01.xlsx
3   1.a.02   1.b.02  file 01.xlsx
4   4.a.01   4.b.01  file 04.xlsx
5   4.a.02   4.b.02  file 04.xlsx

Upvotes: 1

Related Questions