Reputation: 646
My code reads some excel files and appends them to a list, also adding a column to extract the file name. Then it concatenates everything and sends it to a table.
all_df_list = []
for file in files_list:
#reads and appends excel files
frame = pd.read_excel(pd.read_excel(file, header=0, engine='openpyxl'))
frame['filename'] = os.path.basename(file)
all_df_list.append(frame)
xls=pd.concat(all_df_list)
xls.to_sql(table, con=engine, if_exists='append', index=False, chunksize=10000)
This code works just fine and the output is something like this:
Column A | Column B | filname |
---|---|---|
First | row | file 01.xlsx |
Second | row | file 02.xlsx |
What I need now is to alter my code to check for that filename in the column ( (os.path.basename(file)
) and either overwrite for only those rows or delete rows where filename = (os.path.basename(file)
and then execute the code above. Like, if that filename exists, delete the corresponding rows and execute that code above, something like that.
Any ideas how I can do that? I'm using create_engine method from sqlalchemy to access the database
Upvotes: 2
Views: 806
Reputation: 102
For what I understand, you want to delete the rows based on the filename before inserting the rows once again. Maybe this will help you
engine.execute("DELETE FROM %s WHERE filename = %s", (table, file))
if file is the full path , you should useos.path.basename(file)
instead
Upvotes: 1
Reputation: 9379
UPDATED ANSWER:
Assuming you want to minimize the work done on the database side and do as much as possible using dataframes, this will achieve the same result as my original answer (further down).
Initial state:
For testing purposes, initialize the main database table, SomeTable
, to simulate a situation in which it has been updated previously:
xls=pd.concat([
pd.DataFrame({'Column A':['First'], 'Column B':['row'], 'filename':['file 01.xlsx']}),
pd.DataFrame({'Column A':['Second'], 'Column B':['row'], 'filename':['file 02.xlsx']}),
pd.DataFrame({'Column A':['Third'], 'Column B':['row'], 'filename':['file 03.xlsx']})
])
xls.to_sql("SomeTable", con=engine, if_exists='replace', index=False, chunksize=10000)
Contents of SomeTable
:
Column A Column B filename
0 First row file 01.xlsx
1 Second row file 02.xlsx
2 Third row file 03.xlsx
Source of updates:
Specify the files to use to update SomeTable
in the database and load these into dfBasenames
:
files_list = ['file 01.xlsx', 'file 04.xlsx']
import os
# accumulate new file names in files_list into "new_file"
basenames = set()
for file in files_list:
basenames.add(os.path.basename(file))
dfBasenames = pd.DataFrame({'filename': list(basenames)})
Contents of dfBasenames
:
filename
0 file 04.xlsx
1 file 01.xlsx
Delete stale rows:
Delete rows in df
with filename column value found in dfBasenames
:
df = df.join(dfBasenames.assign(is_stale=True).set_index('filename'), on='filename')
df = df[df['is_stale'].isna()].drop(columns=['is_stale'])
Contents of df
:
Column A Column B filename
1 Second row file 02.xlsx
2 Third row file 03.xlsx
Note that the original row with filename == "file 01.xlsx"
has been deleted.
Update df
from latest files:
Aggregate rows found in Excel files in files_list
and concat
them together with non-stale rows in df
:
all_df_list = []
for file in files_list:
#reads and appends excel files
frame = pd.read_excel(file)
frame['filename'] = os.path.basename(file)
print(f'\nfile {file}:\n{frame}')
all_df_list.append(frame)
df = pd.concat([df] + all_df_list)
Contents of df
:
Column A Column B filename
1 Second row file 02.xlsx
2 Third row file 03.xlsx
0 1.a.01 1.b.01 file 01.xlsx
1 1.a.02 1.b.02 file 01.xlsx
0 4.a.01 4.b.01 file 04.xlsx
1 4.a.02 4.b.02 file 04.xlsx
Commit changes to database:
Overwrite SomeTable
in database using df
:
df.to_sql("SomeTable", con=engine, if_exists='replace', index=False, chunksize=10000)
Contents of SomeTable
:
Column A Column B filename
0 Second row file 02.xlsx
1 Third row file 03.xlsx
2 1.a.01 1.b.01 file 01.xlsx
3 1.a.02 1.b.02 file 01.xlsx
4 4.a.01 4.b.01 file 04.xlsx
5 4.a.02 4.b.02 file 04.xlsx
Full test code:
from sqlalchemy import create_engine
import pandas as pd
connectable = 'sqlite:///foo.db'
engine = create_engine(connectable)
# For testing purposes only, initialize SomeTable so it's not empty
xls=pd.concat([
pd.DataFrame({'Column A':['First'], 'Column B':['row'], 'filename':['file 01.xlsx']}),
pd.DataFrame({'Column A':['Second'], 'Column B':['row'], 'filename':['file 02.xlsx']}),
pd.DataFrame({'Column A':['Third'], 'Column B':['row'], 'filename':['file 03.xlsx']})
])
xls.to_sql("SomeTable", con=engine, if_exists='replace', index=False, chunksize=10000)
# read and print SomeTable for verification of initial state
df = pd.read_sql_table("SomeTable", connectable)
print('======== VERIFY INITIAL STATE: read_sql_table("SomeTable"):')
print(df)
# specify files to use to update SomeTable in the database
files_list = ['file 01.xlsx', 'file 04.xlsx']
import os
# accumulate new file names in files_list into "new_file"
basenames = set()
for file in files_list:
basenames.add(os.path.basename(file))
dfBasenames = pd.DataFrame({'filename': list(basenames)})
print('======== VERIFY FILENAMES TO UPDATE FROM:')
print(dfBasenames)
# delete rows in SomeTable copy with filename column found in dfBasenames
df = df.join(dfBasenames.assign(is_stale=True).set_index('filename'), on='filename')
df = df[df['is_stale'].isna()].drop(columns=['is_stale'])
print('======== VERIFY DELETION OF ROWS MATCHING NEW FILENAMES:')
print(df)
# aggregate rows found in Excel files in dfBasenames into all_df_list and concat to remaining non-stale rows of SomeTable copy
all_df_list = []
for file in files_list:
#reads and appends excel files
frame = pd.read_excel(file)
frame['filename'] = os.path.basename(file)
print(f'\nfile {file}:\n{frame}')
all_df_list.append(frame)
df = pd.concat([df] + all_df_list)
print('======== VERIFY UPDATED DF READY TO COMMIT TO DB:')
print(df)
# overwrite SomeTable in database
df.to_sql("SomeTable", con=engine, if_exists='replace', index=False, chunksize=10000)
# read and print table for verification of correct result
df = pd.read_sql_table("SomeTable", connectable)
print('======== VERIFY UPDATED TABLE: read_sql_table("SomeTable"):')
print(df)
Test output:
======== VERIFY INITIAL STATE: read_sql_table("SomeTable"):
Column A Column B filename
0 First row file 01.xlsx
1 Second row file 02.xlsx
2 Third row file 03.xlsx
======== VERIFY FILENAMES TO UPDATE FROM:
filename
0 file 04.xlsx
1 file 01.xlsx
======== VERIFY DELETION OF ROWS MATCHING NEW FILENAMES:
Column A Column B filename
1 Second row file 02.xlsx
2 Third row file 03.xlsx
file file 01.xlsx:
Column A Column B filename
0 1.a.01 1.b.01 file 01.xlsx
1 1.a.02 1.b.02 file 01.xlsx
file file 04.xlsx:
Column A Column B filename
0 4.a.01 4.b.01 file 04.xlsx
1 4.a.02 4.b.02 file 04.xlsx
======== VERIFY UPDATED DF READY TO COMMIT TO DB:
Column A Column B filename
1 Second row file 02.xlsx
2 Third row file 03.xlsx
0 1.a.01 1.b.01 file 01.xlsx
1 1.a.02 1.b.02 file 01.xlsx
0 4.a.01 4.b.01 file 04.xlsx
1 4.a.02 4.b.02 file 04.xlsx
======== VERIFY UPDATED TABLE: read_sql_table("SomeTable"):
Column A Column B filename
0 Second row file 02.xlsx
1 Third row file 03.xlsx
2 1.a.01 1.b.01 file 01.xlsx
3 1.a.02 1.b.02 file 01.xlsx
4 4.a.01 4.b.01 file 04.xlsx
5 4.a.02 4.b.02 file 04.xlsx
ORIGINAL ANSWER:
Here is a way to do what you've asked.
Initial state: (Same as in UPDATED ANSWER above.)
Source of updates:
Specify the files to use to update SomeTable
in the database and load these into a temporary database table new_file
:
files_list = ['file 01.xlsx', 'file 04.xlsx']
basenames = set()
for file in files_list:
basenames.add(os.path.basename(file))
dfBasenames = pd.DataFrame({'filename': list(basenames)})
dfBasenames.to_sql("new_file", con=engine, if_exists='replace', index=False, chunksize=10000)
Contents of new_file
:
filename
0 file 01.xlsx
1 file 04.xlsx
Delete stale rows:
Delete rows in SomeTable
with filename column value found in new_file
table:
with engine.connect() as connection:
result = connection.execute('delete from SomeTable where exists (select 1 from new_file where new_file.filename = SomeTable.filename)')
Contents of SomeTable
:
Column A Column B filename
0 Second row file 02.xlsx
1 Third row file 03.xlsx
Note that the original row with filename == "file 01.xlsx"
has been deleted.
Update the database from latest files:
Aggregate rows found in Excel files in files_list
and append them to SomeTable
:
all_df_list = []
for file in files_list:
#reads and appends excel files
frame = pd.read_excel(file)
frame['filename'] = os.path.basename(file)
all_df_list.append(frame)
xls = pd.concat(all_df_list)
xls.to_sql("SomeTable", con=engine, if_exists='append', index=False, chunksize=10000)
Contents of SomeTable
:
Column A Column B filename
0 Second row file 02.xlsx
1 Third row file 03.xlsx
2 1.a.01 1.b.01 file 01.xlsx
3 1.a.02 1.b.02 file 01.xlsx
4 4.a.01 4.b.01 file 04.xlsx
5 4.a.02 4.b.02 file 04.xlsx
Full test code: Test code (with print statements) is:
from sqlalchemy import create_engine
import pandas as pd
connectable = 'sqlite:///foo.db'
engine = create_engine(connectable)
# For testing purposes only, initialize SomeTable so it's not empty
xls=pd.concat([
pd.DataFrame({'Column A':['First'], 'Column B':['row'], 'filename':['file 01.xlsx']}),
pd.DataFrame({'Column A':['Second'], 'Column B':['row'], 'filename':['file 02.xlsx']}),
pd.DataFrame({'Column A':['Third'], 'Column B':['row'], 'filename':['file 03.xlsx']})
])
xls.to_sql("SomeTable", con=engine, if_exists='replace', index=False, chunksize=10000)
# read and print SomeTable for verification of initial state
df = pd.read_sql_table("SomeTable", connectable)
print('======== VERIFY INITIAL STATE: read_sql_table("SomeTable"):')
print(df)
# specify files to use to update SomeTable in the database
files_list = ['file 01.xlsx', 'file 04.xlsx']
import os
# accumulate new file names in files_list into "new_file"
basenames = set()
for file in files_list:
basenames.add(os.path.basename(file))
dfBasenames = pd.DataFrame({'filename': list(basenames)})
dfBasenames.to_sql("new_file", con=engine, if_exists='replace', index=False, chunksize=10000)
# read and print table for verification of correct result
df = pd.read_sql_table("new_file", connectable)
print('======== VERIFY FILENAMES TO UPDATE FROM: read_sql_table("new_file"):')
print(df)
# delete rows in SomeTable with filename column found in new_file table
with engine.connect() as connection:
result = connection.execute('delete from SomeTable where exists (select 1 from new_file where new_file.filename = SomeTable.filename)')
# read and print table for verification of correct result
df = pd.read_sql_table("SomeTable", connectable)
print('======== VERIFY DELETION OF ROWS MATCHING NEW FILENAMES: read_sql_table("SomeTable"):')
print(df)
# aggregate rows found in Excel files in files_list into all_df_list
all_df_list = []
for file in files_list:
#reads and appends excel files
frame = pd.read_excel(file)
frame['filename'] = os.path.basename(file)
print(f'\nfile {file}:\n{frame}')
all_df_list.append(frame)
# append rows in all_df_list to SomeTable
xls = pd.concat(all_df_list)
xls.to_sql("SomeTable", con=engine, if_exists='append', index=False, chunksize=10000)
# read and print table for verification of correct result
df = pd.read_sql_table("SomeTable", connectable)
print('======== VERIFY UPDATED TABLE: read_sql_table("SomeTable"):')
print(df)
Test output:
======== VERIFY INITIAL STATE: read_sql_table("SomeTable"):
Column A Column B filename
0 First row file 01.xlsx
1 Second row file 02.xlsx
2 Third row file 03.xlsx
======== VERIFY FILENAMES TO UPDATE FROM: read_sql_table("new_file"):
filename
0 file 01.xlsx
1 file 04.xlsx
======== VERIFY DELETION OF ROWS MATCHING NEW FILENAMES: read_sql_table("SomeTable"):
Column A Column B filename
0 Second row file 02.xlsx
1 Third row file 03.xlsx
file file 01.xlsx:
Column A Column B filename
0 1.a.01 1.b.01 file 01.xlsx
1 1.a.02 1.b.02 file 01.xlsx
file file 04.xlsx:
Column A Column B filename
0 4.a.01 4.b.01 file 04.xlsx
1 4.a.02 4.b.02 file 04.xlsx
======== VERIFY UPDATED TABLE: read_sql_table("SomeTable"):
Column A Column B filename
0 Second row file 02.xlsx
1 Third row file 03.xlsx
2 1.a.01 1.b.01 file 01.xlsx
3 1.a.02 1.b.02 file 01.xlsx
4 4.a.01 4.b.01 file 04.xlsx
5 4.a.02 4.b.02 file 04.xlsx
Upvotes: 1