Reputation: 1054
I am fetching data from a SQL Server database and saving them to files for consequent processing in Python.
I am using Make to automate fetching and refetching of the data (in case some settings change, only the affected part of the queries is run anew, not all of them). So I have a simple Makefile as follows:
rawdata: datafile1.h5, datafile2.h5 # ... more files like this
datafile1.h5: data1_specific_config.py, common_config.py
python fetch_data1.py
datafile2.h5: data2_specific_config.py, common_config.py
python fetch_data2.py
# ... similar rules for other files
and when needed I just run make rawdata
.
Now all the SQL queries executed by the scripts fetch_dataN.py
have a significant common part. Schematically the queryN
which is run by fetch_dataN.py
looks like this:
select ... into ##common_tmp_table ... /*this is identical for all queries*/
select ... from (... ##common_tmp_table ...) /*this is queryN specific; but the same ##common_tmp_table is used*/
Here is the problem: when I now run make rawdata
in a situation where say five different datafiles need to be rebuilt, then the same query select ... into ##common_tmp_table ...
is run five times with the identical output into ##common_tmp_table
. The query takes quite a long time to run so re-executing it five times slows everything down significantly.
But the temporary table is always deleted when one script fetch_dataN.py
finishes because the db connection which created it is terminated.
Question:
Is there a way how I could force the table ##common_tmp_table
to be created only once and persisted between all the scripts fetch_dataN.py
that are executed by make rawdata
?
In particular, is there a way how to use the same db connection in all the scripts run by make rawdata
? Or perhaps to open one extra connection that will persist while all the scripts will be running and which will prevent the global temporary table to be dropped?
Work-around that I know of:
I am able to work around this by manually creating the ##common_tmp_table
(e.g. in MS SQL Server Management Studio) before running make rawdata
and keeping the connection used to this open till all the scripts finish. But this is obviously ugly and annoying.
If make rawdata
could open a separate process that would open a connection, create the tmp table and keep waiting until everything else finishes, that would be a solution. But I don't know if this is possible.
Limitations:
Notes:
MS SQL Server 2008 R2
pyodbc 4.0.28
(for connecting to the database)python 3.7.6
make 4.3
conda 4.7.12
Thanks.
Upvotes: 0
Views: 251
Reputation: 1054
So I found a solution which works very nicely: The idea is to let make rawdata
execute a python script which
##common_tmp_table
make rawdata_
which takes care of rebuilding of the datafiles (just the same as make rawdata
did in the code posted in the question, but now without select ... into ##common_tmp_table ...
in the queries)In code:
Makefile:
#THIS IS NEW
.PHONY rawdata # to always rebuild rawdata target
rawdata:
python fetch_all_non_uptodate.py # just call a script that (among other stuff) runs `make rawdata_`
#THE REST IS AS BEFORE (just added underscore)
rawdata_: datafile1.h5, datafile2.h5 # ... more files like this
datafile1.h5: data1_specific_config.py, common_config.py
python fetch_data1.py
datafile2.h5: data2_specific_config.py, common_config.py
python fetch_data2.py
# ... similar rules for other files
fetch_all_non_uptodate.py:
import subprocess
import pyodbc
conn = pyodbc.connect(...) #open db connection
# simulate the run of make with the -q flag to find out whether all the datafiles are up-to-date (return code 0) or not (return code 1); nothing is re-fetched as yet
uptodate = (subprocess.run(['make', '-q', 'rawdata_']).returncode == 0)
# if the raw datafiles are not up-to-date
if not uptodate:
create_common_tmp_table(conn) # create the ##common_tmp_table in the db and keep it while conn is open
conn.commit() #commit the creation of the tmp table (Important! - otherwise the other connections won't see it!)
subprocess.run(['make', 'rawdata_']) # run make to re-fetch whatever datafiles need to be re-fetched
# the queries can make use of the existing tmp table
# otherwise we just simulate the make output telling that all is up-to-date
else:
print("make: Nothing to be done for 'rawdata'.")
conn.close()
queryN:
/*keep just the specific part - the ##common_tmp_table already exists*/
select ... from (... ##common_tmp_table ...)
Upvotes: 1