Reputation: 915
I have a python script that aggregates data from multiple sources to one, for technical reasons.
In this script, I create an employees table fills it with data and in a second step, fetch each employee's name/last name from another data source. My code is the following:
Create the table and fill it with data:
def createIdentite(mariaConnector, fmsConnector):
print('Creating table "Identite"...')
mariadbCursor = mariaConnector.cursor()
# verify we have the destination tables we need
print(' Checking for table Identite...')
if mariaCheckTableExists(mariadbConnector, 'Identite') == False:
print(' Table doesn\'t exist, creating it...')
mariadbCursor.execute("""
CREATE TABLE Identite (
PK_FP VARCHAR(50) NOT NULL,
LieuNaissance TEXT,
PaysNaissance TEXT,
Name TEXT,
LastName TEXT,
Nationalite TEXT,
PaysResidence TEXT,
PersonneAPrevenir TEXT,
Tel1_PAP TEXT,
Tel2_PAP TEXT,
CategorieMutuelle TEXT,
Ep1_MUTUELLE BOOLEAN,
TypeMutuelle BOOLEAN,
NiveauMutuelle BOOLEAN,
NiveauMutuelle2 BOOLEAN,
NiveauMutuelle3 BOOLEAN,
PartMutuelleSalarie FLOAT,
PartMutuelleSalarieOption FLOAT,
PRIMARY KEY (PK_FP)
)
""")
mariadbCursor.execute("CREATE INDEX IdentitePK_FP ON Identite(PK_FP)")
else:
# flush the table
print(' Table exists, flushing it...')
mariadbCursor.execute("DELETE FROM Identite")
# now fill it with fresh data
print(' Retrieving the data from FMS...')
fmsCursor = fmsConnector.cursor()
fmsCursor.execute("""
SELECT
PK_FP,
Lieu_Naiss_Txt,
Pays_Naiss_Txt,
Nationalite_Txt,
Pays_Resid__Txt,
Pers_URG,
Tel1_URG,
Tel2_URG,
CAT_MUTUELLE,
CASE WHEN Ep1_MUTUELLE = 'OUI' THEN 1 ELSE 0 END as Ep1_MUTUELLE,
CASE WHEN TYPE_MUT = 'OUI' THEN 1 ELSE 0 END as TYPE_MUT,
CASE WHEN Niv_Mutuelle IS NULL THEN 0 ELSE 1 END as Niv_Mutuelle,
CASE WHEN NIV_MUTUELLE[2] IS NULL THEN 0 ELSE 1 END as Niv_Mutuelle2,
CASE WHEN NIV_MUTUELLE[3] IS NULL THEN 0 ELSE 1 END as Niv_Mutuelle3,
PART_MUT_SAL,
PART_MUT_SAL_Option
FROM B_EMPLOYE
WHERE PK_FP IS NOT NULL
""")
print(' Transferring...')
#for row in fmsCursor:
insert = """INSERT INTO Identite (
PK_FP,
LieuNaissance,
PaysNaissance,
Nationalite,
PaysResidence,
PersonneAPrevenir,
Tel1_PAP,
Tel2_PAP,
CategorieMutuelle,
Ep1_MUTUELLE,
TypeMutuelle,
NiveauMutuelle,
NiveauMutuelle2,
NiveauMutuelle3,
PartMutuelleSalarie,
PartMutuelleSalarieOption
) VALUES (
%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s
)"""
values = fmsCursor.fetchall()
mariadbCursor.executemany(insert, values)
mariaConnector.commit()
print(' Inserted '+str(len(values))+' values')
return len(values)
And the part where I retrieve first name and last name:
def updateEmployeeNames(mariaConnector, mssqlConnector):
print("Updating employee names...")
mariadbCursor = mariaConnector.cursor()
mssqlCursor = mssqlConnector.cursor()
mssqlCursor.execute("SELECT Name, LastName, PK_FP FROM F_Person")
rows = mssqlCursor.fetchall()
query = """
UPDATE Identite
SET Name = %s, LastName = %s
WHERE PK_FP = %s
"""
mariadbCursor.executemany(query, rows)
mariadbConnector.commit()
As you might have guessed, the first function takes almost no time to execute (less that 2 seconds), where the second one take almost 20.
Python's not my strong suit, but there might be another way, the aim is to make it much faster.
I already tried adding values to createIdentite's each tuple before the executeMany, but Mysql connector won't let me do that.
Thanks a lot for your help.
Upvotes: 0
Views: 796
Reputation: 142346
CREATE TABLE ... SELECT ...;
and INSERT INTO table ... SELECT ...
. However, they may be difficult to perform from Python.JOIN
to get information simultaneously from two tables.Upvotes: 0
Reputation: 7476
In all MySQL Python Drivers the execute_many is rewritten, since bulk operations are not supported in MySQL, they are supported only via binary protocol in MariaDB since 10.2, full support (including delete and update) was added later and is available in the lastest 10.2, 10.3 and 10.4 versions of MariaDB Server.
The python Driver is rewriting an insert query, iterates over the number of rows and transforms the statement to
INSERT INTO t1 VALUES (row1_id, row1_data), (row2_id, row2_data),....(rown_id, row_n data)
This is quite fast, but the SQL Syntax doesn't allow this for UPDATE or DELETE. In this case the Driver needs to execute the statement n times (n= number of rows), passing the values for each row in a single statment.
MariaDB binary protocol allows to prepare the statement, executing it by sending all data at once (The execute package also contains the data).
If C would be an alternative, take a look at the bulk unittests on Github repository of MariaDB Connector/C. Otherwise you have to wait, MariaDB will likey release it's own python Driver next year.
Upvotes: 1
Reputation: 123729
So the UPDATE to the existing MariaDB table is the bottleneck, in which case it might be faster to do the update on a pandas DataFrame and then push the result the MariaDB table using pandas to_sql
method. A simplified example would be ...
df_main = pd.read_sql_query(fms_query, fms_engine, index_col='PK_FP')
df_mssql = pd.read_sql_query(mssql_query, mssql_engine, index_col='PK_FP')
df_main.update(df_mssql)
df_main.to_sql('Identite', mariadb_engine, if_exists='replace',
dtype={'PK_FP': sqlalchemy.types.String(50)})
... where fms_query
and mssql_query
are the queries from your question. fms_engine
, mssql_engine
, and mariadb_engine
would be SQLAlchemy Engine
objects.
Upvotes: 1