Reputation: 915

make MariaDB update from Python much faster

I have a python script that aggregates data from multiple sources to one, for technical reasons.

In this script, I create an employees table fills it with data and in a second step, fetch each employee's name/last name from another data source. My code is the following:

Create the table and fill it with data:

def createIdentite(mariaConnector, fmsConnector):
    print('Creating table "Identite"...')
    mariadbCursor = mariaConnector.cursor()

    # verify we have the destination tables we need
    print(' Checking for table Identite...')
    if mariaCheckTableExists(mariadbConnector, 'Identite') == False:
        print(' Table doesn\'t exist, creating it...')
        mariadbCursor.execute("""
            CREATE TABLE Identite (
                PK_FP VARCHAR(50) NOT NULL,
                LieuNaissance TEXT, 
                PaysNaissance TEXT, 
                Name TEXT,
                LastName TEXT,

                Nationalite TEXT, 
                PaysResidence TEXT, 
                PersonneAPrevenir TEXT, 
                Tel1_PAP TEXT, 
                Tel2_PAP TEXT,
                CategorieMutuelle TEXT,
                Ep1_MUTUELLE BOOLEAN,
                TypeMutuelle BOOLEAN,
                NiveauMutuelle BOOLEAN,
                NiveauMutuelle2 BOOLEAN,
                NiveauMutuelle3 BOOLEAN,
                PartMutuelleSalarie FLOAT,
                PartMutuelleSalarieOption FLOAT,
                PRIMARY KEY (PK_FP)
            )
            """)
        mariadbCursor.execute("CREATE INDEX IdentitePK_FP ON Identite(PK_FP)")
    else:
        # flush the table
        print(' Table exists, flushing it...')
        mariadbCursor.execute("DELETE FROM Identite")    

    # now fill it with fresh data
    print(' Retrieving the data from FMS...')
    fmsCursor = fmsConnector.cursor()
    fmsCursor.execute("""
        SELECT 
            PK_FP,
            Lieu_Naiss_Txt, 
            Pays_Naiss_Txt, 
            Nationalite_Txt, 
            Pays_Resid__Txt, 
            Pers_URG, 
            Tel1_URG, 
            Tel2_URG,
            CAT_MUTUELLE,
            CASE WHEN Ep1_MUTUELLE = 'OUI' THEN 1 ELSE 0 END as Ep1_MUTUELLE,
            CASE WHEN TYPE_MUT = 'OUI' THEN 1 ELSE 0 END as TYPE_MUT, 
            CASE WHEN Niv_Mutuelle IS NULL THEN 0 ELSE 1 END as Niv_Mutuelle,
            CASE WHEN NIV_MUTUELLE[2] IS NULL THEN 0 ELSE 1 END as Niv_Mutuelle2,
            CASE WHEN NIV_MUTUELLE[3] IS NULL THEN 0 ELSE 1 END as Niv_Mutuelle3,
            PART_MUT_SAL,
            PART_MUT_SAL_Option
        FROM B_EMPLOYE
        WHERE PK_FP IS NOT NULL
        """)

    print(' Transferring...')
    #for row in fmsCursor:
    insert = """INSERT INTO Identite (
                PK_FP,
                LieuNaissance, 
                PaysNaissance, 
                Nationalite, 
                PaysResidence,
                PersonneAPrevenir, 
                Tel1_PAP,
                Tel2_PAP, 
                CategorieMutuelle,
                Ep1_MUTUELLE,
                TypeMutuelle, 
                NiveauMutuelle,
                NiveauMutuelle2,
                NiveauMutuelle3,
                PartMutuelleSalarie,
                PartMutuelleSalarieOption
            ) VALUES (
                %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s
            )"""
    values = fmsCursor.fetchall()
    mariadbCursor.executemany(insert, values)
    mariaConnector.commit()
    print(' Inserted '+str(len(values))+' values')
    return len(values)

And the part where I retrieve first name and last name:

def updateEmployeeNames(mariaConnector, mssqlConnector):
    print("Updating employee names...")
    mariadbCursor = mariaConnector.cursor()
    mssqlCursor = mssqlConnector.cursor()

    mssqlCursor.execute("SELECT Name, LastName, PK_FP FROM F_Person")
    rows = mssqlCursor.fetchall()

    query = """
        UPDATE Identite
        SET Name = %s, LastName = %s
        WHERE PK_FP = %s
    """
    mariadbCursor.executemany(query, rows)

    mariadbConnector.commit()

As you might have guessed, the first function takes almost no time to execute (less that 2 seconds), where the second one take almost 20.

Python's not my strong suit, but there might be another way, the aim is to make it much faster.

I already tried adding values to createIdentite's each tuple before the executeMany, but Mysql connector won't let me do that.

Thanks a lot for your help.

Upvotes: 0

Answers (3)

Rick James

Reputation: 142346

Create the index as you create the temp table.
These combined statements work: CREATE TABLE ... SELECT ...; and INSERT INTO table ... SELECT .... However, they may be difficult to perform from Python.
It is unclear whether you need the temp table at all.
Learn how to use JOIN to get information simultaneously from two tables.

Upvotes: 0

Georg Richter

Reputation: 7476

In all MySQL Python Drivers the execute_many is rewritten, since bulk operations are not supported in MySQL, they are supported only via binary protocol in MariaDB since 10.2, full support (including delete and update) was added later and is available in the lastest 10.2, 10.3 and 10.4 versions of MariaDB Server.

The python Driver is rewriting an insert query, iterates over the number of rows and transforms the statement to

INSERT INTO t1 VALUES (row1_id, row1_data), (row2_id, row2_data),....(rown_id, row_n data)

This is quite fast, but the SQL Syntax doesn't allow this for UPDATE or DELETE. In this case the Driver needs to execute the statement n times (n= number of rows), passing the values for each row in a single statment.

MariaDB binary protocol allows to prepare the statement, executing it by sending all data at once (The execute package also contains the data).

If C would be an alternative, take a look at the bulk unittests on Github repository of MariaDB Connector/C. Otherwise you have to wait, MariaDB will likey release it's own python Driver next year.

Upvotes: 1

Gord Thompson

Reputation: 123729

So the UPDATE to the existing MariaDB table is the bottleneck, in which case it might be faster to do the update on a pandas DataFrame and then push the result the MariaDB table using pandas to_sql method. A simplified example would be ...

df_main = pd.read_sql_query(fms_query, fms_engine, index_col='PK_FP')
df_mssql = pd.read_sql_query(mssql_query, mssql_engine, index_col='PK_FP')
df_main.update(df_mssql)
df_main.to_sql('Identite', mariadb_engine, if_exists='replace',
               dtype={'PK_FP': sqlalchemy.types.String(50)})

... where fms_query and mssql_query are the queries from your question. fms_engine, mssql_engine, and mariadb_engine would be SQLAlchemy Engine objects.

Upvotes: 1

make MariaDB update from Python much faster

Answers (3)

Related Questions