Walucas
Walucas

Reputation: 2568

Python CSV Import taking too much time after first 200k rows

I have a python import script that imports a CSV. It takes around 45 seconds to import 20k rows into a mysql db. The problem is that now I have a huge file (4 million rows), and it is taking like 200 seconds to import those same 20k rows when it reaches the line ~200k of the file.

I dont know if thats some DB issue or the file reading itself... here is my script:

import csv
import mysql.connector
from mysql.connector import Error
import argparse
from itertools import islice

parser = argparse.ArgumentParser()
parser.add_argument('file', type=argparse.FileType('r'))
args = parser.parse_args() 



def parseData(row):
    sql = """INSERT INTO TABLE 
    (column1,column2) 
    VALUES( '%s','%s');""" % (
        row['col1'],row['col2'],) 


    return sql


def dataImport(filename,connection,cursor):
        data = csv.DictReader(filename,delimiter=';')
        i = 0 
        for row in data:
            sql = parseData(row)
            try:
                result = cursor.execute(sql)
                pass
            except Error as e:
                    pass 

            if i%20000 == 0: 
                connection.commit()

            i=i+1

def importa(file):

    try:
        connection = mysql.connector.connect(host=host,
                                            database=db,
                                            user=user,
                                            password=password)

        if connection.is_connected():
            db_Info = connection.get_server_info()
            print("Connected to MySQL Server version ", db_Info)

            cursor = connection.cursor()
            dataImport(file,connection,cursor)

            cursor.close()


    except Error as e:
        print("Error while connecting to MySQL", e)


if __name__ == "__main__":
    importa(args.file)

Upvotes: 1

Views: 868

Answers (1)

Deepstop
Deepstop

Reputation: 3807

Using executemany to do single inserts of 20K rows instead of 20K individual INSERT statements will optimize the transfer, but the database itself could also be a factor in the slow performance. There is a code example just a bit past the middle of this page.

Based on our exchange of comments, the database is the ultimate culprit in the slow performance. One idea that I've seen is to disable indexing until after everything is loaded.

ALTER TABLE table DISABLE KEYS;
... executemany called a bunch of times...
ALTER TABLE table ENABLE KEYS;

That might be worth a try.

Upvotes: 1

Related Questions