user735304
user735304

Reputation: 447

Python CSV to SQLite

I am "converting" a large (~1.6GB) CSV file and inserting specific fields of the CSV into a SQLite database. Essentially my code looks like:

import csv, sqlite3

conn = sqlite3.connect( "path/to/file.db" )
conn.text_factory = str  #bugger 8-bit bytestrings
cur = conn.cur()
cur.execute('CREATE TABLE IF NOT EXISTS mytable (field2 VARCHAR, field4 VARCHAR)')

reader = csv.reader(open(filecsv.txt, "rb"))
for field1, field2, field3, field4, field5 in reader:
  cur.execute('INSERT OR IGNORE INTO mytable (field2, field4) VALUES (?,?)', (field2, field4))

Everything works as I expect it to with the exception... IT TAKES AN INCREDIBLE AMOUNT OF TIME TO PROCESS. Am I coding it incorrectly? Is there a better way to achieve a higher performance and accomplish what I'm needing (simply convert a few fields of a CSV into SQLite table)?

**EDIT -- I tried directly importing the csv into sqlite as suggested but it turns out my file has commas in fields (e.g. "My title, comma"). That's creating errors with the import. It appears there are too many of those occurrences to manually edit the file...

any other thoughts??**

Upvotes: 24

Views: 44791

Answers (6)

cursor.executemany is faster than cursor.execute

As per benchmarking at: https://stackoverflow.com/a/76659706/895245 cursor.executemany should be significantly faster (3x observed) than for looping as in: https://stackoverflow.com/a/7137270/895245. executemany runs in a single transaction, but so does looping by default without any explicit BEGIN/COMMIT: the speedup is due to some other factor.

main.py

from pathlib import Path
import csv
import sqlite3

f = 'tmp.sqlite'
Path(f).unlink(missing_ok=True)
connection = sqlite3.connect(f)
cursor = connection.cursor()
cursor.execute("CREATE TABLE t (x integer)")
cursor.executemany('INSERT INTO t VALUES (?)', csv.reader(open('10m.csv', 'r')))
connection.commit()
connection.close()

Tested with:

python -c 'for i in range(10000000): print(i)' > 10m.csv
time ./main.py

which finished in 9s, so only a small slowdown compared to the dummy range() loop without csv read used in: https://stackoverflow.com/a/76659706/895245

To select or pre-process columns from the CSV we can now just use a generator expression e.g. along the lines of:

python -c 'for i in range(10): print(f"{i},{i*2},{i*4}")' > 10x3.csv
cursor.execute("CREATE TABLE t (x integer, z integer)")
cursor.executemany('INSERT INTO t VALUES (?, ?)',
    ((d[0], d[2]) for d in csv.reader(open('10x3.csv', 'r'))))

csvkit is slow

I wish csvkit, previously mentioned at: https://stackoverflow.com/a/9913925/895245, would support this use case better, but currently it is just way slower than the minimal manual executemany script above e.g.:

echo x > 10m-head.csv
python -c 'for i in range(10000000): print(i)' >> 10m-head.csv
time csvsql --db sqlite:///tmp.sqlite --insert --table t 10m-head.csv

gives 1m25s.

Theoretical speed limit with sqlite .import

Just so see how far off we are due to Python, assuming that nothing can beat the more native .import command:

sqlite3 tmp.sqlite 'create table t(x integer)'
time sqlite3 tmp.sqlite ".import --csv 10m.csv t"

Result on my SSD: 5.8s. So considerably faster than Python, but good to see that we are not that far off either.

Tested on Ubuntu 23.04, Python 3.11.2, Lenovo ThinkPad P51, SSD: Samsung MZVLB512HAJQ-000L7 512GB SSD, 3 GB/s nominal speed, csvkit==1.0.7, sqlite 3.40.1.

Upvotes: 0

Powers
Powers

Reputation: 19308

Pandas makes it easy to load big files into databases in chunks. Read the CSV file into a Pandas DataFrame and then use the Pandas SQL writer (so Pandas does all the hard work). Here's how to load the data in 100,000 row chunks.

import pandas as pd

orders = pd.read_csv('path/to/your/file.csv')
orders.to_sql('orders', conn, if_exists='append', index = False, chunksize=100000)

Modern Pandas versions are very performant. Don't reinvent the wheel. See here for more info.

Upvotes: 2

Chris N
Chris N

Reputation: 627

Try using transactions.

begin    
insert 50,000 rows    
commit

That will commit data periodically rather than once per row.

Upvotes: 4

migonzalvar
migonzalvar

Reputation: 1685

As it's been said (Chris and Sam), transactions do improve a lot insert performance.

Please, let me recommend another option, to use a suite of Python utilities to work with CSV, csvkit.

To install:

pip install csvkit

To solve your problem

csvsql --db sqlite:///path/to/file.db --insert --table mytable filecsv.txt

Upvotes: 19

Sam
Sam

Reputation: 421

Chris is right - use transactions; divide the data into chunks and then store it.

"... Unless already in a transaction, each SQL statement has a new transaction started for it. This is very expensive, since it requires reopening, writing to, and closing the journal file for each statement. This can be avoided by wrapping sequences of SQL statements with BEGIN TRANSACTION; and END TRANSACTION; statements. This speedup is also obtained for statements which don't alter the database." - Source: http://web.utk.edu/~jplyon/sqlite/SQLite_optimization_FAQ.html

"... there is another trick you can use to speed up SQLite: transactions. Whenever you have to do multiple database writes, put them inside a transaction. Instead of writing to (and locking) the file each and every time a write query is issued, the write will only happen once when the transaction completes." - Source: How Scalable is SQLite?

import csv, sqlite3, time

def chunks(data, rows=10000):
    """ Divides the data into 10000 rows each """

    for i in xrange(0, len(data), rows):
        yield data[i:i+rows]


if __name__ == "__main__":

    t = time.time()

    conn = sqlite3.connect( "path/to/file.db" )
    conn.text_factory = str  #bugger 8-bit bytestrings
    cur = conn.cur()
    cur.execute('CREATE TABLE IF NOT EXISTS mytable (field2 VARCHAR, field4 VARCHAR)')

    csvData = csv.reader(open(filecsv.txt, "rb"))

    divData = chunks(csvData) # divide into 10000 rows each

    for chunk in divData:
        cur.execute('BEGIN TRANSACTION')

        for field1, field2, field3, field4, field5 in chunk:
            cur.execute('INSERT OR IGNORE INTO mytable (field2, field4) VALUES (?,?)', (field2, field4))

        cur.execute('COMMIT')

    print "\n Time Taken: %.3f sec" % (time.time()-t) 

Upvotes: 32

fengb
fengb

Reputation: 1488

It's possible to import the CSV directly:

sqlite> .separator ","
sqlite> .import filecsv.txt mytable

http://www.sqlite.org/cvstrac/wiki?p=ImportingFiles

Upvotes: 26

Related Questions