Luke
Luke

Reputation: 7099

Fastest way to write database table to file in python

I'm trying to extract huge amounts of data from a DB and write it to a csv file. I'm trying to find out what the fastest way would be to do this. I found that running writerows on the result of a fetchall was 40% slower than the code below.

with open(filename, 'a') as f:
    writer = csv.writer(f, delimiter='\t')
    cursor.execute("SELECT * FROM table")
    writer.writerow([i[0] for i in cursor.description])

    count = 0
    builder = []
    row = cursor.fetchone()
    DELIMITERS = ['\t'] * (len(row) - 1) + ['\n']
    while row:
        count += 1
        # Add row with delimiters to builder 
        builder += [str(item) for pair in zip(row, DELIMITERS) for item in pair]
        if count == 1000:
            count = 0
            f.write(''.join(builder))
            builder[:] = []
        row = cursor.fetchone()
    f.write(''.join(builder))

Edit: The database I'm using is unique to the small company that I'm working for, so unfortunately I can't provide much information on that front. I'm using jpype to connect with the database since the only means of connecting is via a jdbc driver. I'm running cPython 2.7.5; would love to use PyPy but it doesn't work with Pandas.

Since I'm extracting such a large number of rows, I'm hesitant to use fetchall for fear that I'll run out of memory. row has comparable performance and is much easier on the eyes, so I think I'll use that. Thanks a bunch!

Upvotes: 2

Views: 5054

Answers (1)

abarnert
abarnert

Reputation: 366153

With the little you've given us to go on, it's hard to be more specific, but…

I've wrapped your code up as a function, and written three alternative versions:

def row():
    with open(filename, 'w') as f:
        writer = csv.writer(f, delimiter='\t')
        cursor = db.execute("SELECT * FROM mytable")
        writer.writerow([i[0] for i in cursor.description])
        for row in cursor:
            writer.writerow(row)

def rows():
    with open(filename, 'w') as f:
        writer = csv.writer(f, delimiter='\t')
        cursor = db.execute("SELECT * FROM mytable")
        writer.writerow([i[0] for i in cursor.description])
        writer.writerows(cursor)

def rowsall():
    with open(filename, 'w') as f:
        writer = csv.writer(f, delimiter='\t')
        cursor = db.execute("SELECT * FROM mytable")
        writer.writerow([i[0] for i in cursor.description])
        writer.writerows(cursor.fetchall())

Notice that the last one is the one you say you tried.

Now, I wrote this test driver:

def randomname():
    return ''.join(random.choice(string.ascii_lowercase) for _ in range(30))

db = sqlite3.connect(':memory:')
db.execute('CREATE TABLE mytable (id INTEGER PRIMARY KEY AUTOINCREMENT, name VARCHAR)')
db.executemany('INSERT INTO mytable (name) VALUES (?)',
               [[randomname()] for _ in range(10000)])

filename = 'db.csv'

for f in manual, row, rows, rowsall:
    t = timeit.timeit(f, number=1)
    print('{:<10} {}'.format(f.__name__, t))

And here are the results:

manual     0.055549702141433954
row        0.03852885402739048
rows       0.03992213006131351
rowsall    0.02850699401460588

So, your code takes nearly twice as long as calling fetchall and writerows in my test!

When I repeat a similar test with other databases, however, rowsall is anywhere from 20% faster to 15% slower than manual (never 40% slower, but as much as 15%)… but row or rows is always significantly faster than manual.

I think the explanation is that your custom code is significantly slower than csv.writerows, but that in some databases, using fetchall instead of fetchone (or just iterating the cursor) slows things down significantly. The reason this isn't true with an in-memory sqlite3 database is that fetchone is doing all of the same work as fetchall and then feeding you the list one at a time; with a remote database, fetchone may do anything from fetch all the lines, to fetching a buffer at a time, to fetching a row at a time, making it potentially much slower or faster than fetchall, depending on your data.

But for a really useful explanation, you'd have to tell us exactly which database and library you're using (and which Python version—CPython 3.3.2's csv module seems to be a lot faster than CPython 2.7.5's, and PyPy 2.1/2.7.2 seems to be faster than CPython 2.7.5 as well, but then either one also might run your code faster too…) and so on.

Upvotes: 5

Related Questions