Reputation: 7099
I'm trying to extract huge amounts of data from a DB and write it to a csv file. I'm trying to find out what the fastest way would be to do this. I found that running writerows on the result of a fetchall was 40% slower than the code below.
with open(filename, 'a') as f:
writer = csv.writer(f, delimiter='\t')
cursor.execute("SELECT * FROM table")
writer.writerow([i[0] for i in cursor.description])
count = 0
builder = []
row = cursor.fetchone()
DELIMITERS = ['\t'] * (len(row) - 1) + ['\n']
while row:
count += 1
# Add row with delimiters to builder
builder += [str(item) for pair in zip(row, DELIMITERS) for item in pair]
if count == 1000:
count = 0
f.write(''.join(builder))
builder[:] = []
row = cursor.fetchone()
f.write(''.join(builder))
Edit: The database I'm using is unique to the small company that I'm working for, so unfortunately I can't provide much information on that front. I'm using jpype to connect with the database since the only means of connecting is via a jdbc driver. I'm running cPython 2.7.5; would love to use PyPy but it doesn't work with Pandas.
Since I'm extracting such a large number of rows, I'm hesitant to use fetchall for fear that I'll run out of memory. row
has comparable performance and is much easier on the eyes, so I think I'll use that. Thanks a bunch!
Upvotes: 2
Views: 5054
Reputation: 366153
With the little you've given us to go on, it's hard to be more specific, but…
I've wrapped your code up as a function, and written three alternative versions:
def row():
with open(filename, 'w') as f:
writer = csv.writer(f, delimiter='\t')
cursor = db.execute("SELECT * FROM mytable")
writer.writerow([i[0] for i in cursor.description])
for row in cursor:
writer.writerow(row)
def rows():
with open(filename, 'w') as f:
writer = csv.writer(f, delimiter='\t')
cursor = db.execute("SELECT * FROM mytable")
writer.writerow([i[0] for i in cursor.description])
writer.writerows(cursor)
def rowsall():
with open(filename, 'w') as f:
writer = csv.writer(f, delimiter='\t')
cursor = db.execute("SELECT * FROM mytable")
writer.writerow([i[0] for i in cursor.description])
writer.writerows(cursor.fetchall())
Notice that the last one is the one you say you tried.
Now, I wrote this test driver:
def randomname():
return ''.join(random.choice(string.ascii_lowercase) for _ in range(30))
db = sqlite3.connect(':memory:')
db.execute('CREATE TABLE mytable (id INTEGER PRIMARY KEY AUTOINCREMENT, name VARCHAR)')
db.executemany('INSERT INTO mytable (name) VALUES (?)',
[[randomname()] for _ in range(10000)])
filename = 'db.csv'
for f in manual, row, rows, rowsall:
t = timeit.timeit(f, number=1)
print('{:<10} {}'.format(f.__name__, t))
And here are the results:
manual 0.055549702141433954
row 0.03852885402739048
rows 0.03992213006131351
rowsall 0.02850699401460588
So, your code takes nearly twice as long as calling fetchall
and writerows
in my test!
When I repeat a similar test with other databases, however, rowsall
is anywhere from 20% faster to 15% slower than manual
(never 40% slower, but as much as 15%)… but row
or rows
is always significantly faster than manual
.
I think the explanation is that your custom code is significantly slower than csv.writerows
, but that in some databases, using fetchall
instead of fetchone
(or just iterating the cursor) slows things down significantly. The reason this isn't true with an in-memory sqlite3 database is that fetchone
is doing all of the same work as fetchall
and then feeding you the list one at a time; with a remote database, fetchone
may do anything from fetch all the lines, to fetching a buffer at a time, to fetching a row at a time, making it potentially much slower or faster than fetchall
, depending on your data.
But for a really useful explanation, you'd have to tell us exactly which database and library you're using (and which Python version—CPython 3.3.2's csv
module seems to be a lot faster than CPython 2.7.5's, and PyPy 2.1/2.7.2 seems to be faster than CPython 2.7.5 as well, but then either one also might run your code faster too…) and so on.
Upvotes: 5