Reputation: 913
I have a huge csv file and I am reading it using the Python CSV library's DictReader. It has serial no. and some information associated. In my application I am taking a list of serial no. provided by the user and checking whether those are there in the CSV file or not. First Implementation:
reader=csv.DictReader(open('sample.csv','rb'))
arr=[1000,7777,3434,2121,9999]
for row in reader:
if row['Id'] in arr:
print row['Title']
But this takes too long as my csv file contains over 100 000 Entries
Second Implementation:
reader=csv.DictReader(open('sample.csv','rb'))
arr=[1000,7777,3434,2121,9999]
arr.sort()
i=0
for row in reader:
if row['Id']==arr[i]:
print row['Title']
i=i+1
But this gives ambiguous result, i.e sometimes it prints the Title only for first 2 or first three serial no.s in arr
I want a more efficient way and kind of direct hit on a particular serial no., Is that possible?
Please don't suggest linecache or something based on lines because my title is spread over multiple lines, so basically 1 csv record is not equal to 1 line in the file.
Upvotes: 0
Views: 242
Reputation: 22443
You are attempting to read a 100,000 line text file to find a tiny number of matches.
I would seriously consider preprocessing that csv file into an sqlite3 database prior to these lookups.
I doubt that the csv file is supplied each time that the user requests a few lookup details, so it should be possible.
Of course this depends on how often the csv file is updated but I would bet that it isn't that often. A single preprocessing of the csv into an sqlite database, used for multiple lookups would pay dividends.
When the only tool you have is a hammer, everything looks like a nail!
Edit: The other thing to consider is, you think that you are having issues now, what happens when the csv file has become 2 or 3 Lakh in size. At some point you are going to have to bite the bullet and either have the csv file delivered in some structured format or structure it yourself.
There is also the issue of what the csv file contains. Currently you have no assurance that it is not full of duplicates, which could be seriously messing up your processing. If you apply a structure to the data, not only would it be infinitesimally quicker to search but you would also be assured of clean data at the same time.
Edit 2:
Here is a tiny python script to create a database with 2 Lakh records.
Clearly in your case you will have to read the csv file and populate more fields, but this simple test takes just 4.5 seconds on an old 64bit PC.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import os,sqlite3
db_name = "2lakh.db"
try:
os.remove(db_name)
except:
pass
db = sqlite3.connect(db_name)
cursor = db.cursor()
result = cursor.execute('CREATE TABLE if not exists Big (Big_id INTEGER NOT NULL PRIMARY KEY UNIQUE, Big_data CHAR)')
cursor.execute('PRAGMA synchronous = 0') #Hands off data handling to OS
n = 0
while n < 200001:
try:
db.execute("insert into Big (Big_id,Big_data) values (?,?)",(n,"This is some data"));
except sqlite3.Error as e:
print 'Big Insert Error '+str(e), 'Error'
n += 1
# only report progress and commit database every 10000 records (speeds things up immensely)
if (n % 10000) == 0:
db.commit()
print n, "records written"
db.commit()
db.close()
If you only perform the db.commit()
every 100,000 transactions the time taken is less than 3 seconds to create the entire database.
I hope that this helps.
Upvotes: 2
Reputation: 25023
Your first implementation is the way to go, algorithmically speaking, but if it is too slow you have two or three possible optimizations.
Use a set
instead of a list
.
Use a list of lists instead of a list of dictionaries, i.e. don't use csv.DictReeader
but the simpler csv.reader
.
Use a compiled re
to match your targets and test the current id against the compiled re
.
I wrote two or three 'cause I'm not so sure that the third one is a real optimization, if all else fails then imho it's worth testing this last possibility but ... BTW, what is a Lakh?
Upvotes: 0
Reputation: 8202
If the csv file is to be accessed multiple times, read it once and save the data in a randomly-accessible indexed form, such as a database. Or, if you are filtering it to obtain a small subset of the available rows, make a single first pass to throw away all the certain rubbish and write a new smaller csv file containing only useful data.
If the code you've written extracts all that you'll ever need from this instance of the csv file, I don't think that there's much you can do to improve it.
As to it not working right, are you certain you don't want to look for arr[1] (7777) until after you have found arr[0] (1000)? If you want all lines having row['Id'] matching anything in arr regardless of ordering, you need to test row['Id'] in arr
. Another potential problem is that a csv might contain a number 1000 (or even 999.999999999) on some lines and a string "1000" (or "1000 " etc.) on others, matching the original spreadsheet. 1000 != "1000"
and "1000" != "1000 "
, so some careful data massaging may be in order before comparing values for equality.
Upvotes: 0
Reputation: 2745
How large is arr in your real code? If it is much bigger than this, it would probably pay to use a set.
arr={1000,7777,3434,2121,9999}
A set has much faster containment checking, which seems to be the main bottleneck here.
Upvotes: 0