Reputation: 14691
I read many files from my system. I want to read them faster, maybe like this:
results=[]
for file in open("filenames.txt").readlines():
results.append(open(file,"r").read())
I don't want to use threading. Any advice is appreciated.
the reason why i don't want to use threads is because it will make my code unreadable,i want to find so tricky way to make speed faster and code lesser,unstander easier
yesterday i have test another solution with multi-processing,it works bad,i don't know why, here is the code as follows:
def xml2db(file):
s=pq(open(file,"r").read())
dict={}
for field in g_fields:
dict[field]=s("field[@name='%s']"%field).text()
p=Product()
for k,v in dict.iteritems():
if v is None or v.strip()=="":
pass
else:
if hasattr(p,k):
setattr(p,k,v)
session.commit()
@cost_time
@statistics_db
def batch_xml2db():
from multiprocessing import Pool,Queue
p=Pool(5)
#q=Queue()
files=glob.glob(g_filter)
#for file in files:
# q.put(file)
def P():
while q.qsize()<>0:
xml2db(q.get())
p.map(xml2db,files)
p.join()
Upvotes: 1
Views: 285
Reputation: 26088
Not sure if this is still the code you are using. A couple adjustments I would consider making.
Original:
def xml2db(file):
s=pq(open(file,"r").read())
dict={}
for field in g_fields:
dict[field]=s("field[@name='%s']"%field).text()
p=Product()
for k,v in dict.iteritems():
if v is None or v.strip()=="":
pass
else:
if hasattr(p,k):
setattr(p,k,v)
session.commit()
Updated: remove the use of the dict, it is extra object creation, iteration and collection.
def xml2db(file):
s=pq(open(file,"r").read())
p=Product()
for k in g_fields:
v=s("field[@name='%s']"%field).text()
if v is None or v.strip()=="":
pass
else:
if hasattr(p,k):
setattr(p,k,v)
session.commit()
You could profile the code using the python profiler.
This might tell you where the time being spent is. It may be in session.Commit() this may need to be reduced to every couple of files.
I have no idea what it does so that is really a stab in the dark, you may try and run it without sending or writing any output.
If you can separate your code into Reading, Processing and Writing. A) You can see how long it takes to read all the files.
Then by loading a single file into memory process it enough time to represent the entire job without extra reading IO. B) Processing cost
Then save a whole bunch of sessions representative of your job size. C) Output cost
Test the cost of each stage individually. This should show you what is taking the most time and if any improvement can be made in any area.
Upvotes: 0
Reputation: 670
Is this a one-off requirement or something that you need to do regularly? If it's something you're going to be doing often, consider using MySQL or another database instead of a file system.
Upvotes: 0
Reputation: 64740
Well, if you want to improve performance then improve the algorithm, right? What are you doing with all this data? Do you really need it all in memory at the same time, potentially causing OOM if filenames.txt
specifies too many or too large of files?
If you're doing this with lots of files I suspect you are thrashing, hence your 700s+ (1 hour+) time. Even my poor little HD can sustain 42 MB/s writes (42 * 714s = 30GB). Take that grain of salt knowing you must read and write, but I'm guessing you don't have over 8 GB of RAM available for this application. A related SO question/answer suggested you use mmap, and the answer above that suggested an iterative/lazy read like what you get in Haskell for free. These are probably worth considering if you really do have tens of gigabytes to munge.
Upvotes: 0
Reputation: 11438
results = [open(f.strip()).read() for f in open("filenames.txt").readlines()]
This may be insignificantly faster, but it's probably less readable (depending on the reader's familiarity with list comprehensions).
Your main problem here is that your bottleneck is disk IO - buying a faster disk will make much more of a difference than modifying your code.
Upvotes: 1