mlzboy
mlzboy

Reputation: 14691

How to accelerate reads from batches of files

I read many files from my system. I want to read them faster, maybe like this:

results=[]
for file in open("filenames.txt").readlines():
    results.append(open(file,"r").read())

I don't want to use threading. Any advice is appreciated.

the reason why i don't want to use threads is because it will make my code unreadable,i want to find so tricky way to make speed faster and code lesser,unstander easier

yesterday i have test another solution with multi-processing,it works bad,i don't know why, here is the code as follows:

def xml2db(file):
    s=pq(open(file,"r").read())
    dict={}
    for field in g_fields:
        dict[field]=s("field[@name='%s']"%field).text()
    p=Product()
    for k,v in dict.iteritems():
        if v is None or v.strip()=="":
            pass
        else:
            if hasattr(p,k):
                setattr(p,k,v)
    session.commit()
@cost_time
@statistics_db
def batch_xml2db():
    from multiprocessing import Pool,Queue
    p=Pool(5)
    #q=Queue()
    files=glob.glob(g_filter)
    #for file in files:
    #    q.put(file)

    def P():
        while q.qsize()<>0:
            xml2db(q.get())
    p.map(xml2db,files)
    p.join()

Upvotes: 1

Views: 285

Answers (4)

kevpie
kevpie

Reputation: 26088

Not sure if this is still the code you are using. A couple adjustments I would consider making.

Original:

def xml2db(file):
    s=pq(open(file,"r").read())
    dict={}
    for field in g_fields:
        dict[field]=s("field[@name='%s']"%field).text()
    p=Product()
    for k,v in dict.iteritems():
        if v is None or v.strip()=="":
            pass
        else:
            if hasattr(p,k):
                setattr(p,k,v)
    session.commit()

Updated: remove the use of the dict, it is extra object creation, iteration and collection.

def xml2db(file):
    s=pq(open(file,"r").read())

    p=Product()
    for k in g_fields:
        v=s("field[@name='%s']"%field).text()
        if v is None or v.strip()=="":
            pass
        else:
            if hasattr(p,k):
                setattr(p,k,v)
    session.commit()

You could profile the code using the python profiler.

This might tell you where the time being spent is. It may be in session.Commit() this may need to be reduced to every couple of files.

I have no idea what it does so that is really a stab in the dark, you may try and run it without sending or writing any output.

If you can separate your code into Reading, Processing and Writing. A) You can see how long it takes to read all the files.

Then by loading a single file into memory process it enough time to represent the entire job without extra reading IO. B) Processing cost

Then save a whole bunch of sessions representative of your job size. C) Output cost

Test the cost of each stage individually. This should show you what is taking the most time and if any improvement can be made in any area.

Upvotes: 0

Richard Careaga
Richard Careaga

Reputation: 670

Is this a one-off requirement or something that you need to do regularly? If it's something you're going to be doing often, consider using MySQL or another database instead of a file system.

Upvotes: 0

Thomas M. DuBuisson
Thomas M. DuBuisson

Reputation: 64740

Well, if you want to improve performance then improve the algorithm, right? What are you doing with all this data? Do you really need it all in memory at the same time, potentially causing OOM if filenames.txt specifies too many or too large of files?

If you're doing this with lots of files I suspect you are thrashing, hence your 700s+ (1 hour+) time. Even my poor little HD can sustain 42 MB/s writes (42 * 714s = 30GB). Take that grain of salt knowing you must read and write, but I'm guessing you don't have over 8 GB of RAM available for this application. A related SO question/answer suggested you use mmap, and the answer above that suggested an iterative/lazy read like what you get in Haskell for free. These are probably worth considering if you really do have tens of gigabytes to munge.

Upvotes: 0

Zooba
Zooba

Reputation: 11438

results = [open(f.strip()).read() for f in open("filenames.txt").readlines()]

This may be insignificantly faster, but it's probably less readable (depending on the reader's familiarity with list comprehensions).

Your main problem here is that your bottleneck is disk IO - buying a faster disk will make much more of a difference than modifying your code.

Upvotes: 1

Related Questions