Brian
Brian

Reputation: 14846

Python, read many files and merge the results

I might be asking a very basic question but I really can't figure how to make a simple parallel application in python. I am running my scripts on a machine with 16 cores and I would like to use all of them efficiently. I have 16 huge files to read and I would like each cpu to read one file and then merge the result. Here I give a quick example of what I would like to do:

  parameter1_glob=[]
  parameter2_glob[]


  do cpu in arange(0,16):
      parameter1,parameter2=loadtxt('file'+str(cpu)+'.dat',unpack=True)

      parameter1_glob.append(parameter1)
      parameter2_glob.append(parameter2)

I think that the multiprocessing module might help but I couldn't understand how to apply it to what I want to do.

Upvotes: 5

Views: 2600

Answers (3)

dimo414
dimo414

Reputation: 48864

I agree with what Colin Dunklau said in his comment, this process will bottleneck on reading and writing these files, the CPU demands are minimal. Even if you had 17 dedicated drives, you wouldn't be maxing out even one core. Additionally, though I realize this is tangential to your actual question, you'll likely run into memory limitations with these "huge" files - loading 16 files into memory as arrays and then combining them into another file will almost certainly take up more memory than you have.

You may find better results looking into shell scripting this problem. In particular, GNU sort uses a memory efficient merge-sort to sort one or more files very rapidly - much faster than all but the most carefully written applications in Python or most other languages.

I would suggest avoiding any sort of multi-threading effort, it will dramatically add to the complexity, with minimal benefit. Be sure you keep as little of the file(s) in memory at a time, or you'll run out quickly. In any case, you will absolutely want to have the reading and writing running on two separate disks. The slowdown associated with reading and writing simultaneously to the same disk is tremendously painful.

Upvotes: 2

luispedro
luispedro

Reputation: 7024

Assuming that the results from each file are smallish, you could do this with my package jug:

from jug import TaskGenerator
loadtxt = TaskGenerator(loadtxt)

parameter1_glob=[]
parameter2_glob[]

@TaskGenerator
def write_parameter(oname, ps):
    with open(oname, 'w') as output:
        for p in ps:
            print >>output, p

parameter1_glob = []
parameter2_glob = []

for cpu in arange(0,16):
    ps = loadtxt('file'+str(cpu)+'.dat',unpack=True)
    parameter1_glob.append(ps[0])
    parameter2_glob.append(ps[1])

write_parameter('output1.txt', parameter1_glob)
write_parameter('output2.txt', parameter2_glob)

Now, you can execute several jug execute jobs.

Upvotes: 0

Paulo Scardine
Paulo Scardine

Reputation: 77359

Do you want to merge line by line? Sometimes coroutines are more interesting for I/O bound applications than classic multitasking. You can chain generators and coroutines for all sort of routing, merging and broadcasting. Blow your mind with this nice presentation by David Beazley.

You can use a coroutine as a sink (untested, please refer to dabeaz examples):

# A sink that just prints the lines
@coroutine
def printer():
    while True:
        line = (yield)
        print line,

sources = [
    open('file1'),
    open('file2'),
    open('file3'),
    open('file4'),
    open('file5'),
    open('file6'),
    open('file7'),
]

output = printer()
while sources:
    for source in sources:
        line = source.next()
        if not line: # EOF
            sources.remove(source)
            source.close()
            continue
        output.send(line)

Upvotes: 1

Related Questions