Reputation: 7924

Improving performance of a function in python

I have a text file fo several GB with this format

0 274 593869.99 6734999.96 121.83 1,
0 273 593869.51 6734999.92 121.57 1,
0 273 593869.15 6734999.89 121.57 1,
0 273 593868.79 6734999.86 121.65 1,
0 273 593868.44 6734999.84 121.65 1,
0 273 593869.00 6734999.94 124.21 1,
0 273 593868.68 6734999.92 124.32 1,
0 273 593868.39 6734999.90 124.44 1,
0 273 593866.94 6734999.71 121.37 1,
0 273 593868.73 6734999.99 127.28 1,

I have a simple function to filter in Python 2.7 on Windows. The function reads the entire file, selects the line with the same idtile (first and second column) and returns the list of points (x,y,z, and label) and the idtile.

tiles_id = [j for j in np.ndindex(ny, nx)] #ny = number of row, nx= number of columns
idtile = tiles_id[0]

def file_filter(name,idtile):
        lst = []
        for line in file(name, mode="r"):
            element = line.split() # add value
            if (int(element[0]),int(element[1])) == idtile:
                lst.append(element[2:])
                dy, dx = int(element[0]),int(element[1])
        return(lst, dy, dx)

The file is more than 32 GB and the bottle-neck is the reading of the file. I am looking for some suggestions or examples in order to speed up my function (ex: Parallel computing or other approaches).

My solution is to split the text file into tiles (using x and y location). The solution is not elegant and I am looking for an efficient approach.

Upvotes: 3

Answers (10)

Ber

Reputation: 41813

Maybe the best and fasted was to solve you problem is using a map reduce algorithm on a (massively) parallel system.

Upvotes: 1

Ellioh

Reputation: 5330

Here is some statistics. I'm going to update it as more solutions appear. The following program works with a file that consists of repetitions of the lines from the question.

import sys

def f0(name, idtile):
    lst = []
    dx, dy = None, None
    with open(name) as f:
        for line in f:
            pass


"""def f0(name, idtile):
    lst = []
    dx, dy = None, None
    with open(name) as f:
        for line in f:
            line.split()"""


def f1(name, idtile):
    lst = []
    dx, dy = None, None
    with open(name) as f:
        for line in f:
            element = line.split() # add value
            if (int(element[0]),int(element[1])) == idtile:
                lst.append(element[2:])
                dy, dx = int(element[0]),int(element[1])
    return(lst, dy, dx)


def f2(name, idtile):
    lst = []
    dx, dy = None, None
    with open(name) as f:
        for line in f:
            element = line.split() # add value
            pdx, pdy = int(element[0]),int(element[1])
            if (pdx, pdy) == idtile:
                lst.append(element[2:])
                dy, dx = pdx, pdy
    return(lst, dy, dx)

def f2(name, idtile):
    lst = []
    dx, dy = None, None
    with open(name) as f:
        for line in f:
            element = line.split() # add value
            pdx, pdy = int(element[0]),int(element[1])
            if (pdx, pdy) == idtile:
                lst.append(element[2:])
                dy, dx = pdx, pdy
    return(lst, dy, dx)


def f3(name,idtile):
    lst = []
    id_str = "%d %d " % idtile
    with open(name) as f:
        for line in f:
            if line.startswith(id_str):
                element = line.split() # add value
                lst.append(element[2:])
                dy, dx = int(element[0]),int(element[1])
    return(lst, dy, dx)

functions = [f0, f1, f2, f3]

functions[int(sys.argv[1])]("./srcdata.txt",(0, 273))

The shell script for timing is straightforward:

#!/bin/sh
time python ./timing.py 0
time python ./timing.py 1
time python ./timing.py 2

I prefer to run it that way to avoid previously run functions to have influence on the time of the others.

And the result is:

0.02user 0.01system 0:00.03elapsed
0.42user 0.01system 0:00.44elapsed
0.32user 0.02system 0:00.34elapsed
0.33user 0.01system 0:00.35elapsed

The good news: reading file is NOT a bottleneck, removing extra transfer to int is effective.

The bad news: I still do not now how to optimize it significantly.

Upvotes: 1

akira

Reputation: 6117

Major rule of speed: Do less.

You could create a sorted version of huge.txt and keep track of the positions of the idtitle. So, if you search for (223872, 23239) you know already at which position in the file the given information is and could skip everything before (file.seek). One might argue that this information is somewhat equal to an 'INDEX'.
You could use mmap to mmap the file into the ram.
You can start writing 'workers' to work on different files / positions
You could transfer the given file into something like a SQL-server and use standard SQL to retrieve the data. Yes, transfering the 32gigs of data into the database takes time, but consider it as a kind of preprocessing. After that the database will use whatever means to access the data faster than your approach.

Minor thoughts:

You could work with slices instead of line.split() to avoid lots of tiny memory allocations.

Outline of how to use a Database:

Assume you have something like PostgreSQL around:

CREATE TABLE tiles
(
  tile_x integer,
  tile_y integer,
  x double precision,
  y double precision,
  z double precision,
  flag integer
);

Then you could replace all whitespaces in the input file with | and all the , with (to create a nice and shiney .csv) and feed that directly into the database:

 COPY "tiles" from "\full\path\to\huge.txt" WITH DELIMITER "|";

You can then do fancy stuff like this:

SELECT * FROM "tiles";

tile_x | tile_y |     x     |     y      |   z    | flag
-------+--------+-----------+------------+--------+-----
     0 |    274 | 593869.99 | 6734999.96 | 121.83 |    1
     0 |    273 | 593869.51 | 6734999.92 | 121.57 |    1
     0 |    273 | 593869.15 | 6734999.89 | 121.57 |    1
     0 |    273 | 593868.79 | 6734999.86 | 121.65 |    1
     0 |    273 | 593868.44 | 6734999.84 | 121.65 |    1
     0 |    273 |    593869 | 6734999.94 | 124.21 |    1
     0 |    273 | 593868.68 | 6734999.92 | 124.32 |    1
     0 |    273 | 593868.39 |  6734999.9 | 124.44 |    1
     0 |    273 | 593866.94 | 6734999.71 | 121.37 |    1
     0 |    273 | 593868.73 | 6734999.99 | 127.28 |    1

Or something like this:

SELECT * FROM "tiles" WHERE tile_y > 273;

tile_x | tile_y |     x     |     y      |   z    | flag
-------+--------+-----------+------------+--------+-----
     0 |    274 | 593869.99 | 6734999.96 | 121.83 |    1

Upvotes: 1

Ber

Reputation: 41813

You can convert you filter to a generator function:

def file_filter(name):
        lst = []
        idtile = None
        for line in file(name, mode="r"):
            element = line.split() # add value
            if idtile is None:
               idtile = (int(element[0]), int(element[1]))
            if (int(element[0]), int(element[1])) == idtile:
                lst.append(element[2:])
                dy, dx = int(element[0]),int(element[1])
            else:
                yield lst, dx, dy
                lst = []
                idtile = None

This function should return one tuple of list_of_data, dx, dy for each idtile, provided that the file is sorted on idtile

New you can use it like this:

for lst, dx, dy in file_filter('you_name_it'):
    process_tile_data(lst, dx, dy)

Upvotes: 1

Janne Karila

Reputation: 25197

You can avoid doing the split() and int() on every line by doing a string comparison instead:

def file_filter(name,idtile):
    lst = []
    id_str = "%d %d " % idtile
    with open(name) as f:
        for line in f:
            if line.startswith(id_str):
                element = line.split() # add value
                lst.append(element[2:])
                dy, dx = int(element[0]),int(element[1])
    return(lst, dy, dx)

Upvotes: 1

HYRY

Reputation: 97281

My solution is split the large text file into many small binary file for each idtile. To read the text file faster, you can use pandas:

import pandas as pd
import numpy as np
n = 400000 # read n rows as one block
for df in pd.read_table(large_text_file, sep=" ", comment=",", header=None, chunksize=n):
    for key, g in df.groupby([0, 1]):
        fn = "%d_%d.tmp" % key
            with open(fn, "ab") as f:
                data = g.ix[:, 2:5].values
                data.tofile(f)

Then you can get content of one binary file by:

np.fromfile("0_273.tmp").reshape(-1, 4)

Upvotes: 1

saketrp

Reputation: 2563

Okay. If you need to do this many times, you obviously need to create some sort of an index. But if this not a frequent activity the best bet would be to multithread it like this.

NUMWORKERS = 8
workerlist = []
workQ = Queue.Queue()

def file_filter(name,idtile, numworkers):
    for i in range(NUMWORKERS):
        worker = threading.Thread(target=workerThread, args=(lst,))
    lst = []
    for line in file(name, mode="r"):
        workQ.put(line)                
    for i in range(NUMWORKERS):
        workQ.put(None)
    workQ.join()
    return lst , idtile[0], idtile[1]

def workerThread(lst):
    line = workQ.get()
    if not line:
        return
    element = line.split() # add value
    if (int(element[0]),int(element[1])) == idtile:
        lst.append(element[2:])

In case this is a very frequent activity that happens for each idtile then the solution would be drastically different. Doing this for a number of idtiles together will give you best amortized performance. Because any number of idtiles already known can be processed in a single loop over the file.

Upvotes: 0

codeape

Reputation: 100766

I suggest you change your code so that you read the big file once and write (temporary) files for each tile id. Something like:

def create_files(name, idtiles=None):
    files = {}
    for line in open(name):
         elements = line.split()
         idtile = (int(elements[0]), int(elements[1]))
         if idtiles is not None and idtile not in idtiles:
             continue
         if idtile not in files:
             files[idtile] = open("tempfile_{}_{}".format(elements[0], elements[1]), "w")
         print >>files[idtile], line
    for f in files.itervalues():
        f.close()
    return files

create_files() will return a {(tilex, tiley): fileobject} dictionary.

A variant that closes the files after writing each line, to work around the "Too many open files" error. This variant returns a {(tilex, tiley: filename} dictionary. Will probably be a bit slower.

def create_files(name, idtiles=None):
    files = {}
    for line in open(name):
         elements = line.split()
         idtile = (int(elements[0]), int(elements[1]))
         if idtiles is not None and idtile not in idtiles:
             continue
         filename = "tempfile_{}_{}".format(elements[0], elements[1])
         files[idtile] = filename
         with open(filename, "a") as f:
             print >>f, line
    return files

Upvotes: 1

Ellioh

Reputation: 5330

I would suggest to compare the times used for your full reading procedure and for just reading lines and doing nothing to them. If those times are close, the only thing you can really do is to change approach (splitting your files etc.), for what you can probably optimize is data processing time, not file reading time.

I also see two moments in your code that are worth fixing:

with open(name) as f:
    for line in f:
        pass #Here goes the loop body

Use with to explicitly close your file. Your solution should work in CPython, but that depends on implementation and may not be that effective always.
You perform transformation of a string to int twice. It is a relatively slow operation. Remove the second one by reusing the result.

P.S. It looks like an array of depth or height values for a set of points on Earth surface, and the surface is split in tiles. :-)

Upvotes: 1

mbaytas

Reputation: 3024

Your 'idtile's appear to be in a certain order. That is, the sample data suggests that once you traverse through a certain 'idtile' and hit the next, there is no chance that a line with that 'idtile' will show up again. If this is the case, you may break the for loop once you finish dealing with the 'idtile' you want and hit a different one. Off the top of my head:

loopkiller = false
for line in file(name, mode="r"):
    element = line.split()
    if (int(element[0]),int(element[1])) == idtile:
        lst.append(element[2:])
        dy, dx = int(element[0]),int(element[1])
        loopkiller = true
    elif loopkiller:
        break;

This way, once you are done with a certain 'idtile', you stop; whereas in your example, you keep on reading until the end of the file.

If your idtiles appear in a random order, maybe you could try writing an ordered version of your file first.

Also, evaluating the digits of your idtiles seperately may help you traverse the file faster. Supposing your idtile is a two-tuple of one-digit and three-digit integers, perhaps something along the lines of:

for line in file(name, mode="r"):
    element = line.split()
    if int(element[0][0]) == idtile[0]:
        if element[1][0] == str(idtile[1])[0]:
            if element[1][1] == str(idtile[1])[1]:
                if element[1][2] == str(idtile[1])[2]:
                    dy, dx = int(element[0]),int(element[1])
                else go_forward(walk)
            else go_forward(run)
         else go_forward(sprint)
     else go_forward(warp)

Upvotes: 1

Improving performance of a function in python

Answers (10)

Related Questions