Reputation: 7924
I have a text file fo several GB with this format
0 274 593869.99 6734999.96 121.83 1,
0 273 593869.51 6734999.92 121.57 1,
0 273 593869.15 6734999.89 121.57 1,
0 273 593868.79 6734999.86 121.65 1,
0 273 593868.44 6734999.84 121.65 1,
0 273 593869.00 6734999.94 124.21 1,
0 273 593868.68 6734999.92 124.32 1,
0 273 593868.39 6734999.90 124.44 1,
0 273 593866.94 6734999.71 121.37 1,
0 273 593868.73 6734999.99 127.28 1,
I have a simple function to filter in Python 2.7 on Windows. The function reads the entire file, selects the line with the same idtile
(first and second column) and returns the list of points (x,y,z, and label) and the idtile
.
tiles_id = [j for j in np.ndindex(ny, nx)] #ny = number of row, nx= number of columns
idtile = tiles_id[0]
def file_filter(name,idtile):
lst = []
for line in file(name, mode="r"):
element = line.split() # add value
if (int(element[0]),int(element[1])) == idtile:
lst.append(element[2:])
dy, dx = int(element[0]),int(element[1])
return(lst, dy, dx)
The file is more than 32 GB and the bottle-neck is the reading of the file. I am looking for some suggestions or examples in order to speed up my function (ex: Parallel computing or other approaches).
My solution is to split the text file into tiles (using x and y location). The solution is not elegant and I am looking for an efficient approach.
Upvotes: 3
Views: 314
Reputation: 41813
Maybe the best and fasted was to solve you problem is using a map reduce algorithm on a (massively) parallel system.
Upvotes: 1
Reputation: 5330
Here is some statistics. I'm going to update it as more solutions appear. The following program works with a file that consists of repetitions of the lines from the question.
import sys
def f0(name, idtile):
lst = []
dx, dy = None, None
with open(name) as f:
for line in f:
pass
"""def f0(name, idtile):
lst = []
dx, dy = None, None
with open(name) as f:
for line in f:
line.split()"""
def f1(name, idtile):
lst = []
dx, dy = None, None
with open(name) as f:
for line in f:
element = line.split() # add value
if (int(element[0]),int(element[1])) == idtile:
lst.append(element[2:])
dy, dx = int(element[0]),int(element[1])
return(lst, dy, dx)
def f2(name, idtile):
lst = []
dx, dy = None, None
with open(name) as f:
for line in f:
element = line.split() # add value
pdx, pdy = int(element[0]),int(element[1])
if (pdx, pdy) == idtile:
lst.append(element[2:])
dy, dx = pdx, pdy
return(lst, dy, dx)
def f2(name, idtile):
lst = []
dx, dy = None, None
with open(name) as f:
for line in f:
element = line.split() # add value
pdx, pdy = int(element[0]),int(element[1])
if (pdx, pdy) == idtile:
lst.append(element[2:])
dy, dx = pdx, pdy
return(lst, dy, dx)
def f3(name,idtile):
lst = []
id_str = "%d %d " % idtile
with open(name) as f:
for line in f:
if line.startswith(id_str):
element = line.split() # add value
lst.append(element[2:])
dy, dx = int(element[0]),int(element[1])
return(lst, dy, dx)
functions = [f0, f1, f2, f3]
functions[int(sys.argv[1])]("./srcdata.txt",(0, 273))
The shell script for timing is straightforward:
#!/bin/sh
time python ./timing.py 0
time python ./timing.py 1
time python ./timing.py 2
I prefer to run it that way to avoid previously run functions to have influence on the time of the others.
And the result is:
0.02user 0.01system 0:00.03elapsed
0.42user 0.01system 0:00.44elapsed
0.32user 0.02system 0:00.34elapsed
0.33user 0.01system 0:00.35elapsed
The good news: reading file is NOT a bottleneck, removing extra transfer to int is effective.
The bad news: I still do not now how to optimize it significantly.
Upvotes: 1
Reputation: 6117
Major rule of speed: Do less.
huge.txt
and keep track of the positions of the idtitle
. So, if you search for (223872, 23239) you know already at which position in the file the given information is and could skip everything before (file.seek
). One might argue that this information is somewhat equal to an 'INDEX'.mmap
to mmap the file into the ram.Minor thoughts:
line.split()
to avoid lots of tiny memory allocations.Outline of how to use a Database:
Assume you have something like PostgreSQL
around:
CREATE TABLE tiles
(
tile_x integer,
tile_y integer,
x double precision,
y double precision,
z double precision,
flag integer
);
Then you could replace all whitespaces in the input file with |
and all the ,
with (to create a nice and shiney
.csv
) and feed that directly into the database:
COPY "tiles" from "\full\path\to\huge.txt" WITH DELIMITER "|";
You can then do fancy stuff like this:
SELECT * FROM "tiles";
tile_x | tile_y | x | y | z | flag
-------+--------+-----------+------------+--------+-----
0 | 274 | 593869.99 | 6734999.96 | 121.83 | 1
0 | 273 | 593869.51 | 6734999.92 | 121.57 | 1
0 | 273 | 593869.15 | 6734999.89 | 121.57 | 1
0 | 273 | 593868.79 | 6734999.86 | 121.65 | 1
0 | 273 | 593868.44 | 6734999.84 | 121.65 | 1
0 | 273 | 593869 | 6734999.94 | 124.21 | 1
0 | 273 | 593868.68 | 6734999.92 | 124.32 | 1
0 | 273 | 593868.39 | 6734999.9 | 124.44 | 1
0 | 273 | 593866.94 | 6734999.71 | 121.37 | 1
0 | 273 | 593868.73 | 6734999.99 | 127.28 | 1
Or something like this:
SELECT * FROM "tiles" WHERE tile_y > 273;
tile_x | tile_y | x | y | z | flag
-------+--------+-----------+------------+--------+-----
0 | 274 | 593869.99 | 6734999.96 | 121.83 | 1
Upvotes: 1
Reputation: 41813
You can convert you filter to a generator function:
def file_filter(name):
lst = []
idtile = None
for line in file(name, mode="r"):
element = line.split() # add value
if idtile is None:
idtile = (int(element[0]), int(element[1]))
if (int(element[0]), int(element[1])) == idtile:
lst.append(element[2:])
dy, dx = int(element[0]),int(element[1])
else:
yield lst, dx, dy
lst = []
idtile = None
This function should return one tuple of list_of_data, dx, dy
for each idtile, provided that the file is sorted on idtile
New you can use it like this:
for lst, dx, dy in file_filter('you_name_it'):
process_tile_data(lst, dx, dy)
Upvotes: 1
Reputation: 25197
You can avoid doing the split()
and int()
on every line by doing a string comparison instead:
def file_filter(name,idtile):
lst = []
id_str = "%d %d " % idtile
with open(name) as f:
for line in f:
if line.startswith(id_str):
element = line.split() # add value
lst.append(element[2:])
dy, dx = int(element[0]),int(element[1])
return(lst, dy, dx)
Upvotes: 1
Reputation: 97281
My solution is split the large text file into many small binary file for each idtile. To read the text file faster, you can use pandas:
import pandas as pd
import numpy as np
n = 400000 # read n rows as one block
for df in pd.read_table(large_text_file, sep=" ", comment=",", header=None, chunksize=n):
for key, g in df.groupby([0, 1]):
fn = "%d_%d.tmp" % key
with open(fn, "ab") as f:
data = g.ix[:, 2:5].values
data.tofile(f)
Then you can get content of one binary file by:
np.fromfile("0_273.tmp").reshape(-1, 4)
Upvotes: 1
Reputation: 2563
Okay. If you need to do this many times, you obviously need to create some sort of an index. But if this not a frequent activity the best bet would be to multithread it like this.
NUMWORKERS = 8
workerlist = []
workQ = Queue.Queue()
def file_filter(name,idtile, numworkers):
for i in range(NUMWORKERS):
worker = threading.Thread(target=workerThread, args=(lst,))
lst = []
for line in file(name, mode="r"):
workQ.put(line)
for i in range(NUMWORKERS):
workQ.put(None)
workQ.join()
return lst , idtile[0], idtile[1]
def workerThread(lst):
line = workQ.get()
if not line:
return
element = line.split() # add value
if (int(element[0]),int(element[1])) == idtile:
lst.append(element[2:])
In case this is a very frequent activity that happens for each idtile then the solution would be drastically different. Doing this for a number of idtiles together will give you best amortized performance. Because any number of idtiles already known can be processed in a single loop over the file.
Upvotes: 0
Reputation: 100766
I suggest you change your code so that you read the big file once and write (temporary) files for each tile id. Something like:
def create_files(name, idtiles=None):
files = {}
for line in open(name):
elements = line.split()
idtile = (int(elements[0]), int(elements[1]))
if idtiles is not None and idtile not in idtiles:
continue
if idtile not in files:
files[idtile] = open("tempfile_{}_{}".format(elements[0], elements[1]), "w")
print >>files[idtile], line
for f in files.itervalues():
f.close()
return files
create_files()
will return a {(tilex, tiley): fileobject}
dictionary.
A variant that closes the files after writing each line, to work around the "Too many open files" error. This variant returns a {(tilex, tiley: filename}
dictionary. Will probably be a bit slower.
def create_files(name, idtiles=None):
files = {}
for line in open(name):
elements = line.split()
idtile = (int(elements[0]), int(elements[1]))
if idtiles is not None and idtile not in idtiles:
continue
filename = "tempfile_{}_{}".format(elements[0], elements[1])
files[idtile] = filename
with open(filename, "a") as f:
print >>f, line
return files
Upvotes: 1
Reputation: 5330
I would suggest to compare the times used for your full reading procedure and for just reading lines and doing nothing to them. If those times are close, the only thing you can really do is to change approach (splitting your files etc.), for what you can probably optimize is data processing time, not file reading time.
I also see two moments in your code that are worth fixing:
with open(name) as f:
for line in f:
pass #Here goes the loop body
Use with
to explicitly close your file. Your solution should work in CPython, but that depends on implementation and may not be that effective always.
You perform transformation of a string to int
twice. It is a relatively slow operation. Remove the second one by reusing the result.
P.S. It looks like an array of depth or height values for a set of points on Earth surface, and the surface is split in tiles. :-)
Upvotes: 1
Reputation: 3024
Your 'idtile's appear to be in a certain order. That is, the sample data suggests that once you traverse through a certain 'idtile' and hit the next, there is no chance that a line with that 'idtile' will show up again. If this is the case, you may break the for
loop once you finish dealing with the 'idtile' you want and hit a different one. Off the top of my head:
loopkiller = false
for line in file(name, mode="r"):
element = line.split()
if (int(element[0]),int(element[1])) == idtile:
lst.append(element[2:])
dy, dx = int(element[0]),int(element[1])
loopkiller = true
elif loopkiller:
break;
This way, once you are done with a certain 'idtile', you stop; whereas in your example, you keep on reading until the end of the file.
If your idtiles appear in a random order, maybe you could try writing an ordered version of your file first.
Also, evaluating the digits of your idtiles seperately may help you traverse the file faster. Supposing your idtile
is a two-tuple of one-digit and three-digit integers, perhaps something along the lines of:
for line in file(name, mode="r"):
element = line.split()
if int(element[0][0]) == idtile[0]:
if element[1][0] == str(idtile[1])[0]:
if element[1][1] == str(idtile[1])[1]:
if element[1][2] == str(idtile[1])[2]:
dy, dx = int(element[0]),int(element[1])
else go_forward(walk)
else go_forward(run)
else go_forward(sprint)
else go_forward(warp)
Upvotes: 1