Am1rr3zA
Am1rr3zA

Reputation: 7421

How to read a super huge file into numpy array N lines at a time

I have a huge file (around 30GB), each line includes coordination of a point on a 2D surface. I need to load the file into Numpy array: points = np.empty((0, 2)), and apply scipy.spatial.ConvexHull over it. Since the size of the file is very large I couldn't load it at once into the memory, I want to load it as batch of N lines and apply scipy.spatial.ConvexHull on the small part and then load the next N rows! What's an efficient to do it?
I found out that in python you can use islice to read N lines of a file but the problem is lines_gen is a generator object, which gives you each line of the file and should be used in a loop, so I am not sure how can I convert the lines_gen into Numpy array in an efficient way?

from itertools import islice
with open(input, 'r') as infile:
    lines_gen = islice(infile, N)

My input file:

0.989703    1
0   0
0.0102975   0
0.0102975   0
1   1
0.989703    1
1   1
0   0
0.0102975   0
0.989703    1
0.979405    1
0   0
0.020595    0
0.020595    0
1   1
0.979405    1
1   1
0   0
0.020595    0
0.979405    1
0.969108    1
...
...
...
0   0
0.0308924   0
0.0308924   0
1   1
0.969108    1
1   1
0   0
0.0308924   0
0.969108    1
0.95881 1
0   0

Upvotes: 8

Views: 9962

Answers (4)

hpaulj
hpaulj

Reputation: 231375

With your data, I can read it in 5 line chunks like this:

In [182]: from itertools import islice
with open(input,'r') as infile:
    while True:
        gen = islice(infile,N)
        arr = np.genfromtxt(gen, dtype=None)
        print arr
        if arr.shape[0]<N:
            break
   .....:             
[(0.989703, 1) (0.0, 0) (0.0102975, 0) (0.0102975, 0) (1.0, 1)]
[(0.989703, 1) (1.0, 1) (0.0, 0) (0.0102975, 0) (0.989703, 1)]
[(0.979405, 1) (0.0, 0) (0.020595, 0) (0.020595, 0) (1.0, 1)]
[(0.979405, 1) (1.0, 1) (0.0, 0) (0.020595, 0) (0.979405, 1)]
[(0.969108, 1) (0.0, 0) (0.0308924, 0) (0.0308924, 0) (1.0, 1)]
[(0.969108, 1) (1.0, 1) (0.0, 0) (0.0308924, 0) (0.969108, 1)]
[(0.95881, 1) (0.0, 0)]

The same thing read as one chunk is:

In [183]: with open(input,'r') as infile:
    arr = np.genfromtxt(infile, dtype=None)
   .....:     
In [184]: arr
Out[184]: 
array([(0.989703, 1), (0.0, 0), (0.0102975, 0), (0.0102975, 0), (1.0, 1),
       (0.989703, 1), (1.0, 1), (0.0, 0), (0.0102975, 0), (0.989703, 1),
       (0.979405, 1), (0.0, 0), (0.020595, 0), (0.020595, 0), (1.0, 1),
       (0.979405, 1), (1.0, 1), (0.0, 0), (0.020595, 0), (0.979405, 1),
       (0.969108, 1), (0.0, 0), (0.0308924, 0), (0.0308924, 0), (1.0, 1),
       (0.969108, 1), (1.0, 1), (0.0, 0), (0.0308924, 0), (0.969108, 1),
       (0.95881, 1), (0.0, 0)], 
      dtype=[('f0', '<f8'), ('f1', '<i4')])

(This is in Python 2.7; in 3 there's a byte/string issue I need to work around).

Upvotes: 5

Denti
Denti

Reputation: 434

You can look onto DAGpype‘s chunk_stream_bytes. I didn't worked with it, but I hope it will help.

This is example of chunk read and processing some .csv file (_f_name):

 np.chunk_stream_bytes(_f_name, num_cols = 2) | \
        filt(lambda a : a[logical_and(a[:, 0] < 10, a[:, 1] < 10), :]) | \
        np.corr()

Upvotes: 0

Haleemur Ali
Haleemur Ali

Reputation: 28243

you can define a chunk reader as follows using a generator

def read_file_chunk(fname, chunksize=500000):
    with open(fname, 'r') as myfile:
        lines = []
        for i, line in enumerate(myfile):
            line_values = (float(val) for val in line.split())
            lines.append(line_values)
            if i > 0 and i % 5 == 0:
                yield lines
                lines = [] # resets the lines list
        if lines:
            yield lines # final few lines of file.

# and, assuming the function you want to apply is called `my_func`
chunk_gen = read_file_chunk(my_file_name)
for chunk in chunk_gen:
    my_func(chunk)

Upvotes: 1

Mr. Girgitt
Mr. Girgitt

Reputation: 2903

You could try the second method from this post and read the file in chunks by referring to a given line using a pre-computed lines offset array if it fits into memory. Here is an example of what I typically use to avoid loading whole files to memory::

data_file = open("data_file.txt", "rb") 

line_offset = []
offset = 0

while 1:
    lines = data_file.readlines(100000)
    if not lines:
        break

    for line in lines:
        line_offset.append(offset)
        offset += len(line)

# reading a line
line_to_read = 1
line = ''

data_file.seek(line_offset[line_to_read])   
line = data_file.readline() 

Upvotes: 2

Related Questions