Reputation: 37172
I have a very big file 4GB and when I try to read it my computer hangs. So I want to read it piece by piece and after processing each piece store the processed piece into another file and read next piece.
Is there any method to yield
these pieces ?
I would love to have a lazy method.
Upvotes: 386
Views: 382973
Reputation: 3078
Update :- You can also use file_object.readlines if you want the chunk to give you results in complete line by that i mean no unfinished lines will be present in the result.
for example :-
def read_in_chunks(file_object, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.readlines(chunk_size)
if not data:
break
yield data
-- Adding on to the answer given --
When i was reading file in chunk let's suppose a text file with the name of split.txt the issue i was facing while reading in chunks was I had a use case where i was processing the data line by line and just because the text file i was reading in chunks it(chunk of file) sometimes end with partial lines that end up breaking my code(since it was expecting the complete line to be processed)
so after reading here and there I came to know I can overcome this issue by keeping a track of the last bit in the chunk so what I did was if the chunk has a /n in it that means the chunk consists of a complete line otherwise I usually store the partial last line and keep it in a variable so that I can use this bit and concatenate it with the next unfinished line coming in the next chunk with this I successfully able to get over this issue.
sample code :-
# in this function i am reading the file in chunks
def read_in_chunks(file_object, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
# file where i am writing my final output
write_file=open('split.txt','w')
# variable i am using to store the last partial line from the chunk
placeholder= ''
file_count=1
try:
with open('/Users/rahulkumarmandal/Desktop/combined.txt') as f:
for piece in read_in_chunks(f):
#print('---->>>',piece,'<<<--')
line_by_line = piece.split('\n')
for one_line in line_by_line:
# if placeholder exist before that means last chunk have a partial line that we need to concatenate with the current one
if placeholder:
# print('----->',placeholder)
# concatinating the previous partial line with the current one
one_line=placeholder+one_line
# then setting the placeholder empty so that next time if there's a partial line in the chunk we can place it in the variable to be concatenated further
placeholder=''
# futher logic that revolves around my specific use case
segregated_data= one_line.split('~')
#print(len(segregated_data),type(segregated_data), one_line)
if len(segregated_data) < 18:
placeholder=one_line
continue
else:
placeholder=''
#print('--------',segregated_data)
if segregated_data[2]=='2020' and segregated_data[3]=='2021':
#write this
data=str("~".join(segregated_data))
#print('data',data)
#f.write(data)
write_file.write(data)
write_file.write('\n')
print(write_file.tell())
elif segregated_data[2]=='2021' and segregated_data[3]=='2022':
#write this
data=str("-".join(segregated_data))
write_file.write(data)
write_file.write('\n')
print(write_file.tell())
except Exception as e:
print('error is', e)
Upvotes: 0
Reputation: 482
Refer to python's official documentation https://docs.python.org/3/library/functions.html#iter
Maybe this method is more pythonic:
"""A file object returned by open() is a iterator with
read method which could specify current read's block size
"""
with open('mydata.db', 'r') as f_in:
block_read = partial(f_in.read, 1024 * 1024)
block_iterator = iter(block_read, '')
for index, block in enumerate(block_iterator, start=1):
block = process_block(block) # process your block data
with open(f'{index}.txt', 'w') as f_out:
f_out.write(block)
Upvotes: 7
Reputation: 2552
There are already many good answers, but if your entire file is on a single line and you still want to process "rows" (as opposed to fixed-size blocks), these answers will not help you.
99% of the time, it is possible to process files line by line. Then, as suggested in this answer, you can to use the file object itself as lazy generator:
with open('big.csv') as f:
for line in f:
process(line)
However, one may run into very big files where the row separator is not '\n'
(a common case is '|'
).
'|'
to '\n'
before processing may not be an option because it can mess up fields which may legitimately contain '\n'
(e.g. free text user input).For these kind of situations, I created the following snippet [Updated in May 2021 for Python 3.8+]:
def rows(f, chunksize=1024, sep='|'):
"""
Read a file where the row separator is '|' lazily.
Usage:
>>> with open('big.csv') as f:
>>> for r in rows(f):
>>> process(r)
"""
row = ''
while (chunk := f.read(chunksize)) != '': # End of file
while (i := chunk.find(sep)) != -1: # No separator found
yield row + chunk[:i]
chunk = chunk[i+1:]
row = ''
row += chunk
yield row
[For older versions of python]:
def rows(f, chunksize=1024, sep='|'):
"""
Read a file where the row separator is '|' lazily.
Usage:
>>> with open('big.csv') as f:
>>> for r in rows(f):
>>> process(r)
"""
curr_row = ''
while True:
chunk = f.read(chunksize)
if chunk == '': # End of file
yield curr_row
break
while True:
i = chunk.find(sep)
if i == -1:
break
yield curr_row + chunk[:i]
curr_row = ''
chunk = chunk[i+1:]
curr_row += chunk
I was able to use it successfully to solve various problems. It has been extensively tested, with various chunk sizes. Here is the test suite I am using, for those who need to convince themselves:
test_file = 'test_file'
def cleanup(func):
def wrapper(*args, **kwargs):
func(*args, **kwargs)
os.unlink(test_file)
return wrapper
@cleanup
def test_empty(chunksize=1024):
with open(test_file, 'w') as f:
f.write('')
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 1
@cleanup
def test_1_char_2_rows(chunksize=1024):
with open(test_file, 'w') as f:
f.write('|')
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 2
@cleanup
def test_1_char(chunksize=1024):
with open(test_file, 'w') as f:
f.write('a')
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 1
@cleanup
def test_1025_chars_1_row(chunksize=1024):
with open(test_file, 'w') as f:
for i in range(1025):
f.write('a')
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 1
@cleanup
def test_1024_chars_2_rows(chunksize=1024):
with open(test_file, 'w') as f:
for i in range(1023):
f.write('a')
f.write('|')
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 2
@cleanup
def test_1025_chars_1026_rows(chunksize=1024):
with open(test_file, 'w') as f:
for i in range(1025):
f.write('|')
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 1026
@cleanup
def test_2048_chars_2_rows(chunksize=1024):
with open(test_file, 'w') as f:
for i in range(1022):
f.write('a')
f.write('|')
f.write('a')
# -- end of 1st chunk --
for i in range(1024):
f.write('a')
# -- end of 2nd chunk
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 2
@cleanup
def test_2049_chars_2_rows(chunksize=1024):
with open(test_file, 'w') as f:
for i in range(1022):
f.write('a')
f.write('|')
f.write('a')
# -- end of 1st chunk --
for i in range(1024):
f.write('a')
# -- end of 2nd chunk
f.write('a')
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 2
if __name__ == '__main__':
for chunksize in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]:
test_empty(chunksize)
test_1_char_2_rows(chunksize)
test_1_char(chunksize)
test_1025_chars_1_row(chunksize)
test_1024_chars_2_rows(chunksize)
test_1025_chars_1026_rows(chunksize)
test_2048_chars_2_rows(chunksize)
test_2049_chars_2_rows(chunksize)
Upvotes: 49
Reputation:
In Python 3.8+ you can use .read()
in a while
loop:
with open("somefile.txt") as f:
while chunk := f.read(8192):
do_something(chunk)
Of course, you can use any chunk size you want, you don't have to use 8192
(2**13
) bytes. Unless your file's size happens to be a multiple of your chunk size, the last chunk will be smaller than your chunk size.
Upvotes: 16
Reputation: 222792
To write a lazy function, just use yield
:
def read_in_chunks(file_object, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
with open('really_big_file.dat') as f:
for piece in read_in_chunks(f):
process_data(piece)
Another option would be to use iter
and a helper function:
f = open('really_big_file.dat')
def read1k():
return f.read(1024)
for piece in iter(read1k, ''):
process_data(piece)
If the file is line-based, the file object is already a lazy generator of lines:
for line in open('really_big_file.dat'):
process_data(line)
Upvotes: 559
Reputation: 529
file.readlines()
takes in an optional size argument which approximates the number of lines read in the lines returned.
bigfile = open('bigfilename','r')
tmp_lines = bigfile.readlines(BUF_SIZE)
while tmp_lines:
process([line for line in tmp_lines])
tmp_lines = bigfile.readlines(BUF_SIZE)
Upvotes: 49
Reputation:
If your computer, OS and python are 64-bit, then you can use the mmap module to map the contents of the file into memory and access it with indices and slices. Here an example from the documentation:
import mmap
with open("hello.txt", "r+") as f:
# memory-map the file, size 0 means whole file
map = mmap.mmap(f.fileno(), 0)
# read content via standard file methods
print map.readline() # prints "Hello Python!"
# read content via slice notation
print map[:5] # prints "Hello"
# update content using slice notation;
# note that new content must have same size
map[6:] = " world!\n"
# ... and read again using standard file methods
map.seek(0)
print map.readline() # prints "Hello world!"
# close the map
map.close()
If either your computer, OS or python are 32-bit, then mmap-ing large files can reserve large parts of your address space and starve your program of memory.
Upvotes: 41
Reputation: 453
you can use following code.
file_obj = open('big_file')
open() returns a file object
then use os.stat for getting size
file_size = os.stat('big_file').st_size
for i in range( file_size/1024):
print file_obj.read(1024)
Upvotes: -2
Reputation: 81
I think we can write like this:
def read_file(path, block_size=1024):
with open(path, 'rb') as f:
while True:
piece = f.read(block_size)
if piece:
yield piece
else:
return
for piece in read_file(path):
process_piece(piece)
Upvotes: 5
Reputation: 3793
f = ... # file-like object, i.e. supporting read(size) function and
# returning empty string '' when there is nothing to read
def chunked(file, chunk_size):
return iter(lambda: file.read(chunk_size), '')
for data in chunked(f, 65536):
# process the data
UPDATE: The approach is best explained in https://stackoverflow.com/a/4566523/38592
Upvotes: 14
Reputation: 319531
I'm in a somewhat similar situation. It's not clear whether you know chunk size in bytes; I usually don't, but the number of records (lines) that is required is known:
def get_line():
with open('4gb_file') as file:
for i in file:
yield i
lines_required = 100
gen = get_line()
chunk = [i for i, j in zip(gen, range(lines_required))]
Update: Thanks nosklo. Here's what I meant. It almost works, except that it loses a line 'between' chunks.
chunk = [next(gen) for i in range(lines_required)]
Does the trick w/o losing any lines, but it doesn't look very nice.
Upvotes: 1
Reputation: 141
i am not allowed to comment due to my low reputation, but SilentGhosts solution should be much easier with file.readlines([sizehint])
edit: SilentGhost is right, but this should be better than:
s = ""
for i in xrange(100):
s += file.next()
Upvotes: 2