Reputation: 51
Iam currently trying to split the large file >200GB. Goal is to divide the large file into smaller chunks. I have written following code and it works great on smaller file. However on larger file my computer is restarting. At this point i can't figure out if it is my hardware issuse(i.e. processing power) or some other reason. Also looking for ideas if there is efficient way of doing same thing.
def split(source, target, lines):
index = 0
block = 0
if not os.path.exists(target):
os.mkdir(target)
with open(source, 'rb') as s:
chunk = s.readlines()
while block < len(chunk):
with open(target+(f'file_{index:04d}.txt'), 'wb') as t:
t.writelines(chunk[block: block+lines])
index+=1
block+=lines
Upvotes: 0
Views: 93
Reputation: 1083
It's the s.readlines()
that kills it since it'll try to load it all into memory.
You could do something like
with open("largeFile",'rb') as file:
while True:
data = file.read(1024) //blocksize
the file.read()
only takes the specified blocksize, that should avoid the issue you're currently having.
EDIT:
I'm not smart, I've missed the "text file" part in your title, sorry.
In that case it should be enough to use file.readline()
instead of file.readlines()
Upvotes: 1