Reputation: 2283
I have a large file (400+ MB) that I'm reading in from S3 using get_contents_as_string()
, which means that I end up with the entire file in memory as a string. I'm running several other memory-intensive operations in parallel, so I need a memory-efficient way of splitting the resulting string into chunks by line number. Is split()
efficient enough? Or is something like re.finditer()
a better way to go?
Upvotes: 3
Views: 384
Reputation: 3215
I see three options here, from the most memory-consuming to the least:
split
will create a copy of your file as a list of strings, meaning additional 400 MB used. Easy to implement, takes RAM.re
or simply iterate over a string and memorize \n
positions: for i, c in enumerate(s): if c == '\n': newlines.append(i+1)
.I would also suggest to encapsulate solutions 2 or 3 into a separate class in order to keep newline indexes and the string contents consistent. Proxy
pattern and the idea of lazy evaluation
would fit here, I think.
Upvotes: 1
Reputation: 1982
you could try to read the file line by line
f= open(filename)
partialstring = f.readline()
see https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files
Upvotes: 0