LateCoder
LateCoder

Reputation: 2283

Best way to chunk a large string by line

I have a large file (400+ MB) that I'm reading in from S3 using get_contents_as_string(), which means that I end up with the entire file in memory as a string. I'm running several other memory-intensive operations in parallel, so I need a memory-efficient way of splitting the resulting string into chunks by line number. Is split() efficient enough? Or is something like re.finditer() a better way to go?

Upvotes: 3

Views: 384

Answers (2)

u354356007
u354356007

Reputation: 3215

I see three options here, from the most memory-consuming to the least:

  1. split will create a copy of your file as a list of strings, meaning additional 400 MB used. Easy to implement, takes RAM.
  2. Use re or simply iterate over a string and memorize \n positions: for i, c in enumerate(s): if c == '\n': newlines.append(i+1).
  3. The same as point 2, but with the string stored as a file on HDD. Slow but really memory efficient, also addressing the disadvantage of Python strings - they're immutable, and if one wants to do some changes, interpreter will create a copy. Files don't suffer from this, allowing in-place operations without loading the whole file at all.

I would also suggest to encapsulate solutions 2 or 3 into a separate class in order to keep newline indexes and the string contents consistent. Proxy pattern and the idea of lazy evaluation would fit here, I think.

Upvotes: 1

Marc Wagner
Marc Wagner

Reputation: 1982

you could try to read the file line by line

f= open(filename)

partialstring = f.readline()

see https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files

Upvotes: 0

Related Questions