lightxx
lightxx

Reputation: 1067

Weird behavior parsing large text file using a foreach loop (C# .NET 4)

i have a VERY large text file to parse (~2GB). for various reasons i have to process the file line-wise. i do this by loading the text file to memory (the server I'm running the parser on has way enough memory) with var records = Regex.Split(File.ReadAllText(dumpPath, Encoding.Default), @"my regex here").Where(s => !string.IsNullOrEmpty(s));. this consumes RAM equivalent to the size of the text file plus a few MBs for the IEnumerable overhead. so far so good. then i go over the collection with foreach (var recordsd in records) {...}

here comes the interesting part. i do a lot of string manipulation and regex-ing in the foreach loop. then the program quickly bombs with an System.OutOfMemoryException, even though i never use more than a few kB in the foreach loop. i made a few memory snapshots using the profiler of my choice (ANTS memory profiler), seeing millions and millions of Generation 2 string objects on the heap, consuming all available memory.

seeing that, i - just as a test - included a GC.Collect(); at the end of each foreach iteration, and voila, problem solved and no more out of memory exceptions (sure enough because of the permanent garbage collections the program now runs painstakingly slow). The only memory consumed is the size of the actual file.

now i can't explain why this happens and how to prevent it. to my understanding, the very moment a variable goes out of scope and has no more (active) references to it should be marked for garbage collection, right?

on another side note, i tried to run the program on a really massive machine (64GB RAM). the program finished successfully but never released a single byte of memory before it was closed. why? if there are no more references to an object plus if the object goes out of scope, why is the memory never released?

Upvotes: 2

Views: 350

Answers (1)

Jon Skeet
Jon Skeet

Reputation: 1502086

now i can't explain why this happens and how to prevent it. to my understanding, the very moment a variable goes out of scope and has no more (active) references to it should be marked for garbage collection, right?

No. There's no such thing as being "marked" for garbage collection, and variables aren't garbage collected: objects are. And an object which is already in gen2 won't be garbage collected until the next time the GC looks at gen2, which is relatively rare.

for various reasons i have to process the file line-wise.

Then there's your answer: use File.ReadLines if you're using .NET 4, and write the equivalent (it's easy) if you're not. Then you don't need the whole file in memory at a time - just one line. Your memory usage should absolutely plummet. (Note that that's ReadLines, not ReadAllLines - the latter will read the whole file into an array of strings, which isn't what you want.)

on another side note, i tried to run the program on a really massive machine (64GB RAM). the program finished successfully but never released a single byte of memory before it was closed. why?

If you're talking about memory that the process takes from the operating system, I don't believe that the CLR ever releases memory. I assume it takes the approach that if you've used that much memory once, you'll probably use that much again.

Upvotes: 5

Related Questions