Reputation: 2213
I have a task to fix/improve a program that we have and constantly throws OOM exceptions when processing large files (60-90Mb) in memory. This program essentially an in house solution to move files from our infrastructure to client SFTP and the other way around.
High level description: The data being moved in a form of file(s) containing payment data. Payment data is tokenized on our end and must be untokenized when it reaches client SFTP or, if we are moving the file FROM the client then the data is untokenized on their end and we must tokenize payment info BEFORE creating and storing that data as a file on our end. So the rule we MUST observe is that it HAS to load the whole file into memory (we cannot copy it into a file, temporary or otherwise) tokenize/detokenize it in memory and then push that data from memory to client SFTP in clear text or store on our end in tokenized form. Under no circumstances are we allowed to store a file with clear text payment data on our end.
Currently this application loads the file whole into a custom class that is more or less an array of strings, loops through that array picking up payment information and then sends that payment information to the Tokenization API as another array of values (we have no ability to batch the process or load files partially at this moment) then converts the outcome into a MemoryStream and saves it into a file (with tokenized payment data) or uploads it to SFTP (with untokenized aka clear text payment data).
That's a high level background info on our process and the application. Now the issue is, the app used to be throwing OOM when processing a 90Mb file. I've improved the code somewhat and when compiling the app (compiled as x86, cannot compile as x64 for several reasons). I've also used the editbin.exe /largeaddressaware
patch on the app executable to let it handle addresses larger than 2 GB.
All these adjustments improved processing files up to 120Mb. I could probably do some more code improvements, but our servers that run the app have 16 or more Gigs of RAM and don't seem to run out of RAM when processing files. Furthermore, when running locally, I have the same OOM issues and still have about 40% free RAM. Reading a bit on this topic makes me feel like this is a limitation of the x86 and/or .NET infrastructure.
The error sometimes happens when we send the data to the Tokenization API and sometimes it happens when we try converting the whole file (as a string builder) to an array of bytes to be saved as a file System.Text.Encoding.UTF8.GetBytes(sb.ToString())
Exception: Weirdly enough we get 2 types of exceptions, sometimes this straightforward OOM exception:
System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
at System.IO.MemoryStream.set_Capacity(Int32 value)
at System.IO.MemoryStream.EnsureCapacity(Int32 value)
at System.IO.MemoryStream.Write(Byte[] buffer, Int32 offset, Int32 count)
at xxxxx.ToMemoryStream()
or sometimes this weird one which shuts the app down completely ignoring any try/catches:
The runtime has encountered a fatal error. The address of the error was at 0x73aa2847, on thread 0x5070. The error code is 0xe0004743. This error may be a bug in the CLR or in the unsafe or non-verifiable portions of user code. Common sources of this bug include user marshaling errors for COM-interop or PInvoke, which may corrupt the stack.
Question: Is there a more reliable way to load things into memory and not be limited by x86 or .NET infrastructure without breaking up data or loading it partially? I feel like with 4-8 GB available RAM we shouldn't be running out of it when loading 150Mb file into memory. There seem to be other ways of loading things into memory like Memory or MemoryPool, but before spending any significant time researching all the possibilities I thought of asking this helpful community for any directions I should focus on.
Tech used: C#, .NET framework 4.5.2
Thank you
PS: when this app runs, its RAM footprint is usually around 2-3 gigs and like I mentioned before, whether running locally or on the proper server there is often another 4-8 gigs of RAM available when the app runs OOM.
EDIT 1:
public MemoryStream ToMemoryStream()
{
MemoryStream ms = new MemoryStream();
//byte[] msBuffer;
var sb = new StringBuilder();
foreach (var section in Lines.Values)
{
var sec = section.ToString();
if (sec.Contains("\n"))
sb.Append(sec);
else
sb.AppendLine(sec);
if (sb.Length <= 0) continue;
//msBuffer = System.Text.Encoding.UTF8.GetBytes(sb.ToString());
//ms.Write(msBuffer, 0, msBuffer.Length);
// Disabled msBurrer to help out alleviate Out of memory issues
ms.Write(System.Text.Encoding.UTF8.GetBytes(sb.ToString()), 0, sb.ToString()).Length;
sb.Clear();
}
ms.Position = 0;
return ms;
}
Commented out stuff are my improvements into the function. Previously it would create essentially another copy of the file (already being held in the Lines
object) by continuously adding to sb string builder and then create another copy by converting to an array of bytes and assigning to msBuffer and then yet another copy is the actual MemoryStream object that was returned.
After improvements, the only copies of the file in memory are the original Lines object and the memory stream which is returned and later on saved as a file.
EDIT: while there is no clear solution to my problem, what helped was knowing the size of the file beforehand and allocating that much memory at once when creating MemoryStream helped a fair amount (by about 20%). Thank you for that suggestion Ralf
Upvotes: -1
Views: 78
Reputation: 4728
Instead of trying to store the entire file in RAM, you cantry something different.
First, DON'T read the entire thing into a string. Use a streamreader and parse it line by line.
If it contains normalized data that you need to summarize or rearrange, then you could create a SQLite database file (and requisite tables) to temporarily store such data as you read it from the streamreader. You will delete this file at the end of your process. The benefit here is that the SQLite db engine can summarize, sort, and so forth. You will have a second method that pushes data to the API you mentioned, using a query against the SQLite db. It seems that it's absolutely important that you don't keep the SQLite file around after, so you need to make sure that the SQLite file is deleted even if exceptions occur.
The point here is to keep all that data out of RAM. RAM usage is not really a great architecture because if the input data increases, you have to keep worrying about running out of RAM like you are right now. Disk space is cheaper and more plentiful. Use that.
Upvotes: 0