Reputation: 121
I have to read a large files 4-10gb each line by line, the problem is that a .Net process gets and OutOfMemory exception when I read ~2gb
At first I am just attempting to count the lines, however I will need to access each line individually to strip some data from it.
From what I can see, every option keeps the previous lines in memory, where I only want it to keep the currently read line(unless anyone knows a trick to keep all of it)
Here is that I tried, and several things like it:
StreamReader reader = File.OpenText(FilePath);
while ((line = reader.ReadLine()) != null) //This is where it errors
{
count++;
}
reader.Close();
The exception is:
Exception of type 'System.OutOfMemoryException' was thrown.
at System.Text.StringBuilder.ExpandByABlock(Int32 minBlockCharCount)
at System.Text.StringBuilder.Append(Char* value, Int32 valueCount)
at System.Text.StringBuilder.Append(Char[] value, Int32 startIndex, Int32 charCount)
at System.IO.StreamReader.ReadLine()
at CSV.Program.NumLines() in C:\Users\ted\Documents\Visual Studio 2015\Projects\vConnect\CSV\CSV\Program.cs:line 100
at CSV.Program.Main(String[] args) in C:\Users\ted\Documents\Visual Studio 2015\Projects\vConnect\CSV\CSV\Program.cs:line 20
at System.AppDomain._nExecuteAssembly(RuntimeAssembly assembly, String[] args)
at System.AppDomain.ExecuteAssembly(String assemblyFile, Evidence assemblySecurity, String[] args)
at Microsoft.VisualStudio.HostingProcess.HostProc.RunUsersAssembly()
at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Threading.ThreadHelper.ThreadStart()
Upvotes: 2
Views: 817
Reputation: 14555
You should perhaps be using memory-mapped files [1].
This allows opening up a file and reading from it chunk by chunk. It's implemented by the MemoryMappedFile
[2] class.
[1] https://learn.microsoft.com/en-us/dotnet/standard/io/memory-mapped-files
[2] https://learn.microsoft.com/en-us/dotnet/api/system.io.memorymappedfiles.memorymappedfile
Upvotes: 1
Reputation: 54
You can use methods from the class FileStream: FileStream.Read and FileStream.Seek should allow you to do what you need. An example can be found here: http://www.codeproject.com/Questions/543821/ReadplusBytesplusfromplusLargeplusBinaryplusfilepl
You'll have to modify that slightly but essentially you can start at 0, read until you find a newline character, process the line, start from where you got to and repeat. It won't be terribly efficient but it will get the job done.
Hope this helps.
Have a look at:
FileStream.Read
FileStream.Seek
That pretty much covers what you need to know.
[Update] Your implementation should look a bit like this:
const int megabyte = 1024 * 1024;
public void ReadAndProcessLargeFile(string theFilename, long whereToStartReading = 0)
{
FileStream fileStram = new FileStream(theFilename,FileMode.Open,FileAccess.Read);
using (fileStram)
{
byte[] buffer = new byte[megabyte];
fileStram.Seek(whereToStartReading, SeekOrigin.Begin);
int bytesRead = fileStram.Read(buffer, 0, megabyte);
while(bytesRead > 0)
{
ProcessChunk(buffer, bytesRead);
bytesRead = fileStram.Read(buffer, 0, megabyte);
}
}
}
private void ProcessChunk(byte[] buffer, int bytesRead)
{
// Do the processing here
}
Best regards Espen Harlinn
In addition to the correct answer by Espen Harlinn:
Breaking a file into chunks will hardly help you, unless those chunks are of different natures (different formats, representing different data structures), so they were put in one file without proper justification.
In other cases, it's good to use the big file and keep it open. There are cases when you need to split the file in two pieces. This is just the basic idea; see below.
So, I would assume that the file is big just because it represent a collection of object of the same type or few different types. If all the items are of the same size (in file storage units), addressing then is trivial: you simply need to multiply the size by required index of the item, to get a position parameter for Stream.Seek
. So, the only non-trivial case is when you have a collection of items of different size. If this is the case, you should index the file and build the index table. The index table will consist of the units of the same size, which is typically the list/array of file positions per index. Due to this fact, addressing to the index table can be done by index (shift), as described above, and then you read position of the "big" file, move file position there and read data.
You will have 2 options: 1) keep index table in memory; you can recalculate it each time; but it's better to do it once (cache) and to keep it in some file, the same or a separate one; 2) to have it in a file and read this file at required position. This way, you will have to seek the position in the file(s) in two steps. In principle, this method will allow you to access files of any size (limited only by System.Uint64.MaximumValue
).
After you position in a stream of a "big" file, you can read a single item. You can use serialization for this purpose. Please see:
http://en.wikipedia.org/wiki/Serialization#.NET_Framework,
http://msdn.microsoft.com/en-us/library/vstudio/ms233843.aspx,
http://msdn.microsoft.com/en-us/library/system.runtime.serialization.formatters.binary.binaryformatter.aspx
A fancy way of implementing all the solutions with index table would be encapsulating it all in the class with indexed property.
—SA
Upvotes: 1