Reputation: 2077

Best strategy to implement reader for large text files

We have an application which logs its processing steps into text files. These files are used during implementation and testing to analyse problems. Each file is up to 10MB in size and contains up to 100,000 text lines.

Currently the analysis of these logs is done by opening a text viewer (Notepad++ etc) and looking for specific strings and data depending on the problem.

I am building an application which will help the analysis. It will enable a user to read files, search, highlight specific strings and other specific operations related to isolating relevant text.

The files will not be edited!

While playing a little with some concepts, I found out immediately that TextBox (or RichTextBox) don't handle display of large text very well. I managed to to implement a viewer using DataGridView with acceptable performance, but that control does not support color highlighting of specific strings.

I am now thinking of holding the entire text file in memory as a string, and only displaying a very limited number of records in the RichTextBox. For scrolling and navigating I thought of adding an independent scrollbar.

One problem I have with this approach is how to get specific lines from the stored string.

If anyone has any ideas, can highlight problems with my approach then thank you.

Upvotes: 2

Answers (4)

Nikolay Klimchuk

Reputation: 345

I would suggest to use MemoryMappedFile in .NET 4 (or via DllImport in previous versions) to handle just small portion of file that visible on screen instead of wasting memory and time with loading of entire file.

Upvotes: 1

Chris Judge

Reputation: 1992

I suppose that when one has multiple gigabytes of RAM available, one naturally gravitates towards the "load the whole file into memory" path, but is anyone here really satisfied with such a shallow understanding of the problem? What happens when this guy wants to load a 4 gigabyte file? (Yeah, probably not likely, but programming is often about abstractions that scale and the quick fix of loading the whole thing into memory just isn't scalable.)

There are, of course, competing pressures: do you need a solution yesterday or do you have the luxury of time to dig into the problem and learning something new? The framework also influences your thinking by presenting block-mode files as streams... you have to check the stream's BaseStream.CanSeek value and, if that is true, access the BaseStream.Seek() method to get random access. Don't get me wrong, I absolutely love the .NET framework, but I see a construction site where a bunch of "carpenters" can't put up the frame for a house because the air-compressor is broken and they don't know how to use a hammer. Wax-on, wax-off, teach a man to fish, etc.

So if you have time, look into a sliding window. You can probably do this the easy way by using a memory-mapped file (let the framework/OS manage the sliding window), but the fun solution is to write it yourself. The basic idea is that you only have a small chunk of the file loaded into memory at any one time (the part of the file that is visible in your interface with maybe a small buffer on either side). As you move forward through the file, you can save the offsets of the beginning of each line so that you can easily seek to any earlier section of the file.

Yes, there are performance implications... welcome to the real world where one is faced with various requirements and constraints and must find the acceptable balance between time and memory utilization. This is the fun of programming... figuring out the various ways that a goal can be reached and learning what the tradeoffs are between the various paths. This is how you grow beyond the skill levels of that guy in the office who sees every problem as a nail because he only knows how to use a hammer.

[/rant]

Upvotes: 2

Martin Liversage

Reputation: 106956

Here is an approach that scales well on modern CPU's with multiple cores.

You create an iterator block that yields the lines from the text file (or multiple text files if required):

IEnumerable<String> GetLines(String fileName) {
  using (var streamReader = File.OpenText(fileName))
    while (!streamReader.EndOfStream)
      yield return streamReader.ReadLine();
}

You then use PLINQ to search the lines in parallel. Doing that can speed up the search considerably if you have a modern CPU.

GetLines(fileName)
  .AsParallel()
  .AsOrdered()
  .Where(line => ...)
  .ForAll(line => ...);

You supply a predicate in Where that matches the lines you need to extract. You then supply an action to ForAll that will send the lines to their final destination.

This is a simplified version of what you need to do. Your application is a GUI application and you cannot perform the search on the main thread. You will have to start a background task for this. If you want this task to be cancellable you need to check a cancellation token in the while loop in the GetLines method.

ForAll will call the action on threads from the thread pool. If you want to add the matching lines to a user interface control you need to make sure that this control is updated on the user interface thread. Depending on the UI framework you use there are different ways to do that.

This solution assumes that you can extract the lines you need by doing a single forward pass of the file. If you need to do multiple passes perhaps based on user input you may need to cache all lines from the file in memory instead. Caching 10 MB is not much but lets say you decide to search multiple files. Caching 1 GB can strain even a powerful computer but using less memory and more CPU as I suggest will allow you to search very big files within a reasonable time on a modern desktop PC.

Upvotes: 3

Jon Skeet

Reputation: 1504082

I would suggest loading the whole thing into memory, but as a collection of strings rather than a single string. It's very easy to do that:

string[] lines = File.ReadAllLines("file.txt");

Then you can search for matching lines with LINQ, display them easily etc.

Upvotes: 4

Best strategy to implement reader for large text files

Answers (4)

Related Questions