MaYaN
MaYaN

Reputation: 6996

How can I efficiently index a file?

I am dealing with an application that needs to randomly read an entire line of text from a series of potentially large text files (~3+ GB).

The lines can be of a different length.

In order to reduce GC and create unnecessary strings, I am using the solution provided at: Is there a better way to determine the number of lines in a large txt file(1-2 GB)? to detect each new line and store that in a map in one pass therefore producing an index of lineNo => positioni.e:

// maps each line to it's corresponding fileStream.position in the file    
List<int> _lineNumberToFileStreamPositionMapping = new List<int>();
  1. go through the entire file
  2. when detect a new line increment lineCount and add the fileStream.Position to the _lineNumberToFileStreamPositionMapping

We then use an API similar to:

public void ReadLine(int lineNumber)
{
     var getStreamPosition = _lineNumberToFileStreamPositionMapping[lineNumber];
     //... set the stream position, read the byte array, convert to string etc.
}

This solution is currently providing a good performance however there are two things I do not like:

  1. Since I do not know the total number of lines in the file, I cannot preallocate an array therefore I have to use a List<int> which has the potential inefficiency of resizing to double of what I actually need;
  2. Memory usage, so as an example for a text file of ~1GB with ~5 million lines of text the index occupies ~150MB I would really like to decrease this as much as possible.

Any ideas are very much appreciated.

Upvotes: 6

Views: 463

Answers (1)

smead
smead

Reputation: 1808

  1. Use List.Capacity to manually increase the capacity, perhaps every 1000 lines or so.

  2. If you want to trade performance for memory, you can do this: instead of storing the positions of every line, store only the positions of every 100th (or something) line. Then when, say, line 253 is required, go to the position of line 200 and count forward 53 lines.

Upvotes: 4

Related Questions