Blankman
Blankman

Reputation: 266960

What is the fastest way to read a line from a very large file when you know the line number

At first I loop through the file line by line, and in an array I keep track of specific line numbers that I will need to reference later.

The file is very large (say 1 GB or more) so I one I scan it and load the specific line numbers into an array, the file is no longer in memory.

What would be be the fastest and most efficient way to read a specific line in a file?

The file contains line breaks of string text, where each row represents a transaction.

Instead of the line number, would it make more sense to somehow store the byte offset?

Upvotes: -1

Views: 1400

Answers (3)

Stephen C
Stephen C

Reputation: 718758

What is the fastest way to read a line from a very large file when you know the line number.

The following assumes that the file is too large to hold in memory, either as a data structure or by memory mapping it.

If you just know the line number and you are reading just one line, then the fastest (simple) way will be to read with a BufferedReader.read and count the line separators.

If you are doing the operation multiple times per file, then it is more complicated to do this efficiently. Firstly, you need a data structure to map line numbers to file offsets. This has to be created by reading the file, counting lines and recording byte offsets. There are various ways to represent the, but an array of offsets will be the most memory efficient ... and the fastest if you are only going to use it to map a line number to a byte offset.

Unfortunately, with a Reader you cannot get the current byte offset from, and you cannot "seek" a Reader to a particular byte or character offset.

Thus, to implement efficient line no -> byte offset -> line of text retrieval, you will need to use a BufferedInputStream or similar. But you will also need to make your code aware of the charset of the file you are reading:

  1. You will need to use new String(byte[], ..., Charset) and / or the Charset or CharsetDecoder APIs to turn the bytes that comprise a line into a Java String object.

  2. If you decide to use read(byte[], ...) or ByteBuffer for efficiency, you need to be careful to not to mangle multi-byte characters that span buffer boundaries when converting to strings.

  3. In theory, your code also needs to detect line separators in an encoding aware way. In a few character sets, simple encoding-agnostic tests for ASCII NL and CR byte values won't work. Examples where that would be problematic include EBCDIC (which uses 0x15 as the code for NL), and UTF-16 encodings (where CR and NL are represented as 2 bytes not one).

This is not trivial ...

The second problem is that each time your application reads the line with a given line number, it will first need to "seek" the stream to the right byte offset. Depending on the access patterns, that may entail a seek system call followed by a read system call. Depending on the access patterns, it may be advisable to implement some kind of caching. You could cache just the line that you read ... or you could read on a disk block boundary and cache preceding and / or following lines. The best strategy will depend on access patterns, etc, and probably can only be determined experimentally.

(It might be simpler and/or more efficient to store the lines in a database rather than reading from a 1GB+ text file. A database will typically do some server-side caching for you, and there various ways of implementing client-side caching.)

Upvotes: 2

Hasibul Hasan Rupok
Hasibul Hasan Rupok

Reputation: 5

You can use BufferedReader to read from a file, it is firster than the other way. If I say simply, If you ask to read a line using BufferedReader it reads multiple lines at a time but it gives you the line that you are looking for the rest lines are stored in the buffer, when you ask again for another line then it checks to the buffer if the line exists in the buffer it will give you that line from the buffer, it does not go to the saved memory every time to read from a file.

BufferedReader.readLine()!

Upvotes: -1

Parsifal
Parsifal

Reputation: 4486

The question is a little unclear, so I'm going to break out multiple possibilities and suggest solutions to each.

I keep track of specific line numbers that I will need to reference later

Do you know all of the lines that you'll need to retrieve before you start? Or have some mechanism to identify those lines?

If yes, then the simplest solution is to read the file once, using BufferedReader.readLine(). Store each retrieved line in a Map, keyed by its line number.

The file is very large (say 1 GB or more)

At that size, I'd actually read the entire file into the Map -- 1 GB isn't that big. I'd probably use a TreeMap, since its internal structures are slightly smaller than a HashMap, and the difference in performance isn't going to be noticeable.

Instead of the line number, would it make more sense to somehow store the byte offset?

Yes, in fact that's what you must do. In fact, I would store the starting and ending offset of each line, but you can get away with storing just the start -- the end is the start of the next line - 1.

It would be most efficient to store in a long[] versus an ArrayList<Long>, since the latter will have object overhead for each entry. But with the former you'll have to either allocate an array that's larger than the likely number of lines, and/or handle resizing yourself.

You'll need to read the entire file to figure out the line separators. However, as long as the file uses the standard \n (or \r) line separator, and is uses an encoding compatible with US-ASCII (eg, iso-8859-x, utf-8, or even utf-16/utf-32 as long as you pay attention to read size), then you don't need to do any character-set translation when reading the file.

I would use a BufferedInputStream, and read it a character at a time. This translates to retrieving individual bytes out of an array, so is at least as fast as you can do it any other way -- at least using java.io.

If your files are less than 2GB, then you can used a MappedByteBuffer. This will give you a slight performance boost when reading a byte at a time, because it doesn't need to copy bytes from the OS buffer onto the Java heap. It also leverages the OS to read and cache pages, which may result in more efficient reads (eg, because the OS decides to read more than one block at a time).

For reading the lines, you need to seek to the starting offset, grab the bytes between it and the ending offset, and then call new String(b, charset).

If you don't use a MappedByteBuffer, then I'd recommend a FileChannel rather than a RandomAccessFile. This will also leverage the page cache (albeit to a lesser extent), and doesn't require a syscall for each read.

Upvotes: 0

Related Questions