Obtaining a reliable character count from a text file's size in bytes

Question

Ok, so the question is: given a random text file's FileInfo object, and knowing the encoding of said file (it can be ASCII, UTF7, UTF8, Unicode, etc.) is there a way to get the exact character count of the file without reading it?

You know the file's size in bytes through the FileInfo.Length property so theoretically knowing the CharSize of the encoding you should be able to get the character count.

Testing with some encodings seems to work (ASCII, Unicode) but others are slightly off (UTF8 for instance).

Is this even possible in general or do you have to read the whole file to always get a reliable character count?

Aidiakapi · Accepted Answer

Problem

As mentioned before, it's not possible without reading all characters due to variable-width character encoding.

What you did is approximate the amount of characters by assuming all characters fit in the smallest unit. This will be exact on characters encodings like UTF8 or UTF16 when there's only ASCII characters in the file.

Better approximation

If you know a target language you might be able to better approximate the characters by assuming on average each character is a certain amount of bytes. For example, with UTF8 and english, most characters will be 1 bytes. You could say that on average a character takes 1.005 bytes (one 2 byte character every 200 characters), and then you may get a better approximation.

Faster decoding

Since the speed of reading the entire file here is the problem, I'm going to assume you're either dealing with massive files, or high quantities of files. Both have their own problem. If neither of these are true, then there's no point in trying to optimize anyways.

Memory issues

Both have their own problems, in the first case it's likely that a memory might not fit into memory at one time (at least not continuous or with the rest of the app running). The solution is to stream the file, instead of loading it at once.

The downside is that C# doesn't provide an efficient built-in method for counting characters from a stream. The only built-in solution I can think of is the one listed in this SO answer. It does take surrogates into account, and you can specify the encoding.

Speed issues

If the problem is that there's a sheer number of files, then you're likely already spending a lot of time seeking each file's metadata. In which case I recommend avoiding the problem altogether. If you need to read the files, you might gain some benefit from using a specialized function where you can share a large file buffer across multiple calls. Code sample:

/// 
/// Counts all the characters in a file sharing a reading buffer across multiple calls.
/// 
/// The path to the file.
/// Encoding to use.
/// The buffer to share, will be recreated if it cannot contain the file.
/// The amount of characters in the file.
public static int GetCharacterCount(string filePath, Encoding encoding, ref byte[] buffer)
{
    int fileLength;
    using (var fstream = new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.Read))
    {
        fileLength = (int)fstream.Length;
        // Expand the buffer if necessary
        if (buffer == null || buffer.Length < fileLength)
            buffer = new byte[fstream.Length];

        if (fstream.Read(buffer, 0, fileLength) != fileLength)
            throw new EndOfStreamException("Couldn't read all bytes from the file.");
    }

    return encoding.GetCharCount(buffer, 0, fileLength);
}

Sidestepping the problem

Instead of counting the characters in a file, you could try to avoid it altogether, by doing it once, and storing it. That way you don't even need to decode the files, but you do need to do some bookkeeping. If it's query often, refresh/create few times, this might be your best approach. You can keep a cache with filenames and character counts, and then query that, instead of reading the actual files.

Whether this is a valid solution depends entirely on your use case.

Optimizing the decoding

If you have no control over the input files, and they may be excessively large or there may be too many, you could have major gains by writing specialized code. You could go as far as C with SIMD and cache optimizations. Or simply using more efficient file access patterns in C#. It's going to get hairy quickly, and regardless of what path you choose. In general, unless the purpose of your application is solely to count the characters in a file, I wouldn't waste my time on this.

Obtaining a reliable character count from a text file's size in bytes

Answers (2)

Problem

Better approximation

Faster decoding

Memory issues

Speed issues

Sidestepping the problem

Optimizing the decoding

Related Questions

Obtaining a reliable character count from a text file&#39;s size in bytes

Answers (2)

Problem

Better approximation

Faster decoding

Memory issues

Speed issues

Sidestepping the problem

Optimizing the decoding

Related Questions

Obtaining a reliable character count from a text file's size in bytes