Jader Dias
Jader Dias

Reputation: 90465

How to find out how many characters a file has without reading the whole of it?

If the file is a text file, and StreamReader can figure out the Encoding it uses, how can I find out how much characters it has without reading the whole file?

I'm reading 1GB CSV files and it takes at least 4 seconds to read it with a StreamReader. File.ReadAllText().Length would cause System.OutOfMemoryException.

I imagine if I had the FileInfo(filename).Length and the Encoding, then I can calculate the number of characters.

Upvotes: 3

Views: 3467

Answers (5)

Jeffrey L Whitledge
Jeffrey L Whitledge

Reputation: 59453

For ASCII, CP-437, CP-1252, ISO-8859-1, or code pages similar to these, then the number of characters will be the number of bytes.

If the file is in UTF-16, then you cannot know the number of characters from the number of bytes, but it will likely be something similar to the number of bytes / 2. In any case, you can exactly calculate the size of memory needed to hold the file in a .NET string, because it will be the size of the file (since .NET uses UTF-16 internally) plus a constant overhead. The Length of such a string will be number of bytes divided by 2.

If the file is in UTF-8 (or any other vairable-width encoding), then the number of characters could be a wide range up to several times the number of bytes, or it could be one character per byte. It just depends on the data.

If the file is in UTF-32 (which is extremely unlikely), then the number of characters will be exactly the length of the file in bytes divided by four. But even though this is the exact number of characters, it does not indicate the length of the .NET string created from this file, since that might involve the use of surrogate code points for characters in the high planes, so the answer still depends on what you inted to do with the information.

Upvotes: 1

Amadan
Amadan

Reputation: 198324

You can't. The reason is, some encoding (notably, UTF-8) have variable character width: some characters take up only 1 byte (ASCII), a lot take up 2 bytes, there are even cases with 3 or more bytes per character. Thus, without decoding the characters, it is impossible to know the length of the file under an encoding.

Also, all characters in C# strings are represented as UTF-16, AFAIK, so unless you have a very weird text (i.e. you're using many characters from outside plane 0), you can estimate the memory requirements in bytes rather easily, by multiplying the character count by 2 (and vice versa, estimate the number of characters by doubling the byte size).

Now, a better question is - why do you need the character count? What is it that you're doing with the CSV file later, that you want to load it all up into the memory, and why would knowing its size help?

Upvotes: 4

cusimar9
cusimar9

Reputation: 5259

The problem with this is if the file is UTF8 encoded then each character can occupy between 1 and 4 bytes, therefore you have no way of 'calculating' the number of characters without processing the file in some way.

Other encoding methods may prove more fruitful.

Upvotes: 0

GvS
GvS

Reputation: 52518

For some encodings this works (ASCII, Window 1262, IBM-850, etc), but not for UTF8 and UTF7, since they have some characters encoded as 1 byte, some as 2 (and I believe some even more as 2).

Upvotes: 0

carlosfigueira
carlosfigueira

Reputation: 87228

I don't think it really can - some encodings encode characters with different number of bytes, so you'd really need to convert the bytes into characters to find the number of characters.

For example, in UTF-8, the characters from \u0000 to \u007F are represented in 1 byte only; between \0u0080 and \u07FF they need 2 bytes, and so on.

Upvotes: 0

Related Questions