Reputation: 1897

Storing a string as UTF8 in C#

I'm doing a lot of string manipulation in C#, and really need the strings to be stored one byte per character. This is because I need gigabytes of text simultaneously in memory and it's causing low memory issues. I know for certain that this text will never contain non-ASCII characters, so for my purposes, the fact that System.String and System.Char store everything as two bytes per character is both unnecessary and a real problem.

I'm about to start coding my own CharAscii and StringAscii classes - the string one will basically hold its data as byte[], and expose string manipulation methods similar to the ones that System.String does. However this seems a lot of work to do something that seems like a very standard problem, so I'm really posting here to check that there isn't already an easier solution. Is there for example some way I can make System.String internally store data as UTF8 that I haven't noticed, or some other way round the problem?

Upvotes: 38

Answers (4)

Thanatos

Reputation: 1186

As I can see your problem is that char in C# is occupying 2 bytes, instead of one.

One way to read a text file is to open it with :

    System.IO.FileStream fs = new System.IO.FileStream(file, System.IO.FileMode.Open);
    System.IO.BinaryReader br = new System.IO.BinaryReader(fs);

    byte[] buffer = new byte[1024];
    int read = br.Read(buffer, 0, (int)fs.Length);

    br.Close();
    fs.Close();

And this way you are reading the bytes from the file. I tried it with *.txt files encoded in UTF-8 that is 2 bytes per char, and ANSI that is 1 byte per char.

Upvotes: 0

Jon Hanna

Reputation: 113392

Not really. System.String is designed for storing strings. Your requirement is for a very particular subset of strings with particular memory benefits.

Now, "very particular subset of strings with particular memory benefits" comes up a lot, but not always the same very particular subset. Code that is ASCII-only isn't for reading by human beings, so it tends to be either short codes, or something that can be handled in a stream-processing manner, or else chunks of text merged in with bytes doing other jobs (e.g. quite a few binary formats will have small bits that translate directly to ASCII).

As such, you've a pretty strange requirement.

All the more so when you come to the gigabytes part. If I'm dealing with gigs, I'm immediately thinking about how I can stop having to deal with gigs, and/or get much more serious savings than just 50%. I'd be thinking about mapping chunks I'm not currently interested in to a file, or about ropes, or about a bunch of other things. Of course, those are going to work for some cases and not for all, so yet again, we're not talking about something where .NET should stick in something as a one-size-fits-all, because one size will not fit all.

Beyond that, just the utf-8 bit isn't that hard. It's all the other methods that becomes work. Again, what you need there won't be the same as someone else.

Upvotes: 3

Chris

Reputation: 2895

As you've found, the CLR uses UTF-16 for character encoding. Your best bet may be to use the Encoding classes & a BitConverter to handle the text. This question has some good examples for converting between the two encodings:

Convert String (UTF-16) to UTF-8 in C#

Upvotes: 6

KeithS

Reputation: 71591

Well, you could create a wrapper that retrieves the data as UTF-8 bytes and converts pieces as needed to System.String, then vice-versa to push the string back out to memory. The Encoding class will help you out here:

var utf8 = Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(myString);

var myReturnedString = utf8.GetString(utfBytes);

Upvotes: 12

Storing a string as UTF8 in C#

Answers (4)

Related Questions