xmen
xmen

Reputation: 1967

Limit UTF-8 encoded bytes length from string

I need to limit the output byte[] length encoded with UTF-8 encoding. Eg. byte[] length must be less than or equals 1000 First I wrote the following code

            int maxValue = 1000;

            if (text.Length > maxValue)
                text = text.Substring(0, maxValue);
            var textInBytes = Encoding.UTF8.GetBytes(text);

works good if string is just using ASCII characters, because 1 byte per character. But if characters goes beyond that it could be 2 or 3 or even 6 bytes per character. That would be a problem with the above code. So to fix that problem I wrote this.

            List<byte> textInBytesList = new List<byte>();
            char[] textInChars = text.ToCharArray();
            for (int a = 0; a < textInChars.Length; a++)
            {
                byte[] valueInBytes = Encoding.UTF8.GetBytes(textInChars, a, 1);
                if ((textInBytesList.Count + valueInBytes.Length) > maxValue)
                    break;

                textInBytesList.AddRange(valueInBytes);
            }

I haven't tested code, but Im sure it will work as I want. However, I dont like the way it is done, is there any better way to do this ? Something I'm missing ? or not aware of ?

Thank you.

Upvotes: 0

Views: 2705

Answers (1)

Kate Sinclair
Kate Sinclair

Reputation: 11

My first posting on Stack Overflow, so be gentle! This method should take care of things pretty quickly for you..

    public static byte[] GetBytes(string text, int maxArraySize, Encoding encoding) {
        if (string.IsNullOrEmpty(text)) return null;            

        int tail = Math.Min(text.Length, maxArraySize);
        int size = encoding.GetByteCount(text.Substring(0, tail));
        while (tail >= 0 && size > maxArraySize) {
            size -= encoding.GetByteCount(text.Substring(tail - 1, 1));
            --tail;
        }

        return encoding.GetBytes(text.Substring(0, tail));
    }

It's similar to what you're doing, but without the added overhead of the List or having to count from the beginning of the string every time. I start from the other end of the string, and the assumption is, of course, that all characters must be at least one byte. So there's no sense in starting to iterate down through the string any farther in than maxArraySize (or the total length of the string).

Then you can call the method like so..

        byte[] bytes = GetBytes(text, 1000, Encoding.UTF8);

Upvotes: 1

Related Questions