Reputation: 17

Number of bytes used for Unicode characters in varchar

A common misconception is to think that CHAR(n) and VARCHAR(n), the n defines the number of characters. But in CHAR(n) and VARCHAR(n) the n defines the string length in bytes (0-8,000). n never defines numbers of characters that can be stored

According to this statement from Microsoft, I assume, n is the data length of a string and when we store unicode characters in varchar, a single character should take 2 bytes. But, when I try with a sample as below, I see varchar data taking 1 byte instead of 2 bytes.

declare @varchar varchar(6), @nvarchar nvarchar(6)

set @varchar = 'Ø'

select @varchar as VarcharString, len(@varchar) as VarcharStringLength, DATALENGTH(@varchar) as VarcharStringDataLength

Could someone explain the reason behind it?

Upvotes: 0

Answers (4)

lptr

Reputation: 6808

You can store unicode in varchar (if you want to), however every byte is interpreted as a single character, while unicode (for sql server, utf16, ucs2) uses 2 bytes for a single character and you have to account for that, when displaying unicode stored in varchar.

declare @nv nvarchar(10) = N'❤'
select @nv;

declare @v varchar(10) = cast(cast(@nv as varbinary(10)) as varchar(10))
select @v, len(@v); --two chars

select cast(@nv as varbinary(10)), cast(@v as varbinary(10)); --same bytes in both n/var char
--display nchar from char
select cast(cast(@v as varbinary(10)) as nvarchar(10));

Upvotes: 0

devio

Reputation: 37225

Found time to test the assumptions of my first answer:

Create UTF8-enabled database

CREATE DATABASE [test-sc] COLLATE Latin1_General_100_CI_AI_KS_SC_UTF8

Create table with all kinds of N/VARCHAR columns

CREATE TABLE [dbo].[UTF8Test]( [Id] [int] IDENTITY(1,1) NOT NULL, [VarcharText] [varchar](50) COLLATE Latin1_General_100_CI_AI NULL, [VarcharTextSC] [varchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC NULL, [VarcharUTF8] [varchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC_UTF8 NULL, [NVarcharText] [nvarchar](50) COLLATE Latin1_General_100_CI_AI_KS NULL, [NVarcharTextSC] [nvarchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC NULL, [NVarcharUTF8] [nvarchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC_UTF8 NULL)

Insert test data from various Unicode ranges

INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8]) VALUES ('a','a','a','a','a','a') INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8]) VALUES ('ö','ö','ö',N'ö',N'ö',N'ö') -- U+56D7 INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8]) VALUES (N'囗',N'囗',N'囗',N'囗',N'囗',N'囗') -- U+2000B INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8]) VALUES (N'𠀋',N'𠀋',N'𠀋',N'𠀋',N'𠀋',N'𠀋')

SELECT lengths

SELECT TOP (1000) [Id] ,[VarcharText] ,[VarcharTextSC] ,[VarcharUTF8] ,[NVarcharText] ,[NVarcharTextSC] ,[NVarcharUTF8] FROM [test-sc].[dbo].[UTF8Test] SELECT TOP (1000) [Id] ,LEN([VarcharText]) VT ,LEN([VarcharTextSC]) VTSC ,LEN([VarcharUTF8]) VU ,LEN([NVarcharText]) NVT ,LEN([NVarcharTextSC]) NVTSC ,LEN([NVarcharUTF8]) NVU FROM [test-sc].[dbo].[UTF8Test] SELECT TOP (1000) [Id] ,DATALENGTH([VarcharText]) VT ,DATALENGTH([VarcharTextSC]) VTSC ,DATALENGTH([VarcharUTF8]) VU ,DATALENGTH([NVarcharText]) NVT ,DATALENGTH([NVarcharTextSC]) NVTSC ,DATALENGTH([NVarcharUTF8]) NVU FROM [test-sc].[dbo].[UTF8Test]

I was surprised to find that the old mantra "a VARCHAR only stores single byte characters" needs to be revised when using UTF8 collations.

Note that only table columns are associated with collations, but not T-SQL variables:

SELECT @VarcharText = [VarcharText], @NVarcharText = [NVarcharText] FROM [test-sc].[dbo].[UTF8Test] WHERE [Id] = 4 SELECT @VarcharText, Len(@VarcharText), DATALENGTH(@VarcharText), @NVarcharText, Len(@NVarcharText), DATALENGTH(@NVarcharText) SELECT @VarcharText = [VarcharTextSC], @NVarcharText = [NVarcharTextSC] FROM [test-sc].[dbo].[UTF8Test] WHERE [Id] = 4 SELECT @VarcharText, Len(@VarcharText), DATALENGTH(@VarcharText), @NVarcharText, Len(@NVarcharText), DATALENGTH(@NVarcharText) SELECT @VarcharText = [VarcharUTF8], @NVarcharText = [NVarcharUTF8] FROM [test-sc].[dbo].[UTF8Test] WHERE [Id] = 4 SELECT @VarcharText, Len(@VarcharText), DATALENGTH(@VarcharText), @NVarcharText, Len(@NVarcharText), DATALENGTH(@NVarcharText)

Upvotes: 1

Jay Rindani

Reputation: 40

declare @char varchar(4)
declare @nvarchar nvarchar(4)

Set @char = '@'
Set @nvarchar = '@'

select @char as charString, 
LEN(@char) as charStringLength,
DATALENGTH(@char) as charStringDataLength

select @nvarchar as nvarcharString, 
LEN(@nvarchar) as nvarcharStringLength,
DATALENGTH(@nvarchar) as nvarcharStringDataLength

Upvotes: 0

devio

Reputation: 37225

I thought the original quote was a bit confusion, as it continues

The misconception happens because when using single-byte encoding, the storage size of CHAR and VARCHAR is n bytes and the number of characters is also n.

but since it mentions encodings, my guess is that the statement refers to the UTF encodings supported in SQL Server 2019 and higher which seem to allow (I haven't tried yet) to store Unicode in VARCHAR columns.

Upvotes: 0

Number of bytes used for Unicode characters in varchar

Answers (4)

Related Questions