Reputation: 17
A common misconception is to think that CHAR(n) and VARCHAR(n), the n defines the number of characters. But in CHAR(n) and VARCHAR(n) the n defines the string length in bytes (0-8,000). n never defines numbers of characters that can be stored
According to this statement from Microsoft, I assume, n is the data length of a string and when we store unicode characters in varchar
, a single character should take 2 bytes. But, when I try with a sample as below, I see varchar
data taking 1 byte instead of 2 bytes.
declare @varchar varchar(6), @nvarchar nvarchar(6)
set @varchar = 'Ø'
select @varchar as VarcharString, len(@varchar) as VarcharStringLength, DATALENGTH(@varchar) as VarcharStringDataLength
Could someone explain the reason behind it?
Upvotes: 0
Views: 2699
Reputation: 6788
You can store unicode in varchar (if you want to), however every byte is interpreted as a single character, while unicode (for sql server, utf16, ucs2) uses 2 bytes for a single character and you have to account for that, when displaying unicode stored in varchar.
declare @nv nvarchar(10) = N'❤'
select @nv;
declare @v varchar(10) = cast(cast(@nv as varbinary(10)) as varchar(10))
select @v, len(@v); --two chars
select cast(@nv as varbinary(10)), cast(@v as varbinary(10)); --same bytes in both n/var char
--display nchar from char
select cast(cast(@v as varbinary(10)) as nvarchar(10));
Upvotes: 0
Reputation: 37215
Found time to test the assumptions of my first answer:
CREATE DATABASE [test-sc] COLLATE Latin1_General_100_CI_AI_KS_SC_UTF8
CREATE TABLE [dbo].[UTF8Test](
[Id] [int] IDENTITY(1,1) NOT NULL,
[VarcharText] [varchar](50) COLLATE Latin1_General_100_CI_AI NULL,
[VarcharTextSC] [varchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC NULL,
[VarcharUTF8] [varchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC_UTF8 NULL,
[NVarcharText] [nvarchar](50) COLLATE Latin1_General_100_CI_AI_KS NULL,
[NVarcharTextSC] [nvarchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC NULL,
[NVarcharUTF8] [nvarchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC_UTF8 NULL)
INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8])
VALUES ('a','a','a','a','a','a')
INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8])
VALUES ('ö','ö','ö',N'ö',N'ö',N'ö')
-- U+56D7
INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8])
VALUES (N'囗',N'囗',N'囗',N'囗',N'囗',N'囗')
-- U+2000B
INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8])
VALUES (N'𠀋',N'𠀋',N'𠀋',N'𠀋',N'𠀋',N'𠀋')
SELECT TOP (1000) [Id]
,[VarcharText]
,[VarcharTextSC]
,[VarcharUTF8]
,[NVarcharText]
,[NVarcharTextSC]
,[NVarcharUTF8]
FROM [test-sc].[dbo].[UTF8Test]
SELECT TOP (1000) [Id]
,LEN([VarcharText]) VT
,LEN([VarcharTextSC]) VTSC
,LEN([VarcharUTF8]) VU
,LEN([NVarcharText]) NVT
,LEN([NVarcharTextSC]) NVTSC
,LEN([NVarcharUTF8]) NVU
FROM [test-sc].[dbo].[UTF8Test]
SELECT TOP (1000) [Id]
,DATALENGTH([VarcharText]) VT
,DATALENGTH([VarcharTextSC]) VTSC
,DATALENGTH([VarcharUTF8]) VU
,DATALENGTH([NVarcharText]) NVT
,DATALENGTH([NVarcharTextSC]) NVTSC
,DATALENGTH([NVarcharUTF8]) NVU
FROM [test-sc].[dbo].[UTF8Test]
I was surprised to find that the old mantra "a VARCHAR
only stores single byte characters" needs to be revised when using UTF8 collations.
SELECT @VarcharText = [VarcharText],
@NVarcharText = [NVarcharText]
FROM [test-sc].[dbo].[UTF8Test]
WHERE [Id] = 4
SELECT @VarcharText, Len(@VarcharText), DATALENGTH(@VarcharText), @NVarcharText, Len(@NVarcharText), DATALENGTH(@NVarcharText)
SELECT @VarcharText = [VarcharTextSC],
@NVarcharText = [NVarcharTextSC]
FROM [test-sc].[dbo].[UTF8Test]
WHERE [Id] = 4
SELECT @VarcharText, Len(@VarcharText), DATALENGTH(@VarcharText), @NVarcharText, Len(@NVarcharText), DATALENGTH(@NVarcharText)
SELECT @VarcharText = [VarcharUTF8],
@NVarcharText = [NVarcharUTF8]
FROM [test-sc].[dbo].[UTF8Test]
WHERE [Id] = 4
SELECT @VarcharText, Len(@VarcharText), DATALENGTH(@VarcharText), @NVarcharText, Len(@NVarcharText), DATALENGTH(@NVarcharText)
Upvotes: 1
Reputation: 40
declare @char varchar(4)
declare @nvarchar nvarchar(4)
Set @char = '@'
Set @nvarchar = '@'
select @char as charString,
LEN(@char) as charStringLength,
DATALENGTH(@char) as charStringDataLength
select @nvarchar as nvarcharString,
LEN(@nvarchar) as nvarcharStringLength,
DATALENGTH(@nvarchar) as nvarcharStringDataLength
Upvotes: 0
Reputation: 37215
I thought the original quote was a bit confusion, as it continues
The misconception happens because when using single-byte encoding, the storage size of CHAR and VARCHAR is n bytes and the number of characters is also n.
but since it mentions encodings, my guess is that the statement refers to the UTF encodings supported in SQL Server 2019 and higher which seem to allow (I haven't tried yet) to store Unicode in VARCHAR
columns.
Upvotes: 0