IHateATMFees
IHateATMFees

Reputation: 346

Reading characters from an unknown character encoding

I have a string that came from an old database of unknown character encoding. I am having trouble encoding/filtering the string to show the correct text.

What the data looks like in the database: Marronnière à quatre pans
What we need the string to show up as: Marronnière à quatre pans

Specifically, I am having trouble parsing the string so I can display the character à (à)

This is an asp.Net 2.0 site written in VB using a Sql Server 2005 Database. Not sure if it matters, but data comes from a column with this collation: SQL_Latin1_General_CP1_CI_AS

I've tried encoding the string to various encodings in the code to no avail. I've also passed the string (encoded different ways) into a byte array to find a unique byte pattern for the bad characters without success.

Any ideas or leads would be greatly appreciated, thanks.

Upvotes: 0

Views: 840

Answers (1)

MarkJ
MarkJ

Reputation: 30398

It sounds like the collation in the SQL Server database doesn't match the character encoding that was actually used :( It's a fairly common mistake for careless developers.

That's why the SQL Server administration tools are showing weird characters rather than the strings that you are expecting.

Possibly it is UTF-8? In UTF-8 à is represented by the bytes 0xC3 0xA8, which would be interpreted under the Windows code page Latin-1 as è. I know nothing about SQL Server collations, but it seems likely that SQL_Latin1_CP1_CI_AS is similar to Windows "Latin-1".

You either need to

  • fix up the encoding when reading from the database. Ugly and confusing for the next poor victim who has to deal with this database and code.
  • or, better, correct the data in the database so that it matches the collation. You might like to change the collation to UTF-8 or UTF-16: you will also need to change the data though.

Upvotes: 1

Related Questions