MySQL - find and fix incorrect characters

I have a table in MySQL that contains people's names and now some people are putting in characters with accents. Whatever original character set/collation this database used, it could not handle characters with accents and characters. Characters such as 'é' turned into 'é', characters such as 'ü' turned into 'ü' in the front-end applications that connect to this database. The database has since been modified (as well as most front-end applications that needed it) so that these fields use a UTF8 encoding and collation. Most names with accents now render correctly.

The problem is, there are some fields that now have a literal 'é' (some weird double-encoding caused by manually copying and pasting stuff? I don't know, I wasn't there!). I now need to find all of these and modify these fields to use the correct accented characters. To find them, I wrote a query (based on the table I found here):

select count(*), bad_char
from some_table
inner join (
  select '€' as bad_char union
  select '‚' as bad_char union
  select 'Æ’' as bad_char union
  select '„' as bad_char union
  select '…' as bad_char union
  ...
  -- snip a whole bunch
  ...
  select 'þ' as bad_char union
  select 'ÿ' as bad_char ) bad_chars
where some_table.some_text_field like CONCAT('%',bad_chars.bad_char,'%')
group by bad_char
order by count(*);

And I get results like this:

count(*), bad_char
------------------
'1',     '¯'
'1',     'Ñ'
'1',     'Ö'
'1',     'Ž'
'1',     'Ç'
'1',     '¬'
...snip...
'1797',  'ß'
'4450',  'Ê'

But I have a feeling this doesn't quite work the way I think it does. One of the "bad_char" combinations I'm searching for is 'Ê' but when I run a query like this:

select some_text_field from some_table where some_text_field like '%Ê%';

I get many results that are 'as' which are the same letters but without the accents. But for other cases such as '¯' the query seems to work fine.

How can I get this query to not treat 'as' the same as 'Ê'?

Upvotes: 2

Views: 3561

Answers (3)

andreszs
andreszs

Reputation: 2956

For anyone else trying to fix broken UTF8 accented text on latin1 fields, when some of them have proper latin1 and others have broken UTF8, the solution is this:

update posts set post_text = @txt where length(post_text) = length(@txt := convert(binary post_text using utf8));

This will fix only those rows with wrongly encoded UTF8 back into proper latin1 encoding, and á will become á, etc.

Upvotes: 0

Rick James
Rick James

Reputation: 142278

For Mojibake ('é' turned into 'é'), the fix for the data is

ALTER TABLE Tbl MODIFY COLUMN col VARBINARY(...) ...;
ALTER TABLE Tbl MODIFY COLUMN col VARCHAR(...) ... CHARACTER SET utf8 ...;

Your SELECTs sound like Mojibake -- € should have been , correct?

The fix for the code is to understand that

  • The bytes you have in the client are correctly encoded in utf8 (good).
  • You connected with SET NAMES latin1 (or set_charset('latin1') or ...), probably by default. (It should have been utf8.)
  • The column in the tables may or may not have been CHARACTER SET utf8, but it should have been that.

The SET NAMES is often done by language-specific code; what programming language are you using.

Use SHOW CREATE TABLE to see what CHARACTER SET you are using.

Upvotes: 1

C3roe
C3roe

Reputation: 96250

I get many results that are 'as' which are the same letters but without the accents.

That would be an issue of the collation used - those are rule sets for character comparison, and they define which characters are to be treated as equal in different languages.

But you can use the BINARYoperator to change that directly within the query.

Upvotes: 1

Related Questions