Reputation: 222771
I have a database where some of the elements consist of HTML special characters:
| Universidad Tecnológica Nacional - UTN |
| Instituto Tecnológico de Buenos Aires |
| Instituto Superior del Profesorado "Dr. Joaquín V. González" |
| Escuela Nacional de Náutica "Manuel Belgrano" |
| Conservatorio Nacional de Música "Carlos López Buchardo" |
| Instituto Argentino de Computacion - IAC |
| Conservatorio de Superior de Música "Manuel de Falla" |
I need to convert it to a proper UTF format. Can I do better than just iterating through the database, and having a mapping from each code to the equivalent symbol?
á -> 'á'
" -> '"'
...
Upvotes: 2
Views: 8500
Reputation: 795
I had to do the same thing, although my fields were in UTF-8 so ' '= '%20'. Here is a simple query to fix (worked on my tables.. of course you need to adjust for your own):
update TABLE set FIELD_NAME=replace(FIELD_NAME,'&#225','á');
I got this from a similar question: Search and replace part of string in database
Also if you have a small number of character replacements, this is still very easy to do.
Upvotes: 0
Reputation: 198116
If I would be you, I would double cross my fingers and take care that:
What I probably would do is to create some PHP routine that is generating queries, one per line. Checking for rows that contain the double-encoded values and generating the needed changes.
That is taking care of having the unique ID of each row (primary key) in the WHERE criteria so it's known which rows are changed, not which data.
If you're done, you can run this SQL as a batch against the database. You can also run it against a copy of the actual database so to do some kind of dry-run and also to do some quality checks later on.
Hope this helps.
Upvotes: 1
Reputation: 125955
As mentioned in my comment above, it's terribly unclear what you're trying to do in your own case.
Can I do better than just iterating through the database, and having a mapping from each code to the equivalent symbol?
Well, yes. You can replace character code entities (e.g. {
and ƫ
) with their replacement characters without having to lookup the character code in a "mapping". But named entities (e.g. "
) will always need to be looked up.
Here's my attempt to solve the general case:
Create a table to store named character entities defined in HTML:
CREATE TABLE ents (
ref VARCHAR(8) NOT NULL COLLATE utf8_bin,
rep CHAR(1) NOT NULL,
PRIMARY KEY (ref)
);
Populate this table - I suggest using a script, for example from PHP:
$dbh = new PDO("mysql:dbname=$dbname", $username, $password);
$dbh->setAttribute(PDO::ATTR_EMULATE_PREPARES, FALSE);
$ins = $dbh->prepare('INSERT INTO ents (ref, rep) VALUES (?, ?)');
$t = get_html_translation_table(HTML_ENTITIES);
foreach ($t as $k => $v) $ins->execute([substr($v, 1, -1), $k]);
Define an SQL function to perform entity replacements (using this table where applicable, or else by character code):
DELIMITER ;;
CREATE FUNCTION dhe(s TEXT) RETURNS TEXT
BEGIN
DECLARE n, p, i, t INT DEFAULT 0;
DECLARE r VARCHAR(12);
entity_search: LOOP
SET n := LOCATE('&', s, n+1);
IF (!n) THEN
LEAVE entity_search;
END IF;
IF (SUBSTRING(s, n+1, 1) = '#') THEN
CASE
WHEN SUBSTRING(s, n+2, 1) RLIKE '[[:digit:]]' THEN
SET t := 2, p := n+2, r := '[[:digit:]]';
WHEN SUBSTRING(s, n+2, 1) = 'x' THEN
SET t := 3, p := n+3, r := '[[:xdigit:]]';
ELSE ITERATE entity_search;
END CASE;
ELSE
SET t := 1, p := n+1, r := '[[:alnum:]_]';
END IF;
SET i := 0;
reference: LOOP
IF SUBSTRING(s, p+i, 1) NOT RLIKE r THEN
IF SUBSTRING(s, p+i, 1) RLIKE '[[:alnum:]_]' THEN
ITERATE entity_search;
END IF;
LEAVE reference;
END IF;
IF i = 8 THEN ITERATE entity_search; END IF;
SET i := i + 1;
END LOOP reference;
SET s := CONCAT(
LEFT(s, n-1),
CASE t
WHEN 1 THEN COALESCE(
(SELECT rep FROM ents WHERE ref = SUBSTRING(s, p, i))
, SUBSTRING(s, n, i + IF(SUBSTRING(s, p+i, 1)=';',1,0))
)
WHEN 2 THEN CHAR(SUBSTRING(s, p, i))
WHEN 3 THEN CHAR(CONV(SUBSTRING(s, p, i), 16, 10))
END,
SUBSTRING(s, p + i + IF(SUBSTRING(s, p+i, 1)=';',1,0))
);
END LOOP entity_search;
RETURN s;
END;;
DELIMITER ;
Apply this function twice to decode your (apparently) doubly-encoded table:
UPDATE my_table SET my_column = dhe(dhe(my_column));
Upvotes: 6
Reputation: 806
You can go at this a couple of ways.
UPDATE all of the HTML entity data into the correct encoding
This is the best long-term solution. Constantly converting between html entities is wasting CPU time. This may not be much for 1 query (100ms or less), but scale that up to 1,000+ users, doing it dozens of times a second, and it quickly becomes a meaningful amount of CPU time.
Used an SQL stored procedure / function to convert.
I've done this a few times in the past, it is a quick fix. The upside is that you can reuse this functionality, however you'll have to manually add every HTML entity instance you want to convert, which gets tedious very fast. Here is the function I wrote.
CREATE DEFINER = `root`@`localhost` FUNCTION `NewProc`(x longtext)
RETURNS longtext
NO SQL
DETERMINISTIC
BEGIN
DECLARE TextString LONGTEXT;
SET TextString = x ;
#quotation mark
IF INSTR( x , '"' )
THEN SET TextString = REPLACE(TextString, '"','"') ;
END IF ;
#apostrophe
IF INSTR( x , ''' )
THEN SET TextString = REPLACE(TextString, ''','"') ;
END IF ;
#ampersand
IF INSTR( x , '&' )
THEN SET TextString = REPLACE(TextString, '&','&') ;
END IF ;
#less-than
IF INSTR( x , '<' )
THEN SET TextString = REPLACE(TextString, '<','<') ;
END IF ;
#greater-than
IF INSTR( x , '>' )
THEN SET TextString = REPLACE(TextString, '>','>') ;
END IF ;
#non-breaking space
IF INSTR( x , ' ' )
THEN SET TextString = REPLACE(TextString, ' ',' ') ;
END IF ;
#inverted exclamation mark
IF INSTR( x , '¡' )
THEN SET TextString = REPLACE(TextString, '¡','¡') ;
END IF ;
#cent
IF INSTR( x , '¢' )
THEN SET TextString = REPLACE(TextString, '¢','¢') ;
END IF ;
#pound
IF INSTR( x , '£' )
THEN SET TextString = REPLACE(TextString, '£','£') ;
END IF ;
#currency
IF INSTR( x , '¤' )
THEN SET TextString = REPLACE(TextString, '¤','¤') ;
END IF ;
#yen
IF INSTR( x , '¥' )
THEN SET TextString = REPLACE(TextString, '¥','¥') ;
END IF ;
#broken vertical bar
IF INSTR( x , '¦' )
THEN SET TextString = REPLACE(TextString, '¦','¦') ;
END IF ;
#section
IF INSTR( x , '§' )
THEN SET TextString = REPLACE(TextString, '§','§') ;
END IF ;
#spacing diaeresis
IF INSTR( x , '¨' )
THEN SET TextString = REPLACE(TextString, '¨','¨') ;
END IF ;
#copyright
IF INSTR( x , '©' )
THEN SET TextString = REPLACE(TextString, '©','©') ;
END IF ;
#feminine ordinal indicator
IF INSTR( x , 'ª' )
THEN SET TextString = REPLACE(TextString, 'ª','ª') ;
END IF ;
#angle quotation mark (left)
IF INSTR( x , '«' )
THEN SET TextString = REPLACE(TextString, '«','«') ;
END IF ;
#negation
IF INSTR( x , '¬' )
THEN SET TextString = REPLACE(TextString, '¬','¬') ;
END IF ;
#soft hyphen
IF INSTR( x , '­' )
THEN SET TextString = REPLACE(TextString, '­','') ;
END IF ;
#registered trademark
IF INSTR( x , '®' )
THEN SET TextString = REPLACE(TextString, '®','®') ;
END IF ;
#spacing macron
IF INSTR( x , '¯' )
THEN SET TextString = REPLACE(TextString, '¯','¯') ;
END IF ;
#degree
IF INSTR( x , '°' )
THEN SET TextString = REPLACE(TextString, '°','°') ;
END IF ;
#plus-or-minus
IF INSTR( x , '±' )
THEN SET TextString = REPLACE(TextString, '±','±') ;
END IF ;
#superscript 2
IF INSTR( x , '²' )
THEN SET TextString = REPLACE(TextString, '²','²') ;
END IF ;
#superscript 3
IF INSTR( x , '³' )
THEN SET TextString = REPLACE(TextString, '³','³') ;
END IF ;
#spacing acute
IF INSTR( x , '´' )
THEN SET TextString = REPLACE(TextString, '´','´') ;
END IF ;
#micro
IF INSTR( x , 'µ' )
THEN SET TextString = REPLACE(TextString, 'µ','µ') ;
END IF ;
#paragraph
IF INSTR( x , '¶' )
THEN SET TextString = REPLACE(TextString, '¶','¶') ;
END IF ;
#middle dot
IF INSTR( x , '·' )
THEN SET TextString = REPLACE(TextString, '·','·') ;
END IF ;
#spacing cedilla
IF INSTR( x , '¸' )
THEN SET TextString = REPLACE(TextString, '¸','¸') ;
END IF ;
#superscript 1
IF INSTR( x , '¹' )
THEN SET TextString = REPLACE(TextString, '¹','¹') ;
END IF ;
#masculine ordinal indicator
IF INSTR( x , 'º' )
THEN SET TextString = REPLACE(TextString, 'º','º') ;
END IF ;
#angle quotation mark (right)
IF INSTR( x , '»' )
THEN SET TextString = REPLACE(TextString, '»','»') ;
END IF ;
#fraction 1/4
IF INSTR( x , '¼' )
THEN SET TextString = REPLACE(TextString, '¼','¼') ;
END IF ;
#fraction 1/2
IF INSTR( x , '½' )
THEN SET TextString = REPLACE(TextString, '½','½') ;
END IF ;
#fraction 3/4
IF INSTR( x , '¾' )
THEN SET TextString = REPLACE(TextString, '¾','¾') ;
END IF ;
#inverted question mark
IF INSTR( x , '¿' )
THEN SET TextString = REPLACE(TextString, '¿','¿') ;
END IF ;
#multiplication
IF INSTR( x , '×' )
THEN SET TextString = REPLACE(TextString, '×','×') ;
END IF ;
#division
IF INSTR( x , '÷' )
THEN SET TextString = REPLACE(TextString, '÷','÷') ;
END IF ;
#capital a, grave accent
IF INSTR( x , 'À' )
THEN SET TextString = REPLACE(TextString, 'À','À') ;
END IF ;
#capital a, acute accent
IF INSTR( x , 'Á' )
THEN SET TextString = REPLACE(TextString, 'Á','Á') ;
END IF ;
#capital a, circumflex accent
IF INSTR( x , 'Â' )
THEN SET TextString = REPLACE(TextString, 'Â','Â') ;
END IF ;
#capital a, tilde
IF INSTR( x , 'Ã' )
THEN SET TextString = REPLACE(TextString, 'Ã','Ã') ;
END IF ;
#capital a, umlaut mark
IF INSTR( x , 'Ä' )
THEN SET TextString = REPLACE(TextString, 'Ä','Ä') ;
END IF ;
#capital a, ring
IF INSTR( x , 'Å' )
THEN SET TextString = REPLACE(TextString, 'Å','Å') ;
END IF ;
#capital ae
IF INSTR( x , 'Æ' )
THEN SET TextString = REPLACE(TextString, 'Æ','Æ') ;
END IF ;
#capital c, cedilla
IF INSTR( x , 'Ç' )
THEN SET TextString = REPLACE(TextString, 'Ç','Ç') ;
END IF ;
#capital e, grave accent
IF INSTR( x , 'È' )
THEN SET TextString = REPLACE(TextString, 'È','È') ;
END IF ;
#capital e, acute accent
IF INSTR( x , 'É' )
THEN SET TextString = REPLACE(TextString, 'É','É') ;
END IF ;
#capital e, circumflex accent
IF INSTR( x , 'Ê' )
THEN SET TextString = REPLACE(TextString, 'Ê','Ê') ;
END IF ;
#capital e, umlaut mark
IF INSTR( x , 'Ë' )
THEN SET TextString = REPLACE(TextString, 'Ë','Ë') ;
END IF ;
#capital i, grave accent
IF INSTR( x , 'Ì' )
THEN SET TextString = REPLACE(TextString, 'Ì','Ì') ;
END IF ;
#capital i, acute accent
IF INSTR( x , 'Í' )
THEN SET TextString = REPLACE(TextString, 'Í','Í') ;
END IF ;
#capital i, circumflex accent
IF INSTR( x , 'Î' )
THEN SET TextString = REPLACE(TextString, 'Î','Î') ;
END IF ;
#capital i, umlaut mark
IF INSTR( x , 'Ï' )
THEN SET TextString = REPLACE(TextString, 'Ï','Ï') ;
END IF ;
#capital eth, Icelandic
IF INSTR( x , 'Ð' )
THEN SET TextString = REPLACE(TextString, 'Ð','Ð') ;
END IF ;
#capital n, tilde
IF INSTR( x , 'Ñ' )
THEN SET TextString = REPLACE(TextString, 'Ñ','Ñ') ;
END IF ;
#capital o, grave accent
IF INSTR( x , 'Ò' )
THEN SET TextString = REPLACE(TextString, 'Ò','Ò') ;
END IF ;
#capital o, acute accent
IF INSTR( x , 'Ó' )
THEN SET TextString = REPLACE(TextString, 'Ó','Ó') ;
END IF ;
#capital o, circumflex accent
IF INSTR( x , 'Ô' )
THEN SET TextString = REPLACE(TextString, 'Ô','Ô') ;
END IF ;
#capital o, tilde
IF INSTR( x , 'Õ' )
THEN SET TextString = REPLACE(TextString, 'Õ','Õ') ;
END IF ;
#capital o, umlaut mark
IF INSTR( x , 'Ö' )
THEN SET TextString = REPLACE(TextString, 'Ö','Ö') ;
END IF ;
#capital o, slash
IF INSTR( x , 'Ø' )
THEN SET TextString = REPLACE(TextString, 'Ø','Ø') ;
END IF ;
#capital u, grave accent
IF INSTR( x , 'Ù' )
THEN SET TextString = REPLACE(TextString, 'Ù','Ù') ;
END IF ;
#capital u, acute accent
IF INSTR( x , 'Ú' )
THEN SET TextString = REPLACE(TextString, 'Ú','Ú') ;
END IF ;
#capital u, circumflex accent
IF INSTR( x , 'Û' )
THEN SET TextString = REPLACE(TextString, 'Û','Û') ;
END IF ;
#capital u, umlaut mark
IF INSTR( x , 'Ü' )
THEN SET TextString = REPLACE(TextString, 'Ü','Ü') ;
END IF ;
#capital y, acute accent
IF INSTR( x , 'Ý' )
THEN SET TextString = REPLACE(TextString, 'Ý','Ý') ;
END IF ;
#capital THORN, Icelandic
IF INSTR( x , 'Þ' )
THEN SET TextString = REPLACE(TextString, 'Þ','Þ') ;
END IF ;
#small sharp s, German
IF INSTR( x , 'ß' )
THEN SET TextString = REPLACE(TextString, 'ß','ß') ;
END IF ;
#small a, grave accent
IF INSTR( x , 'à' )
THEN SET TextString = REPLACE(TextString, 'à','à') ;
END IF ;
#small a, acute accent
IF INSTR( x , 'á' )
THEN SET TextString = REPLACE(TextString, 'á','á') ;
END IF ;
#small a, circumflex accent
IF INSTR( x , 'â' )
THEN SET TextString = REPLACE(TextString, 'â','â') ;
END IF ;
#small a, tilde
IF INSTR( x , 'ã' )
THEN SET TextString = REPLACE(TextString, 'ã','ã') ;
END IF ;
#small a, umlaut mark
IF INSTR( x , 'ä' )
THEN SET TextString = REPLACE(TextString, 'ä','ä') ;
END IF ;
#small a, ring
IF INSTR( x , 'å' )
THEN SET TextString = REPLACE(TextString, 'å','å') ;
END IF ;
#small ae
IF INSTR( x , 'æ' )
THEN SET TextString = REPLACE(TextString, 'æ','æ') ;
END IF ;
#small c, cedilla
IF INSTR( x , 'ç' )
THEN SET TextString = REPLACE(TextString, 'ç','ç') ;
END IF ;
#small e, grave accent
IF INSTR( x , 'è' )
THEN SET TextString = REPLACE(TextString, 'è','è') ;
END IF ;
#small e, acute accent
IF INSTR( x , 'é' )
THEN SET TextString = REPLACE(TextString, 'é','é') ;
END IF ;
#small e, circumflex accent
IF INSTR( x , 'ê' )
THEN SET TextString = REPLACE(TextString, 'ê','ê') ;
END IF ;
#small e, umlaut mark
IF INSTR( x , 'ë' )
THEN SET TextString = REPLACE(TextString, 'ë','ë') ;
END IF ;
#small i, grave accent
IF INSTR( x , 'ì' )
THEN SET TextString = REPLACE(TextString, 'ì','ì') ;
END IF ;
#small i, acute accent
IF INSTR( x , 'í' )
THEN SET TextString = REPLACE(TextString, 'í','í') ;
END IF ;
#small i, circumflex accent
IF INSTR( x , 'î' )
THEN SET TextString = REPLACE(TextString, 'î','î') ;
END IF ;
#small i, umlaut mark
IF INSTR( x , 'ï' )
THEN SET TextString = REPLACE(TextString, 'ï','ï') ;
END IF ;
#small eth, Icelandic
IF INSTR( x , 'ð' )
THEN SET TextString = REPLACE(TextString, 'ð','ð') ;
END IF ;
#small n, tilde
IF INSTR( x , 'ñ' )
THEN SET TextString = REPLACE(TextString, 'ñ','ñ') ;
END IF ;
#small o, grave accent
IF INSTR( x , 'ò' )
THEN SET TextString = REPLACE(TextString, 'ò','ò') ;
END IF ;
#small o, acute accent
IF INSTR( x , 'ó' )
THEN SET TextString = REPLACE(TextString, 'ó','ó') ;
END IF ;
#small o, circumflex accent
IF INSTR( x , 'ô' )
THEN SET TextString = REPLACE(TextString, 'ô','ô') ;
END IF ;
#small o, tilde
IF INSTR( x , 'õ' )
THEN SET TextString = REPLACE(TextString, 'õ','õ') ;
END IF ;
#small o, umlaut mark
IF INSTR( x , 'ö' )
THEN SET TextString = REPLACE(TextString, 'ö','ö') ;
END IF ;
#small o, slash
IF INSTR( x , 'ø' )
THEN SET TextString = REPLACE(TextString, 'ø','ø') ;
END IF ;
#small u, grave accent
IF INSTR( x , 'ù' )
THEN SET TextString = REPLACE(TextString, 'ù','ù') ;
END IF ;
#small u, acute accent
IF INSTR( x , 'ú' )
THEN SET TextString = REPLACE(TextString, 'ú','ú') ;
END IF ;
#small u, circumflex accent
IF INSTR( x , 'û' )
THEN SET TextString = REPLACE(TextString, 'û','û') ;
END IF ;
#small u, umlaut mark
IF INSTR( x , 'ü' )
THEN SET TextString = REPLACE(TextString, 'ü','ü') ;
END IF ;
#small y, acute accent
IF INSTR( x , 'ý' )
THEN SET TextString = REPLACE(TextString, 'ý','ý') ;
END IF ;
#small thorn, Icelandic
IF INSTR( x , 'þ' )
THEN SET TextString = REPLACE(TextString, 'þ','þ') ;
END IF ;
#small y, umlaut mark
IF INSTR( x , 'ÿ' )
THEN SET TextString = REPLACE(TextString, 'ÿ','ÿ') ;
END IF ;
RETURN TextString ;
END;
Convert the entities in code.
As JMack mentioned, this functionality is built into PHP, as well as various other languages such as Ruby, Java, etc. This is another quick fix.
Upvotes: 2
Reputation: 32094
It looks like this is going to be one time solution, so I would not think about anything too much generic. I would just find all escaped symbols with something like (replace unions with the reference to your table):
SELECT DISTINCT IF(@p:=LOCATE('&', s),
SUBSTR(s, @p, LOCATE(';', s, @p+5)-@p+1),
NULL) as e_chars
FROM (
SELECT 'Universidad Tecnol&#243;gica Nacional - UTN' as s
UNION ALL
SELECT 'Instituto Tecnol&#243;gico de Buenos Aires'
UNION ALL
SELECT 'Instituto Superior del Profesorado &quot;Dr. Joaqu&#237;n V. Gonz&#225;lez&quot;'
UNION ALL
SELECT 'Escuela Nacional de N&#225;utica &quot;Manuel Belgrano&quot;'
UNION ALL
SELECT 'Conservatorio Nacional de M&#250;sica &quot;Carlos L&#243;pez Buchardo&quot;'
UNION ALL
SELECT 'Instituto Argentino de Computacion - IAC'
UNION ALL
SELECT 'Conservatorio de Superior de M&#250;sica &quot;Manuel de Falla&quot;'
) as s;
to receive:
'&#243;'
'&quot;'
'&#225;'
'&#250;'
And then I would write a simple update query like:
UPDATE t1 SET
s = REPLACE(s, '&#243;', 'ó'),
s = REPLACE(s, '&quot;', ''''),
s = REPLACE(s, '&#225;', 'á'),
s = REPLACE(s, '&#250;', 'ú')
WHERE LOCATE(s, '&');
Then you can repeat the first query to see if there is something left.
Upvotes: 3
Reputation: 8743
MySQL does not provide any function to decode HTML entities. MySQL has no concern about HTML, and will not provide any functions for it specifically.
If you do not want these in the database, you need to decode your strings before they are inserted. If you are using PHP, the html_entity_decode() function is probably what you are looking for.
As for those already in the database, you will need to create a PHP script to go through and read each row, process it, then replace the old row with the new decoded one.
Upvotes: 2