Doug Cassidy
Doug Cassidy

Reputation: 1905

html_entity_decode to plain text or not utf (elipsis to ... , etc)

Trying to figure out this decoding. I want to wind up with the most generic text possible. Elipsis to '...' Fancy quotes to single or double quotes, regular old '-' not the emdash. Is there another way other than str_replace with a table of fancy vs. regular strings?

$str = 'Hey,…I came back….ummm,…OK,…cool';

echo htmlspecialchars_decode($str, ENT_QUOTES) ;
// Hey,…I came back….ummm,…OK,…cool

echo html_entity_decode($str, ENT_QUOTES, 'ISO-8859-15')  ;
// Hey,…I came back….ummm,…OK,…cool

echo html_entity_decode($str, ENT_QUOTES, 'UTF-8')  ;
//this works, but changes to the elipse character 
// Hey,…I came back….ummm,…OK,…cool

echo str_replace("…", "...", $str)  ;
//Hey,...I came back....ummm,...OK,...cool
//desired result

Upvotes: 0

Views: 897

Answers (1)

Álvaro González
Álvaro González

Reputation: 146460

I'm not sure of your specs but I have the impression you want something like this:

$str = 'Hey,…I came back….ummm,…OK,…cool';
echo iconv('UTF-8', 'ASCII//TRANSLIT', html_entity_decode($str, ENT_QUOTES, 'UTF-8'));

This basically makes any Unicode character fit into 7-bit ASCII. Unexpected results may arise.

Update: Examples of unexpected results:

$str = 'Álvaro España €£¥¢©®';
echo iconv('UTF-8', 'ASCII//TRANSLIT', html_entity_decode($str, ENT_QUOTES, 'UTF-8'));
# 'Alvaro Espa~na EURlbyenc(c)(R)

$str = 'Test: உதாரண';
echo iconv('UTF-8', 'ASCII//TRANSLIT', html_entity_decode($str, ENT_QUOTES, 'UTF-8'));
# Notice: iconv(): Detected an illegal character in input string

$str = 'Test: உதாரண End Test';
echo iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', html_entity_decode($str, ENT_QUOTES, 'UTF-8'));
# Test:  End Test

You should note that HTML entities like … are just a trick to allow browsers to display characters that do not belong to the document encoding. They have nothing to do with databases! If you're getting them into your DB it's probably because your app is not using UTF-8 (UTF-8 allows to represent any character) but users are typing those characters anyway and the browser makes its best to fit them into the document. The easiest fix it to just switch to UTF-8, as explained in UTF-8 all the way through.

Fb doesnt like these &# characters and I would assume doesnt like the elipsis characters either

HTML entities are, well, HTML, not plain text. If Facebook expects plain text, HTML entities will displayed as-is rather than being decoded. As about «…», I really doubt that Facebook (which is using UTF-8) treats them exceptionally. You're probably sending them in the wrong encoding.

Upvotes: 2

Related Questions