Puka
Puka

Reputation: 1575

How to remove the embedded formatting of an UTF-8 string?

I'm querying the Facebook API in PHP to get a list of posts and display it on a website.

// $facebook is an instance of Facebook\Facebook
$response = $facebook->get('posts?fields=id,message,created_time,full_picture,permalink_url,status_type&limit=20');
$graphEdge = $response->getGraphEdge();
$posts = [];

foreach ($graphEdge as $post) {
    $message = $post->getField('message');
}

The text returned by the call looks like the picture below:

enter image description here

My problem is that sometimes the formatting of the text seems to be embedded in the characters themselves. For eg., the text "Montรฉlimar - aux Portes du Soleil" uses a different font than what's defined in CSS and I can't force it to use a different style. The HTML looks like this:

<p>
  Profitez dโ€™un cadre de vie idรฉal pour faire construire votre maison individuelle sur la commune de ๐Œ๐จ๐ง๐ญ๐žฬ๐ฅ๐ข๐ฆ๐š๐ซ - ๐š๐ฎ๐ฑ ๐๐จ๐ซ๐ญ๐ž๐ฌ ๐๐ฎ ๐’๐จ๐ฅ๐ž๐ข๐ฅ โ˜€๏ธ
  Notre lotissement ยซ ๐‹๐ž ๐ƒ๐จ๐ฆ๐š๐ข๐ง๐ž ๐๐ž ๐†๐žฬ๐ซ๐ฒ ยป ...
</p>

We even store the data in a JSON object and it looks like this (see the "description" field):

[
    {
        "pageName": "---",
        "type": "---",
        "date": "---",
        "description": "Profitez dโ€™un cadre de vie idรฉal pour faire construire votre maison individuelle sur la commune de ๐Œ๐จ๐ง๐ญ๐žฬ๐ฅ๐ข๐ฆ๐š๐ซ - ๐š๐ฎ๐ฑ ๐๐จ๐ซ๐ญ๐ž๐ฌ ๐๐ฎ ๐’๐จ๐ฅ๐ž๐ข๐ฅ โ˜€๏ธ Notre lotissement ยซ ๐‹๐ž ๐ƒ๐จ๐ฆ๐š๐ข๐ง๐ž ๐๐ž ๐†๐žฬ๐ซ๐ฒ ยป ...",
        "time": 0000,
        "thumbnail": "---",
        "url": "---",
        "img": "---"
    }
]

As you can see, some text has a default styling that I can't figure how to get rid of. I've tried to re-encode the text to UTF-8 via PHP using mb_convert_encoding(); but this doesn't solve the problem because the string is already UTF-8.

How can I remove this formatting? Is this even formatting, or just special UTF-8 symbols?

Upvotes: 1

Views: 596

Answers (2)

jspit
jspit

Reputation: 7703

If the UTF-8 special characters get in the way, you can try converting the string to ASCII with iconv. However, there is a risk that the individual characters and, under certain circumstances, important information will be lost.

$strUTF8mb4 = "Profitez dโ€™un cadre de vie idรฉal pour faire construire votre maison individuelle sur la commune de ๐Œ๐จ๐ง๐ญ๐žฬ๐ฅ๐ข๐ฆ๐š๐ซ - ๐š๐ฎ๐ฑ ๐๐จ๐ซ๐ญ๐ž๐ฌ ๐๐ฎ ๐’๐จ๐ฅ๐ž๐ข๐ฅ โ˜€๏ธ Notre lotissement ยซ ๐‹๐ž ๐ƒ๐จ๐ฆ๐š๐ข๐ง๐ž ๐๐ž ๐†๐žฬ๐ซ๐ฒ ยป ...";
$strASCII = iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", $strUTF8mb4);
//string(181) "Profitez d'un cadre de vie id'eal pour faire construire votre maison individuelle sur la commune de Montelimar - aux Portes du Soleil Notre lotissement << Le Domaine de Gery >> ..."

Especially for the French language, this code could produce slightly better results:

$strIso = iconv("UTF-8", "ISO-8859-15//TRANSLIT//IGNORE", $strUTF8mb4);
$strUtf8 = iconv("ISO-8859-15", "UTF-8", $strIso);
//"Profitez d'un cadre de vie idรฉal pour faire construire votre maison individuelle sur la commune de Montelimar - aux Portes du Soleil Notre lotissement ยซ Le Domaine de Gery ยป ..."

Upvotes: 1

Puka
Puka

Reputation: 1575

If you copy one of the characters (the "M" of "Montรฉlimar" for eg.) and try to look for it in the Unicode Character Table (https://unicode-table.com/en/1D40C/), you will find that it is not a letter but a "Mathematical Bold Capital M", represented by these symbols:

  • Unicode number: U+1D40C
  • HTML-code: &#119820;

So this is a problem with your content itself and not an encoding problem. Everything is fine and I don't think you can anything do to fix this appearance issue.

Upvotes: 1

Related Questions