Ximik
Ximik

Reputation: 2495

UTF-8 and HTML entities

I try to eject text from Word .DOC file with PHP. All seems ok, but the only trouble is something like

СУДОВА БУХГАЛТЕРІЯ

instead of russian text. I've tried to use html_entity_decode and utf8_encode, but they didn't help. Is there any simple solution?

Upvotes: 5

Views: 1530

Answers (1)

Gumbo
Gumbo

Reputation: 655239

html_entity_decode should work with the proper parameters (unless you’re using PHP 5.3.3 or later):

html_entity_decode($str, ENT_QUOTES, 'UTF-8')

This will convert the character references into UTF-8. Before PHP 5.3.3, the charset parameter’s default value was ISO-8859-1. In that case the cyrillic characters can’t be converted as the ISO 8859-1 character set doesn’t contain them.

Upvotes: 4

Related Questions