Nidhi Kaushal
Nidhi Kaushal

Reputation: 299

How to normalize encoding names, like ks_c_5601-1987 to CP949?

I am fetching emails from a mail server and converting the message to UTF-8 charset and save it in DB.To convert the charset I am using mb_convert_encoding but it fails to convert gb2312 and ks_c_5601-1987. On googling I found that instead of gb2312 I can use CP936 and for ks_c_5601-1987 use CP949.

Going by the above approach it would mean to maintain a separate list of charset mappings in my code. Is there a way to normalize names of encodings to names internally supported by PHP hence eliminating the need to maintain any map locally?

Upvotes: 7

Views: 3265

Answers (1)

borrible
borrible

Reputation: 17336

According to the list of supported character encodings there are only a small number of encodings listed explicitly by code page. Given the small number of these cases - whilst not a built-in normalisation as requested - a list of mappings may not be too inappropriate.

The relevant ones appear to be the following (the lowercase name on the right is the name you'll need to convert from):

  • CP932 shift_jis
  • CP51932 euc_jp
  • CP50220 iso-2022-jp
  • CP50221 csISO220JP
  • CP50222 iso-2022-jp
  • CP936 gb2312
  • CP950 big5

The following are also listed by code-page on the PHP documentation but appear to have suitable synonyms already:

  • CP866 (IBM866)
  • UHC (CP949)
  • Windows-1251 (CP1251)
  • Windows-1252 (CP1252)

Upvotes: 2

Related Questions