How can I remove characters that are not supported by MySQL's utf8 character set?

Question

How can I remove characters from a string that are not supported by MySQL's utf8 character set? In other words, characters with four bytes, such as "𝜀", that are only supported by MySQL's utf8mb4 character set.

For example,

𝜀C = -2.4‰ ± 0.3‰; 𝜀H = -57‰

should become

C = -2.4‰ ± 0.3‰; H = -57‰

I want to load a data file into a MySQL table that has CHARSET=utf8.

ikegami · Accepted Answer

MySQL's utf8mb4 encoding is what the world calls UTF-8.

MySQL's utf8 encoding is a subset of UTF-8 that only supports characters in the BMP (meaning characters U+0000 to U+FFFF inclusive).

Reference

So, the following will match the unsupported characters in question:

/[^\N{U+0000}-\N{U+FFFF}]/

Here are three different techniques you can use clean your input:

1: Remove unsupported characters:

s/[^\N{U+0000}-\N{U+FFFF}]//g;

2: Replace unsupported characters with U+FFFD:

s/[^\N{U+0000}-\N{U+FFFF}]/\N{REPLACEMENT CHARACTER}/g;

3: Replace unsupported characters using a translation map:

my %translations = (
    "\N{MATHEMATICAL ITALIC SMALL EPSILON}" => "\N{GREEK SMALL LETTER EPSILON}",
    # ...
);

s{([^\N{U+0000}-\N{U+FFFF}])}{ $translations{$1} // "\N{REPLACEMENT CHARACTER}" }eg;

For example,

use utf8;                              # Source code is encoded using UTF-8
use open ':std', ':encoding(UTF-8)';   # Terminal and files use UTF-8.

use strict;
use warnings;
use 5.010;               # say, //
use charnames ':full';   # Not needed in 5.16+

my %translations = (
   "\N{MATHEMATICAL ITALIC SMALL EPSILON}" => "\N{GREEK SMALL LETTER EPSILON}",
   # ...
);

$_ = "𝜀C = -2.4‰ ± 0.3‰; 𝜀H = -57‰";
say;

s{([^\N{U+0000}-\N{U+FFFF}])}{ $translations{$1} // "\N{REPLACEMENT CHARACTER}" }eg;
say;

Output:

𝜀C = -2.4‰ ± 0.3‰; 𝜀H = -57‰
εC = -2.4‰ ± 0.3‰; εH = -57‰

How can I remove characters that are not supported by MySQL's utf8 character set?

Answers (1)

Related Questions

How can I remove characters that are not supported by MySQL&#39;s utf8 character set?

Answers (1)

Related Questions

How can I remove characters that are not supported by MySQL's utf8 character set?