Reputation: 1116

A script to strip ranges of UTF-8 Characters out of a file

My problem is that I have a data file containing UTF-8, most of which is valid and must be kept, but some of which has random "garbage" UTF-8, namely in the range of 0xf0 - 0xff. An example of the hex for the bad data can be seen below

 f4 80 80  ab f4 80 80 b6 f4 80 80 
 a5 f4 80 80 a6 f4 80 80  83 f4 80 80 b6 f4 80 81  
 84 f4 80 81 98 f4 80 81  87 f4 80 81 8c f4

I'm trying to write a perl script that will search and replace for characters that the first byte is in the range 0xf0 - 0xff. On this website the codepage is listed as private use.

My existing attempts either do nothing, or have only been able to remove the first byte of a multi-byte character, such as perl -CSD -pi.orig -e 's/[\x{f4}-\x{ff}]/?/g' Running perl v5.12.5

I'm not much of a perl expert, nor a utf-8 expert. I'm also open to doing this in ruby/python/C++(98)/whatever as long as it's relatively portable on a linux box.

Here's a link to a snippet of the garbage data. http://pastebin.com/LR0StPHu

Upvotes: 3

Answers (3)

khw

Reputation: 578

It's a waste of your time to have to look up the hex ranges of Private Use areas. Simply say

s/\p{Private_Use}//g

perluniprops is the pod file that gives all the Unicode properties. If you want just the above-BMP private use areas, you can consult it (grepping for Private) to find how to match those.

Upvotes: 3

Mark Reed

Reputation: 95267

Ok, let's not mix up a few things.

UTF-8 characters whose first byte is 0xf0 are four bytes long, which is the most you ever need to encode a legal Unicode character. Since over 94% of the possible Unicode range requires that fourth byte, 0xf0 doesn't map to any single code page, and certainly not to the private use areas.

Such characters are outside the Basic Multilingual Plane. But that's different from being invalid or private use; it just means their code points are greater than U+FFFF (decimal value 65,535).

If you want to exclude all characters outside the BMP, you should be searching for the ones matching this regex:

[\x{10000}-\x{10FFFF}]

That uses Perl's \x{...} interpolation syntax to include characters by their hexadecimal code point value. If you're actually using Perl, then for ease of use you might want to put the regex into a variable (using the quote-regex construction qr(...), since bare slashes will immediately try to match the regex against $_ at assignment time):

my $not_bmp = qr([\x{10000}-\x{10FFFF}]);

But, again, removing characters matching that regex eliminates over 94% of possible Unicode characters, so be sure that's what you want.

If you really only want to eliminate private use characters - some of which are inside the BMP - just exclude those ranges specifically. With Perl or Python or any other UTF-8-aware language, you don't have to worry about bytes; just check the code points.

As Wikipedia will tell you, the three Private Use Areas are in these code point ranges:

U+E000..U+F8FF
U+F0000..U+FFFFF
U+100000..U+10FFFF

So the corresponding Perl regex looks like this:

my $pua = qr([\x{e000}-\x{f8ff}\x{f0000}-\x{fffff}\x{100000}-\x{10ffff}]);

Many other languages have similar Unicode support (matching against UTF-8 characters, including characters in a string by code point, and so on). For example, here's Ruby, which mainly differs in using \u{...} instead of \x{...} for the interpolation:

not_bmp = %r([\u{10000}-\u{10FFFF}])
pua = %r([\u{e000}-\u{f8ff}\u{f0000}-\u{fffff}\u{100000}-\u{10ffff}])

Python \u escapes only work with exactly four hex digits, but if you have Python3 - or a Python2 compiled in wide mode - you can use capital \U, which takes exactly eight (there's no variable-length support via {...} as Perl and Ruby have):

not_bmp = re.compile(u'[\U00010000-\U0010ffff]')
pua = re.compile(u'[\ue000-\uf8ff\U000f0000-\U000fffff\U00100000-\U0010ffff]')

Upvotes: 5

simbabque

Reputation: 54333

You need to work with characters, not with bytes.

If you have your data inside of your code, and you use the use utf8 pragma to tell Perl that your program's source code is in utf8. We do this for the example so you can copy/paste my code.

You can do a string replace using the \x{} escape sequence in a character class []. Those can be used in ranges as well as individually.

use utf8;

my $foo = "asfd ☃ 􀀫􀀶􀀥􀀦􀀶􀁄􀁘􀁇􀁌􀀤􀁕􀁄􀁅􀁌􀁄􀀯􀁌􀁐􀁌􀁗􀁈􀁇 Բարեւ ສະບາຍດີ";
$foo =~ s/[\x{10002b}\x{100036}]//g;
CORE::say $foo;

This will output:

asfd ☃ 􀀥􀀦􀁄􀁘􀁇􀁌􀀤􀁕􀁄􀁅􀁌􀁄􀀯􀁌􀁐􀁌􀁗􀁈􀁇 Բարեւ ສະບາຍດີ

(There's also a wide character in print warning, but let's ignore that, it's because my STDOUT is not opened properly).

The two characters I substituted \x{10002b}\x{100036} are the first two characters in your example data. The font I use in my IDE shows the ordinals of characters that it doesn't have any glyphs for, so it's easy for me to tell what those characters are.

These characters are from the Supplementary Private Use Area-B. (Wikipedia)

16 PUA-B U+100000..U+10FFFF Supplementary Private Use Area-B 65,536 65,534 Unknown

So we can also do a range.

my $foo = "asfd ☃ 􀀫􀀶􀀥􀀦􀀶􀁄􀁘􀁇􀁌􀀤􀁕􀁄􀁅􀁌􀁄􀀯􀁌􀁐􀁌􀁗􀁈􀁇 Բարեւ ສະບາຍດີ";
$foo =~ s/[\x{100000}-\x{10ffff}]//g;
CORE::say $foo;

Output:

asfd ☃  Բարեւ ສະບາຍດີ

To get all Private Use Areas, you need to include the three ranges which are listed here.

/[\x{E000}-\x{F8FF}\x{F0_000}-\x{FF_FFD}\x{100_000}-\x{10f_fff}]//g;

Upvotes: 3

A script to strip ranges of UTF-8 Characters out of a file

Answers (3)

Related Questions