Reputation: 1998
The XML character set is limited to the following:
[\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]
Entities may not be used to represent characters outside this set either.
I am parsing some XML data files from an external source using XML::DOM. Some of the XML files have non-printable characters encoded in the form of &#xx; (eg. 
) which is causing the parser to crash as these are invalid. I am trying to find an easy way to remove these invalid characters. I have tried
$xml =~ s/(&#\c\c;)//g;
which doesn't seem to work. SO doesn't seem to have anything related and I have been searching online for a while with no success.
Upvotes: 0
Views: 2491
Reputation: 126722
It makes sense to write a subtitution that finds all entities in the HTML and uses an /e
modifier so that the replacement string can be supplied by a block of Perl code.
This example creates the $html_chars
regex pattern from your own question that will check whether any single character is within range, and then uses it to test the values of all character entities in the string.
Note that the hash #
in the pattern must be escaped as a consequence of the /x
modifier that allows whitespace and comments to make the regex more readable.
My test string uses entities for all ASCII character codes in both decimal and hex, and you can see that the substitution removes just the control characters except for HT, LF and CR.
use strict;
use warnings;
my $html_chars = qr/[\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]/;
my $html = do {
local $/;
<DATA>;
};
$html =~ s{ ( &\# ( x[0-9A-Z]+ | [0-9]+ ) ; ) } {
my ($entity, $code) = ($1, $2);
$code = hex $code if $code =~ s/x//i;
chr($code) =~ $html_chars ? $entity : '';
}eixg;
print $html;
__DATA__
Decimal
�	  

 !"#$%&'()*+,-./
0123456789:;<=>?
@ABCDEFGHIJKLMNO
PQRSTUVWXYZ[\]^_
`abcdefghijklmno
pqrstuvwxyz{|}~
Hex
�	


 !"#$%&'()*+,-./
0123456789:;<=>?
@ABCDEFGHIJKLMNO
PQRSTUVWXYZ[\]^_
`abcdefghijklmno
pqrstuvwxyz{|}~
output
Decimal
	
 !"#$%&'()*+,-./
0123456789:;<=>?
@ABCDEFGHIJKLMNO
PQRSTUVWXYZ[\]^_
`abcdefghijklmno
pqrstuvwxyz{|}~
Hex
	

 !"#$%&'()*+,-./
0123456789:;<=>?
@ABCDEFGHIJKLMNO
PQRSTUVWXYZ[\]^_
`abcdefghijklmno
pqrstuvwxyz{|}~
Upvotes: 2
Reputation: 35198
I would recommending explicitly specifying which characters that you want to remove.
The following removes the unprintable character entities in the ascii range. This could easily be expanded if you wanted to cover all of the unprintable entities as you've defined them.
Also, please note as @ikegami mentioned in the question comments that using a regex like this will break the contents of CDATA section.
use strict;
use warnings;
my $data = do {local $/; <DATA>};
# Allowed entities:
# [\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]
# Decimal Character Entities
$data =~ s/�*(?!(?:9|1[03])\b)(?:[12]?[0-9]|3[01]);//g;
# Hex Character Entities
$data =~ s/�*(?![9ADad]\b)1?[[:xdigit:]];//g;
print $data;
__DATA__
<?xml version="1.0" encoding="UTF-8" ?>
<root>
<hex_character_entities>
<hex00>�	

</hex00>
<hex10></hex10>
<hex20> !...</hex20>
</hex_character_entities>
<decimal_character_entities>
<dec00>�	</dec00>
<dec10>  </dec10>
<dec20></dec20>
<dec30> !...</dec30>
</decimal_character_entities>
</root>
Outputs:
<?xml version="1.0" encoding="UTF-8" ?>
<root>
<hex_character_entities>
<hex00>	

</hex00>
<hex10></hex10>
<hex20> !...</hex20>
</hex_character_entities>
<decimal_character_entities>
<dec0>	</dec0>
<dec1> </dec1>
<dec2></dec2>
<dec3> !...</dec3>
</decimal_character_entities>
</root>
Upvotes: 2
Reputation: 49
I would try using \w
instead of \c
.
The following produces the correct results for me:
my $xml = <<XML;
<?xml version="1.0" encoding="UTF-8" ?>
<outer>
<inner></inner>
</outer>
XML
$xml =~ s/&#\w{2};//g;
Upvotes: -1