Reputation: 3152
I have xml files which contains characters <
,>
,&
. For example:
<?xml version="1.0" encoding="utf-8"?>
<test>
<medi>bla bla >12 bla</medi>
<diag>bla & bla <12</diag>
</test>
These characters are reserved for xml notation and should be replaced by escaping strings <
, >
, &
. This holds also for quotes (" ->"
) and apostrophes (' -> '
).
Here is what I like to get:
<?xml version="1.0" encoding="utf-8"?>
<test>
<medi>bla bla >12 bla</medi>
<diag>bla & bla <12</diag>
</test>
Usually, I use regular expression with perl or sed. But, sincerely, I did not succeed. The difficulty is to avoid replacing xml relevant characters like <
and >
and &
of escape strings.
To make it clear what I mean I put a solution with perl which does not work:
use strict;
use warnings;
my $input = $ARGV[0];
my $output = $ARGV[1];
open INPUT, $input or die "Couldn't open file $input, $!";
open OUTPUT, ">$output" or die "Couldn't open file $output, $!";
my $rec;
while (<INPUT>) {
$rec = $_;
print $rec;
$rec =~ s/(<medi>.*)<(.*<\/medi>)/$1<$2/g;
$rec =~ s/(<medi>.*)>(.*<\/medi>)/$1>$2/g;
$rec =~ s/(<medi>.*)&(.*<\/medi>)/$1&$2/g;
$rec =~ s/(<medi>.*)'(.*<\/medi>)/$1'$2/g;
$rec =~ s/(<medi>.*)"(.*<\/medi>)/$1"$2/g;
$rec =~ s/(<diag>.*)<(.*<\/diag>)/$1<$2/g;
$rec =~ s/(<diag>.*)>(.*<\/diag>)/$1>$2/g;
$rec =~ s/(<diag>.*)&(.*<\/diag>)/$1&$2/g;
$rec =~ s/(<diag>.*)'(.*<\/diag>)/$1'$2/g;
$rec =~ s/(<diag>.*)"(.*<\/diag>)/$1"$2/g;
print $rec;
print OUTPUT $rec;
}
close INPUT;
close OUTPUT;
This gives me:
<?xml version="1.0" encoding="utf-8"?>
<test>
<medi>bla bla &gt;12 bla</medi>
<diag>bla & bla &lt;12</diag>
</test>
What happens:
>
was replaced with &
which is not intended<diag> bla & bla ...
is not replacedI'm sure there is a regexp which may solve this problem. But if there is a completely other way to make the xml well formed I am open for it.
Upvotes: 0
Views: 2190
Reputation:
if your data in 'd', by gnu sed
sed -E 's/&/&\;/g;s/</<\;/g;s/>/>\;/g;s/\x27/&apos\;/g;/xml ver/!s/\"/"\;/g' d
or " is \x22
if you assure it so
Upvotes: 0
Reputation: 69244
If you have files that contain characters like '<', '>' and '&' in the text nodes, then you do not have XML files.
In order to fix this, you would need to parse the files with an XML parser. But it's likely that most XML parsers would refuse to parse these files as they aren't well-formed XML. It's possible that something like XML::Lenient could be useful here.
The correct approach is to go back to the source of these files and fix that process so that it generates well-formed XML files. If you are creating the files, then you need to fix the code that creates them. If someone is providing the files to you, then you need to go back to them and ask them to provide valid XML files.
Upvotes: 1