Reputation: 649
Hi,
I have an xml file that I need to validate. To do this I use the following code
use strict;
use warnings;
use XML::Parser;
my $File="folder/file1.xml";
my $p1 = new XML::Parser();
my $p2;
my $Crash_Error_String='';
eval{$p2=$p1->parsefile($File)};
$Crash_Error_String=$@ if !defined $p2 ;
if(!defined $p2){
print $Crash_Error_String . "\n";
}
Now, if the file does not contain valid XML I get a string in the variable $Crash_Error_String as follows:
not well-formed (invalid token) at line 1771, column 58, byte 248467 at /usr/lib64/perl5/XML/Parser.pm line 187.
This tells me that there is an XML related problem in the file at byte 248467
I can now print out the value where the problem occurs with:
my($fh, $File, $byte_position, $byte_value);
$byte_position = 248467;
open($fh, "+<", $File) || die "can't open $File: $!";
binmode($fh) || die "can't binmode $File";
sysseek($fh, $byte_position, 0) # NB: 0-based
|| die "couldn't see to byte $byte_position in $File: $!";
sysread($fh, $byte_value, 1) == 1
|| die "couldn't read byte from $File: $!";
printf "read byte with ordinal value %#02x at position %d\n",
ord($byte_value), $byte_position;
close $fh;
Which, in this specific example, gives
read byte with ordinal value 0x1f at position 248467
Now for my problem: How can I replace the value 0x1f with the entry _x001f_
I have tried the following (placing the code below between the calls to "sysread" and "close" in the code above)
sysseek($fh, $byte_position, 0) # NB: 0-based
|| die "couldn't see to byte $byte_position in $File: $!";
my $NewV="_x001f_";
syswrite($fh,$NewV);
But it places the new value immediately to the right of the problem string. In addition it eats up the characters to the right.
So, before the error I have the following fragment within the file (the character that XML Parser is complaining about is not actually shown below but it's basically a character between the i and the e of vérifier)
pour vérifier la réaction
And after my replacement I have the following fragment within the file
pour vérifi_x001f_éaction
As you can see the replacement string has eaten into the following part of the string.
The replacement I want is:
pour vérifi_x001f_er la réaction
Any help much appreciated.
Upvotes: 0
Views: 399
Reputation: 385657
If the file is too big for memory, but disk space isn't a problem, the simplest solution is:
It can be done in-place, but it's far more complicated (and a problem will lead to data loss).
use Fcntl qw( SEEK_CUR SEEK_SET );
use constant BLOCK_SIZE => 4*1024*1024;
my $qfn = 'file';
my $offset = 248467;
open(my $fh_src, '<:raw', $qfn) or die("Can't open \"$qfn\": $!\n");
open(my $fh_dst, '+<:raw', $qfn) or die("Can't open \"$qfn\": $!\n");
sysseek($fh_src, $offest, SEEK_SET) or die($!);
sysseek($fh_dst, $offest, SEEK_SET) or die($!);
my $buf;
{
my $rv = sysread($fh_src, $buf, 1);
die($!) if !defined($rv);
die("Premature EOF") if !$rv;
# Since we're only reading one byte, we don't need to worry about a partial read.
$buf = sprintf("_x%04x_", ord($buf));
}
while (1) {
my $written = 0;
while ($written < length($buf)) {
my $rv = syswrite($fh_dst, $buf, length($buf)-$written, $written);
die($!) if !defined($rv);
$written += $rv;
}
my $rv = sysread($fh_src, $buf, BLOCK_SIZE);
die($!) if !defined($rv);
last if !$rv;
}
# Must use sysseek instead of tell with sysread/syswrite.
truncate($fh_dst, sysseek($fh_dst, 0, SEEK_CUR))
or die($!);
Technically, the truncate
isn't required because the new file will always be larger than the old one.
Upvotes: 1
Reputation: 39158
› perl -i.bak -0777 -lpe's/\x1f/_x001f_/g' 50738935.xml
› hex 50738935.xml.bak
0000 70 6f 75 72 20 76 c3 a9 72 69 66 69 1f 65 72 20 pour v.. rifi.er
0010 6c 61 20 72 c3 a9 61 63 74 69 6f 6e 0a la r..ac tion.
› hex 50738935.xml
0000 70 6f 75 72 20 76 c3 a9 72 69 66 69 5f 78 30 30 pour v.. rifi_x00
0010 31 66 5f 65 72 20 6c 61 20 72 c3 a9 61 63 74 69 1f_er la r..acti
0020 6f 6e 0a on.
Upvotes: 0