Reputation: 12038
I have an XML file containing a number of HTTP responses including the HTTP headers, I am wanting to write the individual responses out to file with just the content not the header. I am struggling to remove the HTTP headers at the start of the file with out messing with the rest
#!/usr/bin/perl
use XML::Simple;
use MIME::Base64;
use URI::Escape;
#CheckArgs
....
my $input = $ARGV[0];
# Parse XML
my $xml = new XML::Simple;
my $data = $xml->XMLin("$input");
# Iterate through the file
for (my $i=0; $i < @{$data->{item}}; $i++){
my $status = $data->{item}[$1]->{status};
my $path = $data->{item}[$i]->{path};
if ($status != "200") {
print "Skipping $path due to status of $status\n";
next;
}
print "$status $path\n";
my $filename = uri_escape($path);
# The Content is Base64 Encoded
my $encoded = $data->{item}[$i]->{response}->{content};
my $decoded = decode_base64($encoded);
# Remove HTTP headers
$decoded =~ s/^(.*?)((\r\n)|\n|\r){2}//gm;
open(IMGFILE, "> $filename") or die("Can't open $filename: ".$@);
binmode IMGFILE;
print IMGFILE $decoded;
close IMGFILE;
}
$decoded
looks like this before before the search and replace
HTTP/1.1 200 OK
Server: nginx
Date: Thu, 12 Nov 2025 20:79:99 GMT
Content-Type: application/pdf
Content-Length: 88151
Last-Modified: Mon, 14 Sep 2025 20:79:99 GMT
Connection: keep-alive
ETag: "123123-123546"
Expires: Thu, 19 Nov 2025 20:79:99 GMT
Cache-Control: max-age=123456
Accept-Ranges: bytes
%PDF-1.6
%âãÏÓ
54 0 obj
<<
/Linearized 1
/O 56
/H [ 720 305 ]
/L 45164
/E 7644
/N 10
/T 43966
>>
endobj
[Lots more binary and text]
So I am trying to match from the start of the file to the first instance of two new lines with the following line:
$decoded =~ s/^(.*?)((\r\n)|\n|\r){2}//m;
# s => Search Replace
# ^ => Start of file
# (.*?) => Non-greedy match anything including \r and \n
# ((\r\n)|\n|\r){2} => two new lines
# // => Replace with empty string
# m multiline to allow . to match \r\n
After an amount of playing with the regex I am failing to get result I want, from the example above I would want my new file starting with the characters %PDF-1.6
those characters and everything after them should be unaltered. Please note the PDF file is just an example, there a lot of other file types I want this to work with.
$decoded =~ s/^(.*?)((\r\n)|\n|\r){2}//m;
# matches \r\n due to or. So Try
$decoded =~ s/^(.*?)((\r\n)|([^\r]\n)|(\r[^\n])){2}//m;
Upvotes: 0
Views: 1131
Reputation: 126762
m multiline to allow . to match \r\n
The /m
modifier affects only the ^
and $
characters. You need /s
which allows .
to match LF
((\r\n)|\n|\r){2} => two new lines
There is a metacharacter that does this already - \R
I suggest that something like
$decoded =~ s/^.*?\R{2,}//s
will do what you want
Upvotes: 1