David Waters
David Waters

Reputation: 12038

Perl Regex: Matching From Start of File to Pattern

I have an XML file containing a number of HTTP responses including the HTTP headers, I am wanting to write the individual responses out to file with just the content not the header. I am struggling to remove the HTTP headers at the start of the file with out messing with the rest

#!/usr/bin/perl
use XML::Simple;
use MIME::Base64;
use URI::Escape;

#CheckArgs
....
my $input = $ARGV[0];

# Parse XML
my $xml = new XML::Simple;
my $data = $xml->XMLin("$input");

# Iterate through the file
for (my $i=0; $i < @{$data->{item}}; $i++){ 
    my $status = $data->{item}[$1]->{status};
    my $path = $data->{item}[$i]->{path};
    if ($status != "200") {
        print "Skipping $path due to status of $status\n";
        next;
    }
    print "$status $path\n";
    my $filename = uri_escape($path);
    # The Content is Base64 Encoded
    my $encoded = $data->{item}[$i]->{response}->{content};
    my $decoded = decode_base64($encoded);

    # Remove HTTP headers
    $decoded =~ s/^(.*?)((\r\n)|\n|\r){2}//gm; 
    open(IMGFILE, "> $filename") or die("Can't open $filename: ".$@);
    binmode IMGFILE;
    print IMGFILE $decoded;
    close IMGFILE;
}

$decoded looks like this before before the search and replace

HTTP/1.1 200 OK
Server: nginx
Date: Thu, 12 Nov 2025 20:79:99 GMT
Content-Type: application/pdf
Content-Length: 88151
Last-Modified: Mon, 14 Sep 2025 20:79:99 GMT
Connection: keep-alive
ETag: "123123-123546"
Expires: Thu, 19 Nov 2025 20:79:99 GMT
Cache-Control: max-age=123456
Accept-Ranges: bytes


%PDF-1.6
%âãÏÓ
54 0 obj
<< 
/Linearized 1 
/O 56 
/H [ 720 305 ] 
/L 45164 
/E 7644 
/N 10 
/T 43966 
>> 
endobj
[Lots more binary and text]

So I am trying to match from the start of the file to the first instance of two new lines with the following line:

$decoded =~ s/^(.*?)((\r\n)|\n|\r){2}//m;
# s => Search Replace
# ^ => Start of file
# (.*?) => Non-greedy match anything including \r and \n
# ((\r\n)|\n|\r){2} => two new lines 
# // => Replace with empty string
# m multiline to allow . to match \r\n

After an amount of playing with the regex I am failing to get result I want, from the example above I would want my new file starting with the characters %PDF-1.6 those characters and everything after them should be unaltered. Please note the PDF file is just an example, there a lot of other file types I want this to work with.

EDIT 1

$decoded =~ s/^(.*?)((\r\n)|\n|\r){2}//m; 
# matches \r\n due to or. So Try
$decoded =~ s/^(.*?)((\r\n)|([^\r]\n)|(\r[^\n])){2}//m;

Upvotes: 0

Views: 1131

Answers (1)

Borodin
Borodin

Reputation: 126762

m multiline to allow . to match \r\n

The /m modifier affects only the ^ and $ characters. You need /s which allows . to match LF

((\r\n)|\n|\r){2} => two new lines

There is a metacharacter that does this already - \R

I suggest that something like

$decoded =~ s/^.*?\R{2,}//s

will do what you want

Upvotes: 1

Related Questions