capser
capser

Reputation: 2635

Perl - push lines inbetween regex into one element of array

This is the log file I am dealing with -

|
blah1a
blah1b
blah1c
|
****blahnothing1
|
blah2a
blah2b
blah2c
|
blahnothing2
|
blah3a
blah3b
blah3c
|
blahnothing3

The information that I need is nestled between two pipe characters. There are alot of lines with that start with asteriks, I skip over them. Each line has windows end of line characters. The data in between these pipe characters is contigious, but when read on a linux host, it is chopped up with the windows new lines. I wrote the perl script with a range operator between the two lines hoping that everything that starts with a pipe delimiter would get pushed into an array element and then stop at the next pipe delimiter, then start again. Each array element would have all the lines in between the two pipes characters.

Ideally the arrays would look like this, sans the windows control characters.

$lines[0] blah1a blah1b blah1c
$lines[1] blah2a blah2b blah2c
$lines[2] blah3a blah3b blah3c

However each arrays do not look like that.

#!/usr/bin/perl

use strict ;
use warnings ;

my $delimiter = "|";
my $filename = $ARGV[0] ;
my @lines ;
open(my $fh, '<:encoding(UTF-8)' , $filename) or die "could not open file $filename $!";

while (my $line = readline $fh) {
    next if ($line =~/^\*+/) ;
    if ($line =~ /$delimiter/ ... $line =~/$delimiter/) {
    push (@lines, $line) ;
    }


}

print  $lines[0] ;
print  $lines[1] ;
print  $lines[2] ;

Upvotes: 0

Views: 908

Answers (2)

zdim
zdim

Reputation: 66899

It seems that you want to merge lines between |, into a string, which gets placed on an array.

One way is to set the | as input record separator, so read a chunk between pipes each time

{  # localize the change to $/

    local $/ = "|";
    open(my $fh, '<:encoding(UTF-8)' , $filename) 
        or die "could not open file $filename $!";

    my @records;
    while (my $section = <$fh>)
    {
        next if $section =~ /^\s*\*/;  
        chomp $section;                # remove the record separator (| here)
        $section =~ s/\R/ /g;          # clean up newlines
        $section =~ s/^\s*//;          # clean up leading spaces
        push @records, $section if $section;
    }
    print "$_\n" for @records;
}

I skip a "section" if it starts with * (and an optional space). There can be more restrictive versions. The $section can end up being an emtpy string, so we push it on the array conditionally.

Output, with the example in the question copy-pasted into the input file with $filename

blah1a blah1b blah1c 
blah2a blah2b blah2c 
blahnothing2 
blah3a blah3b blah3c 
blahnothing3 

The approach in the question is fine, but you need to merge lines within a "section" (between pipes) and place each such string on the array. So you need a flag to track when enter/leave a section.

Upvotes: 1

Borodin
Borodin

Reputation: 126742

This seems to satisfy your requirement

I've left the two lines blahnothing2 and blahnothing3 in place as I couldn't see a rationale for removing them

The \R regex pattern is the generic newline, and matches the newline sequences from any platform, i.e. CR, LF, or CRLF

use strict;
use warnings 'all';

my $data = do {
    open my $fh, '<:raw', 'blah.txt' or die $!;
    local $/;
    <$fh>;
};

$data =~ s/^\s*\*.*\R/ /gm; # Remove lines starting with *
$data =~ s/\R/ /g;          # Change all line endings to spaces

# Split on pipe and remove blank elements
my @data = grep /\S/, split /\s*\|\s*/, $data; 

use Data::Dump;
dd \@data;

output

[
  "blah1a blah1b blah1c",
  "blah2a blah2b blah2c",
  "blahnothing2",
  "blah3a blah3b blah3c",
  "blahnothing3 ",
]

Upvotes: 2

Related Questions