perlsplitmultilinetext-parsingdata-analysis

Reputation: 13062

Parsing multiline data in Perl

I have some data that I need to analyze. The data is multilined and each block is separated by a newline. So, it is something like this

Property 1: 1234
Property 2: 34546
Property 3: ACBGD

Property 1: 1234
Property 4: 4567

Property 1: just
Property 3: an
Property 5: simple
Property 6: example

I need to filter out those data blocks that have some particular Property present. For example, only those that have Property 4, only those that have Property 3 and 6 both etc. I might also need to choose based upon the value at these Properties, so for example only those blocks that have Property 3 and its value is 'an'.

How would I do this in Perl. I tried splitting it by "\n" but didn't seem to work properly. Am I missing something?

Upvotes: 1

Answers (8)

jmcnamara

Reputation: 41624

In relation to the first part of your question, you can read records in "paragraph mode" using perl's -00 commandline option, for example:

#!/usr/bin/perl -00

my @data = <>;

# Print the last block.
print $data[-1], "\n"

Upvotes: 0

Alex Reynolds

Reputation: 96967

#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;

my $propertyRef;
my $propertyRefIdx = 0;

while (<>) {
    chomp($_);
    if ($_ =~ /Property (\d+): (.*)/) {
        my $propertyKey = $1;
        my $propertyValue = $2;

        $propertyRef->[$propertyRefIdx]->{$propertyKey} = $propertyValue;
    }
    else {
        $propertyRefIdx++;
    }
}

print Dumper $propertyRef;

Let's say this script is called propertyParser.pl and you have a file containing the properties and values called properties.txt. You could call this as follows:

$ propertyParser.pl < properties.txt

Once you have populated $propertyRef with all your data, you can then loop through elements and filter them based on whatever rules you need to apply, such as certain key and/or value combinations:

foreach my $property (@{$propertyRef}) {
    if (defined $property->{1} && defined $property->{3} 
                               && ! defined $property->{6}) {
        # do something for keys 1 and 3 but not 6, etc.
    }
}

Upvotes: 2

Axeman

Reputation: 29854

Your record separator should be "\n\n". Every line ends with one, and you differentiate a block by a double newline. Using this idea, it was rather easy to filter out the blocks with Property 4.

use strict;
use warnings;
use English qw<$RS>;

open( my $inh, ... ) or die "I'm dead!";

local $RS = "\n\n";
while ( my $block = <$inh> ) { 
    if ( my ( $prop4 ) = $block =~ m/^Property 4:\s+(.*)/m ) { 
        ...
    }
    if ( my ( $prop3, $prop6 ) 
             = $block =~ m/
        ^Property \s+ 3: \s+ ([^\n]*)
        .*?
        ^Property \s+ 6: \s+ ([^\n]*)
        /smx 
       ) {
        ...
    }
}

Both expressions use a multiline ('m') flag, so that ^ applies to any line start. The last one uses the flag to include newlines in '.' expressions ('s') and the extended syntax ('x') which, among other things, ignores whitespace within the expression.

If the data was rather small, you could process it all in one go like:

use strict;
use warnings;
use English qw<$RS>;

local $RS = "\n\n";
my @block
    = map { { m/^Property \s+ (\d+): \s+ (.*?\S) \s+/gmx } } <DATA>
    ;
print Data::Dumper->Dump( [ \@block ], [ '*block' ] ), "\n";

Which shows the result to be:

@block = (
           {
             '1' => '1234',
             '3' => 'ACBGD',
             '2' => '34546'
           },
           {
             '4' => '4567',
             '1' => '1234'
           },
           {
             '6' => 'example',
             '1' => 'just',
             '3' => 'an',
             '5' => 'simple'
           }
         );

Upvotes: 1

OMG_peanuts

Reputation: 1817

Assuming that your data are stored into a file (let's say mydata.txt), you could write the following perl script (let's call him Bob.pl):

my @currentBlock = ();
my $displayCurrentBlock = 0;
# This will iterate on each line of the file
while (<>) {
  # We check the content of $_ (the current line)
  if ($_ =~ /^\s*$/) {
    # $_ is an empty line, so we display the current block if needed
    print @currentBlock if $displayCurrentBlock;
    # Current block and display status are resetted
    @currentBlock = ();
    $displayCurrentBlock = 0;
  } else{
    # $_ is not an empty line, we add it to the current block
    push @currentBlock, $_;
    # We set the display status to true if a certain condition is met
    $displayCurrentBlock = 1 if ($_ =~ /Property 3: an\s+$/);
  }
}
# A last check and print for the last block
print @currentBlock if $displayCurrentBlock;

Next, you just have to lauch perl Bob.pl < mydata.txt, and voila !

localhost> perl Bob.pl < mydata.txt
Property 1: just
Property 3: an
Property 5: simple
Property 6: example

Upvotes: 0

Dave Cross

Reputation: 69294

The secret to making this task simple is to use the $/ variable to put Perl into "paragraph mode". That makes it easy to process your records one at a time. You can then filter them with something like grep.

#!/usr/bin/perl

use strict;
use warnings;

my @data = do {
  local $/ = '';
  <DATA>;
};

my @with_4   = grep { /^Property 4:/m } @data;

my @with_3   = grep { /^Property 3:/m } @data;
my @with_3_6 = grep { /^Property 6:/m } @with_3;

print scalar @with_3_6;

__DATA__
Property 1: 1234
Property 2: 34546
Property 3: ACBGD

Property 1: 1234
Property 4: 4567

Property 1: just
Property 3: an
Property 5: simple
Property 6: example

In that example I'm processing each record as plain text. For more complex work, I'd probably turn each record into a hash.

#!/usr/bin/perl

use strict;
use warnings;

use Data::Dumper;

my @data;

{
  local $/ = '';

  while (<DATA>) {
    chomp;

    my @rec = split /\n/;
    my %prop;
    foreach my $r (@rec) {
      my ($k, $v) = split /:\s+/, $r;
      $prop{$k} = $v;
    }

    push @data, \%prop;
  }
}

my @with_4   = grep { exists $_->{'Property 4'} } @data;

my @with_3_6 = grep { exists $_->{'Property 3'} and
                      exists $_->{'Property 6'} } @data;

my @with_3an = grep { exists $_->{'Property 3'} and
                      $_->{'Property 3'} eq 'an' } @data;

print Dumper @with_3an;

__DATA__
Property 1: 1234
Property 2: 34546
Property 3: ACBGD

Property 1: 1234
Property 4: 4567

Property 1: just
Property 3: an
Property 5: simple
Property 6: example

Upvotes: 14

0xDEADBEEF

Reputation: 858

Check what the $/ variable will do for you, for example explanation here. You can set the 'end of line' separator to be whatever you please. You could try setting it to '\n\n'

$/ = "\n\n";
foreach my $property (<DATA>)
    {
    print "$property\n";
    }


__DATA__
Property 1: 1234
Property 2: 34546
Property 3: ACBGD

Property 1: 1234
Property 4: 4567

Property 1: just
Property 3: an
Property 5: simple
Property 6: example

As your data elements seem to be deilmited by blank lines this will read each property group of lines one by one.

You could also read the entire file into an array and process it from memory

my(@lines) = <DATA>

Upvotes: 0

musiKk

Reputation: 15189

Quick and dirty:

my $string = <<END;
Property 1: 1234
Property 2: 34546
Property 3: ACBGD

Property 1: 1234
Property 4: 4567

Property 1: just
Property 3: an
Property 5: simple
Property 6: example
END

my @blocks = split /\n\n/, $string;

my @desired_blocks = grep /Property 1: 1234/, @blocks;

print join("\n----\n", @desired_blocks), "\n";

Upvotes: 2

Matt Gumbley

Reputation: 427

Dependent on the size of each property set and how much memory you have...

I'd use a simple state machine that scans sequentially through the file - with a line-by-line sequential scan, not multiline - adding each property/id/value to a hash keyed on id. When you get a blank line or end-of-file, determine whether the elements of the hash should be filtered in or out, and emit them as necessary, then reset the hash.

Upvotes: 3

Parsing multiline data in Perl

Answers (8)

Related Questions