Reputation: 81

retrieve lines between patterns using perl

I have a file that contains a list like below:

ID: ID_A
attr1: attribute
attr2: name
attr3: city


ID: ID_B
attr1: attribute2
attr2: name2
attr3: city3
attr4: country

the file contains about 60k entries of this sort. The unique identifier is always on the ID line. Once I hit a new ID, I need to be able to retrieve all the attributes for that ID.

I am trying to do the following:

if($line=/ID/../ID)
{
    $job[0]=$line
}

but this doesn't work and I also have to create an array that's large enough or small enough every time. Any tips on how to proceed will help very much.

thank you. JS

Upvotes: 0

Answers (3)

Sobrique

Reputation: 53508

This is much easier if you make use of $/ - the record seperator. And set it to "\n\n".

But as noted in the comments by Dave Cross - it would probably be better still to set it to '' because then perl will skip multiple blank lines, whilst otherwise accomplishing the same result.

#!/usr/bin/perl
use strict;
use warnings;

use Data::Dumper;

#set record separator to (one or more) blank lines
local $/ = '';

#iterate each chunk of data 
while ( <DATA> ) {
    #g matches repeatedly, and so this'll get alternating values
    #this conveniently is what you need to assign straight to a hash 
    my %record = m/(\w+): (.*)/g; 
    print Dumper \%record;
}

__DATA__
ID: ID_A
attr1: attribute
attr2: name
attr3: city

ID: ID_B
attr1: attribute2
attr2: name2
attr3: city3
attr4: country

Once you've pulled your record/fields, you can either push them into an array of records:

push ( @all_records, \%record );

Giving:

$VAR1 = [
          {
            'attr2' => 'name',
            'ID' => 'ID_A',
            'attr1' => 'attribute',
            'attr3' => 'city'
          },
          {
            'attr2' => 'name2',
            'ID' => 'ID_B',
            'attr4' => 'country',
            'attr1' => 'attribute2',
            'attr3' => 'city3'
          }
        ];

Or put it into a hash-of-hashes, keyed on ID number:

$all_records{$record{ID}} = \%record;

Giving:

$VAR1 = {
          'ID_A' => {
                      'ID' => 'ID_A',
                      'attr3' => 'city',
                      'attr1' => 'attribute',
                      'attr2' => 'name'
                    },
          'ID_B' => {
                      'attr2' => 'name2',
                      'attr3' => 'city3',
                      'attr1' => 'attribute2',
                      'attr4' => 'country',
                      'ID' => 'ID_B'
                    }
        };

Depends a bit what you're doing with the records - you may not need to 'hold' them at all if you're just processing and discarding, and if you've got duplicate IDs, then you probably don't want to be using the hash of hashes approach (ID must be unique for that to work).

Upvotes: 1

Matt Jacob

Reputation: 6553

It's hard to provide a decent answer without knowing your expected output format or how you intend to use this data, but this will get you 90% of the way there:

use strict;
use warnings;

my %data;
my $id;

while (<DATA>) {
    chomp;
    next unless /\S/;
    my ($key, $value) = split(/\s*:\s*/);

    if ($key eq 'ID') {
        $id = $value;
        next;
    }

    $data{$id}{$key} = $value;
}

print "$data{ID_B}{attr2}\n";  # prints name2

__DATA__
ID: ID_A
attr1: attribute
attr2: name
attr3: city

ID: ID_B
attr1: attribute2
attr2: name2
attr3: city3
attr4: country

Upvotes: 0

Ian McGowan

Reputation: 3799

I would create a hash-of-hashes (since you don't know what attributes may be encountered in the file). The key to the main hash is ID, and the contents of each entry are another sub-hash. That sub-hash has the attribute name as the key.

This is not idiomatic perl at all, but works in my testing...

#!/usr/bin/perl
use strict;
use Data::Dumper;
my %master;
my %tmphash;
my $oldid="";
my $id;

# Create a hash-of-hashes
while (<>) {
  if (/^ID: (.*)/) {
    $id=$1;
    # We need to skip the first one to "prime the pump"
    if ($oldid ne "") {
      $master{$oldid}={%tmphash};
    }
    $oldid=$id;
    %tmphash=();
  } else {
    # Until we get to the next ID: add anything we find to tmphash
    if (/^(.*): (.*)/) {
      $tmphash{$1}=$2;
    }
  }
}
# Don't forget the last one...
$master{$oldid}={%tmphash};

print Dumper(\%master);

foreach my $id (sort keys %master) {
    foreach my $attr (keys %{ $master{$id} }) {
        print "$id, $attr: $master{$id}{$attr}\n";
    }
}

Upvotes: 0

retrieve lines between patterns using perl

Answers (3)

Related Questions