Reputation:

Perl: Load file into hash using while

In my last question I asked for proper way of storing data from text file in my Perl script, the solution was using AoH.

Anyway, my implementation seems to be incomplete:

#!/usr/bin/perl

use strict;
use warnings;

# Open netstat output
my $netstat_dump = "tmp/netstat-output.txt";
open (my $fh, "<", $netstat_dump) or die "Could not open file '$netstat_dump': $!";

# Store data in an hash
my %hash;
while(<$fh>) {
  chomp;
  my ($Protocol, $RecvQ, $SendQ, $LocalAddress, $ForeignAddress, $State, $PID) = split(/\s+/);
  # Exclude $RecvQ and $SendQ
  $hash{$PID} = [$Protocol, $LocalAddress, $ForeignAddress, $State $PID];
}
close $fh;
print Dumper \%hash;

First problem is that I get uninitialized value error on $PID even though $PID is declared in line above.

Second problem with script is that it loads last letters from input file and puts them in their own rows:

$VAR1 = {
...
'6907/thin' => [
                           'tcp',
                           '127.0.0.1:3001',
                           '0.0.0.0:*',
                           'LISTEN',
                           '6907/thin'
                         ],
          '' => [
                  'udp6',
                  ':::49698',
                  ':::*',
                  '31664/dhclient',
                  ''
                ],
          'r' => [
                   'udp6',
                   ':::45016',
                   ':::*',
                   '651/avahi-daemon:',
                   'r'
                 ]
        };

'' => and 'r' => come from input file which looks like this:

tcp        0      0 0.0.0.0:3790            0.0.0.0:*               LISTEN      7550/nginx.conf 
tcp        0      0 127.0.1.1:53            0.0.0.0:*               LISTEN      1271/dnsmasq    
tcp        0      0 127.0.0.1:631           0.0.0.0:*               LISTEN      24202/cupsd     
tcp        0      0 127.0.0.1:5432          0.0.0.0:*               LISTEN      11222/postgres  
tcp        0      0 127.0.0.1:3001          0.0.0.0:*               LISTEN      6907/thin server (1
tcp        0      0 127.0.0.1:50505         0.0.0.0:*               LISTEN      6874/prosvc     
tcp        0      0 127.0.0.1:7337          0.0.0.0:*               LISTEN      6823/postgres.bin
tcp6       0      0 ::1:631                 :::*                    LISTEN      24202/cupsd     
udp        0      0 0.0.0.0:46096           0.0.0.0:*                           651/avahi-daemon: r
udp        0      0 0.0.0.0:5353            0.0.0.0:*                           651/avahi-daemon: r
udp        0      0 127.0.1.1:53            0.0.0.0:*                           1271/dnsmasq    
udp        0      0 0.0.0.0:68              0.0.0.0:*                           31664/dhclient  
udp        0      0 0.0.0.0:631             0.0.0.0:*                           912/cups-browsed
udp        0      0 0.0.0.0:37620           0.0.0.0:*                           31664/dhclient  
udp6       0      0 :::5353                 :::*                                651/avahi-daemon: r
udp6       0      0 :::45016                :::*                                651/avahi-daemon: r
udp6       0      0 :::49698                :::*                                31664/dhclient

It also makes me feel that my hash function is not parsing whole file and interrupts somewhere.

Upvotes: 4

Answers (5)

G. Cito

Reputation: 6378

You might want to use or look at the source of some related CPAN modules to see how the authors have solved similar problems: e.g. Parse::Netstat, Regexp::Common, etc.

Upvotes: 1

Axeman

Reputation: 29854

Sometimes splitting doesn't work as well as a full specification of the data you are likely to receive. Sometimes you need a regex. Especially because you have a field that may or may not be there. ("LISTEN")

As well, you're also having a hard time separating your PID from your process information.

So here's my regex:

my $netstat_regex
    = qr{
    \A                # The beginning of input
    ( \w+ )           # the proto
    \s+
    (?: \d+ \s+ ){2}  # we don't care about these
    (                 # Open capture
        [[:xdigit:]:.]+?               
        :
        (?: \d+ )
    )                 # Close capture
    \s+
    (                 # Open capture
        [[:xdigit:]:.]+?               
        :
        (?: \d+ | \* )
    )                 # Close capture
    \s+
    (?: LISTEN \s+ )? # It might not be a listen socket. 
    ( \d+ )           # Nothing but the PID
    /
    ( .*\S )          # All the other process data (trimmed)
    }x;

Then I process it so:

my %records;

while ( <$fh> ) { 
    my %rec;
    @rec{ qw<proto local remote PID data> } = m/$netstat_regex/;
    if ( %rec ) { 
        $records{ $rec{PID} } = \%rec;
    }
    else {
        print "Error processing input line #$.:\n$_\n";
    }    
}

Note that I also have some code to show me what doesn't fit my pattern, so that I can refine it if necessary. I don't give my full trust to the input.

Nice and tidy dump:

%records: {
            11222 => {
                       PID => '11222',
                       data => 'postgres',
                       local => '127.0.0.1:5432',
                       proto => 'tcp',
                       remote => '0.0.0.0:*'
                     },
            1271 => {
                      PID => '1271',
                      data => 'dnsmasq',
                      local => '127.0.1.1:53',
                      proto => 'udp',
                      remote => '0.0.0.0:*'
                    },
            24202 => {
                       PID => '24202',
                       data => 'cupsd',
                       local => '::1:631',
                       proto => 'tcp6',
                       remote => ':::*'
                     },
            31664 => {
                       PID => '31664',
                       data => 'dhclient',
                       local => ':::49698',
                       proto => 'udp6',
                       remote => ':::*'
                     },
            651 => {
                     PID => '651',
                     data => 'avahi-daemon: r',
                     local => ':::45016',
                     proto => 'udp6',
                     remote => ':::*'
                   },
            6823 => {
                      PID => '6823',
                      data => 'postgres.bin',
                      local => '127.0.0.1:7337',
                      proto => 'tcp',
                      remote => '0.0.0.0:*'
                    },
            6874 => {
                      PID => '6874',
                      data => 'prosvc',
                      local => '127.0.0.1:50505',
                      proto => 'tcp',
                      remote => '0.0.0.0:*'
                    },
            6907 => {
                      PID => '6907',
                      data => 'thin server (1',
                      local => '127.0.0.1:3001',
                      proto => 'tcp',
                      remote => '0.0.0.0:*'
                    },
            7550 => {
                      PID => '7550',
                      data => 'nginx.conf',
                      local => '0.0.0.0:3790',
                      proto => 'tcp',
                      remote => '0.0.0.0:*'
                    },
            912 => {
                     PID => '912',
                     data => 'cups-browsed',
                     local => '0.0.0.0:631',
                     proto => 'udp',
                     remote => '0.0.0.0:*'
                   }
          }

Upvotes: 4

mpapec

Reputation: 50637

You can remove state column before split() so every row have the same number of columns,

# assuming that state is always upper case followed by spaces and digit(s)
$State = s/\b([A-Z]+)(?=\s+\d)// ? $1 : "";

Upvotes: 2

choroba

Reputation: 241838

If your input contains tabs, you can split on /\t/ instead. \s+ matches any whitespace, i.e. one tab as well as two tabs, so the "empty columns" are skipped.

Fixing that still doesn't hash all the lines from the input, though. Hash keys must be unique, but the input contains some PIDS more than once (1271/dnsmasq 24202/cupsd 31664/dhclient 2 times and 651/avahi-daemon: r 4 times). You can solve the problem by using HoAoA instead:

#!/usr/bin/perl
use warnings;
use strict;

use Data::Dumper;

my $netstat_dump = 'input.txt';
open my $FH, '<', $netstat_dump or die "Could not open file '$netstat_dump': $!";

my %hash;
while (<$FH>) {
    chomp;
    my ($Protocol, $RecvQ, $SendQ, $LocalAddress, $ForeignAddress, $State, $PID)
         = split /\t/;
    push @{ $hash{$PID} }, [ $Protocol, $LocalAddress, $ForeignAddress, $State, $PID ];
}
close $FH;
print Dumper \%hash;

Upvotes: 2

Oesor

Reputation: 6642

When you split a line such as:

udp        0      0 0.0.0.0:37620           0.0.0.0:*                           31664/dhclient

on whitespace you get 5 elements, not 6. This is because the state column has no string in it and the PID gets assigned to $State.

Likewise,

udp        0      0 0.0.0.0:5353            0.0.0.0:*                           651/avahi-daemon: r

stores the pid as the 5th element (state) and 'r' as the 6th (pid) due to the space between the colon and r in the PID.

You may want to look into using unpack to split apart fixed width fields. Note that if the input has varying column widths based on content, you will need to determine column widths to use unpack.

Refer to the tutorial for a how-to for this.

Upvotes: 5

Perl: Load file into hash using while

Answers (5)

Related Questions