CheeseConQueso
CheeseConQueso

Reputation: 6051

What is the most efficient way to parse a text file using Perl?

Although this is pretty basic, I can't find a similar question, so please link to one if you know of an existing question/solution on SO.


I have a .txt file that is about 2MB and about 16,000 lines long. Each record length is 160 characters with a blocking factor of 10. This is an older type of data structure which almost looks like a tab-delimited file, but the separation is by single-chars/white-spaces.

First, I glob a directory for .txt files - there is never more than one file in the directory at a time, so this attempt may be inefficient in itself.

my $txt_file = glob "/some/cheese/dir/*.txt";

Then I open the file with this line:

open (F, $txt_file) || die ("Could not open $txt_file");

As per the data dictionary for this file, I'm parsing each "field" out of each line using Perl's substr() function within a while loop.

while ($line = <F>)
{
$nom_stat   = substr($line,0,1);
$lname      = substr($line,1,15);
$fname      = substr($line,16,15);
$mname      = substr($line,31,1);
$address    = substr($line,32,30);
$city       = substr($line,62,20);
$st         = substr($line,82,2);
$zip        = substr($line,84,5);
$lnum       = substr($line,93,9);
$cl_rank    = substr($line,108,4);
$ceeb       = substr($line,112,6);
$county     = substr($line,118,2);
$sex        = substr($line,120,1);
$grant_type = substr($line,121,1);
$int_major  = substr($line,122,3);
$acad_idx   = substr($line,125,3);
$gpa        = substr($line,128,5);
$hs_cl_size = substr($line,135,4);
}


This approach takes a lot of time to process each line and I'm wondering if there is a more efficient way of getting each field out of each line of the file.

Can anyone suggest a more efficient/preferred method?

Upvotes: 6

Views: 2719

Answers (4)

Ian C.
Ian C.

Reputation: 3973

A single regular expression, compiled and cached using the /o option, is the fastest approach. I ran your code three ways using the Benchmark module and came out with:

         Rate unpack substr regexp
 unpack 2.59/s     --   -59%   -67%
 substr 6.23/s   141%     --   -21%
 regexp 7.90/s   206%    27%     --

Input was a file with 20k lines, each line had the same 160 characters on it (16 repetitions of the characters 0123456789). So it's the same input size as the data you're working with.

The Benchmark::cmpthese() method outputs the subroutine calls from slowest to fastest. The first column is telling us how many times per second the sub-routine can be run. The regular expression approach is fastest. Not unpack as I state previously. Sorry about that.

The benchmark code is below. The print statements are there as sanity checks. This was with Perl 5.10.0 built for darwin-thread-multi-2level.

#!/usr/bin/env perl
use Benchmark qw(:all);
use strict;

sub use_substr() {
    print "use_substr(): New itteration\n";
    open(F, "<data.txt") or die $!;
    while (my $line = <F>) {
        my($nom_stat, 
           $lname,   
           $fname,      
           $mname,    
           $address,     
           $city,    
           $st,       
           $zip,         
           $lnum,        
           $cl_rank,
           $ceeb,    
           $county,
           $sex,     
           $grant_type,
           $int_major, 
           $acad_idx,  
           $gpa,   
           $hs_cl_size) = (substr($line,0,1),
                           substr($line,1,15),
                           substr($line,16,15),
                           substr($line,31,1),
                           substr($line,32,30),
                           substr($line,62,20),
                           substr($line,82,2),
                           substr($line,84,5),
                           substr($line,93,9),
                           substr($line,108,4),
                           substr($line,112,6),
                           substr($line,118,2),
                           substr($line,120,1),
                           substr($line,121,1),
                           substr($line,122,3),
                           substr($line,125,3),
                           substr($line,128,5),
                           substr($line,135,4));
       #print "use_substr(): \$lname = $lname\n";
       #print "use_substr(): \$gpa   = $gpa\n";
    }    
    close(F);
    return 1;
}

sub use_regexp() {
    print "use_regexp(): New itteration\n";
    my $pattern = '^(.{1})(.{15})(.{15})(.{1})(.{30})(.{20})(.{2})(.{5})(.{9})(.{4})(.{6})(.{2})(.{1})(.{1})(.{3})(.{3})(.{5})(.{4})';
    open(F, "<data.txt") or die $!;
    while (my $line = <F>) {
        if ( $line =~ m/$pattern/o ) {
            my($nom_stat, 
               $lname,   
               $fname,      
               $mname,    
               $address,     
               $city,    
               $st,       
               $zip,         
               $lnum,        
               $cl_rank,
               $ceeb,    
               $county,
               $sex,     
               $grant_type,
               $int_major, 
               $acad_idx,  
               $gpa,   
               $hs_cl_size) = ( $1,
                                $2,
                                $3,
                                $4,
                                $5,
                                $6,
                                $7,
                                $8,
                                $9,
                                $10,
                                $11,
                                $12,
                                $13,
                                $14,
                                $15,
                                $16,
                                $17,
                                $18);
            #print "use_regexp(): \$lname = $lname\n";
            #print "use_regexp(): \$gpa   = $gpa\n";
        }
    }    
    close(F);
    return 1;
}

sub use_unpack() {
    print "use_unpack(): New itteration\n";
    open(F, "<data.txt") or die $!;
    while (my $line = <F>) {
        my($nom_stat, 
           $lname,   
           $fname,      
           $mname,    
           $address,     
           $city,    
           $st,       
           $zip,         
           $lnum,        
           $cl_rank,
           $ceeb,    
           $county,
           $sex,     
           $grant_type,
           $int_major, 
           $acad_idx,  
           $gpa,   
           $hs_cl_size) = unpack(
               "(A1)(A15)(A15)(A1)(A30)(A20)(A2)(A5)(A9)(A4)(A6)(A2)(A1)(A1)(A3)(A3)(A5)(A4)(A*)", $line
               );
        #print "use_unpack(): \$lname = $lname\n";
        #print "use_unpack(): \$gpa   = $gpa\n";
    }
    close(F);
    return 1;
}

# Benchmark it
my $itt = 50;
cmpthese($itt, {
        'substr' => sub { use_substr(); },
        'regexp' => sub { use_regexp(); },
        'unpack' => sub { use_unpack(); },
    }
);
exit(0)

Upvotes: 4

Joel Berger
Joel Berger

Reputation: 20280

It looks to me that you are working with fixed width fields here. Is that true? If it is, the unpack function is what you need. You provide the template for the fields and it will extract the info from those fields. There is a tutorial available, and the template information is found in the documentation for pack which is unpack's logical inverse. As a basic example simply:

my @values = unpack("A1 A15 A15 ...", $line);

where 'A' means any text character (as I understand it) and the number is how many. There is quite an art to unpack as some people use it, but I believe this will suffice for basic use.

Upvotes: 8

markijbema
markijbema

Reputation: 4055

You could do something like:

while ($line = <F>){
   if ($line =~ /(.{1}) (.{15}) ........ /){
     $nom_stat = $1;
     $lname = $2;
     ...
   }
}

I think it's faster than your substr suggestion, but I'm not sure whether it's the fastest solution, but I think it might very well be.

Upvotes: 0

Geo
Geo

Reputation: 96947

Do a split on each line, like this:

my @values = split(/\s/,$line);

and then work with your values.

Upvotes: 0

Related Questions