teebeegee
teebeegee

Reputation: 23

Iterate through a file multiple times, each time finding a regex and returning one line (perl)

I have one file with ~90k lines of text in 4 columns.

col1    col2     col3    value1
...
col1    col2     col3    value90000

A second file contains ~200 lines, each one corresponding to a value from column 4 of the larger file.

value1
value2
...
value200

I want to read in each value from the smaller file, find the corresponding line in the larger file, and return that line. I have written a perl script that places all the values from the small file into an array, then iterates through that array using each value as a regex to search through the larger file. After some debugging, I feel like I have it almost working, but my script only returns the line corresponding to the LAST element of the array.

Here is the code I have:

open my $fh1, '<', $file1 or die "Could not open $file1: $!";

my @array = <$fh1>;
close $fh1;

my $count = 0;

while ($count < scalar @array) {
    my $value = $array[$count];
    
    open my $fh2, '<', $file2 or die "Could not open $file2: $!";
    
    while (<$fh2>) {
        if ($_ =~ /$value/) {
            my $line = $_;
            print $line;
            }
    }
    close $fh2;
    $count++;   
}

This returns only:


col1     col2     col3   value200

I can get it to print each value of the array, so I know it's iterating through properly, but it's not using each value to search the larger file as I intended. I can also plug any of the values from the array into the $value variable and return the appropriate line, so I know the lines are there. I suspect my bug may have to do with either:

  1. newlines in the array elements, since all the elements have a newline except the last one. I've tried chomp but get the same result.

or

  1. something to do with the way I'm handling the second file with opening/closing. I've tried moving or removing the close command and that either breaks the code or doesn't help.

Upvotes: 2

Views: 511

Answers (2)

Polar Bear
Polar Bear

Reputation: 6808

Slightly different approach to the problem

use warnings;
use strict;
use feature 'say';

my $values = shift;

open my $fh1, '<', $values or die "Could not open $values";
my @lookup = <$fh1>;
close $fh1;

chomp @lookup;
my $re = join '|', map { '\b'.$_.'\b' } @lookup;

((split)[3]) =~ /$re/ && print while <>;

Run as script.pl value_file data_file

Upvotes: 0

Shawn
Shawn

Reputation: 52539

You should only be reading the 90k line file once, and checking each value from the other file against the fourth column of each line as you do, instead of reading the whole large file once per line of the smaller one:

#!usr/bin/env perl
use warnings;
use strict;
use feature qw/say/;

my ($file1, $file2) = @ARGV;

# Read the file of strings to match against
open my $fh1, '<', $file1 or die "Could not open $file1: $!";
my %words = map { chomp; $_ => 1 } <$fh1>;
close $fh1;

# Process the data file in one pass
open my $fh2, '<', $file2 or die "Could not open $file2: $!";    
while (my $line = <$fh2>) {
    chomp $line;
    # Only look at the fourth column
    my @fields = split /\s+/, $line, 4;
    say $line if exists $words{$fields[3]};
}
close $fh2;

Note this uses a straight up string comparison (Via hash key lookup) against the last column instead of regular expression matching - your sample data looks like that's all that's needed. If you're using actual regular expressions, let me know and I'll update the answer.


Your code does look like it should work, just horribly inefficiently. In fact, after adjusting your sample data so that more than one line matches, it does print out multiple lines for me.

Upvotes: 3

Related Questions