Reputation: 23
I have one file with ~90k lines of text in 4 columns.
col1 col2 col3 value1
...
col1 col2 col3 value90000
A second file contains ~200 lines, each one corresponding to a value from column 4 of the larger file.
value1
value2
...
value200
I want to read in each value from the smaller file, find the corresponding line in the larger file, and return that line. I have written a perl script that places all the values from the small file into an array, then iterates through that array using each value as a regex to search through the larger file. After some debugging, I feel like I have it almost working, but my script only returns the line corresponding to the LAST element of the array.
Here is the code I have:
open my $fh1, '<', $file1 or die "Could not open $file1: $!";
my @array = <$fh1>;
close $fh1;
my $count = 0;
while ($count < scalar @array) {
my $value = $array[$count];
open my $fh2, '<', $file2 or die "Could not open $file2: $!";
while (<$fh2>) {
if ($_ =~ /$value/) {
my $line = $_;
print $line;
}
}
close $fh2;
$count++;
}
This returns only:
col1 col2 col3 value200
I can get it to print each value of the array, so I know it's iterating through properly, but it's not using each value to search the larger file as I intended. I can also plug any of the values from the array into the $value
variable and return the appropriate line, so I know the lines are there. I suspect my bug may have to do with either:
chomp
but get the same result.or
close
command and that either breaks the code or doesn't help.Upvotes: 2
Views: 511
Reputation: 6808
Slightly different approach to the problem
use warnings;
use strict;
use feature 'say';
my $values = shift;
open my $fh1, '<', $values or die "Could not open $values";
my @lookup = <$fh1>;
close $fh1;
chomp @lookup;
my $re = join '|', map { '\b'.$_.'\b' } @lookup;
((split)[3]) =~ /$re/ && print while <>;
Run as script.pl value_file data_file
Upvotes: 0
Reputation: 52539
You should only be reading the 90k line file once, and checking each value from the other file against the fourth column of each line as you do, instead of reading the whole large file once per line of the smaller one:
#!usr/bin/env perl
use warnings;
use strict;
use feature qw/say/;
my ($file1, $file2) = @ARGV;
# Read the file of strings to match against
open my $fh1, '<', $file1 or die "Could not open $file1: $!";
my %words = map { chomp; $_ => 1 } <$fh1>;
close $fh1;
# Process the data file in one pass
open my $fh2, '<', $file2 or die "Could not open $file2: $!";
while (my $line = <$fh2>) {
chomp $line;
# Only look at the fourth column
my @fields = split /\s+/, $line, 4;
say $line if exists $words{$fields[3]};
}
close $fh2;
Note this uses a straight up string comparison (Via hash key lookup) against the last column instead of regular expression matching - your sample data looks like that's all that's needed. If you're using actual regular expressions, let me know and I'll update the answer.
Your code does look like it should work, just horribly inefficiently. In fact, after adjusting your sample data so that more than one line matches, it does print out multiple lines for me.
Upvotes: 3