ES55
ES55

Reputation: 480

Using a regular expression with nested for loops, using Perl

I have two arrays:

For example, these arrays could contain

@file_list = ('Bob_car', 'Bob_house', 'Bob_work', 'Fred_car', 'Fred_house', 'Fred_work', ...);
@name_list = ('Bob', 'Fred', ...);

(the real data is not that simple).

My goal is to compare each file with every name and see if they match. They match if the file string starts with the name.

I could then use these matches to sort the files into new directories, based on their corresponding name.

Here is my code:

for ( my $i = 0; $i < scalar @file_list ; $i++ )
   {
    for ( my $j = 0; $j < @name_list ; $j++ )
        {
         if ( $file_list[ $i ] =~ m/^$name_list[ $j ]/ )
            {
             print "$file_list[ $i ] goes with $name_list[ $j ]\n"; 
            } 
         else
            {
             print "no match\n";   
            }
        }
   }

However, I don't get any matches. I've tested the individual loops and they are working. Else, is there something off about the regex?

About how the arrays were made:

For @name_list, the file containing the names is organized in a seemingly random way, just because of how it was used for something else. The names in that file are on several different lines, with lots of blank lines in between and lots of blank entries within lines. Names can appear more than once.

I used the following code to make @name_list:

while (my $line = <$OriginalFILE>) 
    {
     chomp $line;
     my @current_line = split( "\t", $line );

     for ( my $i = 0; $i < scalar @current_line ; $i ++ )
         {
          if ( $current_line[ $i ] =~ m/^\s*$/ )
             {
              # print "$current_line[$i] is blank\n"; 
             }
          else 
             {
              push( @raw_name_list, $current_line[ $i ] );   
             }
         } # end of for
    } # while

# collect list without repeat instances of the same name

my %unique = ();
foreach my $name (@raw_name_list)
    {
     $unique{$name} ++;
    }
my @name_list = keys %unique; 

foreach my $name ( @name_list )
   {
    # print "$name\n";
    chomp $name; 

    unless(mkdir $name, 0700) 
        {
         die "Unable to create directory called $name\n";
        }
   }    

The array @file_list was made using:

opendir(DIR, $ARGV[1]);                             
my @file_list = grep ! /^\./, readdir DIR;
closedir(DIR); 
# print @file_list;

@amon, here is what i did to test the loops and regex:

FILE: for my $file (@transposed_files) {
  print "$file\n";
  for my $name (@transposedunique) {
    print "i see this $name\n";
    if ($file =~ /^\Q$name\E/) {
      print "$file goes with $name\n";
      next FILE;
    }
  }
  #print "no match for $file\n";
}

oh, and I transposed the arrays, so that they would print to an outfile into separate rows.

Upvotes: 1

Views: 944

Answers (4)

Hynek -Pichi- Vychodil
Hynek -Pichi- Vychodil

Reputation: 26141

I'm always interested in doing things in efficient way so every time I see O(N^2) algorithm rings bells for me. Why it should be O(N*M) and not O(N+M)?

my $re = join('|',map quotemeta, @name_list);
$re = qr/$re/;
for my $file (@file_list) {
  if($file =~ /^($re)/) {
    my $name = $1;
    ... do what you need
  }
}

Upvotes: 1

amon
amon

Reputation: 57646

Short version: You are building your name array wrong. Look at this line:

$unique{name} ++;

You are just incrementing the name entry of the hash. You probably wanted the $name variable.

The Longer Version

On English, and Foreach-Loops

Your code is a bit unperlish and looks more like C than like Perl. Perl is much closer to English than you might think. From the original wording of your question:

take the first element from @file_list and then to compare that to each element in @name_list

You wrote this as

for (my $i = 0; $i < @file_list; $i++) {
  for (my $j = 0; $j < @name_list; $j++) {
    ...; # compare $file_list[$i] with $name_list[$j]
  }
}

I'd rather do

for my $file (@file_list) {
  for my $name (@name_list) {
    ...; # compare $file with $name
  }
}

and save myself from the hassle of array subscripting.

Building Correct Regexes

Your code contains the following test:

$file_list[ $i ] =~ m/^$name_list[ $j ]/

This will not do what you think if $name_list[$j] contains special characters like (, ., +. You can match the literal contents of a variable by enclosing it in \Q ... \E. This would make the code

$file =~ /^\Q$name\E/

(if used with my variant of the loop).

You could also go the nifty route and compare the leading substring directly:

$name eq substr $file, 0, length($name)

This expresses the same condition.

On Loop Control

I will make two assumptions:

  1. You are only interested in the first matching name for any file
  2. You only want to print the no match message if no name was found

Perl allows us to break out of arbitrary loops, or restart the current iteration, or go directly to the next iteration, without using flags, as you would do in other languages. All we have to do is to label our loops like LABEL: for (...).

So once we have a match, we can start our search for the next file. Also, we only want to print no match if we left the inner loop without going to the next file. This code does it:

FILE: for my $file (@file_list) {
  for my $name (@name_list) {
    if ($file =~ /^\Q$name\E/) {
      print "$file goes with $name\n";
      next FILE;
    }
  }
  print "no match for $file\n";
}

The Zen of Negation

In your file parsing code, you express a condition

if ($field =~ /^\s*$/) {
} else {
  # do this stuff only if the field does not consist only of
  # zero or more whitespace characters
}

That description is far to complex. How about

if ($field =~ /\S/) {
  # do this stuff only if the field contains a non-whitespace character.
}

The same condition, but simpler, and more efficient.

Simplify your Parse

In short, your file parsing code can be condensed to

my %uniq;
while (<$OriginalFILE>) {
  chomp;
  $uniq{$_} = undef for grep /\S/, split /\t/;
}
my @name_list = sort { length($b) <=> length($a) } keys %uniq;

The split function takes a regex as first argument, and will split on $_ if no other string is specified. It returns a list of fields.

The grep function takes a condition and a list, and will return all elements of a list that match the condition. The current element is in $_, which regexes match by default. For explanation of the regex, see above.

Note: This still allows for the fields to contain whitespace, even in leading position. To split on all whitespace, you can give split the special argument of a string containing a single space: split ' '. This would make the grep unneccessary.

The for loop can also be used as a statement modifier, i.e. like EXPR for LIST. The current element is in $_. We assign something to the $_ entry in our %uniq hash (which is already initialized to the empty hash). This could be a number, but undef works as well.

The keys are returned in a seemingly random order. But as multiple names could match a file, but we only want to select one match, we will have to match the most specific name first. Therefore, I sort the names after their length in descending order.

Upvotes: 2

Civa
Civa

Reputation: 2176

its look something wrong in loop.

follow comments in code

for ( my $i = 0; $i < scalar @file_list ; $i++ )
{
    #use some string variable assign it ""
for ( my $j = 0; $j < @name_list ; $j++ )
    {
     if ( $file_list[ $i ] =~ m/^$name_list[ $j ]/ )
        {
        # assign string variable to founded name_list[$j]  
        break loop
        } 

    }
     # check condition if string not equal to  "" match found print your requirement with string value else match not found

}

Upvotes: 0

Matt Ryall
Matt Ryall

Reputation: 10595

Your code seems to work for me. All I did was construct two arrays like this:

my @file_list = qw/Bob_car Bob_house Bob_work Fred_car Fred_house Fred_work/;
my @name_list = qw/Fred Bob Mary/;

Then running your code produces output like this:

no match
Bob_car goes with Bob
no match
no match
Bob_house goes with Bob
no match
no match
Bob_work goes with Bob
no match
Fred_car goes with Fred
no match
no match
Fred_house goes with Fred
no match
no match
Fred_work goes with Fred
no match
no match

So it looks like it's working.

A common problem with reading input from files or from a user is forgetting to strip the newline character from the end of the input. This could be your problem. If so, have a read about perldoc -f chomp, and just chomp each value as you add it to your array.

Upvotes: 1

Related Questions