Makuza
Makuza

Reputation: 149

How can I make it so that perl can recognize an unknown pattern?

I have a file such as file1:

tree_apple
tree_banana
tree_orange
tree_cherry

I want to be make a script that recognizes a consistent naming structure between the items in the file. For file:1 the consistent naming structure would be "tree". So I would want a perl script that reads through the file and saves the consistent naming structure as variable lets say $pattern.Assume that ALL items in a file share a consistent naming structure. It doesnt matter if lets say only 2 of the items in a list has pattern, if the pattern is not present in all of the items then it is not the consistent naming structure.

Note: that the files do have some structure. They are only alphanumeric characters, but can be separated by groups by "" such as the fruits being separated into a group after the "".

Also note: the consistent naming structure is not always in the beginning, it could also be in the middle or in the ends.

If we had a file such as file2:

mask_protection
gloves_protection
armour_protection
boots_protection

Now the consistent naming structure is "protection", notice how it is at the end now instead.

Or if we had a file such as file3:

123_red_456
123_blue_456
123_green_456
123_yellow_456

Now the consistent naming structure is in both the beginning and the end. It is 123 and 456.

or finally it could be in the middle such as with "cell" in file4:

Apple_cell_phone
Blood_cell_donation
Prison_cell_inspection
Excel_cell_row

So is there a way to look through a file and find a consistent pattern with perl?

Upvotes: 1

Views: 75

Answers (1)

ikegami
ikegami

Reputation: 385506

If we can rely on the uniformity of the use of _ that is found in your examples, it's just a question of splitting on _ and finding columns with common values.

my @template;
if (defined( my $line = <> )) {
   chomp($line);
   @template = split(/_/, $line, -1);

   while (defined( $line = <> )) {
      chomp($line);
      my @fields = split(/_/, $line, -1);
      @template == @fields
         or die("Inconsistency in the number of fields at \"$ARGV\" line $.\n");

      for my $i (0..$#template) {
         if (defined($template[$i]) && $template[$i] ne $fields[$i]) {
            $template[$i] = undef;
         }
      }
   }
}

say join "_", map { $_ // '*' } @template;

Output:

$ ./a file1
tree_*

$ ./a file2
*_protection

$ ./a file3
123_*_456

$ ./a file4
*_cell_*

If we can't rely on the uniformity of the use of _ that is found in your examples, you need to explain why the pattern for file3 isn't 123_*e*_456.

Upvotes: 3

Related Questions