kaylani2
kaylani2

Reputation: 83

Including regex on variable before matching string

I'm trying to find and extract the occurrence of words read from a text file in a text file. So far I can only find when the word is written correctly and not munged (a changed to @ or i changed to 1). Is it possible to add a regex to my strings for matching or something similar? This is my code so far:

sub getOccurrenceOfStringInFileCaseInsensitive
{
  my $fileName = $_[0];
  my $stringToCount = $_[1];
  my $numberOfOccurrences = 0;
  my @wordArray = wordsInFileToArray ($fileName);

  foreach (@wordArray)
  {
    my $numberOfNewOccurrences = () = (m/$stringToCount/gi);
    $numberOfOccurrences += $numberOfNewOccurrences;
  } 


  return $numberOfOccurrences;
}

The routine receives the name of a file and the string to search. The routine wordsInFileToArray () just gets every word from the file and returns an array with them. Ideally I would like to perform this search directly reading from the file in one go instead of moving everything to an array and iterating through it. But the main question is how to hard code something into the function that allows me to capture munged words.

Example: I would like to extract both lines from the file. example.txt:

russ1@anh@ck3r

russianhacker

# this variable also will be read from a blacklist file
$searchString = "russianhacker";
getOccurrenceOfStringInFileCaseInsensitive ("example.txt", $searchString);

Thanks in advance for any responses.

Edit:

The possible substitutions will be defined by an user and the regex must be set to fit. A user could say that a common substitution is to change the letter "a" to "@" or even "1". The possible change is completely arbitrary. When searching for a specific word ("russian" for example) this could be done with something like:

(m/russian/i); # would just match the word as it is
(m/russi[a@1]n/i); # would match the munged word

But I'm not sure how to do that if I have the string to match stored in a variable, such as:

$stringToSearch = "russian";

Upvotes: 2

Views: 187

Answers (3)

zdim
zdim

Reputation: 66881

There are parts of the problem which aren't specified precisely enough (yet).

Some of the roll-your-own approaches, that depend on the details, are

  • If user defined substitutions are global (replace every occurrence of a character in every string) the user can submit a mapping, as a hash say, and you can fix them all. The process will identify all candidates for the words (along with the actual, unmangled, words, if found). There may be false positives so also plan on some post-processing

  • If the user can supply a list of substitutions along with words that they apply to (the mangled or the corresponding unmangled ones) then we can have a more targeted run

Before this is clarified, here is another way: use a module for approximate ("fuzzy") matching.

The String::Approx seems to fit quite a few of your requirements.

The match of the target with a given string relies on the notion of the Levenshtein edit distance: how many insertions, deletions, and replacements ("edits") it takes to make the given string into the sought target. The maximum accepted number of edits can be set.

A simple-minded example:

use warnings;
use strict;
use feature 'say';

use String::Approx qw(amatch);

my $target = qq(russianhacker);

my @text = qw(that h@cker was a russ1@anh@ck3r);

my @matches = amatch($target, ["25%"], @text);

say for @matches;     #==>  russ1@anh@ck3r

See documentation for what the module avails us, but at least two comments are in place.

First, note that the second argument in amatch specifies the percentile-deviation from the target string that is acceptable. For this particular example we need to allow every fourth character to be "edited." So much room for tweaking can result in accidental matches which then need be filtered out, so there will be some post-processing to do.

Second -- we didn't catch the easier one, h@cker. The module takes a fixed "pattern" (target), not a regex, and can search for only one at a time. So, in principle, you need a pass for each target string. This can be improved a lot, but there'll be more work to do.

Please study the documentation; the module offers a whole lot more than this simple example.

Upvotes: 2

kaylani2
kaylani2

Reputation: 83

I've ended solving the problem by including the regex directly on the variable that I'll use to match against the lines of my file. It looks something like this:

sub getOccurrenceOfMungedStringInFile
{
  my $fileName = $_[0];
  my $mungedWordToCount = $_[1];
  my $numberOfOccurrences = 0;

  open (my $inputFile, "<", $fileName) or die "Can't open file: $!";

  $mungedWordToCount =~ s/a/\[a\@4\]/gi;

  while (my $currentLine = <$inputFile>)
  {
    chomp ($currentLine);
    $numberOfOccurrences += () = ($currentLine =~ m/$mungedWordToCount/gi);
  }

  close ($inputFile) or die "Can't open file: $!";

  return $numberOfOccurrences;
}

Where the line:

$mungedWordToCount =~ s/a/\[a\@4\]/gi;

Is just one of the substitutions that are needed and others can be added similarly. I didn't know that Perl would just interpret the regex inside of the variable since I've tried that before and could only get the wanted results defining the variables inside the function using single quotes. I must've done something wrong the first time.

Thanks for the suggestions, people.

Upvotes: 1

Grinnz
Grinnz

Reputation: 9231

This is sort of a full-text search problem, so one method is to normalize the document strings before matching against them.

use strict;
use warnings;
use Data::Munge 'list2re';
...
my %norms = (
  '@' => 'a',
  '1' => 'i',
  ...
);
my $re = list2re keys %norms;
s/($re)/$norms{$1}/ge for @wordArray;

This approach only works if there's only a single possible "normalized form" for any given word, and may be less efficient anyway than just trying every possible variation of the search string if your document is large enough and you recompute this every time you search it.

As a note your regex m/$randomString/gi should be m/\Q$randomString/gi, as you don't want any regex metacharacters in $randomString to be interpreted that way. See docs for quotemeta.

Upvotes: 2

Related Questions