Reputation: 11
How do I use keywords from an array in an regex to search a files.
I'm trying to look at a text file and see if and where the keywords appear. There are two files keywords.txt
keyword.txt
word1
word2
word3
filestosearchon.txt
a lot of words that go on and one and contain linebreaks and linebreaks (up to 100000 characters)
I would like to find the keyword and the position of the match. This works for one word but I am unable to figure out how to iterate the keywords on the regex.
#!/usr/bin/perl
# open profanity list
open(FILE, "keywords.txt") or die("Unable to open file");
@keywords = <FILE>;
close(FILE);
# open text file
local $/=undef;
open(txt, "filetosearchon.txt") or die("Unable to open file");
$txt = <txt>;
$regex = "keyword";
push @section,[length($`),length($&),$1]
while ($txt =~ m/$regex/g);
foreach $element(@section)
{
print (join(", ",@$element), $regex, "\n");
}
How can I iterate the keywords from the array over this while loop to get the matched keywords and position?
Appreciate anyhelp. Thanks
Upvotes: 1
Views: 8136
Reputation: 1000
Try grep
:
@words = split(/\s+/, $txt);
for ($i = 0; $i < scalar(@words); ++$i) {
print "word \#$i\n" if grep(/$words[$i]/, @keywords);
}
Would give you the word position in your text string where a keyword was found. This may or may not be more helpful than a character-based position.
Upvotes: 2
Reputation: 36282
I am not sure what is the output you expect, but something like this could be useful. I save keywords in a hash, read next file, split each line in words and search each one in the hash.
Content of script.pl
:
use warnings;
use strict;
die qq[Usage: perl $0 <keyword-file> <search-file>\n] unless @ARGV == 2;
open my $fh, q[<], shift or die $!;
my %keyword = map { chomp; $_ => 1 } <$fh>;
while ( <> ) {
chomp;
my @words = split;
for ( my $i = 0; $i <= $#words; $i++ ) {
if ( $keyword{ $words[ $i ] } ) {
printf qq[Line: %4d\tWord position: %4d\tKeyword: %s\n],
$., $i, $words[ $i ];
}
}
}
Run it like:
perl script.pl keyword.txt filetosearchon.txt
And output should be similar to this:
Line: 7 Word position: 7 Keyword: will
Line: 8 Word position: 8 Keyword: the
Line: 8 Word position: 10 Keyword: will
Line: 10 Word position: 4 Keyword: the
Line: 14 Word position: 1 Keyword: compile
Line: 18 Word position: 9 Keyword: the
Line: 20 Word position: 2 Keyword: the
Line: 20 Word position: 5 Keyword: the
Line: 22 Word position: 1 Keyword: the
Line: 22 Word position: 25 Keyword: the
Upvotes: 2
Reputation: 12486
One way to do this would be to just build a regex containing every word:
(alpha|bravo|charlie|delta|echo|foxtrot|...|zulu)
Perl's regex compiler is pretty smart and will smoosh this down as much as it can, so the regex will be more efficient than you think. See this answer by Tom Christiansen. For example the following regex:
(cat|rat|sat|mat)
Will compile to:
(c|r|s|m)at
Which is efficient to run. This approach probably beats the "search for each keyword in turn" approach because it only needs to make one pass over the input string; the naive approach requires one pass per keyword you want to search for.
By the way; If you're building a profanity filter, as your sample code suggests, remember to account for intentional mis-spellings: 'pron', 'p0rn', etc. Then there's the fun you can have with Unicode!
Upvotes: 3