kleqkleq
kleqkleq

Reputation: 11

Perl: Search text file for keywords from array

How do I use keywords from an array in an regex to search a files.

I'm trying to look at a text file and see if and where the keywords appear. There are two files keywords.txt

keyword.txt
word1
word2
word3

filestosearchon.txt
a lot of words that go on and one and contain linebreaks and linebreaks (up to 100000   characters)

I would like to find the keyword and the position of the match. This works for one word but I am unable to figure out how to iterate the keywords on the regex.

#!/usr/bin/perl

# open profanity list
open(FILE, "keywords.txt") or die("Unable to open file");
@keywords = <FILE>; 
close(FILE);

# open text file
local $/=undef; 
open(txt, "filetosearchon.txt") or die("Unable to open file");
$txt = <txt>;

$regex = "keyword";


push @section,[length($`),length($&),$1]    
while ($txt =~ m/$regex/g);

foreach $element(@section)  
{
print (join(", ",@$element), $regex, "\n");    
}

How can I iterate the keywords from the array over this while loop to get the matched keywords and position?

Appreciate anyhelp. Thanks

Upvotes: 1

Views: 8136

Answers (3)

mpe
mpe

Reputation: 1000

Try grep:

@words = split(/\s+/, $txt);

for ($i = 0; $i < scalar(@words); ++$i) {
    print "word \#$i\n" if grep(/$words[$i]/, @keywords);
}

Would give you the word position in your text string where a keyword was found. This may or may not be more helpful than a character-based position.

Upvotes: 2

Birei
Birei

Reputation: 36282

I am not sure what is the output you expect, but something like this could be useful. I save keywords in a hash, read next file, split each line in words and search each one in the hash.

Content of script.pl:

use warnings;
use strict;

die qq[Usage: perl $0 <keyword-file> <search-file>\n] unless @ARGV == 2;

open my $fh, q[<], shift or die $!;

my %keyword = map { chomp; $_ => 1 } <$fh>;

while ( <> ) {
        chomp;
        my @words = split;
        for ( my $i = 0; $i <= $#words; $i++ ) {
                if ( $keyword{ $words[ $i ] } ) {
                        printf qq[Line: %4d\tWord position: %4d\tKeyword: %s\n], 
                                $., $i, $words[ $i ];
                }
        }
}

Run it like:

perl script.pl keyword.txt filetosearchon.txt

And output should be similar to this:

Line:    7      Word position:    7     Keyword: will
Line:    8      Word position:    8     Keyword: the
Line:    8      Word position:   10     Keyword: will
Line:   10      Word position:    4     Keyword: the
Line:   14      Word position:    1     Keyword: compile
Line:   18      Word position:    9     Keyword: the
Line:   20      Word position:    2     Keyword: the
Line:   20      Word position:    5     Keyword: the
Line:   22      Word position:    1     Keyword: the
Line:   22      Word position:   25     Keyword: the

Upvotes: 2

Li-aung Yip
Li-aung Yip

Reputation: 12486

One way to do this would be to just build a regex containing every word:

(alpha|bravo|charlie|delta|echo|foxtrot|...|zulu)

Perl's regex compiler is pretty smart and will smoosh this down as much as it can, so the regex will be more efficient than you think. See this answer by Tom Christiansen. For example the following regex:

(cat|rat|sat|mat)

Will compile to:

(c|r|s|m)at

Which is efficient to run. This approach probably beats the "search for each keyword in turn" approach because it only needs to make one pass over the input string; the naive approach requires one pass per keyword you want to search for.

By the way; If you're building a profanity filter, as your sample code suggests, remember to account for intentional mis-spellings: 'pron', 'p0rn', etc. Then there's the fun you can have with Unicode!

Upvotes: 3

Related Questions