Speed in replacing a set of words in text in perl -

Question

Context first: I'm trying to highlight song titles in a Wikipedia page. First I get the quoted portions from the page, I check if they exist in a database of song titles, and then I highlight those that I find. The database part is surprisingly fast, and so is extracting the song titles (they are quoted).

Therefore (I think) I need to replace a set of words (the titles) in the HTML and wrap them in a span like this (for every word):

s/word/word<\/span>/gi

The text is about 100k long, and the list is about 300 words (neither pre-determined), so an iterative process replacing one word at the time is too slow (I need to keep this < 1 sec if possible).

So I've done

my $re = join '|', map { quotemeta($_) } @words;
$dom =~ s/($re)/$1<\/span>/gi;

which seems to work and is fast (0.64 on my benchmark case).

Now I want to replace \"$word\" instead of just $word so I tried this:

my $re = join '|', map { quotemeta(join '', '"', $_, '"') } @words;

and speed dropped by a factor of 10. Comparing speed with NYTProf all the difference seems to be inside CORE:substcont

Why is that?

(Extra thanks for suggestions on how to avoid replacing text inside tags such as id="word_to_be_replaced")

brian d foy · Accepted Answer

I don't know what you are actually doing because you only pull out what you think the problem is (and we don't get to see the rest).

First, you have join '', '"', $_, '"', but that's just qq("$_").

Next, if you have an alternation of words surrounded by quotes, you don't need quotes around each word. Group the words alteration and put the quotes around that:

s/ " (?: word1 | word2 | ... ) " /.../x;

My first suspect is that that whatever your pattern does involves much more backtracking.

To avoid replacing the same text that might be in the HTML, I'd use an HTML parser and only look at the text. But, that's going to take a lot longer than what's already happening.

Speed in replacing a set of words in text in perl -

Answers (1)

Related Questions