Reputation: 2550

Fast Text Preprocessing

In my project I work with text in general. I found that preprocessing can be very slow. So I would like to ask you if you know how to optimize my code. The flow is like this:

get HTML page -> (To plain text -> stemming -> remove stop words) -> further text processing

In brackets there are preprocessing steps. The application runs in about 10.265s, but preprocessing takes 9.18s! This is time for preprocessing 50 HTML pages (excluding downloading).

I use HtmlAgilityPack library to convert HTML into plain text. This is quite fast. It takes 2.5ms to convert 1 document, so it's relatively OK.

Problem comes later. Stemming one document takes up to 120ms. Unfortunately those HTML pages are in Polish. There no exists stemmer for Polish language written in C#. I know only 2 free to use written in Java: stempel and morfologic. I precompiled stempel.jar to stempel.dll with help of IKVM software. So there is nothing more to do.

Eliminating stop words takes also a lot of time (~70ms for 1 doc). It is done like this:


result = Regex.Replace(text.ToLower(), @"(([-]|[.]|[-.]|[0-9])?[0-9]*([.]|[,])*[0-9]+)|(\b\w{1,2}\b)|([^\w])", " ");
while (stopwords.MoveNext())
{
   string stopword = stopwords.Current.ToString();                
   result = Regex.Replace(result, "(\\b"+stopword+"\\b)", " ");                               
}
return result;

First i remove all numbers, special characters, 1- and 2-letter words. Then in loop I remove stop words. There are about 270 stop words.

Is it possible to make it faster?

Edit:

What I want to do is remove everything which is not a word longer than 2 letters. So I want to get out all special chars (including '.', ',', '?', '!', etc.) numbers, stop words. I need only pure words that I can use for Data Mining.

Upvotes: 6

Answers (6)

Ventus

Reputation: 2550

OK, I know that SO is not a pure forum and maybe I shouldn't answer my own question but I'd like to share with my results.

Finally, thanks to you guys, I managed to get better optimization of my text preprocessing. First of all I made simpler that long expression from my question (following Josh Kelley's answer):

[0-9]|[^\w]|(\b\w{1,2}\b)

It does the same as first one but is very simple. Then following Josh Kelley's suggestion again I put this regex into assembly. Great example of compiling expressions into assembly I found here. I did that, because this regex is used many, many times. After lecture of few articles about compiled regex, that was my decision. I removed the last expression after eliminating stop words (no real sense with that).

So the execution time on 12KiB text file was ~15ms. This is only for expression mentioned above.

Last step were stop words. I decided to make a test for 3 different options (Execution times are for the same 12KiB text file).

One big Regular Expression

with all stop words and compiled into assembly (mquander's suggestion). Nothing to clear here.

Execution time: ~215ms

String.Replace()

People say that this can be faster than Regex. So for each stop word I used string.Replace() method. Many loops to take with result:

Execution time: ~65ms

LINQ

method presented by LBushkin. Nothing to say more.

Execution time: ~2.5ms

I can only say wow. Just compare execution times of first one with the last one! Big thanks LBushkin!

Upvotes: 2

Josh Kelley

Reputation: 58412

Speed up your regexes

Your regexes could use some work.

For example, this line:

result = Regex.Replace(result, "(\\b"+stopword+"\\b)", " ");

uses parentheses to capture the stopword for later use, then it never uses it. Maybe the .NET regex engine is smart enough to skip capturing in this case, maybe not.

This regex is way too complicated:

"(([-]|[.]|[-.]|[0-9])?[0-9]*([.]|[,])*[0-9]+)|(\b\w{1,2}\b)|([^\w])"

"([-]|[.]|[-.]|[0-9])?" is identical to "([-.0-9])?". (Unless you're trying to match "-." as one of your possibilities? I'll assume not for now.) If you don't need capturing (and you don't in your example), then it's identical to "[-.0-9]?".
"[-.0-9]?" is a bit redundant coming before "[0-9]*". You can further simplify it to "[-.]?[0-9]*".
Similarly, if you don't need capturing, then "([.]|[,])*" is identical to "[,.]*".

Finally, test if compiling your regexes can yield better performance.

Cut down on Regex and string manipulation

Constructing a bunch of strings, making up a bunch of Regex objects, and then discarding them, as you're doing in this loop, is probably not very fast:

result = Regex.Replace(result, "(\\b"+stopword+"\\b)", " ");

Try pre-processing stopwords into an array of Regex objects or creating one monster pre-compiled Regex (as others have suggested).

Restructure your algorithm

It looks like you're only interested in processing stemmed, non-stopword, text, and not punctuation, numbers, etc.

To do that, your algorithm takes the following approach:

Stem all text (including stopwords?).
Use regexes (not necessarily the fastest approach) to replace (which requires continually rearranging the string's body) non-words with whitespace.
Use regexes (again, not necessarily the fastest approach) to replace (again) stopwords with whitespace, one stopword at a time.

I started to write another approach here, but LBushkin beat me to it. Do what he says. Keep in mind, as a general rule, that changing your algorithm usually gives bigger improvements than micro-optimizations like improving your regex usage.

Upvotes: 1

LBushkin

Reputation: 131806

Iteratively replacing words is going to be the biggest bottleneck in your implementation. On each iteration, you have to scan the entire string for the stopword, then the replace operation has to allocate a new string and populate it with the text post-replacement. That's not going to be fast.

A much more efficient approach is to tokenize the string and perform replacement in a streaming fashion. Divide the input into individual words separated by whatever whitespace or separator characters are appropriate. You can do this incrementally so you don't need to allocate any additional memory to do so. For each word (token), you can now perform a lookup in a hashset of stopwords - if you find match, you will replace it as you stream out the final text to a separate StringBuilder. If the token is not a stopword, just stream it out the StringBuilder unmodified. This approach should have O(n) performance, as it only scans the string once and uses a HashSet to perform stopword lookup.

Below is one approach that I would expect to perform better. While it isn't fully streaming (it uses String.Split() which allocated an array of additional strings), it does all of the processing in a single pass. Refining the code to avoid allocating additional string is probably not going to provide much of an improvement, since you still need to extract out substrings to perform comparisons to your stopwords.

Code below returns a list of words that excludes all stopwords and words two letters or shorter form the result. It uses case-insensitive comparison on the stopwords as well.

public IEnumerable<string> SplitIntoWords( string input,
                                           IEnumerable<string> stopwords )
{
    // use case-insensitive comparison when matching stopwords
    var comparer = StringComparer.InvariantCultureIgnoreCase;
    var stopwordsSet = new HashSet<string>( stopwords, comparer );
    var splitOn = new char[] { ' ', '\t', '\r' ,'\n' };

    // if your splitting is more complicated, you could use RegEx instead...
    // if this becomes a bottleneck, you could use loop over the string using
    // string.IndexOf() - but you would still need to allocate an extra string
    // to perform comparison, so it's unclear if that would be better or not
    var words = input.Split( splitOn, StringSplitOptions.RemoveEmptyEntries );

    // return all words longer than 2 letters that are not stopwords...
    return words.Where( w => !stopwordsSet.Contains( w ) && w.Length > 2 );
}

Upvotes: 14

Xzhsh

Reputation: 2239

I agree with mquander, and here's a bit more info. Every time you use an regex, C# will create a table to match the text. This is fine and dandy if you only call the regex function a couple of times, but what you are doing here is creating around 270 new tables and trashing them for each html document.

What I would try is just creating one Regex object, and use the | operator to match all the different stop words and the first filter. After that, you should use regex compilation to assembly in order to let the JIT generate machine code.

http://en.csharp-online.net/CSharp_Regular_Expression_Recipes%E2%80%94Compiling_Regular_Expressions

You should see a dramatic speedup with just this

Upvotes: 0

mqp

Reputation: 72015

Instead of having the regex replace in the loop, why not dynamically construct a monster matching regex that matches any one of your stopwords, and then run one replace, replacing it with nothing? Something like "\b(what|ok|yeah)\b" if your stopwords are "what", "ok", and "yeah". That seems like it would probably be more efficient.

Upvotes: 2

Charlie Salts

Reputation: 13508

You might be running into the Schlemiel the Painter problem. In C# (and other languages) when you append or concatenate strings, you are actually creating an entirely new string. Doing this in a loop often causes a LOT of memory allocation which can otherwise be avoided.