Amit
Amit

Reputation: 34735

How to set a time limit on a java function running a regex

I am running a regex in a java function to parse a document and return true if it has found the string specified by the regex and return false if it hasn't. But the problem is that when the document doesn't contain the string specified by the regex It takes a very long time to return false and I want to terminate that function if it takes more than 6 seconds to execute.

How can I set a time limit of 6 seconds on that function so as to forcibly terminate that if it takes more than 6 seconds.

I am calling a method "method 1" of class 2 from class 1. The "method 1" calls "method 2" of the same class i.e. "class 2". Method 2 runs the regex code over a document. If it finds the string specified by the regex, then it returns the result to the method 1 which in turn return the result to the method in "class 1" which called the "method 1" of class 2. Now the problem is that the execution time of both the method1 and method2 of class 2 should be not more than 6 seconds.

So, I made a new RegexpThread class in the same file, in which my class2 was. Then I move the method2 of the class2 into class RegexpThread. Then whenever the method 1 is called, it instantiates the RegexpThread class as follows:

RegexpThread rt = new RegexpThread() {
    public void run() {
        method 2(m, urlCopy, document);
    }    
};

rt.start();

try {
    rt.join(6 * 1000);
} catch (InterruptedException e) {
    return "y";
}

if(rt.getResultXml().equals("")) {
    return "g";
}

resultXml.append(rt.getResultXml());

return resultXml.toString();

The code shown is in the method 1 of class2. The method 2 in the RegexpThread class perform some regex search over a document. There is a private field named "resultXml" in the RegexpThread class. If the method 2 has found the string specified by the regex then it assigns the result to the private field "resultXml". If not then the "resultXml" contains its default value i.e empty string.

So, in the above "if block", it is checking the "resultXml" field against empty string. If it is an empty string then that means the regex has not found its string in the document. But if it is not an empty string then that means the regex has found the string in the document and has assigned the result to the "resultXml" field.

so, look at this and tell me what to do...

Upvotes: 5

Views: 6198

Answers (10)

Ironluca
Ironluca

Reputation: 3762

The below answer is perhaps late for the post and Java version has also changed. However, the mechanism mentioned below works for me.

The central idea is to change the input text which is being evaluated to an empty string while the matching is in progress. The input for the below test has been taken from OWASP ReDoS example. The input text has been changed as the one provided was not of adequate length for the complexity.

package org.test.xpath;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class InterruptableMatcherTest {

    public static void main(String[] args) throws Exception{

        Pattern pattern=Pattern.compile("^(([a-z])+.)+[A-Z]([a-z])+$");
        String input="aaaaaaaaaaaaaaaaaaaaadddddddddddddddddddddddddddddddddddddddaaaaaaaaaaaa!";

        PatternMatcher patternMatcher=new PatternMatcher(pattern, input);
        Thread thread=new Thread(patternMatcher);

        thread.start();

        Thread.sleep(1*1000);
        System.out.println("Done sleeping ...");
        if(patternMatcher.running)patternMatcher.reset();//Without this call the program will hang
        thread.join();

    }//main closing

}//class closing

class PatternMatcher implements Runnable{

    Pattern pattern;
    Matcher matcher;

    boolean running=false;

    PatternMatcher(Pattern pattern, String input) {

        this.pattern=pattern;
        matcher=this.pattern.matcher(input);

    }//constructor closing

    @Override
    public void run() {

        running=true;
        matcher.matches();
        running=false;

    }//run closing

    void reset(){

        System.out.println("Reset called ...");
        matcher.reset("");

    }//reset closing

}//class closing

The reset() method, resets the input of the matcher to an empty String. refer code for Matcher class, Matcher reset(CharSequence input) method, which calls the Matcher reset(), which in turn sets the start and end of the text region to be matched to 0, effectively stopping the matching process in the next stage match. The mechanism works for me by terminating the matching process after a set timeout.

Upvotes: 0

yegor256
yegor256

Reputation: 105053

You can use AOP and a @Timeable annotation from jcabi-aspects (I'm a developer):

@Timeable(limit = 1, unit = TimeUnit.SECONDS)
String yourMethod() {
  // execution as usual
}

Make sure that you somewhere in your method you check for Thread#isInterrupted():

if (Thread.currentThread.isInterrupted()) {
  throw new IllegalStateException("time out");
}

When time limit is reached your thread will get isInterrupted() flag set to true and it's your job to handle this situation correctly and to stop execution.

Upvotes: -1

Sebastian Kübeck
Sebastian Kübeck

Reputation: 196

I agree to check the regular expressions before they are used. If you need a safety net, you might use something like this...

http://gist.github.com/630969

Upvotes: 0

Ben S
Ben S

Reputation: 69342

The Java Thread class is not equipped to deal with this sort of interruption, and as such is not suitable for your requirements.

I would implement the functionality in a separate Process using the ProcessBuilder and use the Input and Output streams provided by the Process class for communication. Forceful interruption is provided by the destroy method of of the Process class.

I believe this is the correct, safest, implementation for the requirements you have. Unfortunately, Java doesn't make it easy to launch another Java process in a platform independent way so you'll have to have the java executable to your path and create a separate main method to do this. This is harder than it should be.

Upvotes: 0

Stephen C
Stephen C

Reputation: 718788

There are two ways to answer this question.

On the one hand, there is no practical/effective way that is known to be safe of killing a thread that is executing Matcher.find(...) or Matcher.match(...). Calling Thread.stop() would work, but there are significant safety issues. The only way to address this would be to develop your own regex engine that regularly checked the interrupted flag. (This is not totally impractical. For example, if GPL wasn't an issue for you, you could start with the existing regex engine in OpenJDK.)

On the other hand, the real root of your problem is (most likely) that you are using regexes the wrong way. Either you are trying to do something that is too complicated for a single regex, or your regex is suboptimal.

EDIT: The typical cause of regexes taking too long is multiple quantifiers (?, , +) causing pathological backtracking. For example, if you try to match a string of N "A" characters followed by a "B" with the regex "^AAAAAA$", the complexity of the computation is (at least) O(N**5). Here's a more "real world" example:

"(.*)<html>(.*)<head>(.*)</head>(.*)<body>(.*)</body>(.*)</html>(.*)"

Now imagine what happens if you encounter a "web page" like this:

<html><html><html><html><html><html><html><html><html><html>
<head><head><head><head><head><head><head><head><head><head>
</head></head></head></head></head></head></head></head></head></head>
<body><body><body><body><body><body><body><body><body><body><body>
</body></body></body></body></body></body></body></body></body></body>

Notice that there is no closing </html> tag. This will run for a long time before failing. (I'm not exactly sure what the complexity is ... but you can estimate it experimentally it you feel like it.)

In this case, a simple answer is to use simpler regexes to locate the 6 marker elements and then extract the stuff between then using substring().

Upvotes: 3

Brian Agnew
Brian Agnew

Reputation: 272267

I will assume for the moment that your regexp code is correct, and it really is some computational code that is CPU-bound for 6s.

Given the above, I think you have only one option. To execute your code in a number of stages/iterations and check a variable for a request to halt. You can't do this using the normal Pattern/Matcher code.

You could do this by splitting your input string beforehand in some fashion, and then feeding to your regexp bit by bit (your initial split would have to be independent of your regexp).

You can't do this by:

  1. using Thread.stop() etc. This has been deprecated and doesn't work properly.
  2. Using Thread.interrupt(). This sets an interrupted flag on the thread which is only checked when the thread performs IO. If the thread is CPU-bound, then that flag will never be checked.

Given the above, I would look again at why the regexp is taking 6s to match. Is the regexp correct ? Can you perform a regexp on smaller text segments ?

Upvotes: 2

waxwing
waxwing

Reputation: 18743

I might be mistaken here, but I think all ways to terminate a thread have been deprecated for some time. The recommended way is to use a shared isRunning variable that your worker thread periodically checks and gracefully exits when it's been set.

This will not work in your case, but it looks to me like you're treating the symptom - not the real problem. You should post the code of your regexp function, that takes 6 seconds to execute. If it's the regexp itself, the execution time may be a case of catastrophic backtracking.

Upvotes: 9

Tom
Tom

Reputation: 44821

What you've done kind of looks fine to me here's how I'd modify it:

final AtomicReference<String> resultXml = new AtomicReference<String>();

RegexpThread rt = new RegexpThread() {
  public void run() {
    method2(m, urlCopy, document, resultXml);
  }

};

rt.start();

try {
    rt.join(6 * 1000);
} catch (InterruptedException e) {
    return "y";
}

if(resultXml.get() == null) {
    rt.interupt();
    return "g";
}

resultXml.append(resultXml.get());

return resultXml.toString();

Upvotes: -2

kdgregory
kdgregory

Reputation: 39606

You don't show the function that actually performs the regex, so I'll assume that it reads lines from the file and executes the regex over each line.

If that is the case, then a better solution is to pass a timeout value to that function. After every N lines (whatever N might be), it checks the timeout value.

The real problem that you'll have is with blocking IO -- for example, reading from a network. In that case, there's nothing you can do from Java, as the block is actually happening in the OS kernel.

Upvotes: 0

Yuval Adam
Yuval Adam

Reputation: 165242

Start your thread via an ExecutorService and give it a timeout, like so:

ExecutorService pool = Executors.newFixedThreadPool(POOL_SIZE);
pool.execute(rt);
pool.awaitTermination(timeout, timeUnit);

awaitTermination() will wait until the task is finished (as well as all other tasks under this ExecutorService), the thread is interrupted, or timeout occurs - which ever comes first.

Sounds like this fits your needs.

Upvotes: 0

Related Questions