sdabet
sdabet

Reputation: 18670

Object.wait() exceeds timeout

What could explain that the duration of Object.wait(timeout) exceeds the provided timeout value?

long start = System.currentTimeMillis();
obj.wait(1000);
long duration = System.currentTimeMillis() - start;
// sometimes (very rarely) duration may exceed 1500

Context: Somewhere in the depth of a very complex software there is a piece of code that makes such a wait and generates warning logs in case of excessive durations. In a production environment, with high traffic, some logs report huge overwaits (like 30 seconds). So I'm trying to reproduce it, understand what may happen and how to fix/improve it.

Upvotes: 7

Views: 3012

Answers (3)

t0r0X
t0r0X

Reputation: 4802

The "user-time" or "wall-clock-time" spent with the "wait(timeout)" call is, ususally, the timeout value plus the time until the thread was re-scheduled for execution and executed.

See the Javadoc for the Object.wait(long timeout) method:

The thread T is then [...] re-enabled for thread scheduling. It then competes in the usual manner with other threads for the right to synchronize on the object;

So there is no guarantee for "real-time" operation, it's more a kind of "best try", depending on current system load and maybe also on other locking dependencies in your application. Therefore, if the system is under heavy load, or your application handles many threads, the wait might take considerably longer than the timeout.

PS
The quote @nathan-hughes mentioned in his comment to your question is probably the key sentence in the Javadoc of the "wait" method: The specified amount of real time has elapsed, more or less.

PPS
Based on your question edit with additional context information ('very complex software', 'high traffic', 'huge overwaits'): you have to find all usages of your obj object as a lock, and determine how those usages interact together.

This can get really complex. Here an attempt to sketch a "simple" scenario of what might go wrong, whith only two plain threads, like e.g. this:

// thread 1
synchronized (obj) {
    // wait 1000ms
    obj.wait(1000);
}
// check for overwait

// thread 2, after, let's say 500 ms
synchronized (obj) {
    obj.notify();
}

Easy scenario, everything is fine, the execution order is roughly:

  1. 0ms: T1 aquires the lock on 'obj'
  2. 0ms: T1 registers itself as waiting for 'obj', and gets excluded from thread scheduling. While excluded from thread scheduling, the lock on 'obj' is again released (!)
  3. 500ms: T2 aquires the lock on 'obj', notifies one thread waiting for notification (thread is chosen based on thread scheduling settings), and releases the lock on 'obj'
  4. 500ms + X: T1 is re-enabled for thread scheduling, it waits until it re-aquires the lock on 'obj' (!), then it finished it's block and releases the lock on 'obj'.

These are only 2 simple threads and synchronized blocks. Let's make this more complex, with poorly written code. What if 2nd thread would be something like that:

// bad variant of thread 2, after, let's say 500 ms
synchronized (obj) {
    obj.notify();

    // do complex operation, taking more than few ms,
    // maybe a heavy SQL query/update...
}

In this case, even though T1 has got notified (or maybe timed out), it has to wait until it gains again the lock on 'obj', which is still held by T2 as long as the complex operation runs (step 3 in the previous list)! This might indeed take up to ... seconds or more.

Even more complexity: we return to our initial simple threads T1 and T2, but add a 3rd thread:

// thread 3, after, let's say also 500 ms
synchronized (obj) {
    // do complex operation, taking more than few ms,
    // maybe a heavy SQL query/update...
}

The execution order could become, roughly:

  1. 0ms: T1 aquires the lock on 'obj'
  2. 0ms: T1 registers itself as waiting for 'obj', and gets excluded from thread scheduling. While excluded from thread scheduling, the lock on 'obj' is again released (!)
  3. 500ms: T2 aquires the lock on 'obj', notifies one thread waiting for notification (thread is chosen based on thread scheduling settings), and releases the lock on 'obj'
  4. 500ms + X: T2 is re-enabled for thread scheduling, but does not get the lock on 'obj', because
  5. 500ms + X: T3 is scheduled by thread scheduler before T1, and it aquires the lock on 'obj' (!), and starts doing it's complex operation. T1 can't do anything but wait!
  6. 500ms + MANY: T3 *releases the lock on 'obj'.
  7. 500ms + MANY: T1 re-aquires the lock on 'obj' (!), then exits its synchonized block and releases itself the lock on 'obj'.

This is only scratching the surface of what might happen in your 'very complex software', with 'high traffic'. Add more threads, maybe poorly coded (e.g. doing too much in the 'synchronized' blocks), high traffic, and you might easily get the overwaits you mentioned.

OPTIONS
How to solve this... depends on the purpose and complexity of your software, there is no simple plan. More can't be said based on the available information.

Maybe reanalysing the code with pen and paper is enough, maybe profiling it could help you find the locks, maybe you can get the needed information about the current locks via JMX or a thread dump (via signal, jconsole, jcmd, jvisualvm), or by monitoring with the Java Mission Control and Java Flight Recording (features available since ... JDK 7u40 I think).

You've asked in a comment if Thread.sleep(timeout) would help: can't be said without more info. Maybe it would help. Or maybe reentrant locks, or other locking options (see packages java.util.concurrent, java.util.concurrent.atomic, java.util.concurrent.locks) would be more appropriate. It depends on your code, your use case and on the Java version you're using.

If GC is not an issue (see below), and you have analyzed the code, it "looks fine", and you think the high traffic is the cause, you might also consider enabling biased locking or/and spin locking. See the Java 7 JVM options for more details (article contains links to Java 8 JVM options too).

GARBAGE COLLECTION
By the way, 'high traffic' should have made me ask this earlier: the garbage collection, have you monitored it? If not properly configured/tuned, GC might also often cause very significant pauses! (I had this week such a case, 15-30 seconds for full GC...)

Upvotes: 3

11thdimension
11thdimension

Reputation: 10653

Wait/Sleep should be used to ensure the logical ordering of program execution. There is no guarantee when a thread will be scheduled next, that's why wait is often kept in a loop.

That said, if you just want to check the cause of timeout then try to find the thread which owns the lock and analyze that thread.

ThreadMXBean bean = ManagementFactory.getThreadMXBean();
ThreadInfo[] ti = bean.getThreadInfo(bean.getAllThreadIds(), true, true);

ThreadInfo object contains LockInfo, you can get the hashCode of the lock and check the thread which has matching hashcode for a lock it holds.

Apart from that, if you can't really modify the code then try enabling JMX on the production (A restart might be required).

Following arguments are to be added to the Java process

-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=8989

Port is the port for JMX. You can enable authentication if you need, then you'll have to provide the user/pass too. Check here.

After enabling JMX you'll be able to see the live state of threads using Jvisualvm or jconsole. These tools also allow you take take thread dump on click of a button. Analyzing thread dump also may give you a clue.

If you can enable remote debugging on the server with JVM then you'll be able to debug the thread which holds the lock from your IDE. Following are the arguments for the Java process to enable remote debugging

-Xdebug -Xnoagent -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8888

This will enable remote debugging on port 8888. If you're using eclipse you can suspend (right click suspend) any intended thread in the debug perspective to see what it's currently doing and debug it.

Good luck!

Upvotes: 2

JJF
JJF

Reputation: 2777

When a thread actually awakens after a time out expires or sleep wakes up is not exact. The doc for sleep has this note

Causes the currently executing thread to sleep (temporarily cease execution) for the specified number of milliseconds, subject to the precision and accuracy of system timers and schedulers. The thread does not lose ownership of any monitors.

notify is also subject to these variances.

So it has to do with the precision of the timers in the system and what other threads or processes are doing at the exact time the thread is eligible to run again. The general rule is a timeout like this is a minimum amount of time that will elapse. Object.notify has a variant that takes nanoseconds as well that gives you finer grain control over the amount of time that elapses.

see the Javadoc description about public final void wait(long timeout, int nanos)

Upvotes: 5

Related Questions