CPU load-based timeouts in Selenium tests

Question

Background

We run integration tests in multiple browsers (chrome and firefox) using Selenium Grid, but tests become flaky as the CPU load on the host machines increases. Timeouts for wait statements like:

    WebDriverWait(browser, WAIT_TIME).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, ".drawing-surface"))
    )

are more than sufficient for machines under minimal load, but become woefully insufficient as we run large test suites. The only remedy seems to be scaling horizontally (adding machines to the test infrastructure and registering them with Selenium Grid to distribute the load), scaling vertically (replacing our test machines with more powerful ones), or increasing the timeouts for these tests (which would dramatically increase the amount of time that it takes for devs to see that a test is failing). None of these are particularly good options.

I wonder whether this clock-time approach to waits is the right idea in the first place. Why use monotomic clock-time when the OS already reports how much CPU time has been granted to a particular process running on the system?

Idea

Could it be possible to scale timeouts based on CPU load? Instead of waiting 20 seconds on a monotonic clock, could Selenium be configured or manipulated to wait based on 20 seconds of CPU time in the browser's OS process?

I'm imagining some service that could be called (subprocess? dbus?) from within a Selenium test. In very broad terms:

while selenium waits for the presence of a particular element, check in another thread how much CPU time has been consumed by the browser process running this test so far
- how could the test figure out the PID of the browser process running the current test?
- once the PID or process name is known, ps could be used to get the CPU time
the other thread continues running until it sees that the browser process has gained N seconds of CPU time
the test waits on both the Selenium call looking for the element and the timeout thread
- if the element is found first, kill the other thread and continue the rest of the test
- if the timeout hits first, fail the test immediately

Open questions

The points that I'm struggling with are:

How can a test figure out the PID if the browser process that's running the current test?
In Selenium Grid, are the unit test code itself being run on the same machine as the browser that's running the test? or does this happen over the network?
- If this happens over the network, is there an existing way for tests to communicate to the remote machine, that I could hijack to send and receive information about the CPU time of the browser process? or will this need to be built separately?

My understanding of Selenium (Grid), geckodriver/chromedriver, firefox-marionnette, etc. are all quite limited, so I'm trying to understand how I could accomplish the above (or something equivalent) in this rather complex ecosystem, even if this means a PR to these projects. Any tips are appreciated, thanks!

CPU load-based timeouts in Selenium tests

Background

Idea

Open questions

Answers (1)

Related Questions