AlejoDiaz49
AlejoDiaz49

Reputation: 95

Execution time inconsistency in a program with high priority in the scheduler using RT Kernel

Problem


We are trying to implement a program that sends commands to a robot in a given cycle time. Thus this program should be a real-time application. We set up a pc with a preempted RT Linux kernel and are launching our programs with chrt -f 98 or chrt -rr 99 to define the scheduling policy and priority. Loading of the kernel and launching of the program seems to be fine and work (see details below).

Now we were measuring the time (CPU ticks) it takes our program to be computed. We expected this time to be constant with very little variation. What we measured though, were quite significant differences in computation time. Of course, we thought this could be undefined behavior in our rather complex program, so we created a very basic program and measured the time as well. The behavior was similarly bad.

Question


Environment Description


First of all, we installed an RT Linux Kernel on the PC using this tutorial. The main characteristics of the PC are:

PC Characteristics Details
CPU Intel(R) Atom(TM) Processor E3950 @ 1.60GHz with 4 cores
Memory RAM 8 GB
Operating System Ubunut 20.04.1 LTS
Kernel Linux 5.9.1-rt20 SMP PREEMPT_RT
Architecture x86-64

Tests


The first time we detected this problem was when we were measuring the time it takes to execute this "complex" program with a single thread. We did a few tests with this program but also with a simpler one:

  1. The CPU execution times
  2. The wall time (the world real-time)
  3. The difference (Wall time - CPU time) between them and the ratio (CPU time / Wall time).

We also did a latency test on the PC.

Latency Test

For this one, we followed this tutorial, and these are the results:

Latency Test Generic Kernel

Latency Test RT Kernel

The processes are shown in htop with a priority of RT

Test Program - Complex

We called the function multiple times in the program and measured the time each takes. The results of the 2 tests are:

Complex - Test 1 - Picture 1

From this we observed that:

Test Program - Simple

We did the same test but this time with a simpler program:

#include <vector>
#include <iostream>
#include <time.h>

int main(int argc, char** argv) {
    int iterations = 5000;
    double a = 5.5;
    double b = 5.5;
    double c = 4.5;
    std::vector<double> wallTime(iterations, 0);
    std::vector<double> cpuTime(iterations, 0);
    struct timespec beginWallTime, endWallTime, beginCPUTime, endCPUTime;

    std::cout << "Iteration | WallTime | cpuTime" << std::endl;

    for (unsigned int i = 0; i < iterations; i++) {
        // Start measuring time
        clock_gettime(CLOCK_REALTIME, &beginWallTime);
        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &beginCPUTime);

        // Function
        a = b + c + i;

        // Stop measuring time and calculate the elapsed time
        clock_gettime(CLOCK_REALTIME, &endWallTime);
        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &endCPUTime);

        wallTime[i] = (endWallTime.tv_sec - beginWallTime.tv_sec) + (endWallTime.tv_nsec - beginWallTime.tv_nsec)*1e-9;
        cpuTime[i] = (endCPUTime.tv_sec - beginCPUTime.tv_sec) + (endCPUTime.tv_nsec - beginCPUTime.tv_nsec)*1e-9;

        std::cout << i << " | " << wallTime[i] << " | " << cpuTime[i] << std::endl;
    }
    return 0;
}

Simple - Test 1 - Picture 1

Final Thoughts


We understand that:

Of course, we can give more details.

Thanks a lot for your help!

Upvotes: 0

Views: 312

Answers (1)

Goswin von Brederlow
Goswin von Brederlow

Reputation: 12362

Your function will near certainly be optimized away so you are just measuring how long it takes to read the clocks. And as you can see that doesn't take very long with some exceptions:

The very first time you run the code (unless you just compiled it) the pages need to be loaded from disk. If you are unlucky the code spans pages and you include the loading of the next page in the measured time. Quite unlikely given the code size.

The first loop the code and any data needs to be loaded into cache. So that takes longer to execute. The branch predictor might also need a few loops to predict the loop right so the second, third loop might be slightly longer too.

For everything else I think you can blame scheduling:

  • an IRQ happens but nothing gets rescheduled
  • the process gets paused while another process runs
  • the process gets moved to another CPU thread leaving the caches hot
  • the process gets moved to another CPU core making L1 cache cold but leaving L2/L3 caches hot (if your L2 is shared)
  • the process gets moved to a CPU on another socket making L1/L2 caches cold but L3 cache hot (if L3 is shared)

You can do little about IRQs. Some you can fix to specific cores but others are just essential (like the timer interrupt for the scheduler itself). You kind of just have to live with that.

But you can fix your program to a specific CPU and you can fix everything else to all the other cores. Basically reserving the core for the real-time code. I guess you would have to use cgroups for this, to keep everything else off the chosen core. And you might still get some kernel threads run on the reserved core. Nothing you can do about that. But that should eliminate most of the large execution times.

Upvotes: 1

Related Questions