Reputation: 179
Registering a level triggered eventfd on epoll_ctl
only fires once, when not decrementing the eventfd counter. To summarize the problem, I have observed that the epoll flags (EPOLLET
, EPOLLONESHOT
or None
for level triggered behaviour) behave similar. Or in other words: Does not have an effect.
Could you confirm this bug?
I have an application with multiple threads. Each thread waits for new events with epoll_wait
with the same epollfd. If you want to terminate the application gracefully, all threads have to be woken up. My thought was that you use the eventfd counter (EFD_SEMAPHORE|EFD_NONBLOCK
) for this (with level triggered epoll behavior) to wake up all together. (Regardless of the thundering herd problem for a small number of filedescriptors.)
E.g. for 4 threads you write 4 to the eventfd. I was expecting epoll_wait
returns immediately and again and again until the counter is decremented (read) 4 times. epoll_wait
only returns once for every write.
Yep, I read all related manuals carefully ;)
#include <sys/epoll.h>
#include <sys/eventfd.h>
#include <sys/types.h>
#include <unistd.h>
#include <pthread.h>
static int event_fd = -1;
static int epoll_fd = -1;
void *thread(void *arg)
{
(void) arg;
for(;;) {
struct epoll_event event;
epoll_wait(epoll_fd, &event, 1, -1);
/* handle events */
if(event.data.fd == event_fd && event.events & EPOLLIN) {
uint64_t val = 0;
eventfd_read(event_fd, &val);
break;
}
}
return NULL;
}
int main(void)
{
epoll_fd = epoll_create1(0);
event_fd = eventfd(0, EFD_SEMAPHORE| EFD_NONBLOCK);
struct epoll_event event;
event.events = EPOLLIN;
event.data.fd = event_fd;
epoll_ctl(epoll_fd, EPOLL_CTL_ADD, event_fd, &event);
enum { THREADS = 4 };
pthread_t thrd[THREADS];
for (int i = 0; i < THREADS; i++)
pthread_create(&thrd[i], NULL, &thread, NULL);
/* let threads park internally (kernel does readiness check before sleeping) */
usleep(100000);
eventfd_write(event_fd, THREADS);
for (int i = 0; i < THREADS; i++)
pthread_join(thrd[i], NULL);
}
Upvotes: 7
Views: 5102
Reputation: 2954
With current linux version (e.g. Ubuntu 22.04 LTS) the code from the question works absolutely fine as intended. I have edited it a bit and added some error checking and time reporting. In particular, the return code of eventfd_read()
should always be checked for spurious wakeups:
#include <sys/time.h>
#include <sys/epoll.h>
#include <sys/eventfd.h>
#include <sys/types.h>
#include <unistd.h>
#include <pthread.h>
#include <stdio.h>
static int event_fd = -1;
static int epoll_fd = -1;
struct thread_data {
int id;
};
void *thread(void *arg)
{
struct thread_data* data = (struct thread_data *) arg;
struct timeval tv;
gettimeofday(&tv, NULL);
printf("Thread %d started at %ld.%06ld\n", data->id, tv.tv_sec, tv.tv_usec);
for(;;) {
struct epoll_event event;
int rc = epoll_wait(epoll_fd, &event, 1, -1);
/* handle events */
if(rc == 1 && event.data.fd == event_fd && event.events & EPOLLIN) {
uint64_t val = 0;
if(eventfd_read(event_fd, &val) >= 0) {
gettimeofday(&tv, NULL);
printf("Thread %d received stop signal at %ld.%06ld\n",
data->id, tv.tv_sec, tv.tv_usec);
break;
} else {
gettimeofday(&tv, NULL);
printf("Thread %d received spurious wake up at %ld.%06ld\n",
data->id, tv.tv_sec, tv.tv_usec);
}
}
}
return NULL;
}
int main(void)
{
enum { THREADS = 4 };
enum { WAKE_FIRST = 1 };
epoll_fd = epoll_create1(0);
event_fd = eventfd(0, EFD_SEMAPHORE| EFD_NONBLOCK);
struct epoll_event event;
event.events = EPOLLIN;
event.data.fd = event_fd;
epoll_ctl(epoll_fd, EPOLL_CTL_ADD, event_fd, &event);
pthread_t thrd[THREADS];
struct thread_data data[THREADS];
for(int i = 0; i < THREADS; i++) {
data[i].id = i;
pthread_create(&thrd[i], NULL, &thread, (void *) &data[i]);
}
/* let threads reach epoll_wait() : */
usleep(100000);
struct timeval tv;
gettimeofday(&tv, NULL);
printf("\nSending wake signal to %d threads at %ld.%06ld\n",
WAKE_FIRST, tv.tv_sec, tv.tv_usec);
eventfd_write(event_fd, WAKE_FIRST);
if(THREADS > WAKE_FIRST) {
usleep(100000);
gettimeofday(&tv, NULL);
printf("\nSending wake signal to %d threads at %ld.%06ld\n",
THREADS - WAKE_FIRST, tv.tv_sec, tv.tv_usec);
eventfd_write(event_fd, THREADS - WAKE_FIRST);
}
for(int i = 0; i < THREADS; i++) {
pthread_join(thrd[i], NULL);
}
}
Typical output:
Thread 0 started at 1679048746.554414
Thread 1 started at 1679048746.554440
Thread 2 started at 1679048746.554455
Thread 3 started at 1679048746.554492
Sending wake signal to 1 threads at 1679048746.655088
Thread 3 received stop signal at 1679048746.655170
Sending wake signal to 3 threads at 1679048746.755238
Thread 2 received stop signal at 1679048746.755341
Thread 1 received stop signal at 1679048746.755414
Thread 0 received stop signal at 1679048746.755479
A few more observations:
THREADS
and WAKE_FIRST
.eventfd_write(event_fd, WAKE_FIRST)
is performed before the threads are created.epoll_wait()
again a few times before it finally performs eventfd_read()
. Those repeated calls to epoll_wait()
will return immediately.usleep()
before calling eventfd_read()
, this leads to spurious wakeups of other threads. It seems, that the kernel promotes the eventfd readyness to other threads, if the thread(s), that was/were signalled first, engage in blocking system calls. That's a good feature, not a bug, in my opinion. And yes, with all locking things, one should always check for spurious wake ups.Upvotes: 0
Reputation:
When you write to an eventfd
, a function eventfd_signal
is called. It contains the following line which does the wake up:
wake_up_locked_poll(&ctx->wqh, EPOLLIN);
With wake_up_locked_poll
being a macro:
#define wake_up_locked_poll(x, m) \
__wake_up_locked_key((x), TASK_NORMAL, poll_to_key(m))
With __wake_up_locked_key
being defined as:
void __wake_up_locked_key(struct wait_queue_head *wq_head, unsigned int mode, void *key)
{
__wake_up_common(wq_head, mode, 1, 0, key, NULL);
}
And finally, __wake_up_common
is being declared as:
/*
* The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just
* wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve
* number) then we wake all the non-exclusive tasks and one exclusive task.
*
* There are circumstances in which we can try to wake a task which has already
* started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
* zero in this (rare) case, and we handle it by continuing to scan the queue.
*/
static int __wake_up_common(struct wait_queue_head *wq_head, unsigned int mode,
int nr_exclusive, int wake_flags, void *key,
wait_queue_entry_t *bookmark)
Note the nr_exclusive
argument and you will see that writing to an eventfd
wakes only one exclusive waiter.
What does exclusive mean? Reading epoll_ctl
man page gives us some insight:
EPOLLEXCLUSIVE (since Linux 4.5):
Sets an exclusive wakeup mode for the epoll file descriptor that is being attached to the target file descriptor, fd. When a wakeup event occurs and multiple epoll file descriptors are attached to the same target file using
EPOLLEXCLUSIVE
, one or more of the epoll file descriptors will receive an event withepoll_wait(2)
.
You do not use EPOLLEXCLUSIVE
when adding your event, but to wait with epoll_wait
every thread has to put itself to a wait queue. Function do_epoll_wait
performs the wait by calling ep_poll
. By following the code you can see that it adds the current thread to a wait queue at line #1903:
__add_wait_queue_exclusive(&ep->wq, &wait);
Which is the explanation for what is going on - epoll waiters are exclusive, so only a single thread is woken up. This behavior has been introduced in v2.6.22-rc1 and the relevant change has been discussed here.
To me this looks like a bug in the eventfd_signal
function: in semaphore mode it should perform a wake-up with nr_exclusive
equal to the value written.
So your options are:
poll
, probably on both eventfd
and epollevenfd_write
4 times (probably the best you can do).Upvotes: 4