Rey
Rey

Reputation: 43

Identifying bug in linux kernel module

I am marking Michael's as he was the first. Thank you to osgx and employee of the month for additional information and assistance.

I am attempting to identify a bug in a consumer/produce kernel module. This is a problem being given to me for a course in university. My teaching assistant was not able to figure it out, and my professor said it was okay if I uploaded online (he doesn't think Stack can figure it out!).

File: final.c

#include <linux/completion.h>
#include <linux/init.h>
#include <linux/kthread.h>
#include <linux/module.h>

static int actor_kthread(void *);
static int writer_kthread(void *);

static DECLARE_COMPLETION(episode_cv);
static DEFINE_SPINLOCK(lock);
static int episodes_written;
static const int MAX_EPISODES = 21;
static bool show_over;
static struct task_info {
    struct task_struct *task;
    const char *name;
    int (*threadfn) (void *);
} task_info[] = {
    {.name = "Liz", .threadfn = writer_kthread},
    {.name = "Tracy", .threadfn = actor_kthread},
    {.name = "Jenna", .threadfn = actor_kthread},
    {.name = "Josh", .threadfn = actor_kthread},
};

static int actor_kthread(void *data) {
    struct task_info *actor_info = (struct task_info *)data;
    spin_lock(&lock);
    while (!show_over) {
        spin_unlock(&lock);
        wait_for_completion_interruptible(&episode_cv); //Line 30
        spin_lock(&lock);
        while (episodes_written) {
            pr_info("%s is in a skit\n", actor_info->name);
            episodes_written--;
        }
        reinit_completion(&episode_cv); // Line 36
    }

    pr_info("%s is done for the season\n", actor_info->name);
    complete(&episode_cv); //Why do we need this line?
    actor_info->task = NULL;
    spin_unlock(&lock);
    return 0;
}

static int writer_kthread(void *data) {
    struct task_info *writer_info = (struct task_info *)data;
    size_t ep_num;

    spin_lock(&lock);
    for (ep_num = 0; ep_num < MAX_EPISODES && !show_over; ep_num++) {
        spin_unlock(&lock);

        /* spend some time writing the next episode */
        schedule_timeout_interruptible(2 * HZ);

        spin_lock(&lock);
        episodes_written++;
        complete_all(&episode_cv);
    }

    pr_info("%s wrote the last episode for the season\n", writer_info->name);
    show_over = true;
    complete_all(&episode_cv);
    writer_info->task = NULL;
    spin_unlock(&lock);
    return 0;
}

static int __init tgs_init(void) {
    size_t i;
    for (i = 0; i < ARRAY_SIZE(task_info); i++) {
        struct task_info *info = &task_info[i];
        info->task = kthread_run(info->threadfn, info, info->name);
    }
    return 0;
}

static void __exit tgs_exit(void) {
    size_t i;
    spin_lock(&lock);
    show_over = true;
    spin_unlock(&lock);
    for (i = 0; i < ARRAY_SIZE(task_info); i++)
        if (task_info[i].task)
            kthread_stop(task_info[i].task);
}

module_init(tgs_init);
module_exit(tgs_exit);
MODULE_DESCRIPTION("CS421 Final");
MODULE_LICENSE("GPL");

File: kbuild

Kobj-m := final.o

File: Makefile

# Basic Makefile to pull in kernel's KBuild to build an out-of-tree
# kernel module

KDIR ?= /lib/modules/$(shell uname -r)/build

all: modules

clean modules:

Upvotes: 4

Views: 427

Answers (2)

user4822941
user4822941

Reputation:

I'm quite confused here.

You claim this is a question from an upcoming exam and it was released by the person delivering the course. Why would they do that? Then you say that TA failed to solve the problem. If TA can't do it, who can expect students to pass?

(professor) doesn't think Stack can figure it out

If the claim is that the level on this website is bad I definitely agree. But still, claiming it is below a level to be expected from a random university is a stretch. If there is no claim of the sort, I once more ask how are students expected to do it. What if the problem gets solved?

The code itself is imho unsuitable for teaching as it deviates too much from common idioms.

Another answer here noted one side effect of the actual problem. Namely, it was stated that the loop in tgs_exit can race with threads exiting on their own and test the ->task pointer to be non-NULL, while it becomes NULL just afterwards. The discussion whether this can result in a kthread_stop(NULL) call is not really relevant.

Either a kernel thread exiting on its own will clear everything up OR kthread_stop (and maybe something else) is necessary to do it.

If the former is true, the code suffers from a possible use-after-free. After tgs_exit tests that the pointer, the target thread could have exited. Maybe prior to kthread_stop call or maybe just as it was executed. Either way, it is possible that the passed pointer is stale as the area was already freed by the thread which was exiting.

If the latter is true, the code suffers from resource leaks due to insufficient cleanup - there are no kthread_stop calls if tgs_exit is executed after all threads exit.

The kthread_* api allows threads to just exit, hence effects are as described in the first variant.

For the sake of argument let's say the code is compiled in into the kernel (as opposed to being loaded as a module). Say the exit func is called on shutdown.

There is a design problem that there are 2 exit mechanisms and it transforms into a bug as they are not coordinated. A possible solution for this case would set a flag for writers to stop and would wait for a writer counter to drop to 0.

The fact that the code is in a module makes the problem more acute: unless you kthread_stop, you can't tell if the target thread is gone. In particular "actor" threads do:

actor_info->task = NULL;

So the thread is skipped in the exit handler, which can now finish and let the kernel unload the module itself...

spin_unlock(&lock);
return 0;

... but this code (located in the module!) possibly was not executed yet.

This would not have happened if the code followed an idiom if always using kthread_stop.

Other issue is that writers wake everyone up (so-called "thundering herd problem"), as opposed to at most one actor.

Perhaps the bug one is supposed to find is that each episode has at most one actor? Maybe that the module can exit when there are episodes written but not acted out yet?

The code is extremely weird and if you were shown a reasonable implementation of a thread-safe queue in userspace, you should see how what's presented here does not fit. For instance, why does it block instantly without checking for episodes?

Also a fun fact that locking around the write to show_over plays no role in correctness.

There are more issues and it is quite likely I missed some. As it is, I think the question is of poor quality. It does not look like anything real-world.

Upvotes: 2

Michael Burr
Michael Burr

Reputation: 340188

When cleaning up in tgs_exit() the function executes the following without holding the spinlock:

    if (task_info[i].task)
        kthread_stop(task_info[i].task);

It's possible for a thread that's ending to set it's task_info[i].task to NULL between the check and call to kthread_stop().

Upvotes: 2

Related Questions