Understanding the PID hash table in the linux kernel

Question

This section of the book "Understanding Linux Kernel 3rd edition" explains that instead of searching the process list in order to find a PID, the kernel introduces 4 hash tables, one for each type of PID.

As I understood, each element of a table is the hash of a PID. But how does that makes searching easy? For example, given a PID, is the existence of 4 hash tables because it's faster to search just in the hash of that PID type instead of searching in the lit of all PIDs? Also, why hashing helps? Isn't searching for a hash harder than searching for a simple number?

So, what exactly is an entry in one of these 4 tables? Are them process descriptors? I understood them as it. And in each process descriptor, there is a structure which links to the other similar processes in the same state, that is, for example, the processes that are in the same group and same state.

Is this it?

David LaPorte · Accepted Answer

The kernel stores all processes in the task list. The task list is a circular doubly linked list. Meaning that each element in the list has pointers to the next element and the previous element. The first item links to the last item and vice versa. It can be conceptually thought of as a circle.

Within each task is a process descriptor which holds the PID information you're interested in. What they're saying is that, normally, to find the process you're trying to kill, you have to traverse the task list, looking at the PID field of each process descriptor for each task, until you find the one you're looking for. You can't reference them directly by PID because you have no idea where it is in memory. That's what the links are for. So that the task list does not need to occupy contiguous memory space. Making inserts and deletions possible simply through re-linking. Each process knows where IT is in memory. And it can use it's position in memory to follow the links until it finds the process it's looking for.

That's called linear time search. Worst case, given N elements, it will take you N operations to find your result. And you ALWAYS assume average worst case when describing efficiency. Linear time is pretty inefficient when it comes to searching if you have any significant amount of data.

What they're saying is that there are 4 tables where you can put your PID (depending on your intended target) through a hash function, check the table at the location of your result, and know EXACTLY the memory address of the task in the task list. That's one operation. It's the hash function's job to mitigate collisions. But average worst case, that's called constant time. Much faster.

There is no such thing as searching for a simple number. You're traversing data structures that are in disparate memory locations. If you had a C array, those are pre-allocated in the stack in contiguous memory space. So, in that case, your array index number would point you to the chunk of memory you need immediately. But that's not the case here. These data structures are neither static nor pre-allocated. So you need some way to jump from memory address to memory address. Which is what these data structures address.

I hope that clears things up.

Sources: https://en.wikipedia.org/wiki/Hash_table http://www.makelinux.net/books/lkd2/app01lev1sec1

Understanding the PID hash table in the linux kernel

Answers (1)

Related Questions