Why do Processes "block" ?


You all have heard of the panacea to the problem of “blocking” calls that are somehow blamed for poor scalability and performance: NodeJS uses an event loop, Go lang uses Goroutines, Java introduced Virtual Threads whreas Erlang has BEAM. These magical solutions somehow solve the scalability problem and allow platforms to scale endlessly, saving us from the evil that is blocking calls.

But what does being “blocked” actually mean and what why is it bad ?

To understand this, we need to dive a bit deep into how Process and Threads work and how an OS schedules them. All of my research is Linux specific, but the concepts are similar for other OSes as well and with some change in the implementation details, they can be generalized to other platforms.

An Executable is a file on the disk that contains instructions (and data) on what is to be run (say a Desktop App, a CLI tool or a web server). A Process is a running instance of that Executable with resources allocated to it (an entry in runqueue), Memory Pages, Open Files etc.

CPU time is limited for processes. If we allow process an unlimited amount of time to execute their code, they can take all the CPU time for themselves thus starving other processes of time to execute code. As an example, imagine a terminal program that is running waiting for input from you, the user and before you could type something in, you get an urgent call of nature and leave your station and in the background you are downloading some game files. The program downloading the game files doesn’t get a chance to run because the terminal program is hogging all the CPU time and the game downloading program isn’t able to make network calls to fetch game files or disk calls to save them to disk.

This is why Linux is an preemptible kernel/OS. Running processes can be interrupted (stopped), stashed away with their state intact, another process scheduled and the previous process restored later when it needs to be.

This abstraction of what proccesses need to be run is called a runqueue, where processes wait in-line to be scheduled (just like a queue). A scheduler based on some algorithm picks the next process to run. Some tasks (like responding to hardware events) have a higher priority than user-run processes and to deal with these priorities, the Linux Kernel uses multiple runqueues, each handled by a different Scheduler class. SCHED_FIFO for example, is the scheduler used for handling tasks that service some hardware interrupts.

Most “normal” user executed processes are handled by the CFS, aka the Completed Fair Scheduler. Interestingly the best blog I have read on understanding how the CFS works is by Phuong Le of Victoria Metrics writing about setting the right GOMAXPROCS for running Go application in Kubernetes containers. Yes, Kubernetes CPU resource requests and limits leak the CFS abstraction to the developer space, so to understand how to tune your application for optimal performance you need to know all the way down to how Linux Kernel scheduling works !

The more recent Kernel versions (post v6.5) use a new scheduler call EEVDF which builds upon CFS, but for our current discussion, the differences between CFS and EEVDF aren’t really that important

Let us get back to the topic. What does it mean by get “blocked” ? Process can be in one of many state; For process that are running or can be run, they are said to be in “TASK_RUNNABLE” state. During the scheduler’s loop, it picks the process that needs to be run next based on the priority of each process, the amount of time it has run so far and the scheduler class to which the process is assigned. All this information is part of the struct task_struct of the process that is used by the Linux Kernel to hold the state of a process.

Lets look at the flow of “blocking” call, such as reading data from a file.

  1. The process uses glibc’s read() function with a file descriptor to read from a file.
  2. The read() system call calls into Linux’s Virtual File System subsystem to dispatch the actual file reading to the code that handles the Filesystem of the storage where the file is located (EXT4, XFS etc)
  3. The filesystem then calls the device driver code that handles the actual instruction sending to the SSD/spinning disk/network file system to read the bytes off the disk.

This process can take an arbitrary amount of time. The word “arbitrary” here means “unpredictable”. It can take 2 microseconds, 2 millseconds or even 2 seconds. The process, while it waits for the bytes from the file can do one of 2 things:

  1. Busy Loop until the vfs sets some flag indicating that data from the file has been read. This involves the read() call exectuing something like
while(flag != true) {
    xor eax, eax (or //nop)
}

The busy loop executes some instruction that is ultimately useless, but keeps the process on the CPU. We are now busy burning energy doing nothing until the data arrives while also preventing other processes that could be run now (say you are opening a large video file while simulatenously browsing the internet, maybe the browser has code that is ready to run but is waiting for your video editor to load the file due to the busy loop)

  1. Setup some sort of mechanism to say, “Hey wake me up when this file I am interested in is ready to be read or has data copied into memory” and then yield to other processs that can be now be run on the CPU.
  1. Is what what most “blocking” calls do. The system calls that need to “block” use a wait_queue to signal that it is waiting for an event to happen and to be woken up once the event occurs.

For example, here is the implementation of a wait_for_partner fn that is used to wait until the other end of a unix pipe is ready

static int wait_for_partner(struct pipe_inode_info *pipe, unsigned int *cnt)
{
	DEFINE_WAIT(rdwait);
	int cur = *cnt;

	while (cur == *cnt) {
		prepare_to_wait(&pipe->rd_wait, &rdwait, TASK_INTERRUPTIBLE);
		pipe_unlock(pipe);
		schedule();
		finish_wait(&pipe->rd_wait, &rdwait);
		pipe_lock(pipe);
		if (signal_pending(current))
			break;
	}
	return cur == *cnt ? -ERESTARTSYS : 0;
}

The task sets its status to INTERRUPTIBLE, acquires a wait entry and waits on the rd_qait queue and yields to the scheduler by calling schedule. When the pipe’s partner is ready, the scheduler wakes up the process from where it yielded, thus unblocking the process to proceed further.

Here is an example of the implementation of the nanosleep glibc system call that, well, as the name says, sleeps for the specified nanoseconds.

The lines to watchout here again are where the state of the current task is set to TASK_INTERRUPTIBLE and the subsequent schedule() call

// code as of kernel version 6.19.11 in kernel/time/hrtimer.c
static int __sched do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode)
{
	struct restart_block *restart;

	do {
		set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
		hrtimer_sleeper_start_expires(t, mode);

		if (likely(t->task))
			schedule();

		hrtimer_cancel(&t->timer);
		mode = HRTIMER_MODE_ABS;

	} while (t->task && !signal_pending(current));

	__set_current_state(TASK_RUNNING);

	if (!t->task)
		return 0;

	restart = &current->restart_block;
	if (restart->nanosleep.type != TT_NONE) {
		ktime_t rem = hrtimer_expires_remaining(&t->timer);
		struct timespec64 rmt;

		if (rem <= 0)
			return 0;
		rmt = ktime_to_timespec64(rem);

		return nanosleep_copyout(restart, &rmt);
	}
	return -ERESTART_RESTARTBLOCK;
}

The kernel sets the state of the current task (process) to INTERRUPTIBLE, starts the timer and yields to the scheduler so that another process can be scheduled. The timer can be so short that it triggers the timer interrupt before the call to schedule() happens just 2 lines below, so it has additional checks in the while condition to catch this condition

The timer setup call hrtimer_sleeper_start_expires(t, mode) also sets up the wakeup of the process after the timer expires. `wake

static enum hrtimer_restart hrtimer_wakeup(struct hrtimer *timer)
{
	struct hrtimer_sleeper *t =
		container_of(timer, struct hrtimer_sleeper, timer);
	struct task_struct *task = t->task;

	t->task = NULL;
	if (task)
		wake_up_process(task);

	return HRTIMER_NORESTART;
}

int wake_up_process(struct task_struct *p)
{
	return try_to_wake_up(p, TASK_NORMAL, 0);
}

try_to_wake_up is insanely complicated, so I have glossed over its implementation.

So we have an idea of what happens when a process “blocks”, what happens if we make no system calls. Let us imagine that our code is running a very large matrix multiplication fn that can take seconds to complete. Should all programs be paused until this large and slow computation is done ?

Fortunately (or unfortunately for power consumption) most systems have a period system timer that raises an interrupt every 1/HZ of a second (thus HZ interrupts a second). This system timer often has the highest priority of all interrupts on a system and is used by the kernel for various housekeeping functions and one of the most important functions of this timer interrupt is to update the runtime of each running task. If a task exceeds its scheduled timeslice, it is marked as needing to be rescheduled.

// The actual function that sets the flag that the task passed as an arg 
// needs to be rescheduled
static inline void set_tsk_need_resched(struct task_struct *tsk)
{
	set_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
}

I traced the path through Kernel v6.4 as it was one of the last few versions where the rescheduling logic was simple enough for me to grok. V6.6 on seemed to have moved onto to using a EEVDF scheduler that aims to distribute CPU time equally among all runnable tasks with the same priority

scheduler_tick() // The fn that is run by the system timer handler interrupt
    curr->sched_class->task_tick(rq, curr, 0)
        task_tick_fair(rq, curr, 0)
            entity_tick(cfs_rq, curr_se, queued)
                check_preempt_tick(cfs_rq, curr_se)
                    if (delta_exec > ideal_runtime) resched_curr(rq)
                        resched_curr(rq) -> __resched_curr(rq, TIF_NEED_RESCHED)
                            __resched_curr() sets current->thread_info.TIF_NEED_RESCHED

Note that the system timer interrupt handler only marks the task as needing to be rescheduled. The actual rescheduling (by calling the schedule fn()) is done when the interrupt handler finishes and the kernel switches out of the interrupt handler, which is done via the irqentry_exit fn

noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
{
	lockdep_assert_irqs_disabled();

	/* Check whether this returns to user mode */
	if (user_mode(regs)) {
		irqentry_exit_to_user_mode(regs);
	} else if (!regs_irqs_disabled(regs)) {
		.....
		if (IS_ENABLED(CONFIG_PREEMPTION))
			irqentry_exit_cond_resched();

		....
	} else {
		
		if (state.exit_rcu)
			ct_irq_exit();
	}
}

Since the Linux kernel can be pre-empted from either the user mode context (as in the case of our matrix multiplication example) or when the Kernel is in the kernel mode context (such as when it is serving a read system call and it gets a system timer interrupt), irqentry_exit must handle both modes. irqentry_exit_to_user_mode eventually ends up calling exit_to_user_mode_loop which is where the current task is rescheduled while exiting the interrupt handler to user mode and irqentry_exit_cond_resched is how the current task is rescheduled in the kernel mode (if it is safe to do so). For kernel mode pre-emption additional checks have to be done to make sure that the task isn’t holding any spinlocks that can otherwise cause the kernel to deadlock

Conclusion

OSes also provide more control to developers to decide how they want to deal with blocking calls. Networking calls have had polling/non-block APIs since a while (select, epoll, kqueue etc). Linux recently introduced io_uring a framework to turn any system call into non-blocking calls by submitting system calls as requests into a queue and then grabbing their results from another completion queue. This allows the developers to orchestrate work and have more control over concurrency the scheduling of blocking calls.