Aug 31, 2025

What I learnt in August 2025

August was a slow month. A lot of things happened personally, that I cannot share online; Suffice to say that of all the characteristics a person can have, the 2 most important ones are optimism and courage. Optimism, that we will wake up to a better tomorrow and courage to deal with the trials of everyday life.

I finally got around to setting up a Kubernetes cluster from scratching, following the Learn Kubernetes the Hard way guide from Kelysey Hightower. It was funny because one of the ways I learnt about which components are critical are by messing up the setup. I forgot to enable the execute x permission bit on the binary files for the kube-scheduler and the kube-controller-manager and as a result my pods weren’t scheduled to run on the worker nodes. It was confounding to see the kubectl apply -f being successfully applied, but for the corresponding pods to not show up at all.

Digging deeper I figured out how the design of Kubernetes is modular, that allowed separation of concerns for the component that accept manifiests and adds the definition to the etcd store (the api-server) and core components like controller-manager that use the same Kubernetes APIs that operators use (for more on that, read my Anatomy of a Kubernetes Operator post) for learning when manifests were created/updated and then deciding if pods have to be re-created. The tasks of scheduling the pods falls to the kube-scheduler, which is the component actually responsible for scheduling the pods on a worker node.

How are pods actually run ?

Each Kubernetes cluster has a container runtime that is responsible for creating containers of the pods. Currently Kubernetes uses containerd as the container runtime. However even containerd doesn’t actually create the container. That tasks belongs to runc another piece of software that creates running processes for a container.

To understand what runc does we need to take a step back to understand what a container actually is.

Deep down, on Linux a container is nothing but a process. When you run a nginx or a redis container, it is actually nothing but an nginx or redis process that gets run on the worker node. However that isn’t sufficient enough by itself. The nginx or the redis process must be isolated so that it doesn’t have permissions to access or see other process, network devices, hostname, directory structure of the worker nodes etc. All of this is enforced through a Linux kernel feature called namespaces. When the container is created, the contaienr runtime creates a new process, net, mount namspaces for the container and then attaches the process to these namespaces. The Docker image that is used to run the container is basically a snapshot of the directory structure of a system, as output by a docker build or by taking a snapshot of a running container with containerd. This snapshot is mounted as / in the mount namespace for the process, thus providing the container with an isolated filesystem that doesn’t have to touch the files on the worker nodes.

When we launch a process on our Linux desktop, our process have access to networking devices such as the local loopback device, the ethernet or the wifi device thus allowing it to access the network. We probably wouldn’t want that for the container process, hence we create a net namespace and explicitly share devices with our container process or setup a bridge that allows only certain containers to talk to each other (as in the case of a Pod, for example). All of these are orchestrated by the container runtime as per our Pod and Docker image configuration. If you want to learn more check out Iximiuz’s courses on how to learn how these tools work.

cgroups is a Linux kernel feature to restrict resource usage (like CPU, Memory) for a process or a family of parent/child processes. The Pod containers’ request and limits are translated into cgroup specifications (infact the syntax of the limit specification like 500Mi etc is pretty much the cgroup syntax) that are setup before the container processes is run to restrict the amount of resources that a container process can use. If you want know more about cgroups, how it is used by the Completely Fair Process Scheduler on Linux, go read this excellent blog post by Phuong Le on the Victoria Metrics blog.

Connecting tools

Another piece I learnt about is the CNI protocol. The CNI or Container Networking Interface is a protocol on how a container runtime can invoke cli commands to enable various networking features for containers and how args must be passed to interface with tools provided by the CNI plugins. As an example, the CNI project provides tools like bridge that creates a networking bridge and adds a container to one end of the bridge. Vendors like Azure, AWS also provide CNI plugins customized to run Kubernetes on top of their custom machines as they might have restrictions that our desktops do not. I haven’t explored CNI in depth but one of my goals is to rewrite some of these tools in Zig to better understand when and how containerd takes a Pod spec with one or more containers and orchestrates them on a Worker node.

Rounding up on Kubernetes learning, how do Kubernetes Services work ? A Kubernetes service is a ClusterIP, a sort of virtual IP address that can resolve to any of the n number of replicas that can be behind a single service like servicename.namespace.svc.default.local domain name. The service name resolves to a ClusterIP and when a client pod tries to reach to that IP address, the request can be routed to either the same worker node where the client pod is on or another pod where the service’s pod is running. All of this magic happens through a Linux kernel feature called iptables (or another called IPVS). Think of iptables as a set of tables + rules that allows us to intercept or reroute packets. When a Kubernetes service is created, the kube-proxy that runs on each worker node setups rules in the iptables to reroute packets for our clusterIP to the right container process on the host or outside of it, thus allowing us to transparentlt access pods with just the dns name of the the service. I found quite a few blog posts on this topic already, which I will continue reading and maybe write more on later if I feel like some aspect could be better explained or is missing.

We can see that lot of support infrastructure for containers are baked into Linux kernel as Kernel features. How are containers run on macOS via Docker or Podman ? Turns out that macOS creates virtual machines for Linux when containers have to be run on Macs and Apple provides a containerization framework that setups a virtual machine, configures networking etc and container engines like Docker make use of this framework to run containers. The fact that macOS has to run VMs in order to run container workloads as opposed to containers being normal processes explains why there is a large overhead for containerized workloads on Mac

Speaking of Hey and DHH, 37signals is experimenting with Harbor to self host a docker registry on-prem. I came across Harbor a few months ago, as a part of my org’s challenge to optimize our Docker registry and Artifactory usage as we were paying an eye watering amount as SaaS costs (and one that vendors love marking up 30-40% during every contract renewal). Although very tempted to try it out, the constraints in terms of talent, time and effort in my org would have never allowed us to experiment with self hosting our own Harbor instance and I am glad that someone is doing that experiment and trailblazing the path for the rest of us.