Hongli Lai

EBS StorageClass with VolumeBindingMode Immediate is incompatible with pod topology pinning

2025-03-07T00:00:00+00:00

We ran into a weird pod scheduling error on Amazon Elastic Kubernetes Service (EKS). Some pods, which scheduled just fine in the past, now stay in Pending with the following event:

Failed to schedule pod, incompatible with nodepool "default", daemonset overhead={"cpu": "300m", "memory": "2096Mi", "pods": "7"}, incompatible requirements, key nodepool, nodepool In [static] not in nodepool In [default]; key topology.kubernetes.io/zone, topology.kubernetes.1o/zone DoesNotExist not in topology.kubernetes.io/zone In [eu-west-la eu-west-1b eu-west-1c]; incompatible with nodepool "static", daemonset overhead={"cpu": "300m", "memory": "2096Mi", "pods": "7"}, incompatible requirements, key topology.kubernetes.io/zone, topology.kubernetes.io/zone DoesNotExist not in topology.kubernetes.io/zone In [eu-west-la eu-west-1b eu-west-1c]

This error is definitely related to the fact that the pod tries to pin to a specific availability zone. Removing the node selector makes the problem go away.

nodeSelector:
  topology.kubernetes.io/zone: eu-west-1c

The error event is very unreadable and does not hint at all at what the cause is. Luckily, an Internet search result for this error nudged us into the right direction:

You're trying to mount EBS volumes from three different zones. We add a zonal requirement to the pod for each one, and the intersection of those is the empty set that is represented as topology.kubernetes.io/zone DoesNotExist.

Upon double-checking the pod's volumes, we found a PersistentVolumeClaim that's bound to an EBS PersistentVolume in eu-west-1a. That's why the pod cannot be scheduled! EBS volumes can only be attached by pods living in the same availability zone.

But why did this problem suddenly occur? It turns out to be related to a recent EBS StorageClass change we rolled out: we changed the VolumeBindingMode from WaitForFirstConsumer to Immediate, so that that PersistentVolumeClaims always create a PersistentVolume even when no pod claims them yet. We did this in order to fix a problem with Velero: Velero would partially fail backups if it encounters PersistentVolumeClaims that are not yet bound to a PersistentVolume.

With VolumeBindingMode=Immediately, the PersistentVolume is immediately created in a random availability zone, which may not match the availability zone that the pod is pinned to. We conclude that for EBS StorageClasses, using VolumeBindingMode Immediate is inherently incompatible with pod topology pinning.

Causes of major page faults

2025-02-10T00:00:00+00:00

There's a common Prometheus alert called "NodeMemoryMajorPagesFaults". It's part of the Node Exporter's node-mixin project and also included in the kube-prometheus-stack default alert rules.

But what does this alert mean, and what do you do about it? This article helps you form a good mental model and provides practical guidance.

What are major page faults anyway?

"Major page fault" means that an application is using the memory management system to either read data from disk, or to swap in memory.

Despite the ominous name, it doesn't mean that the system is encountering errors. However, since it indicates heavy disk reads, it could still indicate that something is wrong.
The part "using the memory management system" is key: major page faults are a separate I/O pathway from the normal one.
The part about "reading data" is also important: major page faults are strictly about reading data.

A NodeMemoryMajorPageFaults alert should thus be interpreted as "there is heavy disk read activity, either reading from file or from swap".

Virtual memory

To understand why major page faults are called this way and what they actually do, we have to first understand virtual memory.

When an application uses memory, one might think that the application uses a part of your physical RAM.

But on modern operating systems, applications do not access the physical RAM directly. Instead, they access a virtual representation of RAM — virtual memory. This virtual memory is "infinite" in size (256 TB on x86_64), and is also called the address space.

Virtual memory is organized in chunks called "pages". A page might be backed by a piece of the physical RAM, but also pieces of other things, like a swap disk or a file. Obviously your real RAM is not as big as the address space. The fact that a page can be swapped to disk is one of the reasons that a large address space is possible.

Having parts of virtual memory be backed by a file is also called "memory mapping", or "mmap" in short.

When an application reads from a page that's not backed by RAM, the CPU complains and raises a "major page fault" event. This triggers the operating system kernel, which then fetches the data from swap or a file.

"Minor page faults" exist too, but that's out of the scope of this article.

This fetching isn't done on every access: such data is cached in RAM. The kernel calls this storage location the "page cache". This cache can be evicted when the system is low on RAM.

Who is producing major page faults?

The Prometheus alert doesn't tell you which process caused major page faults. How can you find out?

Through some preliminary Internet searches I've found that /proc/$PID/stat has a "majflt" field that shows the number of major page faults produced so far by this process. So we could write a shell script that parses /proc/*/stat, waits for a period of time, then parses those files again, and compares the results to find the offending process. But this is rather cumbersome.

Major page faults involve I/O, so maybe they'd show up in iotop. However, since they use a special I/O pathway, it's unclear whether they'd actually appear in iotop. We have to verify this.

And finally, even if we've managed to identify the offending process, how can we find the offending file? Or if it's caused by swapping, how can we see that?

Before continuing the research, we need a way to reproduce major page faults.

Reproducing file-backed major page faults

We write a C program which:

Maps a file into memory.
In an infinite loop:
1. Sequentially read all parts the file-backed memory address, on a page-by-page basis.
2. Tell the kernel that it's free to evict the page cache for this file.

Step 2.2 is important. Reading the file-backed memory addresses will result in all that data being cached in RAM. Once cached, reading from those addresses will no longer produce major page faults.

Telling the kernel to evict the page cache requires two actions:

The application must mark the memory address as "unused". This is done with the madvise(MADV_DONTNEED) call. However, this is only a hint, and during testing I've found that the Linux kernel does not actually evict the page cache. That's where action 2 comes in.
Tell the kernel to evict its page cache by running echo 1 > /proc/sys/vm/drop_caches. Note that this only works if madvise(MADV_DONTNEED) was called.

Source code of the C program:

// Save as filemajorpagefaults.c
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

int main(int argc, char *argv[]) {
    if (argc < 2) {
        fprintf(stderr, "Usage: %s <file-to-map>\n", argv[0]);
        exit(EXIT_FAILURE);
    }

    // Open a file in read-only mode.
    int fd = open(argv[1], O_RDONLY);
    if (fd < 0) {
        perror("open");
        exit(EXIT_FAILURE);
    }

    // Obtain the file size.
    struct stat st;
    if (fstat(fd, &st) == -1) {
        perror("fstat");
        close(fd);
        exit(EXIT_FAILURE);
    }
    size_t filesize = st.st_size;
    if (filesize == 0) {
        fprintf(stderr, "Error: File size is zero.\n");
        close(fd);
        exit(EXIT_FAILURE);
    }

    // Memory–map the file.
    char *map = mmap(NULL, filesize, PROT_READ, MAP_SHARED, fd, 0);
    if (map == MAP_FAILED) {
        perror("mmap");
        close(fd);
        exit(EXIT_FAILURE);
    }

    int pagesize = getpagesize();

    // Infinite loop: drop pages and then access the mapping to force a major fault.
    while (1) {
        // Access the mapping. This read will trigger a page fault
        // if the page is not in cache.
        for (size_t i = 0; i < filesize; i += pagesize) {
            volatile char value = map[i];
            (void)value; // Use the value to avoid compiler optimizations.
        }

        // Advise the OS that the pages are no longer needed.
        if (madvise(map, filesize, MADV_DONTNEED) != 0) {
            perror("madvise");
            exit(EXIT_FAILURE);
        }
    }

    return 0;
}

We create a large file (let's say 512 MB), then we compile and run the program:

dd if=/dev/zero of=block bs=1M count=512
cc -Wall filemajorpagefaults.c -o filemajorpagefaults
./filemajorpagefaults block

In another terminal, we continuously tell the kernel to evict the page cache:

while true; do echo 1 > /proc/sys/vm/drop_caches; done

Now that the simulation and the page cache eviction loop are running, let's verify that the program actually generates major page faults. According to the /proc/$PID/stat man page, majflt is field number 12, so:

cat /proc/$(pidof filemajorpagefaults)/stat | awk '{ print $12 }'
# => prints: 53
sleep 60
cat /proc/$(pidof filemajorpagefaults)/stat | awk '{ print $12 }'
# => prints: 450

The counter goes up. Success: we've reproduced file-backed major page faults!

Reproducing swap-backed major page faults

We write a C program that:

Allocates more memory than we have RAM.
In an infinite loop, sequentially read every page in the allocated memory range.

Because not all parts of the allocated memory fit in RAM, this should result in constant swapping.

Source code:

// Save as swapmajorpagefaults.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

int main(int argc, char *argv[]) {
    if (argc < 2) {
        fprintf(stderr, "Usage: %s <size in MB>\n", argv[0]);
        return EXIT_FAILURE;
    }

    // Convert the command-line argument to a long integer.
    long mb = atol(argv[1]);
    if (mb <= 0) {
        fprintf(stderr, "Invalid size provided: %s\n", argv[1]);
        return EXIT_FAILURE;
    }

    // Calculate total size in bytes.
    size_t totalsize = mb * 1024 * 1024;

    // Allocate the memory.
    char *buffer = malloc(totalsize);
    if (buffer == NULL) {
        fprintf(stderr, "Memory allocation failed\n");
        return EXIT_FAILURE;
    }

    // Initialize the memory: force the kernel to actually allocate it.
    memset(buffer, 0, totalsize);

    // Get the system's page size.
    int pagesize = getpagesize();

    // Infinite loop: sequentially read a byte from every page.
    while (1) {
        for (size_t offset = 0; offset < totalsize; offset += pagesize) {
            // The volatile keyword forces the compiler to actually perform the read.
            volatile char value = buffer[offset];
            // Optionally, do something with 'value' to prevent further optimization.
            (void)value;
        }
    }

    return 0;
}

Before running the program, let's check the system's memory usage:

$ free -m
               total        used        free      shared  buff/cache   available
Mem:            7751        1075        5661          15        1303        6675
Swap:              0           0           0

The system has 7551 MB RAM, no swap. We create a 1.5 GB swap file and enable it:

dd if=/dev/zero of=block bs=1M count=1536
chmod 600 block
mkswap block
sudo swapon block

We then compile and run the program:

cc -Wall swapmajorpagefaults.c -o swapmajorpagefaults
./swapmajorpagefaults 7751

After some time, we confirm that the system is swapping:

$ free -m
               total        used        free      shared  buff/cache   available
Mem:            7751        7648         163           0         151         102
Swap:           1535        1140         395

We also confirm that the program generates major page faults:

cat /proc/$(pidof swapmajorpagefaults)/stat | awk '{ print $12 }'
# => prints: 803
sleep 5
cat /proc/$(pidof swapmajorpagefaults)/stat | awk '{ print $12 }'
# => prints: 18838

Does iotop show major page faults?

When filemajorpagefaults and the associated page cache eviction loop are running, we see that there's a lot of disk read activity. So, yes, iotop does show file-backed major page faults.

One thing of note: by default, iotop sorts processes by disk write, so it doesn't show the majorpagefaults simulator on top. To see it, I had to press the left arrow key to sort by disk read.

When swapmajorpagefaults is running, we also see that there's a lot of disk read activity. So, yes, iotop also shows swap-backed major page faults.

Identifying exact major page fault source

Since iotop allows us to identify the offending process, how do we know whether it's caused by file access or by swapping? And if it's caused by file access, how do we know which file? Here's a possible methodology:

Find out the virtual memory address at which the major page fault occurred.
Look in /proc/$PID/maps to see whether that address maps to a file. If not, then it's swap.

For finding out the major page fault memory address, here are some methods:

Using the perf whole-system profiler.
Using eBPF tracing (bpftrace).

Both tools make use of kernel tracing events under the hood. perf is more limited in its user interface and supports fewer kinds of tracing events, but that also makes it relatively easy to use. bpftrace is more flexible and complicated, allowing you to write arbitrary tracing logic.

They also differ in where they perform the tracing. perf performs most of its logic outside the kernel (in userspace), while bpftrace uploads an eBPF program to the kernel and runs it there. The latter is what allows bpftrace to be more flexible.

Some of my production systems run on AWS BottleRocket. Its kernel is very locked down, so it doesn't have a convenient tracing event that perf can use. Instead, I have to use more cumbersome events that require more logic to extract the data that I want. That is only possible using bpftrace. So I will cover using both tools.

Inspecting file-backed major page faults with `perf`

We start by installing Perf:

# Debian/Ubuntu:
sudo apt install linux-tools-common linux-tools-generic

We run perf list to see what kind of tracing events we can use:

$ perf list
List of pre-defined events (to be used in -e or -M):

  ...
  major-faults                                       [Software event]
  ...
  page-faults OR faults                              [Software event]
  ...
  exceptions:page_fault_user                         [Tracepoint event]

There are several candidates which seem useful. Let's try "page-faults" first. We run filemajorpagefaults and the associated page cache eviction loop, then we run perf record to start recording events:

sudo perf record -e page-faults -p $(pidof filemajorpagefaults)

After a few seconds, we press Ctrl-C to stop the recording. We then inspect the recorded events with perf script:

$ sudo perf script
 filemajorpagefa   52415 237518.889179:    1 page-faults:   6535fd11440a main+0x1c1 (/home/hongli/filemajorpagefaults)
 filemajorpagefa   52415 237519.122898:    1 page-faults:   6535fd11440a main+0x1c1 (/home/hongli/filemajorpagefaults)

In the output we see a bunch of things:

The program name ("filemajorpagefa", truncated)
The PID (52415)
A timestamp (237518.889179)
The event name (page-faults)
A virtual memory address (0x6535fd11440a)

Now, let's check /proc/$PID/maps. Here's a snippet:

6535fd114000-6535fd115000 r-xp 00001000 08:01 256489  /home/hongli/filemajorpagefaults
...
71f48d800000-71f4ad800000 r--s 00000000 08:01 257163  /home/hongli/block
71f4ad800000-71f4ad828000 r--p 00000000 08:01 150182  /usr/lib/x86_64-linux-gnu/libc.so.6

^-- start    ^-- end      ^-- permissions             ^-- filename (optional)
    addr         addr

Column 1 (6535fd114000-6535fd115000) is an address range.
Column 2 (r-xp) is the permissions on that address range, e.g., whether it's read-write or read-only.
Column 6 (/home/hongli/filemajorpagefaults) whether this range maps to a file.

We expected the memory address reported by perf (0x6535fd11440a) to map to /home/hongli/block, but we see that it actually maps to /home/hongli/filemajorpagefaults! Furthermore, this address range is executable ("x"), which means that it contains code. So 0x6535fd11440a is not the address at which the page fault occurred, but the address at which the CPU was executing instructions (instruction pointer)!

Upon inspecting the perf record and perf script man pages and performing additional Internet searches, it seems that the page-faults event is incapable of recording the memory access address.

Running the test again with the major-faults event yields the same result.

Might we fare better with exceptions:page_fault_user?

$ sudo perf record -e exceptions:page_fault_user -p $(pidof filemajorpagefaults)
<pressed Ctrl-C after a few seconds>
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.010 MB perf.data (32 samples) ]

$ sudo perf script
 filemajorpagefa   52415 [002] 238352.611105: exceptions:page_fault_user: address=0x71f4acc00000 ip=0x6535fd11440a error_code=0x4
...

Success: there's now an address field in addition to an ip (instruction pointer) field! Looking in /proc/$PID/maps, 0x71f4acc00000 indeed belongs to the range mapped to /home/hongli/block!

Inspecting swap-backed major page faults with `perf`

Let's see how things look like when major page faults are caused by swapping. With swapmajorpagefaults running, we start recording page faults:

$ sudo perf record -e exceptions:page_fault_user -p $(pidof swapmajorpagefaults)
<pressed Ctrl-C after a few seconds>
[ perf record: Woken up 7 times to write data ]
[ perf record: Captured and wrote 2.048 MB perf.data (24276 samples) ]

$ sudo perf script
 swapmajorpagefa   66634 [002] 293739.435811: exceptions:page_fault_user: address=0x7303e3871000 ip=0x7305213993ba error_code=0x6

Let's look for 0x7303e3871000 in /proc/66634/maps:

73033ca00000-730521101000 rw-p 00000000 00:00 0

^-- start    ^-- end      ^-- permissions             ^-- filename (optional)
    addr         addr

This address range has no filename. This means that it belongs to "normal" memory and is not backed by a file. Any major page faults occurring here is due to swapping.

Inspecting major page faults with `bpftrace`

Recall the reason I'm showing how to use bpftrace: on BottleRocket I can't use the exceptions:page_fault_user event. There are other events, but they can only be used by bpftrace.

What events can we use? One candidate I was able to find, is kprobe:handle_mm_fault. This corresponds to the handle_mm_fault function in the Linux kernel. According to Understanding the Linux Virtual Memory Manager by Mel Gorman, handle_mm_fault is the function where page fault handling starts (chapter 4.6.1 Handling a Page Fault). This function is architecture-independent (calls architecture-specific parts internally) and its functionality and function signature are relatively stable over time.

According to the code, there is an address parameter that identifies the page fault address. But we also want to distinguish between major and minor page faults. This distinction is not passed to handle_mm_fault. Instead, its return value tells us whether the page fault was a major one. From the book:

The possible return values for handle_mm_fault() are VM_FAULT_MINOR, VM_FAULT_MAJOR, VM_FAULT_SIGBUS and VM_FAULT_OOM.

In bpftrace we don't have access to constants (which only exist during C compile time). So we have to look for the integer value for VM_FAULT_MAJOR. After a search, we find its definition in the C header mm_types.h:

enum vm_fault_reason {
    ...
    VM_FAULT_MAJOR = (__force vm_fault_t)0x000004,
    ...
};

So the value is 4.

With this knowledge, we can start designing a bpftrace script. We want to:

Attach to the kprobe:handle_mm_fault event and extract the address argument.
Attach to the kretprobe:handle_mm_fault event (which is fired when the function returns) and check whether the returned bitmask contains VM_FAULT_MAJOR. If so, then log the address.

Source:

// Save as handle_mm_fault.bpftrace

// Change 581572 to the page fault simulator's actual PID.
kprobe:handle_mm_fault /pid == 581572/
{
    // Save the faulting address (arg1, or 2nd parameter) to a map, keyed by thread ID.
    // We'll use it later when handle_mm_fault returns (in the kretprobe).
    @fault[tid] = arg1;
}

// 4 is the value for VM_FAULT_MAJOR.
kretprobe:handle_mm_fault /(@fault[tid] && (retval & 4))/
{
    // If the return value has VM_FAULT_MAJOR set, print the faulting address.
    printf("major page fault at 0x%lx\n", @fault[tid]);
    delete(@fault[tid]);
}

Now let's install bpftrace:

# Debian/Ubuntu
sudo apt install bpftrace

# AWS BottleRocket (in admin container) and Amazon Linux
sudo yum install bpftrace

With a major page fault simulator running (either filemajorpagefaults or swapmajorpagefaults), we run bpftrace:

# bpftrace handle_mm_fault.bpftrace
Attaching 2 probes...
major page fault at 0xffff87b39000
major page fault at 0xffff8e029000
major page fault at 0xffff81859000
...
(press Ctrl-C)

Now that we have a bunch of addresses, let's search for them in /proc/581572/maps:

ffff6f3d9000-ffff8f3d9000 r--s 00000000 103:10 35875842   /block

^-- start    ^-- end      ^-- permissions                 ^-- filename (optional)
    addr         addr

Success! We've verified that the major page fault is related to the file /block!

Or, if the /proc/$PID/maps entry contains no filename or if the filename is "[heap]", then that means it's a normal memory range with no file backing. Any major page faults are then caused by swapping.

Conclusion

Major page faults means that there's heavy disk read activity, either from a file or from swap. We've developed a method for investigating the source of major page faults.

First, identify the offending process by looking in iotop and sorting by disk read (pressing the left arrow key).

To determine whether the page faults are due to file access or swapping:

Use perf or bpftrace to identify the memory addresses at which page faults.

perf is suitable for most systems. To install it:

# Debian/Ubuntu:
sudo apt install linux-tools-common linux-tools-generic

Start tracing on the exceptions:page_fault_user event, and run it for a few seconds:

$ sudo perf record -e exceptions:page_fault_user -p <PID>
<pressed Ctrl-C after a few seconds>
[ perf record: Woken up 7 times to write data ]
[ perf record: Captured and wrote 2.048 MB perf.data (24276 samples) ]

$ sudo perf script
swapmajorpagefa   66634 [002] 293739.435811: exceptions:page_fault_user: address=0x7303e3871000 ip=0x7305213993ba error_code=0x6

The addres=... column tells you the memory address at which the page fault occurred.

bpftrace is suitable for AWS BottleRocket. To install it:

# Debian/Ubuntu
sudo apt install bpftrace

# AWS BottleRocket (in admin container) and Amazon Linux
sudo yum install bpftrace

Save the following bpftrace script. Make sure you change the PID 581572 on line 3 to the actual PID you want to inspect.

// Save as handle_mm_fault.bpftrace

kprobe:handle_mm_fault /pid == 581572/
{
    // Save the faulting address (arg1, or 2nd parameter) keyed by thread ID.
    @fault[tid] = arg1;
}

// 4 is the value for VM_FAULT_MAJOR.
kretprobe:handle_mm_fault /(@fault[tid] && (retval & 4))/
{
    // If the return value has VM_FAULT_MAJOR set, print the faulting address.
    printf("major page fault at 0x%lx\n", @fault[tid]);
    delete(@fault[tid]);
}

Run the trace:

# bpftrace handle_mm_fault.bpftrace
Attaching 2 probes...
major page fault at 0xffff87b39000
major page fault at 0xffff8e029000
major page fault at 0xffff81859000
...
(press Ctrl-C)

Look up the address in /proc/$PID/maps. It contains entries in this format:
```
71f48d800000-71f4ad800000 r--s 00000000 08:01 257163  /home/hongli/block

^-- start    ^-- end      ^-- permissions             ^-- filename (optional)
 addr         addr
```
- Column 1 (71f48d800000-71f4ad800000) is the address range in hexadecimal.
- Column 2 (r–s) is the permissions on that address range, e.g., whether it's read-write or read-only.
- Column 6 (/home/hongli/block) is an optional filename.
If there's no filename or if the filename is "[heap]", then it means the major page faults are caused by swapping.

Cure Docker volume permission pains with MatchHostFsOwner

2023-04-20T00:00:00+00:00

Run a container with a host directory mount, and it either leaves root-owned files behind or it runs into "permission denied" errors. Welcome to the dreadful container host filesystem owner matching problem. These issues confuse and irritate people, and they happen because apps in the container run as a different user than the host user.

There are various strategies to solve this issue, but they are all non-trivial (requiring complex logic) and/or have significant caveats (e.g., requiring privileged containers). Here's where my new tool MatchHostFsOwner comes in.

How does MatchHostFsOwner solve container file permission pains?

MatchHostFsOwner implements solution strategy number 1. It ensures that the container runs as the same user (UID/GID) as the host's user. In short, it:

modifies a user account inside the container so that the account's UID/GID matches that of the host user.
executes the actual container command as the aforementioned user account (instead of, e.g., letting it execute as root).

This strategy is easier said than done, and the article documents the many caveats involved with this strategy. Fortunately, MatchHostFsOwner is here to help because it addresses all these caveats, so you don't have to.

Using MatchHostFsOwner

Here are some core concepts to understand:

It's an entrypoint — Install MatchHostFsOwner as the container entrypoint program. It should be the first program to run in the container. When it runs, it modifies the container's environment, then executes the next command with the proper UID/GID.
It requires host user input — when starting a container, the host user must tell MatchHostFsOwner what the host user's UID/GID is. How the user passes this information depends on what tool the user uses to start the container (e.g., Docker CLI, Docker Compose, Kubernetes, etc).
It requires an extra user account in the container — MatchHostFsOwner tries to execute the next command under a user account in the container whose UID equals the host user's UID. If no such account exists (which is common), then MatchHostFsOwner will take a specific account and modify its UID/GID to match that of the host user.

The account MatchHostFsOwner will take and modify is called the "app account". MatchHostFsOwner won't create this account for you — you have to supply it. It won't always be used, but often it will.

By default, MatchHostFsOwner assumes that the app account is named app. But this is customizable.
It requires root privileges — MatchHostFsOwner itself requires root privileges to modify the container's environment. It drops these privileges later before executing the next command.

How exactly MatchHostFsOwner is granted root privileges depends on how one is supposed to start the container. This brings us to the two usage modes.

Usage mode 1: start container without root privileges

This mode is most suitable for starting the container without root privileges. For example:

When your Dockerfile sets a default user account using USER.
When your container is supposed to be started with docker run --user.
When your Kubernetes spec makes use of securityContext's runAsUser/runAsGroup.

In this mode, you must grant MatchHostFsOwner the setuid root bit. MatchHostFsOwner drops its setuid root bit as soon as possible after it has done its work.

This mode has some limitations:

The container cannot be started a second time (e.g., using docker stop and then docker start). Upon starting the container for the second time, MatchHostFsOwner no longer has the setuid root bit, so it won't be able to do its job. Thus, mode 1 is only useful for ephemeral containers.
Incompatible with Docker Compose because it may start the container a second time.
Requires that the container filesystem in which MatchHostFsOwner is located, to be writable. Because MatchHostFsOwner must be able to drop the setuid root bit. Thus, you cannot run the container in read-only mode (e.g., docker run --read-only).

Usage mode 1 in action

Begin by preparing the container.

Create an account in your container for running your app. It doesn't matter what you name it (it's customizable), but let's call it "app" in this demo because MatchHostFsOwner assumes by default that that's the name. Set this account up as the default account for the container.
Place the MatchHostFsOwner executable in a root-owned directory (e.g., /sbin) and ensure that the executable is owned by root, and has the setuid root bit.
Set up the MatchHostFsOwner executable as the container entrypoint.

For example:

FROM ubuntu:22.04

# Install MatchHostFsOwner. Replace X.X.X with an actual version.
# See https://github.com/FooBarWidget/matchhostfsowner/releases
ADD https://github.com/FooBarWidget/matchhostfsowner/releases/download/vX.X.X/matchhostfsowner-X.X.X-x86_64-linux.gz /sbin/matchhostfsowner.gz
RUN gunzip /sbin/matchhostfsowner.gz && \
  chown root: /sbin/matchhostfsowner && \
  chmod +x,+s /sbin/matchhostfsowner
RUN addgroup --gid 9999 app && \
  adduser --uid 9999 --gid 9999 --disabled-password --gecos App app
## Or, on RHEL-based images:
# RUN groupadd --gid 9999 app && \
#   useradd --uid 9999 --gid 9999 app
## Or, on Alpine-based images:
# RUN addgroup -g 9999 app && \
#   adduser -G app -u 9999 -D app
USER app

ENTRYPOINT ["/sbin/matchhostfsowner"]

docker build . -t my-example-image

Next, start the container using a user and group ID that matches the host user's. For example, using the Docker CLI. (See the documentation for a Kubernetes-based example.)

docker run --user "$(id -u):$(id -g)" my-example-image id -a
# Output (assuming host UID/GID is 501/20):
# uid=501(app) gid=20(app) groups=20(app)

Success! Here's what happened under the hood:

MatchHostFsOwner (the entrypoint) runs before the container command (id -a) does.
MatchHostFsOwner sees the container is running as UID/GID 501/20. So it modifies the "app" account's UID/GID to 501/20. It can do that because it has setuid root privileges.
MatchHostFsOwner drops its setuid root privileges, then executes the command id -a under the container's "app" account.

Usage mode 2: start container with root privileges

In this mode, MatchHostFsOwner obtains root privileges through the fact that one starts the container with root privileges. No setuid root privileges required. MatchHostFsOwner drops its root privileges as soon as possible after it has done its work.

This mode is most suitable if any of the following is applicable:

You're using Docker Compose.
The container could be started a second time, as happens with, e.g., Docker Compose.
The container filesystem in which MatchHostFsOwner is located is read-only.

Usage mode 2 in action

Begin by preparing the container:

Create an account in your container for running your app. It doesn't matter what you name it (it's customizable), but let's call it "app" in this demo because MatchHostFsOwner assumes by default that that's the name. Set this account up as the default account for the container.
Place the MatchHostFsOwner executable in a root-owned directory (e.g., /sbin) and ensure that the executable is owned by root.
Set up the MatchHostFsOwner executable as the container entrypoint.
Don't set a default user account with USER.

Example:

FROM ubuntu:22.04

# Install MatchHostFsOwner. Replace X.X.X with an actual version.
# See https://github.com/FooBarWidget/matchhostfsowner/releases
ADD https://github.com/FooBarWidget/matchhostfsowner/releases/download/vX.X.X/matchhostfsowner-X.X.X-x86_64-linux.gz /sbin/matchhostfsowner.gz
RUN gunzip /sbin/matchhostfsowner.gz && \
  chown root: /sbin/matchhostfsowner && \
  chmod +x /sbin/matchhostfsowner
RUN addgroup --gid 9999 app && \
  adduser --uid 9999 --gid 9999 --disabled-password --gecos App app
## Or, on RHEL-based images:
# RUN groupadd --gid 9999 app && \
#   useradd --uid 9999 --gid 9999 app
## Or, on Alpine-based images:
# RUN addgroup -g 9999 app && \
#   adduser -G app -u 9999 -D app

ENTRYPOINT ["/sbin/matchhostfsowner"]

docker build . -t my-example-image

Next, start the container while setting the environment variables MHF_HOST_UID and MHF_HOST_GID to the host user's UID/GID like this:

docker run -e "MHF_HOST_UID=$(id -u)" -e "MHF_HOST_GID=$(id -g)" my-example-image id -a
# Output (assuming host UID/GID is 501/20):
# uid=501(app) gid=20(app) groups=20(app)

Here's what happened under the hood:

MatchHostFsOwner (the entrypoint) runs before the container command (id -a) does.
MatchHostFsOwner sees that MHF_HOST_UID/MHF_HOST_GID is set to 501/20. So it modifies the "app" account's UID/GID to 501/20.
MatchHostFsOwner drops its root privileges, then executes the command id -a under the container's "app" account.

MatchHostFsOwner project mascot

Conclusion

MatchHostFsOwner is an excellent way to solve Docker volume permission problems (more precisely: the container host filesystem owner matching problem). Please have a look at its source code (it's written in Rust!) and check out its documentation for customization, advanced usage, and troubleshooting instructions.

Stay cured!

Ubuntu 22.04 support for Fullstaq Ruby is here

2022-04-30T00:00:00+00:00

Fullstaq Ruby distributes server-optimized Ruby binaries. Install the latest Ruby versions with APT/YUM instead of compiling. Easily keep Ruby security patched via auto-tiny version updates. Combat memory bloat (save as much as 50%) with memory allocator improvements.

Ubuntu 22.04 was released a couple of days ago. Fullstaq Ruby now provides packages for this distribution! Here's the corresponding pull request: #96.

Note that we only provide Ruby 3.1 packages for Ubuntu 22.04. This is because Ubuntu 22.04 ships with OpenSSL v3, and only Ruby 3.1 is compatible with that OpenSSL version.

Want to install or upgrade? Check the installation instructions, or run apt upgrade/yum update.

Ruby gem: distributed locking on Google Cloud

2021-09-14T00:00:00+00:00

I previously designed a robust distributed locking algorithm based on Google Cloud. Now I'm releasing a Ruby implementation of this algorithm: distributed-lock-google-cloud-storage-ruby.

To use this, add to your Gemfile:

gem 'distributed-lock-google-cloud-storage'

Its typical usage is as follows. Initialize a Lock instance. It must be backed by a Google Cloud Storage bucket and object. Then do your work within a #synchronize block.

Important: If your work is a long-running operation, then also be sure to call #check_health! periodically to check whether the lock is still healthy. This call throws an exception if it's not healthy. Learn more in Long-running operations, lock refreshing and lock health checking.

require 'distributed-lock-google-cloud-storage'

lock = DistributedLock::GoogleCloudStorage::Lock(
  bucket_name: 'your bucket name',
  path: 'locks/mywork')
lock.synchronize do
  do_some_work

  # IMPORTANT: _periodically_ call this!
  lock.check_health!

  do_more_work
end

To learn more about this gem, please check out its README and its full API docs.

A robust distributed locking algorithm based on Google Cloud Storage

2021-05-19T00:00:00+00:00

Many workloads nowadays involve many systems that operate concurrently. This ranges from microservice fleets to workflow orchestration to CI/CD pipelines. Sometimes it's important to coordinate these systems so that concurrent operations don't step on each other. One way to do that is by using distributed locks that work across multiple systems.

Distributed locks used to require complex algorithms or complex-to-operate infrastructure, making them expensive both in terms of costs as well as in upkeep. With the emergence of fully managed and serverless cloud systems, this reality has changed.

In this post I'll look into a distributed locking algorithm based on Google Cloud. I'll discuss several existing implementations and suggest algorithmic improvements in terms of performance and robustness.

Update: there is now a Ruby implementation of this algorithm!

Use cases for distributed locks

Distributed locks are useful in any situation in which multiple systems may operate on the same state concurrently. Concurrent modifications may corrupt the state, so one needs a mechanism to ensure that only one system can modify the state at the same time.

A good example is Terraform. When you store the Terraform state in the cloud, and you run multiple Terraform instances concurrently, then Terraform guarantees that only one Terraform instance can modify the infrastructure concurrently. This is done through a distributed lock. In contrast to a regular (local system) lock, a distributed lock works across multiple systems. So even if you run two Terraform instances on two different machines, then Terraform still protects you from concurrent modifications.

More generally, distributed locks are useful for ad-hoc system/cloud automation scripts and CI/CD pipelines. Sometimes you want your script or pipeline to perform non-trivial modifications that take many steps. It can easily happen that multiple instances of the script or pipeline are run. When that happens, you don't want those multiple instances to perform the modification at the same time, because that can corrupt things. You can use a distributed lock to make concurrent runs safe.

Here's a concrete example involving a CI/CD pipeline. Fullstaq Ruby had an APT and YUM repository hosted on Bintray. A few months ago, Bintray announced that they will shutdown in the near future, so we had to migrate to a different solution. We chose to self-host our APT and YUM repository on a cloud object store.

The Fullstaq Ruby package publishing pipeline uses a distributed lock to guarantee concurrency-safety. Learn more: Fullstaq Ruby's APT and YUM repository setup

APT and YUM repositories consist of a bunch of .deb and .rpm packages, plus a bunch of metadata. Package updates are published through Fullstaq Ruby's CI/CD system. This CI/CD system directly modifies multiple files on the cloud object store. We want this publication process to be concurrency-safe because if we commit too quickly then multiple CD/CD runs may occur at the same time. The easiest way to achieve this is by using a distributed lock, so that only one CI/CD pipeline may operate on the cloud object bucket concurrently.

Why building on Google Cloud Storage?

Distributed locks used to be hard to implement. In the past they required complicated consensus protocols such as Paxos or Raft, as well as the hassle of hosting yet another service. See Distributed lock manager.

In a more recent past, people started implementing distributed locks on top of other distributed systems, such as transactional databases and Redis. This significantly reduced the complexity of algorithms. But operational complexity was still significant. A big issue is that these systems aren't "serverless": operating and maintaining a database instance or a Redis instance is not cheap. It's not cheap in terms of effort. It's also not cheap in terms of costs: you pay for a database/Redis instance based on its uptime, not based on how many operations you perform.

Luckily, there are many cloud systems nowadays which not only provide the building blocks necessary to build a distributed lock, but are also fully managed and serverless. Google Cloud Storage is a great system to build a distributed lock on. It's cheap, it's popular, it's highly available and it's maintenance-free. You only pay for the amount of operations you perform on it.

Basic challenges of distributed locking

One of the problems that distributed locking algorithms need to solve, is the fact that participants in the algorithm need to communicate with each other. Distributed systems may run in different networks that aren't directly connected.

Another problem is that of concurrency control. This is made difficult by communication lag. If two participants request ownership of a lock simultaneously, then we want both of them to agree on a single outcome even though it takes time for each participant to hear the other.

Finally, there is the problem of state consistency. When you write to a storage system, then next time you read from that system you want to read what you just wrote. This is called strong consistency. Some storage systems are eventually consistent, which means that it takes a while before you read what you just wrote. Storage systems that are eventually consistent are not suitable for implementing distributed locks.

This is why we leverage Google Cloud Storage as both a communication channel, and as a "referee". Everyone can connect to Cloud Storage, and access control is simple and well-understood. Cloud Storage is also a strongly consistent system and has concurrency control features. This latter allows Cloud Storage to make a single, final decision in case two participants want to take ownership of the lock simultaneously.

Building blocks: generation numbers and atomic operations

Every Cloud Storage object has two separate generation numbers.

The normal generation number changes every time the object's data is modified.
The metageneration number changes every time the object's metadata is modified.

When you perform a modification operation, you can use the x-goog-if-generation-match/x-goog-if-metageneration-match headers in the Cloud Storage API to say: "only perform this operation if the generation/metageneration equals this value". Cloud Storage guarantees that this effect is atomic and free of race conditions. These headers are called precondition headers.

The special value 0 for x-goog-if-generation-match means "only perform this operation if the object does not exist".

This feature — the ability to specify preconditions to operations — is key to concurrency control.

Existing implementations

Several implementations of a distributed lock based on Google Cloud Storage already exist. A prominent one is gcslock by Marc Cohen, who works at Google. Gcslock leverages the x-goog-if-generation-match header, as described in the previous section. Its algorithm is simple, as we'll discuss in the next section.

Most other implementations, such as gcs-mutex-lock and gcslock-ruby, use the gcslock algorithm though with minor adaptations.

I've been able to find one implementation that's significantly different and more advanced: HashiCorp Vault's leader election algorithm. Though it's not functionally meant to be used as a lock, technically it boils down to a lock. We'll discuss this algorithm in a later section.

Gcslock: a basic locking algorithm

The gcslock algorithm is as follows:

Taking the lock means creating an object with x-goog-if-generation-match: 0.
- The content of the object does not matter.
- If creation is successful, then it means we've taken the lock.
- If creation fails with a 412 Precondition Failed error, then it means the object already exists. This means the lock was already taken. We retry later. The retry sleep time increases exponentially every time taking the lock fails.
Releasing the lock means deleting the object.

This algorithm is very simple. It is also relatively high-latency because Cloud Storage's response time is measured in tens to hundreds of milliseconds, and because it utilizes retries with exponential backoff. Relative high latency may or may not be a problem depending on your use case. It's probably fine for most batch operations, but it's probably unacceptable for applications that require pseudo-realtime responsiveness.

There are bigger issues though:

Prone to crashes. If a process crashes while having taken the lock, then the lock becomes stuck forever until an administrator manually deletes the lock.
Hard to find out who the owner is. There is no administration about who owns the mutex. The only way to find out who owns the lock is by querying the processes.
Unbounded backoff. The exponential backoff has no upper limit. If the lock is taken for a long time (e.g. because a process crashed during a lock) then the exponential backoff will grow unbounded. This means that an administrator may need to restart all sorts of processes, after having deleted a stale lock.

gcs-mutex-lock and gcslock-ruby address this by setting an upper bound to the exponential backoff.
Retry contention. If multiple processes start taking the lock at the same time, then they all back off at the same rate. This means that they end up retrying at the same time. This causes spikes in API requests towards Google Cloud Storage. This can cause network contention issues.

gcs-mutex-lock addresses this by allowing adding jitter to the backoff time.
Unintended releases. A lock release request may be delayed by the network. Imagine the following scenario:
1. An administrator thinks the lock is stale, and deletes it.
2. Another process takes the lock.
3. The original lock release request now arrives, inadvertently releasing the lock.
This sort of network-delay-based problem is even documented in the Cloud Storage documentation as a potential risk.

Resisting stuck locks via TTLs

One way to avoid stuck locks left behind by crashing processes, is by considering locks to be stale if they are "too old". We can use the timestamps that Cloud Storage manages, which change every time an object is modified.

What should be considered "too old" really depends on the specific operation. So this should be a configurable parameter, which we call the time-to-live (TTL).

What's more, the same TTL value should be agreed upon by all processes. Otherwise we'll risk that a process thinks the lock is stuck even though the owner thinks it isn't. One way to ensure that all processes agree on the same TTL is by configuring them with the same TTL value, but this approach is error-prone. A better way is to store the TTL value into the lock object.

Here's the updated locking algorithm:

Create the object with x-goog-if-generation-match: 0.
- Store the TTL in a metadata header.
- The content of the object does not matter.
If creation is successful, then it means we've taken the lock.
If creation fails with a 412 Precondition Failed error (meaning the object already exists), then:
1. Fetch from its metadata the update timestamp, generation number and TTL.
2. If the update timestamp is older than the TTL, then delete the object, with x-goog-if-generation-match: [generation]. Specifying this header is important, because if someone else takes the lock concurrently (meaning the lock is no longer stale), then we don't want to delete that.
3. Retry the locking algorithm after an exponential backoff (potentially with an upper limit and jitter).

What's a good value for the TTL?

Cloud Storage's latency is relatively high, in the order of tens to hundreds of milliseconds. So the TTL should be at least a few seconds.
If you perform Cloud Storage operations via the gsutil CLI, then you should be aware that gsutil takes a few seconds to start. Thus, the TTL should be at least a few ten seconds.
A distributed lock like this is best suited for batch workloads. Such workloads typically take seconds to tens or even hundreds of seconds. Your TTL should be a safe multiple of the time your operation is expected to take. We'll discuss this further in the next section, "long-running operations".

As a general rule, I'd say that a safe TTL should be in the order of minutes. It should be at least 1 minute. I think a good default is 5 minutes.

Long-running operations

If an operation takes longer than the TTL, then another process could take ownership of the lock even though the original owner is still operating. Increasing the TTL addresses this issue somewhat, but this approach has drawbacks:

If the operation's completion time is unknown, then it's impossible to pick a TTL.
A larger TTL means that it takes longer for processes to detect stale locks.

A better approach is to refresh the object's update timestamp regularly as long as the operation is still in progress. Keep the TTL relatively short, so that if the process crashes then it won't take too much time for others to detect the lock as stale.

We implement refreshing via a PATCH object API call. The exact data to patch doesn't matter: we only care about the fact that Cloud Storage will change the update timestamp.

We call the time between refreshes the refresh interval. A proper value for the refresh interval depends on the TTL. It must be much shorter than the TTL, otherwise refreshing the lock is pointless. Its value should take into consideration that a refresh operation is subject to network delays, or even local CPU scheduling delays.

As a general rule, I recommend a refresh interval that's at most 1/8th of the TTL. Given a default TTL of 5 minutes, I recommend a default refresh interval of ~37 seconds. This recommendation takes into consideration that refreshes can fail, which we'll discuss in the next section.

Refresh failures

Refreshing the lock can fail. There are two failure categories:

Unexpected state
- The lock object could have been unexpectedly modified by someone else.
- The lock object could be unexpectedly deleted.
Network problems
- If this means that the refresh operation is arbitrarily delayed by the network, then we can end up refreshing a lock that we don't own. While this is unintended, it won't cause any real problems.
- But if this means that the operation failed to reach Cloud Storage, and such failures persist, then the lock can become stale even though the operation is still in progress.

How should we respond to refresh failures?

Upon encountering unexpected state, we should abort the operation immediately.
Upon encountering network problems, there's a chance that the failure is just temporary. So we should retry a couple of times. Only if retrying fails too many times consecutively do we abort the operation.

I think retrying 2 times (so 3 tries in total) is reasonable. In order to abort way before the TTL expires, the refresh interval must be shorter than 1/3rd of the TTL.

When we conclude that we should abort the operation, we declare that the lock is in an unhealthy state.

Aborting should happen in a manner that leaves the system in a consistent state. Furthermore, aborting takes time, so it should be initiated way before the TTL expires, and it's also another reason why in the previous section I recommended a refresh interval of 1/8th of the TTL.

Dealing with inconsistent operation states

Aborting the operation could itself fail, for example because of network problems. This may leave the system in an inconsistent state. There are ways to deal with this issue:

Next time a process takes the lock, detect whether the state is inconsistent, and then deal with it somehow, for example by fixing the inconsistency.

This means that the operation must be written in such a way that inconcistency can be detected and fixed. Fixing arbitrary inconsistency is quite hard, so you should carefully design the operation's algorithm to limit how inconsistent a state can become.

This is a difficult topic and is outside the scope of this article. But you could take inspiration from how journaling filesystems work to recover the filesystem state after a crash.
An easier approach that's sometimes viable, is to consider existing state to be immutable. Your operation makes a copy of the existing state, perform operations on the copy, then atomically (or at least nearly so) declare the copy as the new state.

Detecting unexpected releases or ownership changes

The lock could be released, or its ownership could change, at any time. Either because of a faulty process or because of an unexpected administrator operation. While such things shouldn't happen, it's still a good idea if we are able to handle them somehow.

When these things happen, we also say that the lock is in an unhealthy state.

We make the following changes to the algorithm:

Right after having taken the lock, take note of its generation number.
When refreshing the lock, use the x-goog-if-generation-match: <last known generation number> header.
- If it succeeds, take note of the new generation number.
- If it fails because the object does not exist, then it means the lock was deleted. We abort the operation.
- If it fails with a 412 Precondition Failed error, then it means the ownership unexpectedly changed. We abort the operation without releasing the lock.
When releasing the lock, use the x-goog-if-generation-match: <last known generation number> header, so that we're sure we're releasing the lock we owned and not one that was taken over by another process. We can ignore any 412 Precondition Failed errors.

Studying HashiCorp Vault's leader election algorithm

HashiCorp Vault is a secrets management system. Its high availability setup involves leader election. This is done by taking ownership of a distributed lock. The instance that succeeds in taking ownership is considered the leader.

The leader election algorithm is implemented in physical/gcs/gcs_ha.go and was originally written by Seth Vargo at Google. This algorithm was also discussed by Ahmet Alp Balkan at the Google Cloud blog.

HashiCorp Vault's leader election protocol is actually also a distributed lock! We can draw many interesting lessons from it.

Here are the similarities between Vault's algorithm and what we've discussed so far:

Vault utilizes Cloud Storage's precondition headers to find out whether it was successful in taking a lock.
When Vault fails to take a lock, it also retries later until it suceeds.
Vault detects stale locks via a TTL.
Vault refreshes locks regularly. A Vault instance holds on to the lock as long as its willing to be the leader, so we can consider this to be a gigantic long-running operation, making lock refreshing essential.
Vault checks regurlarly whether the lock was unexpectedly released or changed ownership.
When Vault releases the lock, it also uses a precondition header to ensure it doesn't delete a lock that someone else took ownership of concurrently.

Notable differences:

Vault checks whether the lock is stale, before trying to create the lock object. Whereas we check for staleness after trying to do so. Checking for staleness afterwards is a more optimistic approach. If the lock is unlikely to be stale, then checking afterwards is faster.
When Vault fails to take the lock, it backs off linearly instead of exponentially.
Instead of checking the generation number, and refreshing the lock by updating its data, Vault operates purely on object metadata because it's less costly to read frequently. This means the algorithm checks the metageneration number, and refreshes the lock by updating metadata fields.
Vault stores its unique instance identity name in the lock. This way administrators can easily find out who owns the lock.
Vault's TTL is a runtime configuration parameter. Its value is not stored in the object.
If Vault's leader election system crashes non-fatally (e.g. it detected an unhealthy lock, aborted, then tried again later from the same Vault instance), and the lock hasn't been taken over by another Vault instance at the same time, then Vault is able to retake the lock instantly.

In contrast, our approach so far requires waiting until the lock becomes stale per the TTL.

I think points 3, 4 and 6 are worth learning from.

Instant recovery from stale locks & thread-safety

HashiCorp Vault's ability to retake the lock instantly after a non-fatal crash is worthy of further discussion. It's a desirable feature, but what are the implications?

Upon closer inspection, we see that this feature works by assigning an identity to the lock object. This identity is a random string that's generated during Vault startup. When Vault attempts to take a lock, it checks whether the object already exists and whether its identity equals the Vault instance's own identity. If so, then Vault concludes that it's safe to retake the lock immediately.

This identity string must be chosen with some care, because it affects on the level of mutual exclusion. Vault generates a random identity string that's unique on a per-Vault-instance basis. This results in the lock being multi-process safe, but — perhaps counter-intuitively — not thread-safe!

We can make the lock object thread-safe by including the thread ID in the identity as well. The tradeoff is that an abandoned lock can only be quickly recovered by the same thread that abandoned it in the first place. All other threads still have to wait for the TTL timeout.

In the next section we'll put together everything we've discussed and learned so far.

Putting the final algorithm together

Taking the lock

Parameters:

Object URL
TTL
An identity that's unique on a per-process basis, and optionally on a per-thread basis as well
- Example format: "[process identity]". If thread-safety is desired, append "/[thread identity]".
- Interpret the concept "thread" liberally. For example, if your language is single-threaded with cooperative multitasking using coroutines/fibers, then use the coroutine/fiber identity.

Steps:

Create the object at the given URL.
- Use the x-goog-if-generation-match: 0 header.
- Set Cache-Control: no-store
- Set the following metadata values:
  - Expiration timestamp (based on TTL)
  - Identity
- Empty contents.
If creation is successful, then it means we've taken the lock.
- Start refreshing the lock in the background.
If creation fails with a 412 Precondition Failed error (meaning the object already exists), then:
1. Fetch from the object's metadata:
  - Update timestamp
  - Metageneration number
  - Expiration timestamp
  - Identity
2. If step 1 fails because the object didn't exist, then restart the algorithm from step 1 immediately.
3. If the identity equals our own, then delete the object, and immediately restart the algorithm from step 1.
  - When deleting, use the x-goog-if-metageneration-match: [metageneration] header.
4. If the update timestamp is older than the expiration timestamp then delete the object.
  - Use the x-goog-if-metageneration-match: [metageneration] header.
5. Otherwise, restart the algorithm from step 1 after an exponential backoff (potentially with an upper limit and jitter).

Releasing the lock

Parameters:

Object URL
Identity

Steps:

Stop refreshing the lock in the background.
Delete the lock object at the given URL.
- Use the x-goog-if-metageneration-match: [last known metageneration] header.
- Ignore the 412 Precondition Failed error, if any.

Refreshing the lock

Parameters:

Object URL
TTL
Refresh interval
Max number of times the refresh may fail consecutively
Identity

Every refresh_interval seconds (until a lock release is requested, or until an unhealthy state is detected):

Update the object metadata (which also updates the update timestamp).
- Use the x-goog-if-metageneration-match: [last known metageneration] header.
- Update the expiration timestamp metadata value, based on the TTL.
If the operation succeeds, check the response, which contains the latest object metadata.
1. Take note of the latest metageneration number.
2. If the identity does not equal our own, then declare that the lock is unhealthy.
If the operation fails because the object does not exist or because of a 412 Precondition Failed error, then declare that the lock is unhealthy.
If the operation fails for some other reason, then check whether this is the maximum number of times that we may fail consecutively. If so, then declare that the lock is unhealthy.

Recommended default values

TTL: 5 minutes
Refresh interval: 37 seconds
Max number of times the refresh may fail consecutively: 3

Lock usage

Steps:

Take the lock
Try:
- If applicable:
  - Check whether state is consistent, and fix it if it isn't
  - Check whether lock is healthy, abort if not
- Perform a part of the operation
- Check whether lock is healthy, abort if not
- …etc…
- If applicable: commit the operation's effects as atomically as possible
Finally:
- Release the lock

Conclusion

Distributed locks are very useful for ad-hoc system/cloud automation scripts and CI/CD pipelines. Or more generally, they're useful in any situation in which multiple systems may operate on the same state concurrently. Concurrent modifications may corrupt the state, so one needs a mechanism to ensure that only one system can modify the state at the same time.

Google Cloud Storage is a good system to build a distributed lock on, as long as you don't care about latency that much. By leveraging Cloud Storage's capabilities, we can build a robust distributed locking algorithm that's not too complex. What's more: it's cheap to operate, cheap to maintain, and can be used from almost anywhere.

The distributed locking algorithm proposed by this article builds upon existing algorithms found in other systems, and makes locking more robust.

Eager to use this algorithm in your next system or pipeline? Check out the Ruby implementation. In the near future I also plan on releasing implementations in other languages.

Hongli Lai

EBS StorageClass with VolumeBindingMode Immediate is incompatible with pod topology pinning

Causes of major page faults

What are major page faults anyway?

Virtual memory

Who is producing major page faults?

Reproducing file-backed major page faults

Reproducing swap-backed major page faults

Does iotop show major page faults?

Identifying exact major page fault source

Inspecting file-backed major page faults with perf

Inspecting swap-backed major page faults with perf

Inspecting major page faults with bpftrace

Conclusion

Cure Docker volume permission pains with MatchHostFsOwner

How does MatchHostFsOwner solve container file permission pains?

Using MatchHostFsOwner

Usage mode 1: start container without root privileges

Usage mode 1 in action

Usage mode 2: start container with root privileges

Usage mode 2 in action

Conclusion

Ubuntu 22.04 support for Fullstaq Ruby is here

Ruby gem: distributed locking on Google Cloud

A robust distributed locking algorithm based on Google Cloud Storage

Use cases for distributed locks

Why building on Google Cloud Storage?

Basic challenges of distributed locking

Building blocks: generation numbers and atomic operations

Existing implementations

Gcslock: a basic locking algorithm

Resisting stuck locks via TTLs

Long-running operations

Refresh failures

Dealing with inconsistent operation states

Detecting unexpected releases or ownership changes

Studying HashiCorp Vault's leader election algorithm

Instant recovery from stale locks & thread-safety

Putting the final algorithm together

Taking the lock

Releasing the lock

Refreshing the lock

Recommended default values

Lock usage

Conclusion

Inspecting file-backed major page faults with `perf`

Inspecting swap-backed major page faults with `perf`

Inspecting major page faults with `bpftrace`