Hongli LaiOn Coding, Startups & Lifehttps://www.joyfulbikeshedding.com/blog2023-04-20T00:00:00+00:00Hongli LaiCure Docker volume permission pains with MatchHostFsOwnerhttps://www.joyfulbikeshedding.com/blog/2023-04-20-cure-docker-volume-permission-pains-with-matchhostfsowner.html2023-04-20T00:00:00+00:002023-04-22T18:01:13+00:00Hongli Lai<p>Run a container with a host directory mount, and it either leaves root-owned files behind or it runs into "permission denied" errors. Welcome to the dreadful <a href="/blog/2021-03-15-docker-and-the-host-filesystem-owner-matching-problem.html">container host filesystem owner matching problem</a>. These issues <a href="https://www.reddit.com/r/docker/comments/hjsipd/permission_denied_with_volumes/">confuse</a> <a href="https://medium.com/@nielssj/docker-volumes-and-file-system-permissions-772c1aee23ca">and</a> <a href="https://mydeveloperplanet.com/2022/10/19/docker-files-and-volumes-permission-denied/">irritate</a> <a href="https://blog.gougousis.net/file-permissions-the-painful-side-of-docker/">people</a>, and they happen because apps in the container run as a different user than the host user.</p>
<p>There are <a href="/blog/2021-03-15-docker-and-the-host-filesystem-owner-matching-problem.html#solution-strategies-overiew">various strategies to solve this issue</a>, but they are all non-trivial (requiring complex logic) and/or have significant caveats (e.g., requiring privileged containers). Here's where my new tool <a href="https://github.com/FooBarWidget/matchhostfsowner">MatchHostFsOwner</a> comes in.</p>
<p><strong></strong></p><p>Run a container with a host directory mount, and it either leaves root-owned files behind or it runs into "permission denied" errors. Welcome to the dreadful <a href="/blog/2021-03-15-docker-and-the-host-filesystem-owner-matching-problem.html">container host filesystem owner matching problem</a>. These issues <a href="https://www.reddit.com/r/docker/comments/hjsipd/permission_denied_with_volumes/">confuse</a> <a href="https://medium.com/@nielssj/docker-volumes-and-file-system-permissions-772c1aee23ca">and</a> <a href="https://mydeveloperplanet.com/2022/10/19/docker-files-and-volumes-permission-denied/">irritate</a> <a href="https://blog.gougousis.net/file-permissions-the-painful-side-of-docker/">people</a>, and they happen because apps in the container run as a different user than the host user.</p>
<p>There are <a href="/blog/2021-03-15-docker-and-the-host-filesystem-owner-matching-problem.html#solution-strategies-overiew">various strategies to solve this issue</a>, but they are all non-trivial (requiring complex logic) and/or have significant caveats (e.g., requiring privileged containers). Here's where my new tool <a href="https://github.com/FooBarWidget/matchhostfsowner">MatchHostFsOwner</a> comes in.</p>
<p><strong></strong></p>
<h2 id="how-does-matchhostfsowner-solve-container-file-permission-pains">How does MatchHostFsOwner solve container file permission pains?</h2>
<p>MatchHostFsOwner implements <a href="/blog/2021-03-15-docker-and-the-host-filesystem-owner-matching-problem.html#strategy-1-matching-the-containers-uidgid-with-the-hosts">solution strategy number 1</a>. It ensures that the container runs as the same user (UID/GID) as the host's user. In short, it:</p>
<ul>
<li>modifies a user account inside the container so that the account's UID/GID matches that of the host user.</li>
<li>executes the actual container command as the aforementioned user account (instead of, e.g., letting it execute as root).</li>
</ul>
<p>This strategy is easier said than done, and the article documents the many caveats involved with this strategy. Fortunately, MatchHostFsOwner is here to help because it addresses all these caveats, so you don't have to.</p>
<h2 id="using-matchhostfsowner">Using MatchHostFsOwner</h2>
<p>Here are some core concepts to understand:</p>
<ul>
<li>
<p><strong>It's an entrypoint</strong> — Install MatchHostFsOwner as the container entrypoint program. It <a href="https://github.com/FooBarWidget/matchhostfsowner/blob/main/README.md#combining-other-entrypoint-programs-with-matchhostfsowner">should be the first program to run in the container</a>. When it runs, it modifies the container's environment, then executes the next command with the proper UID/GID.</p>
</li>
<li>
<p><strong>It requires host user input</strong> — when starting a container, the host user must tell MatchHostFsOwner what the host user's UID/GID is. How the user passes this information depends on what tool the user uses to start the container (e.g., Docker CLI, Docker Compose, Kubernetes, etc).</p>
</li>
<li>
<p><strong>It requires an extra user account in the container</strong> — MatchHostFsOwner tries to execute the next command under a user account in the container whose UID equals the host user's UID. If no such account exists (which is common), then MatchHostFsOwner will take a specific account and modify its UID/GID to match that of the host user.</p>
<p>The account MatchHostFsOwner will take and modify is called the <strong>"app account"</strong>. MatchHostFsOwner won't create this account for you — you have to supply it. It won't always be used, but often it will.</p>
<p>By default, MatchHostFsOwner assumes that the app account is named <code>app</code>. But this is <a href="https://github.com/FooBarWidget/matchhostfsowner/blob/main/README.md#custom-usergroup-account-name">customizable</a>.</p>
</li>
<li>
<p><strong>It requires root privileges</strong> — MatchHostFsOwner itself requires root privileges to modify the container's environment. It drops these privileges later before executing the next command.</p>
<p>How exactly MatchHostFsOwner is granted root privileges depends on how one is supposed to start the container. This brings us to the two <em>usage modes</em>.</p>
</li>
</ul>
<h2 id="usage-mode-1-start-container-without-root-privileges">Usage mode 1: start container without root privileges</h2>
<p>This mode is most suitable for starting the container without root privileges. For example:</p>
<ul>
<li>When your Dockerfile sets a default user account using <code>USER</code>.</li>
<li>When your container is supposed to be started with <code>docker run --user</code>.</li>
<li>When your Kubernetes spec makes use of securityContext's <code>runAsUser</code>/<code>runAsGroup</code>.</li>
</ul>
<p>In this mode, you must grant MatchHostFsOwner the setuid root bit. MatchHostFsOwner drops its setuid root bit as soon as possible after it has done its work.</p>
<p>This mode has some limitations:</p>
<ul>
<li>The container cannot be started a second time (e.g., using docker stop and then docker start). Upon starting the container for the second time, MatchHostFsOwner no longer has the setuid root bit, so it won't be able to do its job. Thus, mode 1 is only useful for ephemeral containers.</li>
<li>Incompatible with Docker Compose because it may start the container a second time.</li>
<li>Requires that the container filesystem in which MatchHostFsOwner is located, to be writable. Because MatchHostFsOwner must be able to drop the setuid root bit. Thus, you cannot run the container in read-only mode (e.g., <code>docker run --read-only</code>).</li>
</ul>
<h3 id="usage-mode-1-in-action">Usage mode 1 in action</h3>
<p>Begin by preparing the container.</p>
<ul>
<li>Create an account in your container for running your app. It doesn't matter what you name it (it's <a href="https://github.com/FooBarWidget/matchhostfsowner/blob/main/README.md#custom-usergroup-account-name">customizable</a>), but let's call it "app" in this demo because MatchHostFsOwner assumes by default that that's the name. Set this account up as the default account for the container.</li>
<li>Place the MatchHostFsOwner executable in a root-owned directory (e.g., <code>/sbin</code>) and ensure that the executable is owned by root, and has the setuid root bit.</li>
<li>Set up the MatchHostFsOwner executable as the container entrypoint.</li>
</ul>
<p>For example:</p>
<div class="highlight"><pre class="highlight docker"><code><span class="k">FROM</span><span class="s"> ubuntu:22.04</span>
<span class="c"># Install MatchHostFsOwner. Replace X.X.X with an actual version.</span>
<span class="c"># See https://github.com/FooBarWidget/matchhostfsowner/releases</span>
<span class="k">ADD</span><span class="s"> https://github.com/FooBarWidget/matchhostfsowner/releases/download/vX.X.X/matchhostfsowner-X.X.X-x86_64-linux.gz /sbin/matchhostfsowner.gz</span>
<span class="k">RUN </span><span class="nb">gunzip</span> /sbin/matchhostfsowner.gz <span class="o">&&</span> <span class="se">\
</span> <span class="nb">chown </span>root: /sbin/matchhostfsowner <span class="o">&&</span> <span class="se">\
</span> <span class="nb">chmod</span> +x,+s /sbin/matchhostfsowner
<span class="k">RUN </span>addgroup <span class="nt">--gid</span> 9999 app <span class="o">&&</span> <span class="se">\
</span> adduser <span class="nt">--uid</span> 9999 <span class="nt">--gid</span> 9999 <span class="nt">--disabled-password</span> <span class="nt">--gecos</span> App app
<span class="c">## Or, on RHEL-based images:</span>
<span class="c"># RUN groupadd --gid 9999 app && \</span>
<span class="c"># useradd --uid 9999 --gid 9999 app</span>
<span class="c">## Or, on Alpine-based images:</span>
<span class="c"># RUN addgroup -g 9999 app && \</span>
<span class="c"># adduser -G app -u 9999 -D app</span>
<span class="k">USER</span><span class="s"> app</span>
<span class="k">ENTRYPOINT</span><span class="s"> ["/sbin/matchhostfsowner"]</span>
</code></pre></div>
<div class="highlight"><pre class="highlight shell"><code>docker build <span class="nb">.</span> <span class="nt">-t</span> my-example-image
</code></pre></div>
<p>Next, start the container using a user and group ID that matches the host user's. For example, using the Docker CLI. (See <a href="https://github.com/FooBarWidget/matchhostfsowner/blob/main/README.md#kubernetes">the documentation</a> for a Kubernetes-based example.)</p>
<div class="highlight"><pre class="highlight shell"><code>docker run <span class="nt">--user</span> <span class="s2">"</span><span class="si">$(</span><span class="nb">id</span> <span class="nt">-u</span><span class="si">)</span><span class="s2">:</span><span class="si">$(</span><span class="nb">id</span> <span class="nt">-g</span><span class="si">)</span><span class="s2">"</span> my-example-image <span class="nb">id</span> <span class="nt">-a</span>
<span class="c"># Output (assuming host UID/GID is 501/20):</span>
<span class="c"># uid=501(app) gid=20(app) groups=20(app)</span>
</code></pre></div>
<p>Success! Here's what happened under the hood:</p>
<ul>
<li>MatchHostFsOwner (the entrypoint) runs before the container command (<code>id -a</code>) does.</li>
<li>MatchHostFsOwner sees the container is running as UID/GID 501/20. So it modifies the "app" account's UID/GID to 501/20. It can do that because it has setuid root privileges.</li>
<li>MatchHostFsOwner drops its setuid root privileges, then executes the command <code>id -a</code> under the container's "app" account.</li>
</ul>
<h2 id="usage-mode-2-start-container-with-root-privileges">Usage mode 2: start container with root privileges</h2>
<p>In this mode, MatchHostFsOwner obtains root privileges through the fact that one starts the container with root privileges. No setuid root privileges required. MatchHostFsOwner drops its root privileges as soon as possible after it has done its work.</p>
<p>This mode is most suitable if any of the following is applicable:</p>
<ul>
<li>You're using Docker Compose.</li>
<li>The container could be started a second time, as happens with, e.g., Docker Compose.</li>
<li>The container filesystem in which MatchHostFsOwner is located is read-only.</li>
</ul>
<h3 id="usage-mode-2-in-action">Usage mode 2 in action</h3>
<p>Begin by preparing the container:</p>
<ul>
<li>Create an account in your container for running your app. It doesn't matter what you name it (it's <a href="https://github.com/FooBarWidget/matchhostfsowner/blob/main/README.md#custom-usergroup-account-name">customizable</a>), but let's call it "app" in this demo because MatchHostFsOwner assumes by default that that's the name. Set this account up as the default account for the container.</li>
<li>Place the MatchHostFsOwner executable in a root-owned directory (e.g., <code>/sbin</code>) and ensure that the executable is owned by root.</li>
<li>Set up the MatchHostFsOwner executable as the container entrypoint.</li>
<li>Don't set a default user account with <code>USER</code>.</li>
</ul>
<p>Example:</p>
<div class="highlight"><pre class="highlight docker"><code><span class="k">FROM</span><span class="s"> ubuntu:22.04</span>
<span class="c"># Install MatchHostFsOwner. Replace X.X.X with an actual version.</span>
<span class="c"># See https://github.com/FooBarWidget/matchhostfsowner/releases</span>
<span class="k">ADD</span><span class="s"> https://github.com/FooBarWidget/matchhostfsowner/releases/download/vX.X.X/matchhostfsowner-X.X.X-x86_64-linux.gz /sbin/matchhostfsowner.gz</span>
<span class="k">RUN </span><span class="nb">gunzip</span> /sbin/matchhostfsowner.gz <span class="o">&&</span> <span class="se">\
</span> <span class="nb">chown </span>root: /sbin/matchhostfsowner <span class="o">&&</span> <span class="se">\
</span> <span class="nb">chmod</span> +x /sbin/matchhostfsowner
<span class="k">RUN </span>addgroup <span class="nt">--gid</span> 9999 app <span class="o">&&</span> <span class="se">\
</span> adduser <span class="nt">--uid</span> 9999 <span class="nt">--gid</span> 9999 <span class="nt">--disabled-password</span> <span class="nt">--gecos</span> App app
<span class="c">## Or, on RHEL-based images:</span>
<span class="c"># RUN groupadd --gid 9999 app && \</span>
<span class="c"># useradd --uid 9999 --gid 9999 app</span>
<span class="c">## Or, on Alpine-based images:</span>
<span class="c"># RUN addgroup -g 9999 app && \</span>
<span class="c"># adduser -G app -u 9999 -D app</span>
<span class="k">ENTRYPOINT</span><span class="s"> ["/sbin/matchhostfsowner"]</span>
</code></pre></div>
<div class="highlight"><pre class="highlight shell"><code>docker build <span class="nb">.</span> <span class="nt">-t</span> my-example-image
</code></pre></div>
<p>Next, start the container while setting the environment variables <code>MHF_HOST_UID</code> and <code>MHF_HOST_GID</code> to the host user's UID/GID like this:</p>
<div class="highlight"><pre class="highlight shell"><code>docker run <span class="nt">-e</span> <span class="s2">"MHF_HOST_UID=</span><span class="si">$(</span><span class="nb">id</span> <span class="nt">-u</span><span class="si">)</span><span class="s2">"</span> <span class="nt">-e</span> <span class="s2">"MHF_HOST_GID=</span><span class="si">$(</span><span class="nb">id</span> <span class="nt">-g</span><span class="si">)</span><span class="s2">"</span> my-example-image <span class="nb">id</span> <span class="nt">-a</span>
<span class="c"># Output (assuming host UID/GID is 501/20):</span>
<span class="c"># uid=501(app) gid=20(app) groups=20(app)</span>
</code></pre></div>
<p>Here's what happened under the hood:</p>
<ul>
<li>MatchHostFsOwner (the entrypoint) runs before the container command (<code>id -a</code>) does.</li>
<li>MatchHostFsOwner sees that <code>MHF_HOST_UID</code>/<code>MHF_HOST_GID</code> is set to 501/20. So it modifies the "app" account's UID/GID to 501/20.</li>
<li>MatchHostFsOwner drops its root privileges, then executes the command <code>id -a</code> under the container's "app" account.</li>
</ul>
<figure>
<a href="https://github.com/FooBarWidget/matchhostfsowner"><img src="/images/2023/matchhostfsowner-mascot-small-8c393772.jpg" alt="MatchHostFsOwner mascot: dog with glasess" /></a>
<figcaption>MatchHostFsOwner project mascot</figcaption>
</figure>
<h2 id="conclusion">Conclusion</h2>
<p>MatchHostFsOwner is an excellent way to solve Docker volume permission problems (more precisely: the container host filesystem owner matching problem). Please have a look at its <a href="https://github.com/FooBarWidget/matchhostfsowner">source code</a> (it's written in Rust!) and check out <a href="https://github.com/FooBarWidget/matchhostfsowner/blob/main/README.md">its documentation</a> for customization, advanced usage, and troubleshooting instructions.</p>
<p>Stay cured!</p>
Ubuntu 22.04 support for Fullstaq Ruby is herehttps://www.joyfulbikeshedding.com/blog/2022-04-30-ubuntu-22-04-support-for-fullstaq-ruby-is-here.html2022-04-30T00:00:00+00:002023-04-22T18:01:13+00:00Hongli Lai<p>Ubuntu 22.04 was released a couple of days ago. Fullstaq Ruby now provides packages for this distribution!</p>
<blockquote>
<p><a href="https://fullstaqruby.org">Fullstaq Ruby</a> distributes server-optimized Ruby binaries. <a href="https://github.com/fullstaq-ruby/server-edition/blob/main/README.md#installation">Install</a> the latest Ruby versions with APT/YUM instead of compiling. Easily keep Ruby <a href="https://github.com/fullstaq-ruby/server-edition/blob/main/README.md#minor-version-packages-a-great-way-to-keep-ruby-security-patched">security patched</a> via auto-tiny version updates. Combat memory bloat (<a href="https://dev.to/evilmartians/fullstaq-ruby-first-impressions-and-how-to-migrate-your-docker-kubernetes-ruby-apps-today-4fm7">save as much as 50%</a>) with <a href="https://github.com/fullstaq-ruby/server-edition/blob/main/README.md#key-features">memory allocator improvements</a>.</p>
</blockquote>
<p>Ubuntu 22.04 was released a couple of days ago. Fullstaq Ruby now provides packages for this distribution! Here's the corresponding pull request: #96.</p>
<p>Note that we only provide Ruby 3.1 packages for Ubuntu 22.04. This is because Ubuntu 22.04 ships with OpenSSL v3, and only Ruby 3.1 is compatible with that OpenSSL version.</p>
<p>Want to install or upgrade? Check <a href="https://github.com/fullstaq-ruby/server-edition/blob/master/README.md#installation">the installation instructions</a>, or run <code>apt upgrade</code>/<code>yum update</code>.</p>
Ruby gem: distributed locking on Google Cloudhttps://www.joyfulbikeshedding.com/blog/2021-09-14-ruby-gem-distributed-locking-on-google-cloud.html2021-09-14T00:00:00+00:002023-04-22T18:01:13+00:00Hongli Lai<p>I previously designed a robust <a href="2021-05-19-robust-distributed-locking-algorithm-based-on-google-cloud-storage.html.md">distributed locking algorithm based on Google Cloud</a>. Now I'm releasing a Ruby implementation of this algorithm: <a href="https://github.com/FooBarWidget/distributed-lock-google-cloud-storage-ruby">distributed-lock-google-cloud-storage-ruby</a>.</p>
<p>To use this, add to your Gemfile:</p>
<div class="highlight"><pre class="highlight ruby"><code><span class="n">gem</span> <span class="s1">'distributed-lock-google-cloud-storage'</span>
</code></pre></div>
<p><strong><em></em></strong></p><p>I previously designed a robust <a href="2021-05-19-robust-distributed-locking-algorithm-based-on-google-cloud-storage.html.md">distributed locking algorithm based on Google Cloud</a>. Now I'm releasing a Ruby implementation of this algorithm: <a href="https://github.com/FooBarWidget/distributed-lock-google-cloud-storage-ruby">distributed-lock-google-cloud-storage-ruby</a>.</p>
<p>To use this, add to your Gemfile:</p>
<div class="highlight"><pre class="highlight ruby"><code><span class="n">gem</span> <span class="s1">'distributed-lock-google-cloud-storage'</span>
</code></pre></div>
<p><strong><em></em></strong></p>
<p>Its typical usage is as follows. Initialize a Lock instance. It must be backed by a Google Cloud Storage bucket and object. Then do your work within a <code>#synchronize</code> block.</p>
<p><strong>Important:</strong> If your work is a long-running operation, then also be sure to call <code>#check_health!</code> <em>periodically</em> to check whether the lock is still healthy. This call throws an exception if it's not healthy. Learn more in <a href="https://github.com/FooBarWidget/distributed-lock-google-cloud-storage-ruby/blob/main/README.md#long-running-operations-lock-refreshing-and-lock-health-checking">Long-running operations, lock refreshing and lock health checking</a>.</p>
<div class="highlight"><pre class="highlight ruby"><code><span class="nb">require</span> <span class="s1">'distributed-lock-google-cloud-storage'</span>
<span class="n">lock</span> <span class="o">=</span> <span class="no">DistributedLock</span><span class="o">::</span><span class="no">GoogleCloudStorage</span><span class="o">::</span><span class="no">Lock</span><span class="p">(</span>
<span class="ss">bucket_name: </span><span class="s1">'your bucket name'</span><span class="p">,</span>
<span class="ss">path: </span><span class="s1">'locks/mywork'</span><span class="p">)</span>
<span class="n">lock</span><span class="p">.</span><span class="nf">synchronize</span> <span class="k">do</span>
<span class="n">do_some_work</span>
<span class="c1"># IMPORTANT: _periodically_ call this!</span>
<span class="n">lock</span><span class="p">.</span><span class="nf">check_health!</span>
<span class="n">do_more_work</span>
<span class="k">end</span>
</code></pre></div>
<p>To learn more about this gem, please check out <a href="https://github.com/FooBarWidget/distributed-lock-google-cloud-storage-ruby/blob/main/README.md">its README</a> and its <a href="https://foobarwidget.github.io/distributed-lock-google-cloud-storage-ruby/DistributedLock/GoogleCloudStorage/Lock.html">full API docs</a>.</p>
A robust distributed locking algorithm based on Google Cloud Storagehttps://www.joyfulbikeshedding.com/blog/2021-05-19-robust-distributed-locking-algorithm-based-on-google-cloud-storage.html2021-05-19T00:00:00+00:002023-04-22T18:01:13+00:00Hongli Lai<p>Many workloads nowadays involve many systems that operate concurrently. This ranges from microservice fleets to workflow orchestration to CI/CD pipelines. Sometimes it's important to coordinate these systems so that concurrent operations don't step on each other. One way to do that is by using <em>distributed locks</em> that work across multiple systems.</p>
<p>Distributed locks used to require complex algorithms or complex-to-operate infrastructure, making them expensive both in terms of costs as well as in upkeep. With the emergence of fully managed and serverless cloud systems, this reality has changed.</p>
<p>In this post I'll look into a distributed locking algorithm based on Google Cloud. I'll discuss several existing implementations and suggest algorithmic improvements in terms of performance and robustness.</p>
<p><strong></strong></p><p>Many workloads nowadays involve many systems that operate concurrently. This ranges from microservice fleets to workflow orchestration to CI/CD pipelines. Sometimes it's important to coordinate these systems so that concurrent operations don't step on each other. One way to do that is by using <em>distributed locks</em> that work across multiple systems.</p>
<p>Distributed locks used to require complex algorithms or complex-to-operate infrastructure, making them expensive both in terms of costs as well as in upkeep. With the emergence of fully managed and serverless cloud systems, this reality has changed.</p>
<p>In this post I'll look into a distributed locking algorithm based on Google Cloud. I'll discuss several existing implementations and suggest algorithmic improvements in terms of performance and robustness.</p>
<p><strong></strong></p>
<p><strong>Update</strong>: there is now a <a href="https://github.com/FooBarWidget/distributed-lock-google-cloud-storage-ruby">Ruby implementation</a> of this algorithm!</p>
<h2 id="use-cases-for-distributed-locks">Use cases for distributed locks</h2>
<p>Distributed locks are useful in any situation in which multiple systems may operate on the same state concurrently. Concurrent modifications may corrupt the state, so one needs a mechanism to ensure that only one system can modify the state at the same time.</p>
<p>A good example is Terraform. When you store the Terraform state in the cloud, and you run multiple Terraform instances concurrently, then Terraform guarantees that only one Terraform instance can modify the infrastructure concurrently. This is done through a distributed lock. In contrast to a regular (local system) lock, a distributed lock works across multiple systems. So even if you run two Terraform instances on two different machines, then Terraform still protects you from concurrent modifications.</p>
<p>More generally, distributed locks are useful for <strong>ad-hoc system/cloud automation scripts and CI/CD pipelines</strong>. Sometimes you want your script or pipeline to perform non-trivial modifications that take many steps. It can easily happen that multiple instances of the script or pipeline are run. When that happens, you don't want those multiple instances to perform the modification at the same time, because that can corrupt things. You can use a distributed lock to make concurrent runs safe.</p>
<p>Here's a concrete example involving a CI/CD pipeline. <a href="https://fullstaqruby.org">Fullstaq Ruby</a> had an APT and YUM repository hosted on <a href="https://bintray.com/">Bintray</a>. A few months ago, Bintray announced that they will shutdown in the near future, so <a href="https://github.com/fullstaq-labs/fullstaq-ruby-server-edition/blob/main/dev-handbook/apt-yum-repo.md">we had to migrate to a different solution</a>. We chose to self-host our APT and YUM repository on a cloud object store.</p>
<figure>
<img src="/images/2021/distributed-lock-arch-9433f803.svg" alt="" />
<figcaption>The Fullstaq Ruby package publishing pipeline uses a distributed lock to guarantee concurrency-safety. Learn more: <a href="https://github.com/fullstaq-labs/fullstaq-ruby-server-edition/blob/main/dev-handbook/apt-yum-repo.md">Fullstaq Ruby's APT and YUM repository setup</a></figcaption>
</figure>
<p>APT and YUM repositories consist of a bunch of .deb and .rpm packages, plus a bunch of metadata. Package updates are published through Fullstaq Ruby's CI/CD system. This CI/CD system directly modifies multiple files on the cloud object store. We want this publication process to be <strong>concurrency-safe</strong> because if we commit too quickly then multiple CD/CD runs may occur at the same time. The easiest way to achieve this is by using a distributed lock, so that only one CI/CD pipeline may operate on the cloud object bucket concurrently.</p>
<h2 id="why-building-on-google-cloud-storage">Why building on Google Cloud Storage?</h2>
<p>Distributed locks used to be hard to implement. In the past they required complicated <a href="https://en.wikipedia.org/wiki/Consensus_(computer_science)">consensus protocols</a> such as <a href="https://en.wikipedia.org/wiki/Paxos_(computer_science)">Paxos</a> or <a href="https://en.wikipedia.org/wiki/Raft_(algorithm)">Raft</a>, as well as the hassle of hosting yet another service. See <a href="https://en.wikipedia.org/wiki/Distributed_lock_manager">Distributed lock manager</a>.</p>
<p>In a more recent past, people started implementing distributed locks on top of other distributed systems, such as transactional databases and Redis. This significantly reduced the complexity of algorithms. But operational complexity was still significant. A big issue is that these systems aren't "serverless": operating and maintaining a database instance or a Redis instance is not cheap. It's not cheap in terms of effort. It's also not cheap in terms of costs: you pay for a database/Redis instance based on its uptime, not based on how many operations you perform.</p>
<p>Luckily, there are many cloud systems nowadays which not only provide the building blocks necessary to build a distributed lock, but are also fully managed and serverless. Google Cloud Storage is a great system to build a distributed lock on. It's cheap, it's popular, it's highly available and it's maintenance-free. You only pay for the amount of operations you perform on it.</p>
<h2 id="basic-challenges-of-distributed-locking">Basic challenges of distributed locking</h2>
<p>One of the problems that distributed locking algorithms need to solve, is the fact that participants in the algorithm need to <strong>communicate</strong> with each other. Distributed systems may run in different networks that aren't directly connected.</p>
<p>Another problem is that of <strong>concurrency control</strong>. This is made difficult by communication lag. If two participants request ownership of a lock simultaneously, then we want both of them to agree on a single outcome even though it takes time for each participant to hear the other.</p>
<p>Finally, there is the problem of <strong>state consistency</strong>. When you write to a storage system, then next time you read from that system you want to read what you just wrote. This is called <em>strong consistency</em>. Some storage systems are <em>eventually consistent</em>, which means that it takes a while before you read what you just wrote. Storage systems that are eventually consistent are not suitable for implementing distributed locks.</p>
<p>This is why we leverage Google Cloud Storage as both a communication channel, and as a "referee". Everyone can connect to Cloud Storage, and access control is simple and well-understood. Cloud Storage <a href="https://cloud.google.com/storage/docs/consistency">is also a strongly consistent system</a> and has <a href="https://cloud.google.com/storage/docs/generations-preconditions">concurrency control features</a>. This latter allows Cloud Storage to make a single, final decision in case two participants want to take ownership of the lock simultaneously.</p>
<h2 id="building-blocks-generation-numbers-and-atomic-operations">Building blocks: generation numbers and atomic operations</h2>
<p>Every Cloud Storage object has two separate <a href="https://cloud.google.com/storage/docs/generations-preconditions#_Generations">generation numbers</a>.</p>
<ul>
<li>The normal generation number changes every time the object's data is modified.</li>
<li>The metageneration number changes every time the object's metadata is modified.</li>
</ul>
<p>When you perform a modification operation, you can use the <a href="https://cloud.google.com/storage/docs/xml-api/reference-headers#xgoogifgenerationmatch">x-goog-if-generation-match</a>/<a href="https://cloud.google.com/storage/docs/xml-api/reference-headers#xgoogifmetagenerationmatch">x-goog-if-metageneration-match</a> headers in the Cloud Storage API to say: "only perform this operation if the generation/metageneration equals this value". Cloud Storage guarantees that this effect is atomic and free of race conditions. These headers are called <strong>precondition headers</strong>.</p>
<p>The special value 0 for x-goog-if-generation-match means "only perform this operation if the object does not exist".</p>
<p>This feature — the ability to specify preconditions to operations — is key to concurrency control.</p>
<h2 id="existing-implementations">Existing implementations</h2>
<p>Several implementations of a distributed lock based on Google Cloud Storage already exist. A prominent one is <a href="https://github.com/mco-gh/gcslock">gcslock</a> by <a href="https://mco.dev/">Marc Cohen</a>, who works at Google. Gcslock leverages the <a href="https://cloud.google.com/storage/docs/xml-api/reference-headers#xgoogifgenerationmatch">x-goog-if-generation-match</a> header, as described in the previous section. Its algorithm is simple, as we'll discuss in the next section.</p>
<p>Most other implementations, such as <a href="https://github.com/thinkingmachines/gcs-mutex-lock">gcs-mutex-lock</a> and <a href="https://github.com/XaF/gcslock-ruby">gcslock-ruby</a>, use the gcslock algorithm though with minor adaptations.</p>
<p>I've been able to find one implementation that's significantly different and more advanced: HashiCorp Vault's leader election algorithm. Though it's not functionally meant to be used as a lock, technically it boils down to a lock. We'll discuss this algorithm in a later section.</p>
<h2 id="gcslock-a-basic-locking-algorithm">Gcslock: a basic locking algorithm</h2>
<p>The gcslock algorithm is as follows:</p>
<ul>
<li>Taking the lock means creating an object with <code>x-goog-if-generation-match: 0</code>.
<ul>
<li>The content of the object does not matter.</li>
<li>If creation is successful, then it means we've taken the lock.</li>
<li>If creation fails with a 412 Precondition Failed error, then it means the object already exists. This means the lock was already taken. We retry later. The retry sleep time increases exponentially every time taking the lock fails.</li>
</ul>
</li>
<li>Releasing the lock means deleting the object.</li>
</ul>
<p>This algorithm is very simple. It is also relatively high-latency because Cloud Storage's response time is measured in tens to hundreds of milliseconds, and because it utilizes retries with exponential backoff. Relative high latency may or may not be a problem depending on your use case. It's probably fine for most batch operations, but it's probably unacceptable for applications that require pseudo-realtime responsiveness.</p>
<p>There are bigger issues though:</p>
<ul>
<li>
<p><strong>Prone to crashes</strong>. If a process crashes while having taken the lock, then the lock becomes stuck forever until an administrator manually deletes the lock.</p>
</li>
<li>
<p><strong>Hard to find out who the owner is</strong>. There is no administration about who owns the mutex. The only way to find out who owns the lock is by querying the processes.</p>
</li>
<li>
<p><strong>Unbounded backoff</strong>. The exponential backoff has no upper limit. If the lock is taken for a long time (e.g. because a process crashed during a lock) then the exponential backoff will grow unbounded. This means that an administrator may need to restart all sorts of processes, after having deleted a stale lock.</p>
<p><a href="https://github.com/thinkingmachines/gcs-mutex-lock">gcs-mutex-lock</a> and <a href="https://github.com/XaF/gcslock-ruby">gcslock-ruby</a> address this by setting an upper bound to the exponential backoff.</p>
</li>
<li>
<p><strong>Retry contention</strong>. If multiple processes start taking the lock at the same time, then they all back off at the same rate. This means that they end up retrying at the same time. This causes spikes in API requests towards Google Cloud Storage. This can cause network contention issues.</p>
<p><a href="https://github.com/thinkingmachines/gcs-mutex-lock">gcs-mutex-lock</a> addresses this by allowing adding jitter to the backoff time.</p>
</li>
<li>
<p><strong>Unintended releases</strong>. A lock release request may be delayed by the network. Imagine the following scenario:</p>
<ol>
<li>An administrator thinks the lock is stale, and deletes it.</li>
<li>Another process takes the lock.</li>
<li>The original lock release request now arrives, inadvertently releasing the lock.</li>
</ol>
<p>This sort of network-delay-based problem is even <a href="https://cloud.google.com/storage/docs/generations-preconditions#special-case">documented in the Cloud Storage documentation as a potential risk</a>.</p>
</li>
</ul>
<h2 id="resisting-stuck-locks-via-ttls">Resisting stuck locks via TTLs</h2>
<p>One way to avoid stuck locks left behind by crashing processes, is by considering locks to be <strong>stale</strong> if they are "too old". We can use the timestamps that Cloud Storage manages, which change every time an object is modified.</p>
<p>What should be considered "too old" really depends on the specific operation. So this should be a configurable parameter, which we call the <strong>time-to-live (TTL)</strong>.</p>
<p>What's more, the same TTL value should be agreed upon by all processes. Otherwise we'll risk that a process thinks the lock is stuck even though the owner thinks it isn't. One way to ensure that all processes agree on the same TTL is by configuring them with the same TTL value, but this approach is error-prone. A better way is to store the TTL value into the lock object.</p>
<p>Here's the updated locking algorithm:</p>
<ol>
<li>Create the object with <code>x-goog-if-generation-match: 0</code>.
<ul>
<li>Store the TTL in a metadata header.</li>
<li>The content of the object does not matter.</li>
</ul>
</li>
<li>If creation is successful, then it means we've taken the lock.</li>
<li>If creation fails with a 412 Precondition Failed error (meaning the object already exists), then:
<ol>
<li>Fetch from its metadata the update timestamp, generation number and TTL.</li>
<li>If the update timestamp is older than the TTL, then delete the object, with <code>x-goog-if-generation-match: [generation]</code>. Specifying this header is important, because if someone else takes the lock concurrently (meaning the lock is no longer stale), then we don't want to delete that.</li>
<li>Retry the locking algorithm after an exponential backoff (potentially with an upper limit and jitter).</li>
</ol>
</li>
</ol>
<p>What's a good value for the TTL?</p>
<ul>
<li>Cloud Storage's latency is relatively high, in the order of tens to hundreds of milliseconds. So the TTL should be at least a few seconds.</li>
<li>If you perform Cloud Storage operations via the <code>gsutil</code> CLI, then you should be aware that gsutil takes a few seconds to start. Thus, the TTL should be at least a few ten seconds.</li>
<li>A distributed lock like this is best suited for batch workloads. Such workloads typically take seconds to tens or even hundreds of seconds. Your TTL should be a safe multiple of the time your operation is expected to take. We'll discuss this further in the next section, "long-running operations".</li>
</ul>
<p>As a general rule, I'd say that a safe TTL should be in the order of minutes. It should be at least 1 minute. I think a <strong>good default is 5 minutes</strong>.</p>
<h2 id="long-running-operations">Long-running operations</h2>
<p>If an operation takes longer than the TTL, then another process could take ownership of the lock even though the original owner is still operating. Increasing the TTL addresses this issue somewhat, but this approach has drawbacks:</p>
<ul>
<li>If the operation's completion time is unknown, then it's impossible to pick a TTL.</li>
<li>A larger TTL means that it takes longer for processes to detect stale locks.</li>
</ul>
<p>A better approach is to <strong>refresh</strong> the object's update timestamp regularly as long as the operation is still in progress. Keep the TTL relatively short, so that if the process crashes then it won't take too much time for others to detect the lock as stale.</p>
<p>We implement refreshing via a <a href="https://cloud.google.com/storage/docs/json_api/v1/objects/patch">PATCH object API call</a>. The exact data to patch doesn't matter: we only care about the fact that Cloud Storage will change the update timestamp.</p>
<p>We call the time between refreshes the <strong>refresh interval</strong>. A proper value for the refresh interval depends on the TTL. It must be much shorter than the TTL, otherwise refreshing the lock is pointless. Its value should take into consideration that a refresh operation is subject to network delays, or even local CPU scheduling delays.</p>
<p>As a general rule, <strong>I recommend a refresh interval that's at most 1/8th of the TTL</strong>. Given a default TTL of 5 minutes, I recommend a <strong>default refresh interval of ~37 seconds</strong>. This recommendation takes into consideration that refreshes can fail, which we'll discuss in the next section.</p>
<h2 id="refresh-failures">Refresh failures</h2>
<p>Refreshing the lock can fail. There are two failure categories:</p>
<ul>
<li>
<p><strong>Unexpected state</strong></p>
<ul>
<li>The lock object could have been unexpectedly modified by someone else.</li>
<li>The lock object could be unexpectedly deleted.</li>
</ul>
</li>
<li>
<p><strong>Network problems</strong></p>
<ul>
<li>If this means that the refresh operation is arbitrarily delayed by the network, then we can end up refreshing a lock that we don't own. While this is unintended, it won't cause any real problems.</li>
<li>But if this means that the operation failed to reach Cloud Storage, and such failures persist, then the lock can become stale even though the operation is still in progress.</li>
</ul>
</li>
</ul>
<p>How should we respond to refresh failures?</p>
<ul>
<li>Upon encountering unexpected state, we should abort the operation immediately.</li>
<li>
<p>Upon encountering network problems, there's a chance that the failure is just temporary. So we should retry a couple of times. Only if retrying fails too many times consecutively do we abort the operation.</p>
<p>I think <strong>retrying 2 times</strong> (so 3 tries in total) is reasonable. In order to abort way before the TTL expires, the refresh interval must be shorter than 1/3rd of the TTL.</p>
</li>
</ul>
<p>When we conclude that we should abort the operation, we declare that the lock is in an <em>unhealthy state</em>.</p>
<p>Aborting should happen in a manner that leaves the system in a consistent state. Furthermore, aborting takes time, so it should be initiated way before the TTL expires, and it's also another reason why in the previous section I recommended a refresh interval of 1/8th of the TTL.</p>
<h2 id="dealing-with-inconsistent-operation-states">Dealing with inconsistent operation states</h2>
<p>Aborting the operation could itself fail, for example because of network problems. This may leave the system in an inconsistent state. There are ways to deal with this issue:</p>
<ul>
<li>
<p>Next time a process takes the lock, detect whether the state is inconsistent, and then deal with it somehow, for example by fixing the inconsistency.</p>
<p>This means that the operation must be written in such a way that inconcistency <em>can</em> be detected and fixed. Fixing arbitrary inconsistency is quite hard, so you should carefully design the operation's algorithm to limit <em>how</em> inconsistent a state can become.</p>
<p>This is a difficult topic and is outside the scope of this article. But you could take inspiration from how <a href="https://pages.cs.wisc.edu/~remzi/OSTEP/file-journaling.pdf">journaling filesystems work</a> to recover the filesystem state after a crash.</p>
</li>
<li>
<p>An easier approach that's sometimes viable, is to consider existing state to be immutable. Your operation makes a copy of the existing state, perform operations on the copy, then atomically (or at least nearly so) declare the copy as the new state.</p>
</li>
</ul>
<h2 id="detecting-unexpected-releases-or-ownership-changes">Detecting unexpected releases or ownership changes</h2>
<p>The lock <em>could</em> be released, or its ownership <em>could</em> change, at any time. Either because of a faulty process or because of an unexpected administrator operation. While such things <em>shouldn't</em> happen, it's still a good idea if we are able to handle them somehow.</p>
<p>When these things happen, we also say that the lock is in an <em>unhealthy state</em>.</p>
<p>We make the following changes to the algorithm:</p>
<ul>
<li>Right after having taken the lock, take note of its generation number.</li>
<li>When refreshing the lock, use the <code>x-goog-if-generation-match: <last known generation number></code> header.
<ul>
<li>If it succeeds, take note of the new generation number.</li>
<li>If it fails because the object does not exist, then it means the lock was deleted. We abort the operation.</li>
<li>If it fails with a 412 Precondition Failed error, then it means the ownership unexpectedly changed. We abort the operation without releasing the lock.</li>
</ul>
</li>
<li>When releasing the lock, use the <code>x-goog-if-generation-match: <last known generation number></code> header, so that we're sure we're releasing the lock we owned and not one that was taken over by another process. We can ignore any 412 Precondition Failed errors.</li>
</ul>
<h2 id="studying-hashicorp-vaults-leader-election-algorithm">Studying HashiCorp Vault's leader election algorithm</h2>
<p><a href="https://www.vaultproject.io/">HashiCorp Vault</a> is a secrets management system. Its <a href="https://www.vaultproject.io/docs/concepts/ha">high availability setup</a> involves leader election. This is done by taking ownership of a distributed lock. The instance that succeeds in taking ownership is considered the leader.</p>
<p>The leader election algorithm is implemented in <a href="https://github.com/hashicorp/vault/blob/cba7abc64e4d1cb20129b534e3b1a255fbc18977/physical/gcs/gcs_ha.go">physical/gcs/gcs_ha.go</a> and was originally written by <a href="https://twitter.com/sethvargo">Seth Vargo</a> at Google. This algorithm was also <a href="https://cloud.google.com/blog/topics/developers-practitioners/implementing-leader-election-google-cloud-storage">discussed</a> by <a href="https://twitter.com/ahmetb">Ahmet Alp Balkan</a> at the Google Cloud blog.</p>
<figure>
<img src="/images/2021/hashicorp_vault-5d3cb5d7.svg" alt="HashiCorp Vault logo" class="img-xx-smallwidth" />
<figcaption><a href="https://www.vaultproject.io/">HashiCorp Vault</a>'s leader election protocol is actually also a distributed lock! We can draw many interesting lessons from it.</figcaption>
</figure>
<p>Here are the similarities between Vault's algorithm and what we've discussed so far:</p>
<ul>
<li>Vault utilizes Cloud Storage's precondition headers to find out whether it was successful in taking a lock.</li>
<li>When Vault fails to take a lock, it also retries later until it suceeds.</li>
<li>Vault detects stale locks via a TTL.</li>
<li>Vault refreshes locks regularly. A Vault instance holds on to the lock as long as its willing to be the leader, so we can consider this to be a gigantic long-running operation, making lock refreshing essential.</li>
<li>Vault checks regurlarly whether the lock was unexpectedly released or changed ownership.</li>
<li>When Vault releases the lock, it also uses a precondition header to ensure it doesn't delete a lock that someone else took ownership of concurrently.</li>
</ul>
<p>Notable differences:</p>
<ol>
<li>Vault checks whether the lock is stale, <em>before</em> trying to create the lock object. Whereas we check for staleness <em>after</em> trying to do so. Checking for staleness afterwards is a more optimistic approach. If the lock is unlikely to be stale, then checking afterwards is faster.</li>
<li>When Vault fails to take the lock, it backs off linearly instead of exponentially.</li>
<li>Instead of checking the generation number, and refreshing the lock by updating its data, Vault operates purely on <a href="https://cloud.google.com/storage/docs/metadata">object <em>metadata</em></a> because it's less costly to read frequently. This means the algorithm checks the <em>metageneration</em> number, and refreshes the lock by updating metadata fields.</li>
<li>Vault stores its unique instance identity name in the lock. This way administrators can easily find out who owns the lock.</li>
<li>Vault's TTL is a runtime configuration parameter. Its value is not stored in the object.</li>
<li>
<p>If Vault's leader election system crashes non-fatally (e.g. it detected an unhealthy lock, aborted, then tried again later from the same Vault instance), and the lock hasn't been taken over by another Vault instance at the same time, then Vault is able to retake the lock instantly.</p>
<p>In contrast, our approach so far requires waiting until the lock becomes stale per the TTL.</p>
</li>
</ol>
<p>I think points 3, 4 and 6 are worth learning from.</p>
<h2 id="instant-recovery-from-stale-locks--thread-safety">Instant recovery from stale locks & thread-safety</h2>
<p>HashiCorp Vault's ability to retake the lock instantly after a non-fatal crash is worthy of further discussion. It's a desirable feature, but what are the implications?</p>
<p>Upon closer inspection, we see that this feature works by assigning an <em>identity</em> to the lock object. This identity is a random string that's generated during Vault startup. When Vault attempts to take a lock, it checks whether the object already exists and whether its identity equals the Vault instance's own identity. If so, then Vault concludes that it's safe to retake the lock immediately.</p>
<p><strong>This identity string must be chosen with some care</strong>, because it affects on the level of mutual exclusion. Vault generates a random identity string that's unique on a per-Vault-instance basis. This results in the lock being multi-process safe, but — perhaps counter-intuitively — not thread-safe!</p>
<p>We can make the lock object thread-safe by including the thread ID in the identity as well. The tradeoff is that an abandoned lock can only be quickly recovered by the same thread that abandoned it in the first place. All other threads still have to wait for the TTL timeout.</p>
<p>In the next section we'll put together everything we've discussed and learned so far.</p>
<h2 id="putting-the-final-algorithm-together">Putting the final algorithm together</h2>
<h3 id="taking-the-lock">Taking the lock</h3>
<p>Parameters:</p>
<ul>
<li>Object URL</li>
<li>TTL</li>
<li>An identity that's unique on a per-process basis, and optionally on a per-thread basis as well
<ul>
<li>Example format: "[process identity]". If thread-safety is desired, append "/[thread identity]".</li>
<li>Interpret the concept "thread" liberally. For example, if your language is single-threaded with cooperative multitasking using coroutines/fibers, then use the coroutine/fiber identity.</li>
</ul>
</li>
</ul>
<p>Steps:</p>
<ol>
<li>Create the object at the given URL.
<ul>
<li>Use the <code>x-goog-if-generation-match: 0</code> header.</li>
<li>Set Cache-Control: no-store</li>
<li>Set the following metadata values:
<ul>
<li>Expiration timestamp (based on TTL)</li>
<li>Identity</li>
</ul>
</li>
<li>Empty contents.</li>
</ul>
</li>
<li>If creation is successful, then it means we've taken the lock.
<ul>
<li>Start refreshing the lock in the background.</li>
</ul>
</li>
<li>If creation fails with a 412 Precondition Failed error (meaning the object already exists), then:
<ol>
<li>Fetch from the object's metadata:
<ul>
<li>Update timestamp</li>
<li>Metageneration number</li>
<li>Expiration timestamp</li>
<li>Identity</li>
</ul>
</li>
<li>If step 1 fails because the object didn't exist, then restart the algorithm from step 1 immediately.</li>
<li>If the identity equals our own, then delete the object, and immediately restart the algorithm from step 1.
<ul>
<li>When deleting, use the <code>x-goog-if-metageneration-match: [metageneration]</code> header.</li>
</ul>
</li>
<li>If the update timestamp is older than the expiration timestamp then delete the object.
<ul>
<li>Use the <code>x-goog-if-metageneration-match: [metageneration]</code> header.</li>
</ul>
</li>
<li>Otherwise, restart the algorithm from step 1 after an exponential backoff (potentially with an upper limit and jitter).</li>
</ol>
</li>
</ol>
<h3 id="releasing-the-lock">Releasing the lock</h3>
<p>Parameters:</p>
<ul>
<li>Object URL</li>
<li>Identity</li>
</ul>
<p>Steps:</p>
<ol>
<li>Stop refreshing the lock in the background.</li>
<li>Delete the lock object at the given URL.
<ul>
<li>Use the <code>x-goog-if-metageneration-match: [last known metageneration]</code> header.</li>
<li>Ignore the 412 Precondition Failed error, if any.</li>
</ul>
</li>
</ol>
<h3 id="refreshing-the-lock">Refreshing the lock</h3>
<p>Parameters:</p>
<ul>
<li>Object URL</li>
<li>TTL</li>
<li>Refresh interval</li>
<li>Max number of times the refresh may fail consecutively</li>
<li>Identity</li>
</ul>
<p>Every <code>refresh_interval</code> seconds (until a lock release is requested, or until an unhealthy state is detected):</p>
<ol>
<li>Update the object metadata (which also updates the update timestamp).
<ul>
<li>Use the <code>x-goog-if-metageneration-match: [last known metageneration]</code> header.</li>
<li>Update the expiration timestamp metadata value, based on the TTL.</li>
</ul>
</li>
<li>If the operation succeeds, check the response, which contains the latest object metadata.
<ol>
<li>Take note of the latest metageneration number.</li>
<li>If the identity does not equal our own, then declare that the lock is unhealthy.</li>
</ol>
</li>
<li>If the operation fails because the object does not exist or because of a 412 Precondition Failed error, then declare that the lock is unhealthy.</li>
<li>If the operation fails for some other reason, then check whether this is the maximum number of times that we may fail consecutively. If so, then declare that the lock is unhealthy.</li>
</ol>
<h3 id="recommended-default-values">Recommended default values</h3>
<ul>
<li>TTL: 5 minutes</li>
<li>Refresh interval: 37 seconds</li>
<li>Max number of times the refresh may fail consecutively: 3</li>
</ul>
<h3 id="lock-usage">Lock usage</h3>
<p>Steps:</p>
<ol>
<li>Take the lock</li>
<li>Try:
<ul>
<li>If applicable:
<ul>
<li>Check whether state is consistent, and fix it if it isn't</li>
<li>Check whether lock is healthy, abort if not</li>
</ul>
</li>
<li>Perform a part of the operation</li>
<li>Check whether lock is healthy, abort if not</li>
<li>…etc…</li>
<li>If applicable: commit the operation's effects as atomically as possible</li>
</ul>
</li>
<li>Finally:
<ul>
<li>Release the lock</li>
</ul>
</li>
</ol>
<h2 id="conclusion">Conclusion</h2>
<p>Distributed locks are very useful for ad-hoc system/cloud automation scripts and CI/CD pipelines. Or more generally, they're useful in any situation in which multiple systems may operate on the same state concurrently. Concurrent modifications may corrupt the state, so one needs a mechanism to ensure that only one system can modify the state at the same time.</p>
<p>Google Cloud Storage is a good system to build a distributed lock on, as long as you don't care about latency that much. By leveraging Cloud Storage's capabilities, we can build a robust distributed locking algorithm that's not too complex. What's more: it's cheap to operate, cheap to maintain, and can be used from almost anywhere.</p>
<p>The distributed locking algorithm proposed by this article builds upon existing algorithms found in other systems, and makes locking more robust.</p>
<p>Eager to use this algorithm in your next system or pipeline? Check out <a href="https://github.com/FooBarWidget/distributed-lock-google-cloud-storage-ruby">the Ruby implementation</a>. In the near future I also plan on releasing implementations in other languages.</p>
Docker and the host filesystem owner matching problemhttps://www.joyfulbikeshedding.com/blog/2021-03-15-docker-and-the-host-filesystem-owner-matching-problem.html2021-03-15T00:00:00+00:002023-04-22T18:01:13+00:00Hongli Lai<p>Containers are no longer only used on servers. They are increasingly used on the desktop: as CLI apps or as development environments. I call this the <em>"container-as-OS-app"</em> use case. Within this use case, containerized apps often generate files that are not owned by your local machine's user account. Sometimes they can't access files on the host machine at all. This is the <em>host filesystem owner matching problem</em>.</p>
<ul>
<li>This is bad for security. Containers shouldn't run as root in the first place!</li>
<li>This is a potential productivity killer. It's annoying having to deal with wrong file permissions!</li>
</ul>
<p>Solutions are available, but they have major caveats. As a result it's easy to implement a solution that only works for some, but not everyone. "It works on my machine" is kind of embarrassing when you distribute a development environment to a coworker, who then runs into issues.</p>
<p>This post describes what causes the host filesystem owner matching problem, and analyzes various solutions and their caveats.</p>
<p><strong></strong></p><p>Containers are no longer only used on servers. They are increasingly used on the desktop: as CLI apps or as development environments. I call this the <em>"container-as-OS-app"</em> use case. Within this use case, containerized apps often generate files that are not owned by your local machine's user account. Sometimes they can't access files on the host machine at all. This is the <em>host filesystem owner matching problem</em>.</p>
<ul>
<li>This is bad for security. Containers shouldn't run as root in the first place!</li>
<li>This is a potential productivity killer. It's annoying having to deal with wrong file permissions!</li>
</ul>
<p>Solutions are available, but they have major caveats. As a result it's easy to implement a solution that only works for some, but not everyone. "It works on my machine" is kind of embarrassing when you distribute a development environment to a coworker, who then runs into issues.</p>
<p>This post describes what causes the host filesystem owner matching problem, and analyzes various solutions and their caveats.</p>
<p><strong></strong></p>
<p><em><strong>Update</strong>: introducing <a href="/blog/2023-04-20-cure-docker-volume-permission-pains-with-matchhostfsowner.html">MatchHostFsOwner: a cure for the host filesystemowner matching problem</a>!</em></p>
<h2 id="what-is-the-container-as-os-app-use-case">What is the "container-as-OS-app" use case?</h2>
<p>An "OS app" is an app that:</p>
<ul>
<li>Runs on your machine (as opposed to in the browser or on a server).</li>
<li><strong>Reads or writes files from/to the host OS filesystem.</strong> Files which later may be read/written by other (non-Docker-packaged) apps, such your text editor.</li>
</ul>
<figure>
<a href="/images/2021/host-fs-owner-matching-problem-975a5c4f.png"><img src="/images/2021/host-fs-owner-matching-problem-975a5c4f.png" class="img-largecover" alt="" /></a>
<figcaption>Traditional containerized apps vs container-as-OS-apps, and how the host filesystem matching problem only affects the latter</figcaption>
</figure>
<p>An OS app doesn't have to be graphical in nature. In fact, the kind of OS apps that are often containerized, are CLIs. Examples of OS apps:</p>
<ul>
<li>bash</li>
<li>ls</li>
<li>Git</li>
<li>The C/Go/Rust compiler</li>
<li>Your text editor</li>
</ul>
<p>Increasingly, Docker is used to package such apps. Here are a few examples:</p>
<ul>
<li><a href="https://github.com/emk/rust-musl-builder">rust-musl-builder</a> — compilation environment for Rust that allows generating statically-linked binaries.</li>
<li><a href="http://phusion.github.io/holy-build-box/">Holy Build Box</a> — compilation environment for C/C++ that allows generating portable Linux binaries that run on any Linux distribution.</li>
</ul>
<p>Both of these examples read or write files from/to the host OS filesystem.</p>
<p>Perhaps a little counter-intuitively, <strong>many development environments often also fall under this category</strong>. Let's say that you setup a development environment for your Ruby, Node.js or Go app, using Docker-Compose. Here's what such a Docker-Compose environment often does:</p>
<ol>
<li>It mounts the project directory (on the host filesystem) into the container.</li>
<li>(In case of compiled languages:) Inside the container, it compiles the source code located in the project directory. The compilation products, or cache files, are stored under the project directory.</li>
<li>Inside the container, it launches the app, which runs until the user requests abortion.</li>
<li>(For frameworks/languages where this is applicable:) If the source code on the host OS changes, then the app inside the container live-reloads the new code.</li>
<li>The app inside the container writes to log files, located under the project directory.</li>
</ol>
<p>This problem has been <a href="https://blog.gougousis.net/file-permissions-the-painful-side-of-docker/">documented before</a> by <a href="https://mydeveloperplanet.com/2022/10/19/docker-files-and-volumes-permission-denied/">other</a> <a href="https://www.reddit.com/r/docker/comments/hjsipd/permission_denied_with_volumes/">authors</a> as well.</p>
<p>Key takeaway: development environments often read or write files from/to the host OS filesystem. Files which may be read/written by other apps later.</p>
<h2 id="mismatching-filesystem-owners">Mismatching filesystem owners</h2>
<p>Many containers run apps as root. When they write to files on the host filesystem, they <em>create root-owned files on your host filesystem</em>. You can't modify these files with your host text editor without jumping through some hoops.</p>
<figure>
<a href="/images/2021/permission-denied-9704ef2c.png"><img src="/images/2021/permission-denied-9704ef2c.png" alt="" /></a>
<figcaption>Many containers run as root, creating root-owned files on the host OS's filesystem. These files cannot be accessed by normal apps on the host OS because of permission problems.</figcaption>
</figure>
<p>For example: on a Linux machine (not on macOS; see below), let's run a root container which writes a file on the host:</p>
<div class="highlight"><pre class="highlight shell"><code>docker run <span class="nt">--rm</span> <span class="nt">-v</span> <span class="s2">"</span><span class="si">$(</span><span class="nb">pwd</span><span class="si">)</span><span class="s2">:/host"</span> busybox <span class="nb">touch</span> /host/foo.txt
</code></pre></div>
<p>This file is owned by root, and we can't modify it:</p>
<div class="highlight"><pre class="highlight plaintext"><code>$ ls -l foo.txt
-rw-r--r-- 1 root root 0 Jan 17 10:36 foo.txt
$ echo hi > foo.txt
-bash: foo.txt: Permission denied
</code></pre></div>
<p>Some containers adhere to better security practices, and run under a normal user account. However, this creates a new problem: <em>they can't write to the host filesystem anymore</em>! This is because the host directory is only writable by the user who owns it, not by the user inside the container.</p>
<p>Here's an example container that runs under a normal user account instead of root:</p>
<div class="highlight"><pre class="highlight docker"><code><span class="k">FROM</span><span class="s"> debian:10</span>
<span class="k">RUN </span>addgroup <span class="nt">--gid</span> 1234 app <span class="o">&&</span> <span class="se">\
</span> adduser <span class="nt">--uid</span> 1234 <span class="nt">--gid</span> 1234 <span class="nt">--gecos</span> <span class="s2">""</span> <span class="nt">--disabled-password</span> app
<span class="k">USER</span><span class="s"> app</span>
</code></pre></div>
<p>We then build and run it, telling it to create a file on the host:</p>
<div class="highlight"><pre class="highlight plaintext"><code>$ docker build . -t usercontainer
$ docker run --rm -v "$(pwd):/host" usercontainer touch /host/foo.txt
touch: cannot touch `/host/foo.txt': Permission denied
</code></pre></div>
<h3 id="only-on-linux-not-on-macos">Only on Linux, not on macOS</h3>
<p>These problems are <strong>only applicable when using Docker on Linux</strong>. macOS users don't experience these problems at all, because Docker for Mac actually runs a Linux VM, and inside that VM it mounts host filesystems into the container as a network volume. It ensures that:</p>
<ul>
<li>Inside the container, all mounted files look as if they're owned by the container user.</li>
<li>On the host, all files written by the container become owned by the host user.</li>
</ul>
<p>But the fact that macOS users don't get this problem, is in itself a problem. It means that when someone creates a container-as-OS-app on macOS, and hands it over to a Linux user, then that app may not work because of the permission problems described above.</p>
<h2 id="solution-strategies-overview">Solution strategies overview</h2>
<p>There are two major strategies to solve the host filesystem owner matching problem:</p>
<ol>
<li>Matching the container's UID/GID with the host's UID/GID.</li>
<li>Remounting the host path in the container using BindFS.</li>
</ol>
<p>Each strategy has significant caveats. Let's take a look at how each strategy is implemented, and what the caveats are.</p>
<h2 id="strategy-1-matching-the-containers-uidgid-with-the-hosts">Strategy 1: matching the container's UID/GID with the host's</h2>
<p>The kernel distinguishes users and groups by two numbers: the user ID (UID) and the group ID (GID). Accounts (usernames and group names) are implemented outside the kernel, via user/group account databases that map UIDs and GIDs to usernames and group names. These databases exist in /etc/passwd and /etc/group.</p>
<p>The kernel doesn't care about names, only about UIDs and GIDs. Even files on the filesystem are not owned by <em>usernames</em> or <em>group names</em>, but by UIDs and GIDs.</p>
<p>So if we run an app in a container using the the same UID/GID as the host account's UID/GID, then the files created by that app will be owned by the host's user and group.</p>
<ul>
<li>If the container's user/group account database already has accounts with that UID/GID, but with a different name, then that's no problem.</li>
<li>If the container's account database doesn't have accounts with that UID/GID, then that's <em>also</em> no problem.</li>
</ul>
<p>The simplest way to do this is by running <code>docker run --user <HOST UID>:<HOST GID></code>. This works even if the container has no accounts with this UID/GID.</p>
<p>However, if there are no matching accounts in the container, then many applications won't behave well. This can range from cosmetic problems to crashes. A lot of library code assumes that the username can be queried, and abort on failure. Another problem is that the lack of accounts also mean that there's also no corresponding home directory. A lot of application and library code assume that they can read from or write to the home directory.</p>
<p>So a better way would be to create accounts inside the container with a UID/GID that matches the host's UID/GID. These accounts could have <em>any</em> names: the kernel doesn't care.</p>
<p>Let's go through a practical example to learn how container UIDs/GIDs and accounts work, and how to implement this strategy.</p>
<h3 id="example-creating-a-container-account-with-the-same-uidgid-as-the-host-account">Example: creating a container account with the same UID/GID as the host account</h3>
<p>Here's an example which shows how UIDs and GIDs work. This example must be run on Linux (because you won't run into the host filesystem owner matching problem on macOS). Let's start by figuring out what the host user's UID/GID is by running this command:</p>
<div class="highlight"><pre class="highlight plaintext"><code>hongli@host$ id
uid=1000(hongli) gid=1000(hongli) groups=1000(hongli),27(sudo),999(docker)
</code></pre></div>
<p>My UID and GID are both 1000.</p>
<p>Now let's start an interactive Debian shell session. We mount the host's current working directory into the container, under /host.</p>
<div class="highlight"><pre class="highlight shell"><code>docker run <span class="nt">-ti</span> <span class="nt">--rm</span> <span class="nt">-v</span> <span class="s2">"</span><span class="si">$(</span><span class="nb">pwd</span><span class="si">)</span><span class="s2">:/host"</span> debian:10
</code></pre></div>
<p>Inside the Debian container's root shell, let's create two things:</p>
<ul>
<li>A group called <code>matchinguser</code>, with GID 1000.</li>
<li>A user account called <code>matchinguser</code> with UID 1000. We disable the password because it's not relevant in this example.</li>
</ul>
<div class="highlight"><pre class="highlight shell"><code>addgroup <span class="nt">--gid</span> 1000 matchinguser
adduser <span class="nt">--uid</span> 1000 <span class="nt">--gid</span> 1000 <span class="nt">--gecos</span> <span class="s2">""</span> <span class="nt">--disabled-password</span> matchinguser
</code></pre></div>
<p>Let's use this user account to create a file in the host directory:</p>
<div class="highlight"><pre class="highlight shell"><code>apt update
apt <span class="nb">install</span> <span class="nt">-y</span> <span class="nb">sudo
sudo</span> <span class="nt">-u</span> matchinguser <span class="nt">-H</span> <span class="nb">touch</span> /host/foo2.txt
</code></pre></div>
<p>If we inspect the file permissions of /host/foo2.txt from inside the container, then we see that it's owned by <code>matchinguser</code>:</p>
<div class="highlight"><pre class="highlight plaintext"><code>root@container:/# ls -l /host/foo2.txt
-rw-r--r-- 1 hostuser hostuser 0 Mar 15 09:45 /host/foo2.txt
</code></pre></div>
<p>But if we inspect the same file from the host, then we see that it's owned by the host user:</p>
<div class="highlight"><pre class="highlight plaintext"><code>hongli@host:/# ls -l foo2.txt
-rw-r--r-- 1 hongli hongli 0 Mar 15 09:45 /host/foo2.txt
</code></pre></div>
<p>This is because the file has the UID and GID 1000, which in the container maps to <code>matchinguser</code>, but on the host maps to <code>hongli</code>.</p>
<h3 id="example-modifying-existing-container-accounts-uid">Example: modifying existing container account's UID</h3>
<p>You don't even have to create new container accounts. You can actually modify the UID/GID of existing accounts.</p>
<p>For example, let's delete <code>matchinguser</code>, and recreate it with UID/GID 1500:</p>
<div class="highlight"><pre class="highlight shell"><code>apt <span class="nb">install</span> <span class="nt">-y</span> perl <span class="c"># needed by deluser on Debian</span>
deluser <span class="nt">--remove-home</span> matchinguser
addgroup <span class="nt">--gid</span> 1500 matchinguser
adduser <span class="nt">--uid</span> 1500 <span class="nt">--gid</span> 1500 <span class="nt">--gecos</span> <span class="s2">""</span> <span class="nt">--disabled-password</span> matchinguser
</code></pre></div>
<p>We can then use <code>usermod</code> and <code>groupmod</code> to change those accounts' UID/GID to 1000:</p>
<div class="highlight"><pre class="highlight shell"><code>groupmod <span class="nt">--gid</span> 1000 matchinguser
usermod <span class="nt">--uid</span> 1000 matchinguser
</code></pre></div>
<h3 id="implementation-and-caveats">Implementation and caveats</h3>
<p>Here's a simple implementation strategy. If your container doesn't need precreated accounts, then you can do it as follows:</p>
<ul>
<li>Add an entrypoint script which creates a user/group account, whose UID/GID equal the host account's UID/GID.</li>
<li>The entrypoint script requires two environment variables, <code>HOST_UID</code> and <code>HOST_GID</code>, which specify what the host account's UID and GID are.</li>
<li>The entrypoint then executes the next container command, under the newly created user/group accounts.</li>
<li>Users must run the container with root privileges, with the environment variables <code>HOST_UID</code> and <code>HOST_GID</code>. The container is responsible for dropping privileges.</li>
</ul>
<p>If your container requires a precreated account, then you need to modify the strategy a little bit:</p>
<ul>
<li>Instead of creating a new account, the entrypoint script modifies the UID/GID of the precreated user account, to the host account's UID/GID.</li>
</ul>
<p>Here's an example of a naive entrypoint script. The container account that we want to use is called <code>app</code>.</p>
<div class="highlight"><pre class="highlight shell"><code><span class="c">#!/usr/bin/env bash</span>
<span class="nb">set</span> <span class="nt">-e</span>
<span class="k">if</span> <span class="o">[[</span> <span class="nt">-z</span> <span class="s2">"</span><span class="nv">$HOST_UID</span><span class="s2">"</span> <span class="o">]]</span><span class="p">;</span> <span class="k">then
</span><span class="nb">echo</span> <span class="s2">"ERROR: please set HOST_UID"</span> <span class="o">></span>&2
<span class="nb">exit </span>1
<span class="k">fi
if</span> <span class="o">[[</span> <span class="nt">-z</span> <span class="s2">"</span><span class="nv">$HOST_GID</span><span class="s2">"</span> <span class="o">]]</span><span class="p">;</span> <span class="k">then
</span><span class="nb">echo</span> <span class="s2">"ERROR: please set HOST_GID"</span> <span class="o">></span>&2
<span class="nb">exit </span>1
<span class="k">fi</span>
<span class="c"># Use this code if you want to create a new user account:</span>
addgroup <span class="nt">--gid</span> <span class="s2">"</span><span class="nv">$HOST_GID</span><span class="s2">"</span> matchinguser
adduser <span class="nt">--uid</span> <span class="s2">"</span><span class="nv">$HOST_UID</span><span class="s2">"</span> <span class="nt">--gid</span> <span class="s2">"</span><span class="nv">$HOST_GID</span><span class="s2">"</span> <span class="nt">--gecos</span> <span class="s2">""</span> <span class="nt">--disabled-password</span> app
<span class="c"># -OR-</span>
<span class="c"># Use this code if you want to modify an existing user account:</span>
groupmod <span class="nt">--gid</span> <span class="s2">"</span><span class="nv">$HOST_GID</span><span class="s2">"</span> app
usermod <span class="nt">--uid</span> <span class="s2">"</span><span class="nv">$HOST_UID</span><span class="s2">"</span> app
<span class="c"># Drop privileges and execute next container command, or 'bash' if not specified.</span>
<span class="k">if</span> <span class="o">[[</span> <span class="nv">$# </span><span class="nt">-gt</span> 0 <span class="o">]]</span><span class="p">;</span> <span class="k">then
</span><span class="nb">exec sudo</span> <span class="nt">-u</span> <span class="nt">-H</span> app <span class="nt">--</span> <span class="s2">"</span><span class="nv">$@</span><span class="s2">"</span>
<span class="k">else
</span><span class="nb">exec sudo</span> <span class="nt">-u</span> <span class="nt">-H</span> app <span class="nt">--</span> bash
<span class="k">fi</span>
</code></pre></div>
<p>The above entrypoint script is a good attempt, but fails to consider these significant caveats:</p>
<ol>
<li>
<p><strong>What if there's already another container user/group, with the same UID/GID as the host UID/GID?</strong></p>
<p>Then it's not possible to create a new user account/group with the host UID/GID.</p>
<p>One way to deal with this is by deleting the conflicting container user/group. However, depending on which account exactly is deleted (and what that account is used for inside the container), this could degrade the behavior of the container in unpredictable ways.</p>
<p>As a general rule of thumb, accounts with UID < 1000, and groups with GID < 1000, are considered system accounts and groups. System accounts/groups are managed by the OS maintainers, and are not supposed to be messed with by users of the OS.</p>
<p>In contrast, accounts/groups with UID/GID >= 1000 are "normal accounts"/"normal groups", not managed by the OS maintainers. Users of the OS are free to do whatever they like with those accounts. But here you have to ask yourself: who, in this context, are "users of the OS"? If it's only yourself, and you have full control over which normal accounts go into your container: then there's no problem. But if you're using a base image supplied by someone else, and the base image already comes with precreated normal accounts, then you have to ask yourself whether it's safe to modify them.</p>
</li>
<li>
<p><strong>What if the host user is root?</strong></p>
<p>The host user being root (with UID 0) is a special case that you need to deal with. It's not a good idea to delete the existing root account in the container and replace it with another account. So if the entrypoint script detects that the host UID is 0, then it should run the next command as root.</p>
<p>But on weird systems, the host's root user could have a non-zero GID! So if the entrypoint script detects that the UID is 0 but GID is non-zero, then should modify the root group's GID. This in turn could run into the problem described by (1): what if there's already another group with the same GID?</p>
</li>
<li>
<p><strong>In case of precreated accounts: what about the files it owned?</strong></p>
<p>If your container makes use of a precreated account, then after you modify that account's UID and GID, you should ask yourself what you should do about files that were owned by that account. Should those files' UID/GID be updated to match the new UID/GID?</p>
<p>The Debian version of <code>usermod --uid</code> automatically updates the UIDs of all files in that account's home directory (recursively). However, <code>groupmod</code> does not update the GIDs, so you need to do that yourself from your entrypoint script.</p>
<p><code>usermod --uid</code> does not update the UIDs of files outside that account's home directory. It's up to your entrypoint script to update those files, if any.</p>
<p>Furthermore, you should ask yourself whether it's a good idea to update the UIDs of those files. If those files are world-readable, and your container never writes them, then updating their UIDs/GIDs is redundant. If there are <em>many</em> files, then updating their UIDs/GIDs can take a significant amount of time. I ran into this very problem when using <a href="https://github.com/emk/rust-musl-builder">rust-musl-builder</a>. Rust was installed via <code>rustup</code> into the home directory, and updating the UIDs/GIDs of <code>~/.rustup</code> takes a lot of time.</p>
<p>Perhaps it's only necessary to update the UIDs/GIDs of only specific files. For example, only the files immediately in the home directory, not recursively. This must be judged on a per-container basis.</p>
<p>Finally, some Linux kernel versions have bugs in OverlayFS. Updating the UIDs/GIDs of existing files doesn't always work. This can be worked around by making a copy of those files, removing the original files, and renaming the copies to their original names.</p>
</li>
<li>
<p><strong>Requires root privileges</strong></p>
<p>The simple example entrypoint script is responsible for creating and modifying accounts, which requires root privileges. It's also responsible for dropping privileges to a normal account. However, this means that we can't use the <code>USER</code> stanza in the Dockerfile. Furthermore, users can't run the container with the <code>--user</code> flag, which is counter-intuitive and may make some users wary about the container's security.</p>
<p>One solution is to make the entrypoint program a <em>setuid root executable</em>. This means turning on the "setuid filesystem bit" on the entrypoint program, so that when the entrypoint program runs, it gains root privileges, even if the program was started by a non-root user.</p>
<p>The setuid bit is only used by a few select programs that are involved in privilege escalation. For example, Sudo uses the setuid bit. As you can imagine, the setuid bit is very dangerous. When not careful enough, anyone can gain root privileges without authentication. A setuid root program must be specifically written make abuse impossible.</p>
<p>Another complication is that the setuid root bit does not work on shell scripts, only on "real" executables! So if you want to make use of this bit, you'll have to write the entrypoint program in a language that compiles to native executables, like C, C++, Rust or Go.</p>
<p>Under what conditions is it safe to run a setuid root entrypoint program? One answer is: if the entrypoint's PID is 1. This means it's the very first program run in the container. This indicates that the entrypoint program is run directly by <code>docker run</code>, so we can assume that it's a safe-ish environment.</p>
<p>But checking for PID 1 doesn't work in combination with <code>docker run --init</code>, which spawns an init process (whose job is to solve the <a href="https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/">PID 1 zombie reaping problem</a>). The init process can perform arbitrary work, and execute arbitrary processes before it executes our entrypoint program. So we can't can't assume that our PID is 2 either. Instead, we can check whether we're a child process of the init process. Because after the init process executes the next command, it won't execute any further commands.</p>
</li>
<li>
<p><strong>Requires extra environment variables</strong></p>
<p>In the ideal world, we want users to be able to run our container with <code>docker run --user HOST_UID:HOST_GID</code>, and have the container's entrypoint automatically figure out that the values passed to <code>--user</code> are the host UID/GID.</p>
<p>But our example entrypoint script requires the user to specify that information through environment variables. So users have to pass redundant parameters, like this:</p>
<div class="highlight"><pre class="highlight shell"><code>docker run <span class="se">\</span>
<span class="nt">-e</span> <span class="nv">HOST_UID</span><span class="o">=</span><span class="s2">"</span><span class="si">$(</span><span class="nb">id</span> <span class="nt">-u</span><span class="si">)</span><span class="s2">"</span> <span class="se">\</span>
<span class="nt">-e</span> <span class="nv">HOST_GID</span><span class="o">=</span><span class="s2">"</span><span class="si">$(</span><span class="nb">id</span> <span class="nt">-g</span><span class="si">)</span><span class="s2">"</span> <span class="se">\</span>
<span class="nt">--user</span> <span class="s2">"</span><span class="si">$(</span><span class="nb">id</span> <span class="nt">-u</span><span class="si">)</span><span class="s2">:</span><span class="si">$(</span><span class="nb">id</span> <span class="nt">-g</span><span class="si">)</span><span class="s2">"</span> <span class="se">\</span>
...
</code></pre></div>
<p>This is not a good user experience.</p>
</li>
</ol>
<p>With the above caveats, the entrypoint script becomes no longer trivial. If you want to solve caveat 4, then the entrypoint can't even be a shell script anymore.</p>
<h2 id="strategy-2-remounting-the-host-path-in-the-container-using-bindfs">Strategy 2: remounting the host path in the container using BindFS</h2>
<p><a href="https://bindfs.org/">BindFS</a> is a <a href="http://en.wikipedia.org/wiki/Filesystem_in_Userspace">FUSE</a> filesystem that allows us to mount a directory in another path, with different filesystem permissions. BindFS doesn't change the original filesystem permissions: it just exposes an alternative view that looks as if all the permissions are different.</p>
<p>So a container can use BindFS to create an alternative view of the host directory. In this alternative view, everything is owned by a normal account in the container (whose UID/GID doesn't have to match the host's). When the container uses that account to write to the alternative view, then the created files are still owned by the original directory's owner.</p>
<p>Thus, BindFS allows two-way mapping between the host's UID/GID and the container's UID/GID, in a way that's transparent to applications.</p>
<figure>
<a href="/images/2021/bindfs-c1f3b36c.png"><img src="/images/2021/bindfs-c1f3b36c.png" alt="" /></a>
<figcaption>BindFS provides an alternative view of an existing mount. This alternative view can have any permissions, specified by mount options.</figcaption>
</figure>
<h3 id="bindfs-in-action">BindFS in action</h3>
<p>Let's take a look at how BindFS works. Remember: this example must be run on Linux, because the host filesystem owner matching problem does not appear on macOS.</p>
<p>First, let's figure out what the host user's UID/GID is:</p>
<div class="highlight"><pre class="highlight plaintext"><code>hongli@host$ id
uid=1000(hongli) gid=1000(hongli) groups=1000(hongli),27(sudo),999(docker)
</code></pre></div>
<p>Next, run a Debian 10 container that mounts the current working directory into <code>/host</code> in the container. Be sure to pass <code>--privileged</code> so that FUSE works.</p>
<div class="highlight"><pre class="highlight shell"><code>docker run <span class="nt">-ti</span> <span class="nt">--rm</span> <span class="nt">--privileged</span> <span class="nt">-v</span> <span class="s2">"</span><span class="si">$(</span><span class="nb">pwd</span><span class="si">)</span><span class="s2">:/host"</span> debian:10
</code></pre></div>
<p>Once you're in the container, install BindFS:</p>
<div class="highlight"><pre class="highlight shell"><code>apt update
apt <span class="nb">install</span> <span class="nt">-y</span> bindfs
</code></pre></div>
<p>Next, create a user account in the container to play with:</p>
<div class="highlight"><pre class="highlight shell"><code>addgroup <span class="nt">--gid</span> 1234 app
adduser <span class="nt">--uid</span> 1234 <span class="nt">--gid</span> 1234 <span class="nt">--gecos</span> <span class="s2">""</span> <span class="nt">--disabled-password</span> app
</code></pre></div>
<p>Let's use BindFS to mount <code>/host</code> to <code>/host.writable-by-app</code>.</p>
<div class="highlight"><pre class="highlight shell"><code><span class="nb">mkdir</span> /host.writable-by-app
bindfs <span class="nt">--force-user</span><span class="o">=</span>app <span class="nt">--force-group</span><span class="o">=</span>app <span class="nt">--create-for-user</span><span class="o">=</span>1000 <span class="nt">--create-for-group</span><span class="o">=</span>1000 <span class="nt">--chown-ignore</span> <span class="nt">--chgrp-ignore</span> /host /host.writable-by-app
</code></pre></div>
<p>Here's what the flags mean:</p>
<ul>
<li><code>--force-user=app</code> and <code>--force-group=app</code> mean: make everything in /host look as if they're owned by the user/group named <code>app</code>.</li>
<li><code>--create-for-user=1000</code> and <code>--create-for-group=1000</code> mean: when a new file is created, make it owned by UID/GID 1000 (the host's UID/GID).</li>
<li><code>--chown-ignore</code> and <code>--chgrp-ignore</code> mean: ignore requests to change a file's owner/group. Because we want all files to be owned by the host's UID/GID.</li>
</ul>
<p>When you look at the permissions of the two directories, you see that one is owned by the host's UID/GID, and the other by <code>app</code>:</p>
<div class="highlight"><pre class="highlight plaintext"><code>root@container:/# ls -ld /host /host.writable-by-app
drwxr-xr-x 18 1000 1000 4096 Mar 15 10:10 /host
drwxr-xr-x 18 app app 4096 Mar 15 10:10 /host.writable-by-app
</code></pre></div>
<p>Let's see what happens if we use the <code>app</code> account to create a file in both directories. First, install sudo:</p>
<div class="highlight"><pre class="highlight shell"><code>apt <span class="nb">install</span> <span class="nt">-y</span> <span class="nb">sudo</span>
</code></pre></div>
<p>Then:</p>
<div class="highlight"><pre class="highlight plaintext"><code>root@container:/# sudo -u app -H touch /host/foo3.txt
touch: cannot touch '/host/foo3.txt': Permission denied
root@container:/# sudo -u app -H touch /host.writable-by-app/foo3.txt
</code></pre></div>
<p>Creating a file in /host doesn't work: <code>app</code> doesn't have permissions. But creating a file in /host.writable-by-app <em>does</em> work.</p>
<p>If you look at the file in /host.writable-by-app, then you see that it's owned by <code>app</code>:</p>
<div class="highlight"><pre class="highlight plaintext"><code>root@container:/# ls -l /host.writable-by-app/foo3.txt
-rw-r--r-- 1 app app 0 Mar 16 11:06 /host.writable-by-app/foo3.txt
</code></pre></div>
<p>But if you look at the file in /host, then you see that it's owned by the host's UID/GID:</p>
<div class="highlight"><pre class="highlight plaintext"><code>root@container:/# ls -l /host/foo3.txt
-rw-r--r-- 1 1000 1000 0 Mar 16 11:06 /host/foo3.txt
</code></pre></div>
<p>This is corroborated by the host. If you exit the container and look at foo3.txt, then you see that it's owned by the host's user:</p>
<div class="highlight"><pre class="highlight plaintext"><code>hongli@host$ ls -l foo3.txt
-rw-r--r-- 1 hongli hongli 0 Mar 16 12:06 foo3.txt
</code></pre></div>
<h3 id="implementation">Implementation</h3>
<p>A container that wishes to use the BindFS strategy should have the necessary tools installed, and should include a precreated normal user account. For example:</p>
<div class="highlight"><pre class="highlight docker"><code><span class="k">FROM</span><span class="s"> debian:10</span>
<span class="k">ADD</span><span class="s"> entrypoint.sh /</span>
<span class="k">RUN </span>apt update <span class="o">&&</span> <span class="se">\
</span> apt <span class="nb">install </span>bindfs <span class="nb">sudo</span> <span class="o">&&</span> <span class="se">\
</span> addgroup <span class="nt">--gid</span> 1234 app <span class="o">&&</span> <span class="se">\
</span> adduser <span class="nt">--uid</span> 1234 <span class="nt">--gid</span> 1234 <span class="nt">--gecos</span> <span class="s2">""</span> <span class="nt">--disabled-password</span> app
<span class="k">ENTRYPOINT</span><span class="s"> ["/entrypoint.sh"]</span>
</code></pre></div>
<p>Then:</p>
<div class="highlight"><pre class="highlight shell"><code>docker build <span class="nb">.</span> <span class="nt">-t</span> bindfstest
</code></pre></div>
<p>The entrypoint script could be as follows. In this example, the entrypoint script assumes that the container is started with <code>/host</code> being mounted to a host directory.</p>
<div class="highlight"><pre class="highlight shell"><code><span class="c">#!/usr/bin/env bash</span>
<span class="nb">set</span> <span class="nt">-e</span>
<span class="k">if</span> <span class="o">[[</span> <span class="nt">-z</span> <span class="s2">"</span><span class="nv">$HOST_UID</span><span class="s2">"</span> <span class="o">]]</span><span class="p">;</span> <span class="k">then
</span><span class="nb">echo</span> <span class="s2">"ERROR: please set HOST_UID"</span> <span class="o">></span>&2
<span class="nb">exit </span>1
<span class="k">fi
if</span> <span class="o">[[</span> <span class="nt">-z</span> <span class="s2">"</span><span class="nv">$HOST_GID</span><span class="s2">"</span> <span class="o">]]</span><span class="p">;</span> <span class="k">then
</span><span class="nb">echo</span> <span class="s2">"ERROR: please set HOST_GID"</span> <span class="o">></span>&2
<span class="nb">exit </span>1
<span class="k">fi
</span><span class="nb">mkdir</span> /host.writable-by-app
bindfs <span class="nt">--force-user</span><span class="o">=</span>app <span class="nt">--force-group</span><span class="o">=</span>app <span class="se">\</span>
<span class="nt">--create-for-user</span><span class="o">=</span>1000 <span class="nt">--create-for-group</span><span class="o">=</span>1000 <span class="se">\</span>
<span class="nt">--chown-ignore</span> <span class="nt">--chgrp-ignore</span> <span class="se">\</span>
/host /host.writable-by-app
<span class="c"># Drop privileges and execute next container command, or 'bash' if not specified.</span>
<span class="k">if</span> <span class="o">[[</span> <span class="nv">$# </span><span class="nt">-gt</span> 0 <span class="o">]]</span><span class="p">;</span> <span class="k">then
</span><span class="nb">exec sudo</span> <span class="nt">-u</span> <span class="nt">-H</span> app <span class="nt">--</span> <span class="s2">"</span><span class="nv">$@</span><span class="s2">"</span>
<span class="k">else
</span><span class="nb">exec sudo</span> <span class="nt">-u</span> <span class="nt">-H</span> app <span class="nt">--</span> bash
<span class="k">fi</span>
</code></pre></div>
<p>The container is then run as follows:</p>
<div class="highlight"><pre class="highlight plaintext"><code>docker run -ti --rm --privileged \
-v "/some-host-path:/host" \
-e "HOST_UID=$(id -u)" \
-e "HOST_GID=$(id -g)" \
bindfstest
</code></pre></div>
<h3 id="caveats">Caveats</h3>
<p>BindFS works very well. But there are two caveats:</p>
<ul>
<li>It requires privileged mode! Because FUSE requires this. This might be a security concern.</li>
<li>The container cannot be started as non-root! Although it's possible to work around this problem by using a setuid root entrypoint program, as is described in strategy 1 caveat 4.</li>
</ul>
<p>Some Internet sources say that that <code>--privileged</code> can be replaced with <code>--device /dev/fuse --cap-add SYS_ADMIN</code>. However:</p>
<ul>
<li><code>SYS_ADMIN</code> capabilities is not much better than <code>--privileged</code> from a security perspective.</li>
<li>This trick doesn't work on Docker for Mac. It results in an error.</li>
</ul>
<h2 id="conclusion">Conclusion</h2>
<p>There are two major strategies to solve the host filesystem owner matching problem:</p>
<ol>
<li>Matching the container's UID/GID with the host's UID/GID.</li>
<li>Remounting the host path in the container using BindFS.</li>
</ol>
<p>Both strategies have their own benefits and drawbacks.</p>
<ul>
<li>Using BindFS is easy to implement by yourself, but requires starting the container with root privileges, and in privileged mode.</li>
<li>Running the container in a matching UID/GID does not require privileged mode. It also allows the container to run without root privileges. But it is hard to implement if you want to address all caveats.</li>
</ul>
<p>BindFS's caveats can't be solved. But the caveats related to "matching the container UID/GID with the host's" <em>can</em> be solved, even if it takes quite a lot of engineering.</p>
<p>Armed with the knowledge provided by this article, you'll be able to build a solution yourself. But wouldn't it be nice if you can use a solution already made by someone else — especially if that solution uses strategy 1, which is hard to implement? I wrote exactly such a tool. Check out <a href="/blog/2023-04-20-cure-docker-volume-permission-pains-with-matchhostfsowner.html">MatchHostFsOwner: a cure for the host filesystemowner matching problem</a>!</p>
<p><small><i>The Docker icon used in this article's illustrations is made by <a href="https://www.iconfinder.com/icons/4373190/docker_logo_logos_icon">Flatart</a>.</i></small></p>
Traveling Ruby 20210206: maintenance update featuring Ruby 2.4https://www.joyfulbikeshedding.com/blog/2021-02-06-traveling-ruby-20210206-released.html2021-02-06T00:00:00+00:002023-04-22T18:01:13+00:00Hongli Lai<p><a href="http://phusion.github.io/traveling-ruby">Traveling Ruby</a> allows you to easily ship Ruby apps to end users. It lets you create self-contained Ruby app packages that run on multiple versions of Windows, Linux and macOS.</p>
<p>Today I’ve released version 20210206. This release supports Ruby 2.4, bumps all the gem versions, bumps the minimum supported macOS and Linux versions, and fixes some bugs.</p>
<p>It has been a <em>long</em> time since the last release. So this post also adresses an elephant in the room: is Traveling Ruby back?</p>
<blockquote>
<p><a href="http://phusion.github.io/traveling-ruby">Traveling Ruby</a> allows you to easily ship Ruby apps to end users. It lets you create self-contained Ruby app packages that run on multiple versions of Windows, Linux and macOS.</p>
</blockquote>
<p>Today I've released <a href="http://phusion.github.io/traveling-ruby">Traveling Ruby</a> version 20210206. This release supports Ruby 2.4, bumps all the gem versions, bumps the minimum supported macOS and Linux versions, and fixes some bugs. You can find the exact changelog below.</p>
<h2 id="the-elephant-in-the-room">The elephant in the room</h2>
<p>A more interesting question that the community will probably ask, is: is Traveling Ruby back? After all, it has been a <em>long</em> time since the last release.</p>
<p>The answer is no. I <a href="/blog/2021-01-06-the-future-of-traveling-ruby.html">blogged earlier about why Traveling Ruby stopped being maintained</a>, and how a potential way forward would look like. Reviving Traveling Ruby is an effort that takes much more energy than just this release, and right now I do not have the resources to push such an effort.</p>
<p>So this release is meant to be a quick, conservative maintenance release. It was supposed to contain the least amount of changes to make Traveling Ruby releasable again on modern Linux and macOS systems, though these changes ended up to be <a href="https://github.com/phusion/traveling-ruby/commits/rel-20210206">pretty extensive</a>.</p>
<p>The previous release was based on Ruby 2.2. Because of the conservative nature of this latest release, I upgraded to the oldest Ruby (past 2.2) that is compileable on modern Linux and macOS systems. And that's Ruby 2.4.</p>
<p>This has a bunch of downsides. Besides not having the latest Ruby features, not all gems are compatible with Ruby 2.4, so I didn't upgrade the gems to their very latest versions. This has security implications. For example, we ship nokogiri 1.10, but this version <a href="https://github.com/phusion/traveling-ruby/pull/108">has a vulnerability</a> that's fixed in 1.11. Unfortunately 1.11 requires Ruby 2.5.</p>
<p>This release is mainly meant for <a href="https://github.com/phusion/traveling-ruby/pull/94#issuecomment-754371791">existing Traveling Ruby users</a>, to address their most urgent needs. But more effort is needed to <em>really</em> bring Traveling Ruby to a good state.</p>
<h2 id="changelog">Changelog</h2>
<ul>
<li>On Linux, dropped support for x86. Only x86_64 is now supported.</li>
<li>On Windows, dropped support for x86. Only x64 is now supported.</li>
<li>The minimum supported macOS version is now 10.14 Mojave.</li>
<li>The minimum supported Linux version is now RHEL 7 / CentOS 7 / Debian 8 / Ubuntu 14.06 / glibc 2.17.</li>
<li>Fixed support for paths containing spaces. Contributed by Ville Immonen (@fson) in <a href="https://github.com/phusion/traveling-ruby/pull/94">PR #94</a>. Closes <a href="https://github.com/phusion/traveling-ruby/issues/38">issue #38</a>.</li>
<li>Upgraded CA certificates from that of CentOS 5 to that of CentOS 8.</li>
<li>Upgraded OpenSSL to 1.1.1i.</li>
<li>Upgraded GMP to 6.2.1.</li>
<li>Upgraded libssh2 to 1.9.0.</li>
<li>Upgraded bundler gem to version 1.17.3.</li>
<li>Upgraded bcrypt gem to 3.1.16.</li>
<li>Upgraded charlock_holmes gem to 0.7.7.</li>
<li>Upgraded curses gem to 1.4.0.</li>
<li>Upgraded escape_utils gem to 1.2.1.</li>
<li>Upgraded fast-stemmer gem to 1.0.2.</li>
<li>Upgraded ffi gem to 1.14.2.</li>
<li>Upgraded hitimes gem to 2.0.0.</li>
<li>Upgraded json gem to 2.5.1.</li>
<li>Upgraded kgio gem to 2.11.3.</li>
<li>Upgraded mysql2 gem to 0.5.3.</li>
<li>Upgraded nokogiri gem to 1.10.10.
<ul>
<li>On macOS: upgraded libxml2 to 2.9.10.</li>
<li>On macOS: upgraded libxslt to 1.1.34.</li>
</ul>
</li>
<li>Upgraded nokogumbo gem to 1.5.0.</li>
<li>Upgraded pg gem to 1.2.3.
<ul>
<li>Upgraded libpq to 13.1.</li>
</ul>
</li>
<li>Upgraded posix-spawn gem 0.3.15.</li>
<li>Upgraded puma gem to 5.1.1</li>
<li>Upgraded raindrops gem to 0.19.1.</li>
<li>Upgraded redcarpet gem to 3.5.1.</li>
<li>Upgraded RedCloth gem to 4.3.2.</li>
<li>Upgraded rugged gem to 1.1.0.</li>
<li>Upgraded sqlite3 gem to 1.4.2.
<ul>
<li>Upgraded libsqlite3 to 2020-3340000.</li>
</ul>
</li>
<li>Upgraded thin gem to 1.8.0.</li>
<li>Upgraded unf_ext gem to 0.0.7.7.</li>
<li>Upgraded unicorn gem to 5.8.0.</li>
<li>Upgraded yajl-ruby gem to 1.4.1.</li>
<li>Dropped github-markdown gem.</li>
</ul>