One of the projects I'm working on has a CI/CD pipeline that builds Docker images. The Dockerfile runs yarn install, then yarn build. The latter runs the TypeScript compiler tsc. Everything was working fine, but one day the build failed with the following error:

tsc: command not found

But TypeScript is still part of package.json. Nobody touched package.json or yarn.lock recently. Nobody could reproduce the problem locally with Docker: it only happened in the CI/CD pipeline. What is going on? We needed to debug the Docker build on the CI/CD server.

How Docker build works

docker build works by running each Dockerfile step in a container. At the end of each step, that container is committed to a new image. The first step's container is created from the image specified in FROM. Subsequent steps' containers are created from the image produced by the previous step.

For example take a look at the following docker build session:

A docker build session without cache hits

(1) means that step 4 is being run in a container with ID 6d1e7bc3e824.

(2) means that at the end of step 4, the container was committed to an image with ID b5184c19fdf0.

Caching

docker build supports caching. If the same build step had already run successfully in the past, then that step will be skipped (no container will be spawned), and the image produced last time will be selected.

If a certain step does not hit the cache, then all subsequent steps won't hit the cache either.

Here's an example of what happens if we re-run the docker build shown in the previous section.

Sending build context to Docker daemon  7.168kB
Step 1/6 : FROM node:10.15.1-alpine
 ---> fe6ff768f798
Step 2/6 : WORKDIR /app
 ---> Using cache
 ---> d9b21154b260
Step 3/6 : COPY package.json yarn.lock /app/
 ---> Using cache
 ---> da8e5d5d9c3e
Step 4/6 : RUN yarn install --frozen-lockfile
 ---> Using cache
 ---> b5184c19fdf0
Step 5/6 : COPY hello.ts /app/
 ---> Using cache
 ---> 6cc3ceaed6db
Step 6/6 : RUN yarn run tsc hello.ts && node hello.js
 ---> Using cache
 ---> 70eeee8d9ca6
Successfully built 70eeee8d9ca6

As you can see, there are no container IDs in the above output.

Inspecting the contents of a step

Image that one day, your CI build fails like this:

Sending build context to Docker daemon  7.168kB
Step 1/6 : FROM node:10.15.1-alpine
 ---> fe6ff768f798
Step 2/6 : WORKDIR /app
 ---> Using cache
 ---> d9b21154b260
Step 3/6 : COPY package.json yarn.lock /app/
 ---> Using cache
 ---> f72815b0addf
Step 4/6 : RUN yarn install --frozen-lockfile
 ---> Using cache
 ---> 9ac88c8d0b9d
Step 5/6 : COPY hello.ts /app/
 ---> Using cache
 ---> 27132fdd7a27
Step 6/6 : RUN yarn run tsc hello.ts && node hello.js
 ---> Running in 20f3e5cebec5
yarn run v1.13.0
$ tsc hello.ts
tsc: command not found
error Command failed with exit code 127.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
The command '/bin/sh -c yarn run tsc hello.ts && node hello.js' returned a non-zero code: 1

And yet you cannot reproduce this locally. Perhaps we can find out the cause of this issue by inspecting the environment in which yarn run tsc hello.ts is run. To do this, we should login to the CI server and start a shell in the container that executed step 6.

We can see in the output that the step ran in container 20f3e5cebec5. That step failed, so it didn't produce an image. The container has already exited, so we can't directly start a shell in that container. So we should commit the container to a temporary image, and then start a shell in that temporary image.

On the CI server, we run:

ci-server /ci-workspace$ docker commit 20f3e5cebec5 tempimagename
ci-server /ci-workspace$ docker run -ti --rm tempimagename sh
container /app$ _

We entered the container! Let's see whether we can reproduce the problem there:

container /app$ yarn run tsc hello.ts
yarn run v1.13.0
$ tsc hello.ts
tsc: not found
error Command failed with exit code 127.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.

Yes we can. But why would tsc not be found when yarn install was run earlier? Let's inspect node_modules:

container /app$ ls node_modules
ls: node_modules: No such file or directory

Uh oh, node_modules isn't there. Maybe we can find a clue if we inspect the environment in which yarn install was run. We saw in the docker build output that step 4 matched cached image 27132fdd7a27. So let's exit the previous container and start a shell in image 27132fdd7a27:

ci-server /ci-workspace$ docker run -ti --rm 27132fdd7a27 sh
container /app$ ls node_modules
ls: node_modules: No such file or directory

node_modules isn't there either!

Analyzing the problem

Based on the observed state, I suspect that when yarn install was run, it actually failed, but did not report an error. It didn't produce a node_modules directory, but docker build thought the step succeeded, and so it cached the (corrupt) result of the step. This explains why the CI build kept failing and why it couldn't be reproduced locally.

The problem was solved by clearing the Docker cache on the CI server. This was done by removing all Docker images:

docker rmi -f $(docker images -aq)

Why Yarn didn't report an error is still a mystery. Maybe we'll find out one day.

Conclusion

We can debug a docker build by inspecting the environment that the build steps were being run in.

If a step produced an image or matched a cached image, then launch a shell container from that image directly:

docker run -ti --rm <IMAGE ID> sh

If a step didn't produce an image (e.g. it failed), then commit that step's container to an image and launch a shell container from that temporary image:

docker commit <CONTAINER ID> tempimagename
docker run -ti --rm tempimagename sh

Good luck debugging!

Discuss this on Reddit