Why piping to 'cat' can stuck a CI job 🙀

What's the difference between command and command | cat? There shouldn't be any, right? The first prints the output directly, and the latter prints output via cat, but they should have the same effect. Not so: the latter can cause the command to get stuck indefinitely. This has given a particular CI pipeline of mine quite some headache.

To learn why this happens, and how we can mitigate this problem, we need to dive into the arcane magic that is Unix process management. Join me on this journey.

It started with an innocent pipe 👼

I have a CI job that uses Cypress — an end-to-end Javascript testing framework that works by controlling a headless browser — to perform integration tests.

At some point, I wanted to parse the Cypress output in order to extract information about the Cypress run, such as how many times failed tests were retried. I also wanted the Cypress output to show up in my CI pipeline in real time, rather than only after the Cypress run was completed.

So I chose to pipe Cypress's output to tee. Tee is a tool similar to cat: it prints everything consumed from standard input, but also writes a copy to a file, which we can inspect later. My code looked like this:

cypress run | tee cypress.log
parse cypress.log

After this change, my CI job started to get stuck occasionally. 🤷‍♂️ Thinking I went insane, I double checked a couple of things:

Was the OS upgraded recently? Nope.
Was a dependency upgraded recently? Nope.
Does upgrading Cypress and Headless Chrome make the problem go away? Nope.
Does removing the tee pipe make the problem go away? Yes.

I noticed a pattern: when the CI job gets stuck, then Cypress was always in a complete state: all tests have finished, and Cypress has written its final output. In other words, it was as if Cypress was stuck during some shutdown routine.

I started poking around in the CI server and observed a few peculiar things:

The main Cypress process is gone, but tee is still alive.
A few Cypress subprocesses, such as Headless Chrome and a bunch of Cypress helper processes, are still alive.

When I put all of this together, and thought about how Unix process management works, a lightbulb started to shine. 💡 How could I not have seen it coming?

What's going on

When we run cypress run | tee cypress.log, Bash waits for cypress and for tee to finish. The caveat lies in the latter.

Tee finishes when it is done reading all its standard input.

Question: where does tee's standard input come from?

Answer: from cypress run's standard output. But not only cypress run. When a Unix process creates subprocesses, all subprocesses inherit the standard output channel. In other words, all cypress run subprocesses' standard output are connected to tee.

Here's how the situation looks like right after we ran cypress run | tee cypress.log:

Bash waits for 'cypress run' and 'tee'. Cypress and all its subprocesses' stdout are connected to tee's stdin.

When Cypress has finished running all its tests, cypress run exits. But sometimes, for reasons that are still not clear to me, Cypress' subprocesses don't exit. But because the subprocesses' standard outputs are still connected to tee's standard input, tee does not exit either, and so bash ends up waiting for tee indefinitely:

Tee keeps waiting for all Cypress subprocesses to go away.

Of course, it's not good that Cypress subprocesses don't exit. And the cause for that should be investigated. But this case shows how insidious Unix can be, and how small, seemingly-innocent changes can end up behaving very differently from expectations.

Reproducing with 'cat' 🐱

In order to confirm that I was right, I ran the following experiment, which recreates the above situation without using Cypress. This experiment only uses bash and sleep.

(sleep 999 & echo hi)

The above command spawns a subshell, which spawns sleep 999 in the background, then prints 'hi' and exits. When I ran the above, I observed that the bash prompt returns immediately:

$ date
Tue Jul 21 18:19:04 CEST 2020
$ (sleep 999 & echo hi)
hi
$ date
Tue Jul 21 18:19:07 CEST 2020

I then reran the same experiment, piped to cat:

$ (sleep 999 & echo hi) | cat
hi
<...never exits...>

When we run ps aux in another terminal, we see that both 'sleep' and 'cat' are still alive:

hongli     19145    0.0   sleep 999
hongli     19144    0.0   cat

If not 'tee', then what? 🤔

So far, I haven't been able to find out why Cypress subprocesses get stuck. But the CI job runs inside an ephemeral Docker container anyway, and all processes will be cleaned up after the CI job finishes anyway, so figuring out why the subprocesses get stuck wasn't terribly important. A more important question was: if I could not pipe to tee, then what's the alternative?

The solution that I came up with was this:

# Run Cypress in the background, redirecting its output to a file.
cypress run > cypress.log &

# Capture the PID of 'cypress run'.
local CYPRESS_PID=$!

# Watch the Cypress log file, print (in near-real-time) any new
# data written to the file.
# --pid means: exit tail when 'cypress run' exits.
tail -f --pid=$CYPRESS_PID cypress.log

# `cypress run` has how exited, but we still need to explicitly
# wait for it from Bash, otherwise its dead process entry will stay
# behind in the kernel process list.
wait $CYPRESS_PID

Congratulations, you are now master of the arcane 🧙‍♂️

Lots of people praise Unix for its "simplicity", and Unix pipes for their "elegance". I disagree. If you know how things actually work behind the scenes, then you will see that there are a lot of weird corner cases. Unix can be very insidious: small, seemingly-innocent changes can end up behaving very differently from expectations.

But don't let that stop you from doing your job. At least now, you know how to avoid this particular caveat.