What's the difference between
command | cat? There shouldn't be any, right? The first prints the output directly, and the latter prints output via cat, but they should have the same effect. Not so: the latter can cause the command to get stuck indefinitely. This has given a particular CI pipeline of mine quite some headache.
To learn why this happens, and how we can mitigate this problem, we need to dive into the arcane magic that is Unix process management. Join me on this journey.
It started with an innocent pipe 👼
At some point, I wanted to parse the Cypress output in order to extract information about the Cypress run, such as how many times failed tests were retried. I also wanted the Cypress output to show up in my CI pipeline in real time, rather than only after the Cypress run was completed.
So I chose to pipe Cypress's output to
tee. Tee is a tool similar to cat: it prints everything consumed from standard input, but also writes a copy to a file, which we can inspect later. My code looked like this:
cypress run | tee cypress.log parse cypress.log
After this change, my CI job started to get stuck occasionally. 🤷♂️ Thinking I went insane, I double checked a couple of things:
- Was the OS upgraded recently? Nope.
- Was a dependency upgraded recently? Nope.
- Does upgrading Cypress and Headless Chrome make the problem go away? Nope.
- Does removing the
teepipe make the problem go away? Yes.
I noticed a pattern: when the CI job gets stuck, then Cypress was always in a complete state: all tests have finished, and Cypress has written its final output. In other words, it was as if Cypress was stuck during some shutdown routine.
I started poking around in the CI server and observed a few peculiar things:
- The main Cypress process is gone, but
teeis still alive.
- A few Cypress subprocesses, such as Headless Chrome and a bunch of Cypress helper processes, are still alive.
When I put all of this together, and thought about how Unix process management works, a lightbulb started to shine. 💡 How could I not have seen it coming?
What's going on
When we run
cypress run | tee cypress.log, Bash waits for
cypress and for
tee to finish. The caveat lies in the latter.
Tee finishes when it is done reading all its standard input.
Question: where does tee's standard input come from?
cypress run's standard output. But not only
cypress run. When a Unix process creates subprocesses, all subprocesses inherit the standard output channel. In other words, all
cypress run subprocesses' standard output are connected to
Here's how the situation looks like right after we ran
cypress run | tee cypress.log:
When Cypress has finished running all its tests,
cypress run exits. But sometimes, for reasons that are still not clear to me, Cypress' subprocesses don't exit. But because the subprocesses' standard outputs are still connected to tee's standard input, tee does not exit either, and so bash ends up waiting for tee indefinitely:
Of course, it's not good that Cypress subprocesses don't exit. And the cause for that should be investigated. But this case shows how insidious Unix can be, and how small, seemingly-innocent changes can end up behaving very differently from expectations.
Reproducing with 'cat' 🐱
In order to confirm that I was right, I ran the following experiment, which recreates the above situation without using Cypress. This experiment only uses bash and
(sleep 999 & echo hi)
The above command spawns a subshell, which spawns
sleep 999 in the background, then prints 'hi' and exits. When I ran the above, I observed that the bash prompt returns immediately:
$ date Tue Jul 21 18:19:04 CEST 2020 $ (sleep 999 & echo hi) hi $ date Tue Jul 21 18:19:07 CEST 2020
I then reran the same experiment, piped to cat:
$ (sleep 999 & echo hi) | cat hi <...never exits...>
When we run
ps aux in another terminal, we see that both 'sleep' and 'cat' are still alive:
hongli 19145 0.0 sleep 999 hongli 19144 0.0 cat
If not 'tee', then what? 🤔
So far, I haven't been able to find out why Cypress subprocesses get stuck. But the CI job runs inside an ephemeral Docker container anyway, and all processes will be cleaned up after the CI job finishes anyway, so figuring out why the subprocesses get stuck wasn't terribly important. A more important question was: if I could not pipe to
tee, then what's the alternative?
The solution that I came up with was this:
# Run Cypress in the background, redirecting its output to a file. cypress run > cypress.log & # Capture the PID of 'cypress run'. local CYPRESS_PID=$! # Watch the Cypress log file, print (in near-real-time) any new # data written to the file. # --pid means: exit tail when 'cypress run' exits. tail -f --pid=$CYPRESS_PID cypress.log # `cypress run` has how exited, but we still need to explicitly # wait for it from Bash, otherwise its dead process entry will stay # behind in the kernel process list. wait $CYPRESS_PID
Congratulations, you are now master of the arcane 🧙♂️
Lots of people praise Unix for its "simplicity", and Unix pipes for their "elegance". I disagree. If you know how things actually work behind the scenes, then you will see that there are a lot of weird corner cases. Unix can be very insidious: small, seemingly-innocent changes can end up behaving very differently from expectations.
But don't let that stop you from doing your job. At least now, you know how to avoid this particular caveat.