Przejdź do treści

🔗 Advanced Shell Pipelines

🧠 Overview

Pipelines are one of the most powerful and misunderstood features of POSIX shells. They create multi‑process data flows, isolate environments, propagate (or hide) failures, and interact with process groups and signals. This document explains pipelines as execution graphs, not syntax.


🎓 Who this is for

  • Engineers writing complex data‑processing flows.
  • DevOps/SRE working with CI/CD, logs, and streaming pipelines.
  • Anyone debugging pipeline failures, hangs, or unexpected exit codes.
  • People who want predictable, production‑grade pipeline behavior.

🧩 Internals / Mechanics

🧩 What a pipeline really is

A pipeline:

1
cmd1 | cmd2 | cmd3

is not a single command. It is a process graph:

  • the shell creates N pipes
  • forks N child processes
  • connects stdout of each to stdin of the next
  • assigns them to a process group
  • waits for them (unless backgrounded)

🧩 Subshell behavior

In many shells:

  • each pipeline stage runs in a subshell
  • subshells have isolated environment state
  • variable changes do not propagate back

Example:

1
2
3
4
5
count=0
echo "a b c" | while read _; do
  count=$((count+1))
done
echo "$count"   # often prints 0

🧩 Exit code propagation

By default:

  • $? = exit code of last pipeline command
  • failures in earlier stages are ignored
  • unless set -o pipefail is enabled

Example:

1
2
false | true
echo $?   # 0 without pipefail, 1 with pipefail

🧩 Process groups

Pipelines often share a process group:

  • signals like SIGINT propagate to all stages
  • foreground/background behavior is unified

This matters in CI and containers.


🔧 Techniques

🔧 Use pipefail for safe pipelines

1
set -o pipefail

Ensures the pipeline fails if any stage fails.

🔧 Use read -r to avoid mangling input

1
2
3
printf '%s\n' "$data" | while IFS= read -r line; do
  ...
done

🔧 Use process substitution for cleaner graphs

Instead of:

1
diff <(sort a.txt) <(sort b.txt)

This avoids temporary files and keeps the pipeline readable.

🔧 Use xargs or parallel for fan‑out pipelines

1
printf '%s\0' *.log | xargs -0 -P"$(nproc)" gzip

⚠️ Pitfalls

⚠️ Pipeline swallowing errors

1
docker build . | tee build.log

If docker build fails, the pipeline exit code is 0 unless pipefail is set.

⚠️ Subshell variable loss

1
2
3
4
5
total=0
ls | while read f; do
  total=$((total+1))
done
echo "$total"   # not what you expect

⚠️ Deadlocks from unconsumed pipe output

If a command writes more than the pipe buffer (~64 KB) and the next stage is slow or blocked, the pipeline can hang.

Example:

1
cmd1 | head -n 1

If cmd1 writes endlessly, it may block on a full pipe.

⚠️ Mixing stdout and stderr incorrectly

1
cmd1 2>&1 | cmd2

This merges stderr into the pipeline, which may break parsing.


🚨 Real‑World Failures

🚨 Failure: CI job passes despite build failure

1
docker build . | tee build.log

docker build fails → tee succeeds → pipeline exit = 0 → CI passes.

Fix:

1
2
set -o pipefail
docker build . | tee build.log

🚨 Failure: Pipeline hangs due to unconsumed output

1
long_running_cmd | head -n 1

head exits early → long_running_cmd keeps writing → pipe fills → deadlock.

Fix:

  • use timeout
  • or redesign pipeline to avoid infinite producers

🚨 Failure: Lost variables in subshell

1
2
3
4
5
count=0
printf '%s\n' *.txt | while read f; do
  count=$((count+1))
done
echo "$count"   # 0

Fix:

Use redirection instead of a pipeline:

1
2
3
4
count=0
while read -r f; do
  count=$((count+1))
done < <(printf '%s\n' *.txt)

🛠️ Patterns

🛠️ Pattern: Fail‑fast pipelines

Always:

1
set -euo pipefail

🛠️ Pattern: Use process substitution for clarity

1
diff <(sort a) <(sort b)

🛠️ Pattern: Use redirection to avoid subshells

1
2
3
while read -r line; do
  ...
done < file

🛠️ Pattern: Use xargs for parallel fan‑out

1
find . -name '*.log' -print0 | xargs -0 -P"$(nproc)" gzip

❌ Anti‑Patterns

❌ Anti‑pattern: Using pipelines for state mutation

Pipelines are for data flow, not state changes.

❌ Anti‑pattern: Ignoring stderr

1
cmd | grep pattern

If cmd prints errors, they bypass the pipeline.

❌ Anti‑pattern: Using cat unnecessarily

1
cat file | grep foo

Use:

1
grep foo file

🔍 Debugging

🔍 Trace pipeline execution

1
set -x

Shows:

  • forks
  • redirections
  • pipeline stages

🔍 Inspect process tree

1
2
ps f
pstree -p

🔍 Debug pipe behavior with strace

1
strace -f -e trace=process,desc sh script.sh

⚙️ Performance

⚙️ Minimize forks

Use builtins where possible.

⚙️ Use parallelism

1
xargs -P"$(nproc)"

⚙️ Avoid unnecessary pipelines

1
grep foo file | wc -l

can be replaced with:

1
grep -c foo file

🧵 Process Control

🧵 Process groups

Pipelines often share a process group → signals propagate.

🧵 Foreground/background

Foreground pipelines receive terminal signals.

🧵 Handling SIGPIPE

When downstream commands exit early, upstream commands receive SIGPIPE.


🐳 Containers

🐳 Pipelines inside PID 1 shells

If the shell is PID 1:

  • SIGPIPE may not behave normally
  • children must be reaped
  • long pipelines can leak zombies

🐳 Logging pipelines

Common pattern:

1
app | tee /var/log/app.log

Ensure pipefail is set.


🛰️ CI/CD

🛰️ Deterministic pipelines

CI pipelines must:

  • fail fast
  • avoid interactive commands
  • log clearly

🛰️ Use tee safely

1
2
set -o pipefail
command | tee output.log

🧠 Summary

Pipelines are multi‑process execution graphs with:

  • subshells
  • process groups
  • redirections
  • exit code propagation
  • signal behavior

Mastering them makes your scripts predictable, safe, and production‑ready.