Przejdź do treści

🧵 Advanced Pipelines

🧠 Overview

This module goes deep into how pipelines really work:

  • how the shell builds the process + FD graph
  • how exit codes are chosen (pipefail, last command, etc.)
  • how buffering and backpressure work
  • how SIGINT/SIGPIPE propagate
  • why some pipelines hang “randomly”
  • how to design production‑grade pipelines for CI/CD and data processing

The goal: when you see cmd1 | cmd2 | cmd3, you don’t think “three commands in a row”, you think “three processes + two pipes + specific signal/exit semantics”.


🎓 Who this is for

  • DevOps/SRE building data pipelines, log processing, or CI chains.
  • Engineers who rely on grep | awk | sed | jq | ... in critical scripts.
  • Anyone who has ever seen:
  • a pipeline hang forever,
  • a partial result,
  • or a “broken pipe” at the wrong place.

You should already understand:


🧩 Internals / Mechanics

🧩 How the shell builds a pipeline

For:

1
cmd1 | cmd2 | cmd3

The shell typically:

  1. Creates N‑1 pipes (here: 2 pipes).
  2. Forks N children (here: 3 processes).
  3. In each child:
  4. wires stdin/stdout via dup2() to the appropriate pipe ends,
  5. closes unused FDs,
  6. execve()s the target command.
  7. In the parent:
  8. closes all pipe FDs,
  9. tracks PIDs,
  10. waits according to its pipeline semantics.

Key point: each stage is its own process, with its own buffering, signals, and exit code.


🧩 Exit status of a pipeline

Default (POSIX‑ish, many shells):

1
2
cmd1 | cmd2 | cmd3
echo $?
  • $? is the exit code of the last command (cmd3).

Bash with set -o pipefail:

  • $? is the first non‑zero exit code in the pipeline,
  • or 0 if all succeeded.

This is critical in CI/CD:

  • without pipefail, cmd1 can fail silently if cmd3 succeeds.
  • with pipefail, the pipeline fails if any stage fails.

🧩 Buffering and backpressure

Pipes are bounded buffers (typically 64 KiB on many systems).

  • If cmd1 writes faster than cmd2 reads:
  • the pipe fills,
  • cmd1 blocks on write,
  • backpressure propagates upstream.

  • If cmd2 is slow or stuck:

  • the whole pipeline can appear “hung”.

Also:

  • many tools (e.g. grep, awk, python) buffer differently depending on whether stdout is a TTY or a pipe.
  • line buffering vs block buffering can change perceived latency.

🧩 SIGPIPE and early termination

If a downstream process exits early:

1
producer | consumer
  • consumer exits.
  • producer writes to a pipe with no reader.
  • kernel sends SIGPIPE to producer.
  • default behavior: producer terminates with exit code 141 (128 + 13).

This is normal, but can be surprising.

Example:

1
yes | head -n 1
  • head exits after 1 line.
  • yes gets SIGPIPE and dies.

🧩 Pipelines and process groups

In interactive shells:

  • the whole pipeline is usually placed in one process group.
  • Ctrl‑C (SIGINT) goes to the foreground process group → all stages.

In non‑interactive scripts:

  • job control may be disabled,
  • but the shell still typically groups pipeline processes.

This matters for:

  • signal propagation,
  • clean shutdown,
  • CI behavior.

See also: Advanced Process Control.


🔧 Techniques

🔧 Use set -o pipefail in non‑trivial pipelines

In scripts:

1
set -euo pipefail

This ensures:

  • pipelines fail if any stage fails,
  • not just the last one.

🔧 Make failure explicit in middle stages

Example:

1
build | tee build.log | deploy

If build fails but deploy still runs, you’re in trouble.

Better:

1
2
3
set -o pipefail

build | tee build.log | deploy

Or even:

1
2
build | tee build.log
deploy < build.log

So that failure in build is clearly separated from deploy.


🔧 Use xargs / parallel instead of naive loops

Instead of:

1
2
3
for f in *.json; do
  jq '.foo' "$f"
done

Consider:

1
printf '%s\0' *.json | xargs -0 -P"$(nproc)" -I{} jq '.foo' "{}"

Architecturally:

  • you move from “shell‑driven loop” to “data‑driven worker pool”.
  • but you must understand how exit codes propagate (xargs has its own semantics).

🔧 Use tee to branch pipelines

To both log and process:

1
producer | tee raw.log | consumer

Or to split:

1
producer | tee >(consumer1) >(consumer2) >/dev/null

(implementation‑dependent; process substitution may spawn subshells).


⚠️ Pitfalls

⚠️ Silent failures in early stages

1
generate | filter | transform | upload

If generate fails but upload exits 0, you might:

  • upload partial data,
  • or nothing at all, but still “succeed”.

Without pipefail, $? only reflects upload.


⚠️ Hanging pipelines due to open FDs

If any process keeps a write end of a pipe open:

  • readers never see EOF,
  • pipeline appears hung.

Common causes:

  • parent shell not closing pipe FDs,
  • extra processes inheriting FDs (no CLOEXEC),
  • tools that fork internally and keep FDs open.

⚠️ Mixing TTY‑dependent behavior

Some tools:

  • behave differently when stdout is a TTY vs a pipe,
  • change buffering,
  • change formatting (colors, progress bars).

This can break scripts when moved from interactive use to CI.


⚠️ Over‑pipelining

Deep chains like:

1
cat file | grep foo | awk '{print $2}' | sort | uniq -c | sort -nr

are:

  • harder to debug,
  • more fragile,
  • more sensitive to buffering and partial failures.

Sometimes a small script in Python/Go/Rust is clearer and safer.


🚨 Real‑world failures

🚨 Failure: CI pipeline “hangs randomly”

Scenario:

1
producer | consumer | uploader
  • uploader exits early on error.
  • consumer exits when uploader closes its input.
  • producer keeps writing, but:
  • some process still has a write end open,
  • or SIGPIPE is ignored/handled badly.

Result: CI job hangs.

Root causes:

  • FDs not closed properly.
  • No pipefail.
  • No explicit error handling.

🚨 Failure: Partial deploy with green status

1
build | tee build.log | deploy
  • build fails halfway.
  • deploy reads partial log, still exits 0.
  • CI marks job as success.

Fix:

  • set -o pipefail.
  • Or split stages:
1
2
build | tee build.log
deploy < build.log

🚨 Failure: “Broken pipe” spam in logs

1
producer | head -n 10
  • head exits after 10 lines.
  • producer gets SIGPIPE, logs stack traces or errors.

Fix:

  • treat SIGPIPE as normal in this context,
  • or adjust logging to ignore it when expected.

🛠️ Patterns

🛠️ Pattern: Short, named pipelines

Instead of:

1
cat big.log | grep ERROR | awk '{print $5}' | sort | uniq -c | sort -nr

Use:

1
2
3
4
5
6
7
8
9
extract_errors() {
  grep ERROR |
  awk '{print $5}' |
  sort |
  uniq -c |
  sort -nr
}

cat big.log | extract_errors

Benefits:

  • easier to test,
  • easier to reuse,
  • easier to extend.

🛠️ Pattern: Validate inputs before pipelines

Before:

1
cat "$INPUT" | ...

Do:

1
2
3
4
[ -r "$INPUT" ] || {
  echo "Input not readable: $INPUT" >&2
  exit 1
}

Architecturally: fail fast before building complex process graphs.


🛠️ Pattern: Use logs as first‑class artifacts

Instead of:

1
producer | consumer

Consider:

1
producer | tee producer.log | consumer | tee consumer.log

So you can:

  • debug failures post‑mortem,
  • replay data through later stages.

❌ Anti‑patterns

  • giant, unreadable one‑liner pipelines in critical scripts
  • relying on default exit‑code semantics without pipefail
  • ignoring SIGPIPE and treating it as “unexpected error”
  • using pipelines where a small script would be clearer
  • mixing interactive and non‑interactive assumptions (colors, prompts, paging)

🔍 Debugging

🔍 Trace processes and FDs

Use:

1
strace -f -e trace=process,read,write sh script.sh

to see:

  • which processes are spawned,
  • who reads/writes which FDs,
  • where things block.

🔍 Inspect process tree live

1
2
ps f
pstree -p

to see:

  • which stages are still running,
  • whether something is stuck upstream.

🔍 Check exit codes of all stages

In Bash:

1
2
3
set -o pipefail
cmd1 | cmd2 | cmd3
echo "${PIPESTATUS[@]}"

PIPESTATUS holds exit codes of each stage.


🧠 Summary

Advanced pipelines are not “just a bunch of commands with | between them”. They are:

  • process graphs (multiple PIDs),
  • FD graphs (pipes, redirections),
  • exit‑code semantics (last vs pipefail),
  • buffering and backpressure,
  • signal propagation (SIGINT, SIGPIPE).

Once you think of pipelines this way, you can design:

  • non‑hanging,
  • correctly failing,
  • observable,
  • production‑grade shell pipelines for CI/CD and data processing.