🧵 Advanced Pipelines

🧠 Overview

This module goes deep into how pipelines really work:

how the shell builds the process + FD graph
how exit codes are chosen (pipefail, last command, etc.)
how buffering and backpressure work
how SIGINT/SIGPIPE propagate
why some pipelines hang “randomly”
how to design production‑grade pipelines for CI/CD and data processing

The goal: when you see cmd1 | cmd2 | cmd3, you don’t think “three commands in a row”, you think “three processes + two pipes + specific signal/exit semantics”.

🎓 Who this is for

DevOps/SRE building data pipelines, log processing, or CI chains.
Engineers who rely on grep | awk | sed | jq | ... in critical scripts.
Anyone who has ever seen:
a pipeline hang forever,
a partial result,
or a “broken pipe” at the wrong place.

You should already understand:

basic shell scripting
what fork/exec do (see: Execve & Fork Internals)
basic process control (see: Advanced Process Control)

🧩 Internals / Mechanics

🧩 How the shell builds a pipeline

For:

cmd1 | cmd2 | cmd3

The shell typically:

Creates N‑1 pipes (here: 2 pipes).
Forks N children (here: 3 processes).
In each child:
wires stdin/stdout via dup2() to the appropriate pipe ends,
closes unused FDs,
execve()s the target command.
In the parent:
closes all pipe FDs,
tracks PIDs,
waits according to its pipeline semantics.

Key point: each stage is its own process, with its own buffering, signals, and exit code.

🧩 Exit status of a pipeline

Default (POSIX‑ish, many shells):

cmd1 | cmd2 | cmd3
echo $?

$? is the exit code of the last command (cmd3).

Bash with set -o pipefail:

$? is the first non‑zero exit code in the pipeline,
or 0 if all succeeded.

This is critical in CI/CD:

without pipefail, cmd1 can fail silently if cmd3 succeeds.
with pipefail, the pipeline fails if any stage fails.

🧩 Buffering and backpressure

Pipes are bounded buffers (typically 64 KiB on many systems).

If cmd1 writes faster than cmd2 reads:
the pipe fills,
cmd1 blocks on write,
backpressure propagates upstream.
If cmd2 is slow or stuck:
the whole pipeline can appear “hung”.

Also:

many tools (e.g. grep, awk, python) buffer differently depending on whether stdout is a TTY or a pipe.
line buffering vs block buffering can change perceived latency.

🧩 SIGPIPE and early termination

If a downstream process exits early:

producer | consumer

consumer exits.
producer writes to a pipe with no reader.
kernel sends SIGPIPE to producer.
default behavior: producer terminates with exit code 141 (128 + 13).

This is normal, but can be surprising.

Example:

yes | head -n 1

head exits after 1 line.
yes gets SIGPIPE and dies.

🧩 Pipelines and process groups

In interactive shells:

the whole pipeline is usually placed in one process group.
Ctrl‑C (SIGINT) goes to the foreground process group → all stages.

In non‑interactive scripts:

job control may be disabled,
but the shell still typically groups pipeline processes.

This matters for:

signal propagation,
clean shutdown,
CI behavior.

🔧 Techniques

🔧 Use `set -o pipefail` in non‑trivial pipelines

In scripts:

set -euo pipefail

This ensures:

pipelines fail if any stage fails,
not just the last one.

🔧 Make failure explicit in middle stages

Example:

build | tee build.log | deploy

If build fails but deploy still runs, you’re in trouble.

Better:

set -o pipefail

build | tee build.log | deploy

Or even:

build | tee build.log
deploy < build.log

So that failure in build is clearly separated from deploy.

🔧 Use `xargs` / `parallel` instead of naive loops

Instead of:

for f in *.json; do
  jq '.foo' "$f"
done

Consider:

printf '%s\0' *.json | xargs -0 -P"$(nproc)" -I{} jq '.foo' "{}"

Architecturally:

you move from “shell‑driven loop” to “data‑driven worker pool”.
but you must understand how exit codes propagate (xargs has its own semantics).

🔧 Use `tee` to branch pipelines

To both log and process:

producer | tee raw.log | consumer

Or to split:

producer | tee >(consumer1) >(consumer2) >/dev/null

(implementation‑dependent; process substitution may spawn subshells).

⚠️ Pitfalls

⚠️ Silent failures in early stages

generate | filter | transform | upload

If generate fails but upload exits 0, you might:

upload partial data,
or nothing at all, but still “succeed”.

Without pipefail, $? only reflects upload.

⚠️ Hanging pipelines due to open FDs

If any process keeps a write end of a pipe open:

readers never see EOF,
pipeline appears hung.

Common causes:

parent shell not closing pipe FDs,
extra processes inheriting FDs (no CLOEXEC),
tools that fork internally and keep FDs open.

⚠️ Mixing TTY‑dependent behavior

Some tools:

behave differently when stdout is a TTY vs a pipe,
change buffering,
change formatting (colors, progress bars).

This can break scripts when moved from interactive use to CI.

⚠️ Over‑pipelining

Deep chains like:

cat file | grep foo | awk '{print $2}' | sort | uniq -c | sort -nr

are:

harder to debug,
more fragile,
more sensitive to buffering and partial failures.

Sometimes a small script in Python/Go/Rust is clearer and safer.

🚨 Real‑world failures

🚨 Failure: CI pipeline “hangs randomly”

Scenario:

producer | consumer | uploader

uploader exits early on error.
consumer exits when uploader closes its input.
producer keeps writing, but:
some process still has a write end open,
or SIGPIPE is ignored/handled badly.

Result: CI job hangs.

Root causes:

FDs not closed properly.
No pipefail.
No explicit error handling.

🚨 Failure: Partial deploy with green status

build | tee build.log | deploy

build fails halfway.
deploy reads partial log, still exits 0.
CI marks job as success.

Fix:

set -o pipefail.
Or split stages:

build | tee build.log
deploy < build.log

🚨 Failure: “Broken pipe” spam in logs

producer | head -n 10

head exits after 10 lines.
producer gets SIGPIPE, logs stack traces or errors.

Fix:

treat SIGPIPE as normal in this context,
or adjust logging to ignore it when expected.

🛠️ Patterns

🛠️ Pattern: Short, named pipelines

Instead of:

cat big.log | grep ERROR | awk '{print $5}' | sort | uniq -c | sort -nr

Use:

extract_errors() {
  grep ERROR |
  awk '{print $5}' |
  sort |
  uniq -c |
  sort -nr
}

cat big.log | extract_errors

Benefits:

easier to test,
easier to reuse,
easier to extend.

🛠️ Pattern: Validate inputs before pipelines

Before:

cat "$INPUT" | ...

Do:

[ -r "$INPUT" ] || {
  echo "Input not readable: $INPUT" >&2
  exit 1
}

Architecturally: fail fast before building complex process graphs.

🛠️ Pattern: Use logs as first‑class artifacts

Instead of:

producer | consumer

Consider:

producer | tee producer.log | consumer | tee consumer.log

So you can:

debug failures post‑mortem,
replay data through later stages.

❌ Anti‑patterns

giant, unreadable one‑liner pipelines in critical scripts
relying on default exit‑code semantics without pipefail
ignoring SIGPIPE and treating it as “unexpected error”
using pipelines where a small script would be clearer
mixing interactive and non‑interactive assumptions (colors, prompts, paging)

🔍 Debugging

🔍 Trace processes and FDs

Use:

strace -f -e trace=process,read,write sh script.sh

to see:

which processes are spawned,
who reads/writes which FDs,
where things block.

🔍 Inspect process tree live

ps f
pstree -p

to see:

which stages are still running,
whether something is stuck upstream.

🔍 Check exit codes of all stages

In Bash:

set -o pipefail
cmd1 | cmd2 | cmd3
echo "${PIPESTATUS[@]}"

PIPESTATUS holds exit codes of each stage.

🧠 Summary

Advanced pipelines are not “just a bunch of commands with | between them”. They are:

process graphs (multiple PIDs),
FD graphs (pipes, redirections),
exit‑code semantics (last vs pipefail),
buffering and backpressure,
signal propagation (SIGINT, SIGPIPE).

Once you think of pipelines this way, you can design:

non‑hanging,
correctly failing,
observable,
production‑grade shell pipelines for CI/CD and data processing.

🧵 Advanced Pipelines

🧠 Overview

🎓 Who this is for

🧩 Internals / Mechanics

🧩 How the shell builds a pipeline

🧩 Exit status of a pipeline

🧩 Buffering and backpressure

🧩 SIGPIPE and early termination

🧩 Pipelines and process groups

🔧 Techniques

🔧 Use set -o pipefail in non‑trivial pipelines

🔧 Make failure explicit in middle stages

🔧 Use xargs / parallel instead of naive loops

🔧 Use tee to branch pipelines

⚠️ Pitfalls

⚠️ Silent failures in early stages

⚠️ Hanging pipelines due to open FDs

⚠️ Mixing TTY‑dependent behavior

⚠️ Over‑pipelining

🚨 Real‑world failures

🚨 Failure: CI pipeline “hangs randomly”

🚨 Failure: Partial deploy with green status

🚨 Failure: “Broken pipe” spam in logs

🛠️ Patterns

🛠️ Pattern: Short, named pipelines

🛠️ Pattern: Validate inputs before pipelines

🛠️ Pattern: Use logs as first‑class artifacts

❌ Anti‑patterns

🔍 Debugging

🔍 Trace processes and FDs

🔍 Inspect process tree live

🔍 Check exit codes of all stages

🧠 Summary

🔧 Use `set -o pipefail` in non‑trivial pipelines

🔧 Use `xargs` / `parallel` instead of naive loops

🔧 Use `tee` to branch pipelines