Engineering

SOAR Playbook Design: Reactive Data Flow, Conditional Branching, and the Case for Dry-Run

Most SOAR playbooks are brittle scripts in a pretty wrapper. The ones that survive contact with production share four traits: a visual designer with reactive data flow, a typed node contract, real conditional branching, and a dry-run mode that runs the logic without firing the side effects. Here is how each one works and what changes when you have them.

By Krasper Engineering 10 Jun 2026 9 min read
SOAR Playbook Design — hero

TL;DR — A SOAR playbook is supposed to be the codification of what your best analyst does at 3 AM. In practice, most playbooks drift into one of two failure modes: they are too rigid to handle the variation real incidents produce, or they are so generic that they automate nothing of consequence. Four design choices — reactive data flow, a typed node contract, first-class conditional branching, and a non-destructive dry-run mode — separate playbooks that get used from playbooks that get rewritten every quarter.

The pitch for security orchestration is straightforward: codify the runbook, fire it on the alert, let the platform handle the mechanical work, free the analyst to think. The reality is messier. Real incidents do not match the runbook because real incidents have variation — the same phishing campaign reaches three users with slightly different mail headers, half the URLs are already neutralized by the time triage starts, the EDR has flagged two of the endpoints but not the third, the user in finance has approver privileges that change the containment calculus.

A playbook that cannot handle variation is a script. A script that cannot be safely modified is a liability. A liability that touches production systems is the reason SOAR projects quietly stop being trusted after the first year.

This post walks through the four design choices that, in our experience building Krasper Suite's playbook designer, separate playbooks that hold up under real traffic from the ones that get bypassed. The framing is technical; the examples are deliberately concrete.

Contents

  1. Why most playbooks decay
  2. Reactive data flow — why the editor matters
  3. The node contract — typed ports, explicit failure modes
  4. Conditional branching as a first-class primitive
  5. A worked example — phishing response, end to end
  6. Dry-run mechanics — testing without the blast radius
  7. What changes when you have all four

1. Why most playbooks decay

The decay pattern is consistent across organizations. A playbook is authored against a clean mental model of the incident. It runs well in the demo, well in the first week, well on the incidents that look like the one the author had in mind.

Then the variations start arriving.

The mail header parser was written for the cloud mail gateway and breaks on the on-prem appliance. The IP-to-asset enrichment returns unknown for cloud workloads. The containment step assumes endpoint isolation works, which it does — until the device is offline. The notification step assumes the channel responder is on call, which they are not at 3 AM Saturday.

Each variation produces an exception. Exceptions accumulate as bolted-on conditional logic, deeply nested, increasingly unreviewable. After six months the playbook is a YAML file 2,000 lines long that one engineer understands and is reluctant to modify. The team starts working around it.

The root cause is rarely the logic itself. It is the representation. A playbook represented as a linear script with growing conditionals is a representation that fights its own evolution. A playbook represented as a typed directed graph with reactive data flow does not.

2. Reactive data flow — why the editor matters

The visual editor is often dismissed as a UX nicety. It is not. It is a forcing function for a specific data-flow model that linear scripts do not enforce.

In a reactive node-graph editor, every node declares its inputs and outputs as named, typed ports. Connections between nodes are explicit. The runtime walks the graph, evaluates a node when all its inputs are satisfied, propagates the outputs to downstream nodes. There is no hidden state; the graph is the program.

┌──────────────┐       ┌──────────────────┐       ┌──────────────┐
   │  Alert       │       │  Enrich asset    │       │  Severity    │
   │  ingress     ├──────▶│                  ├──────▶│  classifier  │
   │              │       │  in: ip          │       │              │
   │  out: alert  │       │  out: asset      │       │  out: level  │
   └──────────────┘       └──────────────────┘       └──────┬───────┘
                                                            │
                          ┌─────────────────────────────────┴───┐
                          │                                     │
                          ▼                                     ▼
                   ┌────────────┐                       ┌────────────┐
                   │  Contain   │                       │  Notify    │
                   │  (high)    │                       │  (medium)  │
                   └────────────┘                       └────────────┘
Reactive node graph — typed ports, explicit edges

What this representation gives you that a script does not:

  • The graph is the contract. A reviewer can read it. A junior can modify one node without understanding the whole.
  • Side effects are localized. A node that touches production is a single, identifiable shape on the canvas. Surrounding nodes are pure transformations.
  • Diffability is structural. A change to the graph is a change to a set of nodes and edges, not a change to a thousand-line script with shifting indentation. Pull requests against playbooks become reviewable.

The reactive part matters because the editor can show the current state of the data flowing through each port as the author builds the playbook against a sample alert. The author sees, in real time, what shape asset will have when it reaches the classifier — not what shape the documentation claims it will have. Most playbook bugs come from drift between documented and actual data shapes; reactive feedback collapses that gap.

3. The node contract — typed ports, explicit failure modes

The graph metaphor only buys you anything if the nodes inside it follow a tight contract. Our node contract has four required properties.

Typed inputs and outputs. Every port has a declared schema. The runtime refuses to connect a port that outputs IpAddress to one that expects EmailMessage. This sounds obvious. Half of all playbook failures we have observed in audits trace back to runtime type mismatches that the editor could have prevented at design time.

Explicit failure outputs. Every node that touches an external system has a success port and an error port. The error port is not optional. If the author does not wire it, the editor flags the node as incomplete. This is the single design choice that has prevented the most production incidents — it forces the author to think about the unhappy path at design time, not after the first 2 AM page.

Idempotency by default. Side-effect nodes carry an auto-generated idempotency key derived from the playbook execution ID and the node ID. Replays of the same execution against the same node deduplicate at the runtime level. Authors do not have to think about it; the platform guarantees it.

Deterministic timeouts. Every external call has a declared timeout. Nodes that exceed their timeout route to the error port, not to a hung execution. There is no "I am stuck waiting on the EDR API forever" failure mode, because the runtime kills the node before it gets there.

A node that satisfies all four properties is composable. A node that satisfies none is the reason playbooks decay.

4. Conditional branching as a first-class primitive

Most early SOAR products treated conditionals as an afterthought — a node with an "if" expression on it, output routed to a single downstream node. That works for one or two branches. It collapses at scale.

A first-class branching primitive looks like this:

  • A switch node with N declared outputs, each guarded by a typed predicate over the inputs.
  • The runtime evaluates predicates in declared order, routes to the first match, and short-circuits the rest.
  • A required default output catches anything that matches no predicate. The editor refuses to save the playbook without it.
  • The output ports of the switch carry narrowed types — downstream of the "alert is from cloud mail gateway" branch, the alert port type narrows to CloudMailAlert, and the downstream nodes only see the fields they can rely on.
┌────────────────────────────┐
                     │  switch on alert.source    │
                     │                            │
                     │  case cloud_mail  ────────┼──▶ cloud-mail subgraph
                     │  case onprem_mail ────────┼──▶ onprem subgraph
                     │  case api_gateway ────────┼──▶ api subgraph
                     │  default          ────────┼──▶ unknown-source handler
                     └────────────────────────────┘
Switch node with required default and type-narrowed outputs

Type narrowing is the part that distinguishes a real branching primitive from a cosmetic one. With it, each subgraph can be authored against a tighter contract and the editor catches integration mistakes at design time. Without it, every subgraph has to defensively re-check every field, which is exactly the failure mode that bloats playbooks into unreadable conditionals.

The default branch is not optional for a reason. Unhandled cases in production at 3 AM produce no value. Either the playbook decides what to do, or it explicitly escalates to a human. Both are acceptable; silent fall-through is not.

5. A worked example — phishing response, end to end

Phishing is the standard example because every SOC runs it and the variation surface is large. The skeleton of a real playbook looks like this:

[Alert ingress: mail-security signal]
        │
        ▼
[Parse headers + extract URLs/attachments]
        │
        ▼
[Enrich: sender reputation, domain age, URL reputation]
        │
        ▼
[switch: phishing_confidence]
   ├── high   → [Quarantine mail across all mailboxes]
   │              │
   │              ▼
   │           [Identify all recipients] ── for each ──▶
   │              ├── [User in privileged group?] ── yes ──▶ [Force password reset]
   │              │                                 └── no  ──▶ [Notify user via approved channel]
   │              ▼
   │           [Endpoint check: any URL clicked?] ── yes ──▶ [Isolate endpoint + open IR ticket]
   │
   ├── medium → [Quarantine for sender, flag for analyst review]
   │
   ├── low    → [Tag for trend analysis, no user impact]
   │
   └── default → [Hold for analyst, no automatic action]
Phishing-response playbook — skeleton

A few things to notice about this shape.

The enrichment step is one node, not three — the platform handles parallel fan-out internally. The author does not write concurrency code; they declare that the classifier needs sender_reputation, domain_age, and url_reputation, and the runtime resolves the dependencies in parallel where it can.

The "for each recipient" loop is a single iterator node with a subgraph attached. The subgraph is itself reviewable, diffable, testable in isolation. The author does not write loops; they declare "this subgraph runs once per recipient."

The privileged-user check is a switch with two outcomes, both explicitly handled. There is no implicit fall-through where a privileged user accidentally gets the standard notification path.

The endpoint isolation step has an error port (not shown above) that routes to an analyst alert if isolation fails — for instance, because the endpoint is offline. The playbook never silently assumes its containment action succeeded.

6. Dry-run mechanics — testing without the blast radius

The single feature that determines whether engineers actually iterate on playbooks is dry-run.

A dry-run executes the entire graph against a real or sampled alert, walks every node, evaluates every predicate, propagates every value — but does not call the side-effect APIs. Nodes that would quarantine mail, isolate endpoints, force password resets, or open tickets instead emit a record of what they would have done and pass through the synthetic success output.

The implementation pattern is straightforward: every side-effect adapter checks an execution_mode flag at entry. In live mode it calls the upstream API. In dry_run mode it returns a deterministic synthetic response that matches the schema of a real success, and records the intended call into the execution trace.

text
┌────────────────────────────────────┐
│ Side-effect adapter                │
│                                    │
│ if mode == "dry_run":              │
│   record_intended_call(args)       │
│   return synthetic_success(args)   │
│                                    │
│ else:                              │
│   call_real_api(args)              │
└────────────────────────────────────┘
Side-effect adapter — mode-aware entry point

The benefits compound:

  • Authors can run a playbook against last week's actual phishing alert and see the full execution trace before pushing the change.
  • CI can replay a corpus of historical alerts against the new playbook on every change, and fail the build if any execution diverges from the previous version in unexpected ways.
  • Incident retros can replay the exact alert against the current playbook to confirm the fix lands.

There is one design discipline required: dry-run integrity is a platform guarantee, not a per-node one. Every side-effect adapter that ships must honor the mode flag, and the runtime enforces it at the adapter boundary, not at the author's discretion. A single adapter that ignores dry-run and quietly calls production turns the entire mode into a liability. Adapter test suites verify both modes on every release.

7. What changes when you have all four

The four pieces — reactive data flow, the typed node contract, first-class branching, dry-run — are individually useful and collectively transformative. The shift shows up in two measurable places.

Mean time to respond. When the playbook handles variation natively, the manual handoff steps that previously gated execution disappear. For a phishing playbook of the shape above, the typical observation across operating environments is that the automatable portion of triage drops from the order of tens of minutes (human-driven) to the order of seconds (automated end-to-end), because the analyst no longer has to make the early classification decision by hand. The remaining human time is concentrated on the genuinely ambiguous cases the playbook escalates intentionally, which is where analyst attention belongs.

Mean time to change. This is the metric most teams forget to measure. A playbook that takes a week to safely modify is a playbook the team avoids modifying. With dry-run plus structural diffs, the modification loop typically shrinks from days to under an hour: edit the graph, dry-run against a corpus, review the structural diff, ship. The downstream effect is that the playbook actually keeps pace with the threat landscape instead of falling further behind it.

The numbers depend heavily on the baseline and the integration quality of the surrounding stack — anyone quoting a single before/after multiple is selling a slide, not measuring an outcome. The honest framing is: the direction is consistent across environments we have observed; the magnitude varies.

Closing

A SOAR product is not the playbooks it ships with. It is the authoring substrate — the editor, the node contract, the branching primitives, the dry-run guarantee. The playbooks themselves are how that substrate gets used.

If the substrate is sound, the playbooks evolve. If the substrate is not, the playbooks ossify, the team works around them, and the investment quietly stops returning value.

The next post in this series goes one layer down — into the runtime itself, how executions are persisted, how partial failures are recovered, and why the execution log is the most important audit artifact the platform produces.

Further reading

  • NIST SP 800-61r3 — Computer Security Incident Handling Guide
  • MITRE D3FEND — Detection, Denial, Disruption framework
  • Eric Evans — Domain-Driven Design, chapter on intention-revealing interfaces
Back to Blog

Ready to secure your
enterprise infrastructure?

Schedule a technical briefing. No sales pitch — just architects and your team.