on-call automation · feature overview

nanny 🍼

A triage brain bolted onto a Prometheus/Icinga substrate — it investigates alerts, writes the ticket, and decides whether and when to wake a human.

adaptive paging read-only & audited interactive ntfy / Slack pages LLM optional live demo you can try

Viktor von Drakk · Director of Operations · DomainTools · → / space to advance

the problem

On-call is noisy, slow, and human-expensive

🌙

3 a.m. pages for nothing

Many alerts clear themselves in minutes — but a human already got woken.

🔁

Repetitive triage

Same runbook lookup, same "is the host down or one check?", every time.

📋

Thin incident records

Tickets opened late, half-filled; the "what did we check?" trail is lost.

nanny's bet: a machine can do the first 10 minutes of triage and only escalate when a human is genuinely needed.

architecture

One server between your monitoring and your tools

Ships as one localhost/nanny:bundle podman image. Validated against a real Icinga: 576 hosts / 4,714 checks. Notifications fan out to ntfy, Slack, and PagerDuty — pick any (or none).

the core loop

What happens when an alert fires

An operator who claims or acks during the hold — by tapping the page on their phone or clicking in the console — cancels the auto-page. nanny sees a human is on it.

the key idea

Adaptive paging: learn what self-resolves

Each incident's outcome is recorded per alertname/service signature.
If a signature historically clears itself — ≥3 samples, ≥60% self-resolve — nanny holds for the p90 self-resolve time + 30s before paging.
Unknown/flaky signatures get the default 120s; cap is 900s.
The page fires only for things that actually need a human.

Purely statistical — the hold is learned from history, not the model's opinion. Transparent and explainable.

how triage gathers evidence

Four read-only tools — with or without an LLM

📡

get_icinga_state

Live host + every service check — "is the whole box down, or one check?"

📚

search_confluence

Finds the runbook; the service's notes_url is also fed in up front.

🗂️

search_jira

Prior incidents for this service and how they resolved.

📜

fetch_logs LLM mode

Tails allow-listed logs over a locked-down SSH forced command. No shell, ever.

With an LLM an agent decides which tools to call and reasons the results into a probable cause. With no LLM (none) the core sources run in a fixed order — live state, runbook, prior incidents, self-resolve history — and nanny reports the evidence verbatim, no reasoning (it skips the log tail). Either brain still tickets & pages.

Guardrails (always on): the log reader is a locked-down, read-only SSH forced command — no shell, ever; every tool call lands on the per-incident audit trail; the LLM loop is turn-capped. nanny proposes what a human should do — it never remediates.

the brain is a choice

Hosted, local, or no LLM at all

☁️

LLM API

Default — a hosted model (Claude). Strongest reasoning; per-incident token cost; data leaves the network.

🖥️

Local model

Any OpenAI-compatible server (Ollama/vLLM). Fully offline, in-network.

🚫

No LLM none

Deterministic triage: gather live state, runbook, prior incidents, self-resolve history — still tickets & pages. Zero tokens.

Same pipeline, three brains. Adaptive paging and ticketing don't depend on an LLM — so cost and data-egress are a dial, not a requirement.

notifications & interactive ops

The page lands on your phone — and you drive it from there

ntfy — a free, no-account push channel. The page arrives as an urgent notification with Ack / Claim / Console buttons.
Tapping Ack or Claim calls straight back into nanny — the same claim/ack workflow as the web console, from your lock screen.
Slack — interactive Block Kit cards with the same buttons, updated in place as the incident moves.
PagerDuty — Events API v2 trigger / resolve for teams already living there.
Every level is styled: investigating, paged (urgent), recovered — so the phone tells the whole story.

One incident, one source of truth: the phone, the Slack card, and the console all act on the same shared state with idempotent claim/ack.

live incident in the triage console with claim/ack/runbook actions

A live incident: triage analysis, audit trail, and the Claim / Ack / Release actions mirrored on phone & console.

the operator console

A zero-dependency web console

Dashboard — fleet health, clickable counters, 24h MTTR + paged-vs-self-resolved, live auto-refresh.
Triage — per-incident live state, runbook, analysis + audit, claim/ack/release.
Fleet — live state grouped by host, filterable + paginated.
Controls — one-click integration self-tests, dry-run incident simulator.
Log — everything nanny does, streamed (and forwardable as JSON/syslog).

Hand-written HTML/CSS/JS — no framework, no build step.

nanny dashboard: fleet health, MTTR donuts, recent incidents

Live dashboard — fleet health, 24h paged-vs-self-resolved, and the rolling incident history.

try it in 60 seconds

A self-contained demo mode — no integrations needed

nanny demo spins up the whole lifecycle with synthetic incidents on a loop — fire → triage → hold → page → resolve.
No Slack, PagerDuty, Jira, Icinga, or API keys. Side effects run dry; triage uses the offline no-LLM path; pages go to ntfy.
One incident self-resolves before paging; another holds, pages your phone, then recovers — the thesis, live.
Pace is a dial (NANNY_DEMO_SPEED); run it on a loop or fire on demand.

It's how a prospect tries nanny — clone, docker compose up, subscribe a phone to the ntfy topic, watch it run. Runs anywhere, leaks nothing.

Demo mode fabricates a fleet so every tab is alive — here a host firing CRIT mid-scenario.

interactive

Fire a real incident right now

checking…

🔔Alert firesdb-02 · replication-lag

🔎Triage + ticketgathers state, runbook, history

⏳Adaptive holdwaiting to see if it clears

📲Pages on-call 📳ntfy push · Ack / Claim buttons

✅Self-resolvedno page — logged & learned

why it's valuable

Fewer pages, faster triage, a real paper trail

↓

needless 3 a.m. pages — held until proven real

~10 min

of first-responder triage done before a human is involved

100%

of incidents get a ticket + audit trail, automatically

The win isn't "AI ops magic" — it's removing the repetitive first 10 minutes and not waking people for self-healing blips.

deployment & scale

From one image to stateless on Kubernetes

Single-node mode: in-memory state + local files — the dev / one-team default. Live now on a public cloud host (HTTPS, auth, signed webhooks).
Built: stateless replicas on OpenShift/RHOS, all shared state in a CloudNativePG Postgres cluster — any pod serves any request.
The hard part — the delayed page — is a page_at column + a reconciler every replica runs (FOR UPDATE SKIP LOCKED) → paged exactly once, no leader election. Unit-tested on the in-memory backend.
Packaged as a Helm chart + CNPG Cluster CR (helm-lint/template clean); logs forward as JSON/syslog.

Status: all 4 phases built. Not yet proven on a live cluster: the Postgres backend + a real multi-replica exactly-once paging run — the step between here and production HA.

honest status

What's solid, what's still open

Hardened & working

✓ Auth on every endpoint — 256-bit bearer tokens, idle+absolute expiry, rate-limited login
✓ Webhooks require a token (HMAC or bearer) — verified live
✓ Read-only SSH log reader — forced command, no shell, allow-listed
✓ Interactive ntfy + Slack paging — shipped & deployed
✓ Audit trail forwards as JSON / syslog

Open / in flight

⚠ F6 TLS verification (Icinga can be downgraded)
⚠ F7 CSRF / Origin check on mutating endpoints
⚠ F10 prompt-injection guarding (alert text → LLM)
⚠ HA: built, not yet cluster-proven
✗ Not for internet exposure until F6/F7 close · not auto-remediation (by design)

Full OWASP/SOC-2 audit — hardened items, remaining findings, and fit — in the appendix (last slides) and SECURITY_AUDIT.md.

where it's going

Roadmap & the honest bottom line

☸️

Prove HA on a cluster

Verify the CNPG backend + multi-replica exactly-once paging on real OpenShift.

🔐

Close the audit

F6 TLS, F7 CSRF, F10 prompt-guarding, tamper-evident SIEM audit trail.

📲

Self-hosted ntfy

Bundle an ntfy server for an always-on, quota-free notification channel.

Bottom line: nanny already does the repetitive first-responder work, stops needless pages, and pages you interactively on your phone — auth-hardened, with a live demo you can fire yourself. You know exactly which gaps remain between here and production HA.

appendix · security audit — fixed

Auth is now hardened

#	Was	Now
F2	32-bit token in the URL, never expired	256-bit bearer token in a header; idle + absolute expiry; logout
F3	Endpoints open when LDAP off (default)	Auth required on every endpoint; ldap / password / opt-in name-only
F4	Anyone could forge pages & tickets	Webhooks require a token (HMAC-SHA256 or bearer) — verified live
F5	Login brute-forceable	Per-IP rate-limit → 429 lockout
F8	Stored XSS via a runbook URL	safeUrl() http(s)-only + escaping
F1	Secrets exposed in nanny.env	Rotated/revoked; file locked to 0600 + sanitized .env.example
F9 / F11	Unbounded body · ephemeral audit log	Body cap (partial) · audit forwards to stdout/syslog

appendix · security audit — open & fit

Remaining items & where it fits

🔐

F6 · TLS verification

Icinga TLS can be disabled/downgraded. Needs default-verify + pinned CA.

🎭

F7 · CSRF · 🧠 F10 · Prompt injection

No Origin check on JSON mutations yet; attacker-influenced alert text reaches the LLM.

Plus low-severity F12–F15 (error leakage, query escaping, dep pinning, scan-pivot). Full report: SECURITY_AUDIT.md.

Good fit today

✓ A single ops team's on-call, behind a VPN
✓ Cutting page-fatigue + auto-documenting incidents
✓ Legible, self-hosted — cloud, local, or no LLM

Not yet a fit

✗ Internet-exposed without F6/F7 closed
✗ Multi-replica HA until CNPG is cluster-verified
✗ "Set & forget" auto-remediation (by design)

wrapping up

nanny 🍼

Automates the first-responder's first ten minutes, holds back the blips that self-heal, and pages a human — interactively, on their phone — only when one is genuinely needed. Every incident leaves a ticket and an audit trail.

🌙

Fewer pages

the adaptive hold suppresses what self-resolves

⚡

Faster triage

evidence gathered before a human is involved — LLM optional

📒

Always audited

100% of incidents ticketed, every action logged

Questions? Let's talk.

Try it yourself: nanny demo — self-contained, no integrations · Viktor von Drakk · Director of Operations · DomainTools