on-call automation Β· feature overview

nanny 🍼

A triage brain bolted onto a Prometheus/Icinga substrate β€” it investigates alerts, writes the ticket, and decides whether and when to wake a human.

adaptive paging read-only & audited interactive ntfy / Slack pages LLM optional live demo you can try

Viktor von Drakk Β· Director of Operations Β· DomainTools  Β·  β†’ / space to advance

the problem

On-call is noisy, slow, and human-expensive

πŸŒ™

3 a.m. pages for nothing

Many alerts clear themselves in minutes β€” but a human already got woken.

πŸ”

Repetitive triage

Same runbook lookup, same "is the host down or one check?", every time.

πŸ“‹

Thin incident records

Tickets opened late, half-filled; the "what did we check?" trail is lost.

nanny's bet: a machine can do the first 10 minutes of triage and only escalate when a human is genuinely needed.

architecture

One server between your monitoring and your tools

Icinga2 / Alertmanagerstate of truth + alerts (signed) nanny incident brain + API β€’ triage (LLM / none) β€’ adaptive page hold β€’ shared state + audit Jiratickets Confluencerunbooks ntfypush + buttons PagerDuty / Slackpage Β· interactive cards Hostsread-only logs (SSH) Operatorsweb console + phone

Ships as one localhost/nanny:bundle podman image. Validated against a real Icinga: 576 hosts / 4,714 checks. Notifications fan out to ntfy, Slack, and PagerDuty β€” pick any (or none).

the core loop

What happens when an alert fires

πŸ”” Alertsigned webhook πŸ”Ž Triagegather + write ticket ⏳ Adaptive holdwait the learned window βœ… Self-resolvedno page Β· learn πŸ“² Page on-callntfy / PagerDuty πŸ™‹ Claim / Ackfrom phone or console

An operator who claims or acks during the hold β€” by tapping the page on their phone or clicking in the console β€” cancels the auto-page. nanny sees a human is on it.

the key idea

Adaptive paging: learn what self-resolves

  • Each incident's outcome is recorded per alertname/service signature.
  • If a signature historically clears itself β€” β‰₯3 samples, β‰₯60% self-resolve β€” nanny holds for the p90 self-resolve time + 30s before paging.
  • Unknown/flaky signatures get the default 120s; cap is 900s.
  • The page fires only for things that actually need a human.
Purely statistical β€” the hold is learned from history, not the model's opinion. Transparent and explainable.
page?time β†’ hold window most clear here β†’ page fires
how triage gathers evidence

Four read-only tools β€” with or without an LLM

πŸ“‘

get_icinga_state

Live host + every service check β€” "is the whole box down, or one check?"

πŸ“š

search_confluence

Finds the runbook; the service's notes_url is also fed in up front.

πŸ—‚οΈ

search_jira

Prior incidents for this service and how they resolved.

πŸ“œ

fetch_logs LLM mode

Tails allow-listed logs over a locked-down SSH forced command. No shell, ever.

With an LLM an agent decides which tools to call and reasons the results into a probable cause. With no LLM (none) the core sources run in a fixed order β€” live state, runbook, prior incidents, self-resolve history β€” and nanny reports the evidence verbatim, no reasoning (it skips the log tail). Either brain still tickets & pages.
Guardrails (always on): the log reader is a locked-down, read-only SSH forced command β€” no shell, ever; every tool call lands on the per-incident audit trail; the LLM loop is turn-capped. nanny proposes what a human should do β€” it never remediates.
the brain is a choice

Hosted, local, or no LLM at all

☁️

LLM API

Default β€” a hosted model (Claude). Strongest reasoning; per-incident token cost; data leaves the network.

πŸ–₯️

Local model

Any OpenAI-compatible server (Ollama/vLLM). Fully offline, in-network.

🚫

No LLM none

Deterministic triage: gather live state, runbook, prior incidents, self-resolve history β€” still tickets & pages. Zero tokens.

Same pipeline, three brains. Adaptive paging and ticketing don't depend on an LLM β€” so cost and data-egress are a dial, not a requirement.

notifications & interactive ops

The page lands on your phone β€” and you drive it from there

  • ntfy β€” a free, no-account push channel. The page arrives as an urgent notification with Ack / Claim / Console buttons.
  • Tapping Ack or Claim calls straight back into nanny β€” the same claim/ack workflow as the web console, from your lock screen.
  • Slack β€” interactive Block Kit cards with the same buttons, updated in place as the incident moves.
  • PagerDuty β€” Events API v2 trigger / resolve for teams already living there.
  • Every level is styled: investigating, paged (urgent), recovered β€” so the phone tells the whole story.
One incident, one source of truth: the phone, the Slack card, and the console all act on the same shared state with idempotent claim/ack.
live incident in the triage console with claim/ack/runbook actions
A live incident: triage analysis, audit trail, and the Claim / Ack / Release actions mirrored on phone & console.
the operator console

A zero-dependency web console

  • Dashboard β€” fleet health, clickable counters, 24h MTTR + paged-vs-self-resolved, live auto-refresh.
  • Triage β€” per-incident live state, runbook, analysis + audit, claim/ack/release.
  • Fleet β€” live state grouped by host, filterable + paginated.
  • Controls β€” one-click integration self-tests, dry-run incident simulator.
  • Log β€” everything nanny does, streamed (and forwardable as JSON/syslog).

Hand-written HTML/CSS/JS β€” no framework, no build step.

nanny dashboard: fleet health, MTTR donuts, recent incidents
Live dashboard β€” fleet health, 24h paged-vs-self-resolved, and the rolling incident history.
try it in 60 seconds

A self-contained demo mode β€” no integrations needed

  • nanny demo spins up the whole lifecycle with synthetic incidents on a loop β€” fire β†’ triage β†’ hold β†’ page β†’ resolve.
  • No Slack, PagerDuty, Jira, Icinga, or API keys. Side effects run dry; triage uses the offline no-LLM path; pages go to ntfy.
  • One incident self-resolves before paging; another holds, pages your phone, then recovers β€” the thesis, live.
  • Pace is a dial (NANNY_DEMO_SPEED); run it on a loop or fire on demand.
It's how a prospect tries nanny β€” clone, docker compose up, subscribe a phone to the ntfy topic, watch it run. Runs anywhere, leaks nothing.
synthetic fleet view with a firing host
Demo mode fabricates a fleet so every tab is alive β€” here a host firing CRIT mid-scenario.
interactive

Fire a real incident right now

checking…

πŸ””Alert firesdb-02 Β· replication-lag
πŸ”ŽTriage + ticketgathers state, runbook, history
⏳Adaptive holdwaiting to see if it clears
πŸ“²Pages on-call πŸ“³ntfy push Β· Ack / Claim buttons
βœ…Self-resolvedno page β€” logged & learned
why it's valuable

Fewer pages, faster triage, a real paper trail

↓
needless 3 a.m. pages β€” held until proven real
~10 min
of first-responder triage done before a human is involved
100%
of incidents get a ticket + audit trail, automatically

The win isn't "AI ops magic" β€” it's removing the repetitive first 10 minutes and not waking people for self-healing blips.

deployment & scale

From one image to stateless on Kubernetes

  • Single-node mode: in-memory state + local files β€” the dev / one-team default. Live now on a public cloud host (HTTPS, auth, signed webhooks).
  • Built: stateless replicas on OpenShift/RHOS, all shared state in a CloudNativePG Postgres cluster β€” any pod serves any request.
  • The hard part β€” the delayed page β€” is a page_at column + a reconciler every replica runs (FOR UPDATE SKIP LOCKED) β†’ paged exactly once, no leader election. Unit-tested on the in-memory backend.
  • Packaged as a Helm chart + CNPG Cluster CR (helm-lint/template clean); logs forward as JSON/syslog.
Status: all 4 phases built. Not yet proven on a live cluster: the Postgres backend + a real multi-replica exactly-once paging run β€” the step between here and production HA.
OpenShift / Kubernetes Route β†’ Service nannyreplica nannyreplica nannyreplica CNPG PostgresHA Β· single source of truth
honest status

What's solid, what's still open

Hardened & working

  • βœ“ Auth on every endpoint β€” 256-bit bearer tokens, idle+absolute expiry, rate-limited login
  • βœ“ Webhooks require a token (HMAC or bearer) β€” verified live
  • βœ“ Read-only SSH log reader β€” forced command, no shell, allow-listed
  • βœ“ Interactive ntfy + Slack paging β€” shipped & deployed
  • βœ“ Audit trail forwards as JSON / syslog

Open / in flight

  • ⚠ F6 TLS verification (Icinga can be downgraded)
  • ⚠ F7 CSRF / Origin check on mutating endpoints
  • ⚠ F10 prompt-injection guarding (alert text β†’ LLM)
  • ⚠ HA: built, not yet cluster-proven
  • βœ— Not for internet exposure until F6/F7 close Β· not auto-remediation (by design)
Full OWASP/SOC-2 audit β€” hardened items, remaining findings, and fit β€” in the appendix (last slides) and SECURITY_AUDIT.md.
where it's going

Roadmap & the honest bottom line

☸️

Prove HA on a cluster

Verify the CNPG backend + multi-replica exactly-once paging on real OpenShift.

πŸ”

Close the audit

F6 TLS, F7 CSRF, F10 prompt-guarding, tamper-evident SIEM audit trail.

πŸ“²

Self-hosted ntfy

Bundle an ntfy server for an always-on, quota-free notification channel.

Bottom line: nanny already does the repetitive first-responder work, stops needless pages, and pages you interactively on your phone β€” auth-hardened, with a live demo you can fire yourself. You know exactly which gaps remain between here and production HA.

appendix Β· security audit β€” fixed

Auth is now hardened

#WasNow
F232-bit token in the URL, never expired256-bit bearer token in a header; idle + absolute expiry; logout
F3Endpoints open when LDAP off (default)Auth required on every endpoint; ldap / password / opt-in name-only
F4Anyone could forge pages & ticketsWebhooks require a token (HMAC-SHA256 or bearer) β€” verified live
F5Login brute-forceablePer-IP rate-limit β†’ 429 lockout
F8Stored XSS via a runbook URLsafeUrl() http(s)-only + escaping
F1Secrets exposed in nanny.envRotated/revoked; file locked to 0600 + sanitized .env.example
F9 / F11Unbounded body Β· ephemeral audit logBody cap (partial) Β· audit forwards to stdout/syslog
appendix Β· security audit β€” open & fit

Remaining items & where it fits

πŸ”

F6 Β· TLS verification

Icinga TLS can be disabled/downgraded. Needs default-verify + pinned CA.

🎭

F7 Β· CSRF  Β·  🧠 F10 Β· Prompt injection

No Origin check on JSON mutations yet; attacker-influenced alert text reaches the LLM.

Plus low-severity F12–F15 (error leakage, query escaping, dep pinning, scan-pivot). Full report: SECURITY_AUDIT.md.

Good fit today

  • βœ“ A single ops team's on-call, behind a VPN
  • βœ“ Cutting page-fatigue + auto-documenting incidents
  • βœ“ Legible, self-hosted β€” cloud, local, or no LLM

Not yet a fit

  • βœ— Internet-exposed without F6/F7 closed
  • βœ— Multi-replica HA until CNPG is cluster-verified
  • βœ— "Set & forget" auto-remediation (by design)
wrapping up

nanny 🍼

Automates the first-responder's first ten minutes, holds back the blips that self-heal, and pages a human β€” interactively, on their phone β€” only when one is genuinely needed. Every incident leaves a ticket and an audit trail.

πŸŒ™

Fewer pages

the adaptive hold suppresses what self-resolves

⚑

Faster triage

evidence gathered before a human is involved β€” LLM optional

πŸ“’

Always audited

100% of incidents ticketed, every action logged

Questions? Let's talk.

Try it yourself: nanny demo β€” self-contained, no integrations  Β·  Viktor von Drakk Β· Director of Operations Β· DomainTools

1 / 18
← β†’ / space Β· f fullscreen