A triage brain bolted onto a Prometheus/Icinga substrate β it investigates alerts, writes the ticket, and decides whether and when to wake a human.
adaptive pagingread-only & auditedinteractive ntfy / Slack pagesLLM optionallive demo you can try
Viktor von Drakk Β· Director of Operations Β· DomainTools Β· β / space to advance
the problem
On-call is noisy, slow, and human-expensive
π
3 a.m. pages for nothing
Many alerts clear themselves in minutes β but a human already got woken.
π
Repetitive triage
Same runbook lookup, same "is the host down or one check?", every time.
π
Thin incident records
Tickets opened late, half-filled; the "what did we check?" trail is lost.
nanny's bet: a machine can do the first 10 minutes of triage and only escalate when a human is genuinely needed.
architecture
One server between your monitoring and your tools
Ships as one localhost/nanny:bundle podman image. Validated against a real Icinga: 576 hosts / 4,714 checks. Notifications fan out to ntfy, Slack, and PagerDuty β pick any (or none).
the core loop
What happens when an alert fires
An operator who claims or acks during the hold β by tapping the page on their phone or clicking in the console β cancels the auto-page. nanny sees a human is on it.
the key idea
Adaptive paging: learn what self-resolves
Each incident's outcome is recorded per alertname/service signature.
If a signature historically clears itself β β₯3 samples, β₯60% self-resolve β nanny holds for the p90 self-resolve time + 30s before paging.
Unknown/flaky signatures get the default 120s; cap is 900s.
The page fires only for things that actually need a human.
Purely statistical β the hold is learned from history, not the model's opinion. Transparent and explainable.
how triage gathers evidence
Four read-only tools β with or without an LLM
π‘
get_icinga_state
Live host + every service check β "is the whole box down, or one check?"
π
search_confluence
Finds the runbook; the service's notes_url is also fed in up front.
ποΈ
search_jira
Prior incidents for this service and how they resolved.
π
fetch_logs LLM mode
Tails allow-listed logs over a locked-down SSH forced command. No shell, ever.
With an LLM an agent decides which tools to call and reasons the results into a probable cause. With no LLM (none) the core sources run in a fixed order β live state, runbook, prior incidents, self-resolve history β and nanny reports the evidence verbatim, no reasoning (it skips the log tail). Either brain still tickets & pages.
Guardrails (always on): the log reader is a locked-down, read-only SSH forced command β no shell, ever; every tool call lands on the per-incident audit trail; the LLM loop is turn-capped. nanny proposes what a human should do β it never remediates.
the brain is a choice
Hosted, local, or no LLM at all
βοΈ
LLM API
Default β a hosted model (Claude). Strongest reasoning; per-incident token cost; data leaves the network.
π₯οΈ
Local model
Any OpenAI-compatible server (Ollama/vLLM). Fully offline, in-network.
π«
No LLM none
Deterministic triage: gather live state, runbook, prior incidents, self-resolve history β still tickets & pages. Zero tokens.
Same pipeline, three brains. Adaptive paging and ticketing don't depend on an LLM β so cost and data-egress are a dial, not a requirement.
notifications & interactive ops
The page lands on your phone β and you drive it from there
ntfy β a free, no-account push channel. The page arrives as an urgent notification with Ack / Claim / Console buttons.
Tapping Ack or Claim calls straight back into nanny β the same claim/ack workflow as the web console, from your lock screen.
Slack β interactive Block Kit cards with the same buttons, updated in place as the incident moves.
PagerDuty β Events API v2 trigger / resolve for teams already living there.
Every level is styled: investigating, paged (urgent), recovered β so the phone tells the whole story.
One incident, one source of truth: the phone, the Slack card, and the console all act on the same shared state with idempotent claim/ack.
A live incident: triage analysis, audit trail, and the Claim / Ack / Release actions mirrored on phone & console.
of first-responder triage done before a human is involved
100%
of incidents get a ticket + audit trail, automatically
The win isn't "AI ops magic" β it's removing the repetitive first 10 minutes and not waking people for self-healing blips.
deployment & scale
From one image to stateless on Kubernetes
Single-node mode: in-memory state + local files β the dev / one-team default. Live now on a public cloud host (HTTPS, auth, signed webhooks).
Built: stateless replicas on OpenShift/RHOS, all shared state in a CloudNativePG Postgres cluster β any pod serves any request.
The hard part β the delayed page β is a page_at column + a reconciler every replica runs (FOR UPDATE SKIP LOCKED) β paged exactly once, no leader election. Unit-tested on the in-memory backend.
Packaged as a Helm chart + CNPG Cluster CR (helm-lint/template clean); logs forward as JSON/syslog.
Status: all 4 phases built. Not yet proven on a live cluster: the Postgres backend + a real multi-replica exactly-once paging run β the step between here and production HA.
honest status
What's solid, what's still open
Hardened & working
β Auth on every endpoint β 256-bit bearer tokens, idle+absolute expiry, rate-limited login
β Webhooks require a token (HMAC or bearer) β verified live
Bundle an ntfy server for an always-on, quota-free notification channel.
Bottom line: nanny already does the repetitive first-responder work, stops needless pages, and pages you interactively on your phone β auth-hardened, with a live demo you can fire yourself. You know exactly which gaps remain between here and production HA.
appendix Β· security audit β fixed
Auth is now hardened
#
Was
Now
F2
32-bit token in the URL, never expired
256-bit bearer token in a header; idle + absolute expiry; logout
F3
Endpoints open when LDAP off (default)
Auth required on every endpoint; ldap / password / opt-in name-only
F4
Anyone could forge pages & tickets
Webhooks require a token (HMAC-SHA256 or bearer) β verified live
F5
Login brute-forceable
Per-IP rate-limit β 429 lockout
F8
Stored XSS via a runbook URL
safeUrl() http(s)-only + escaping
F1
Secrets exposed in nanny.env
Rotated/revoked; file locked to 0600 + sanitized .env.example
F9 / F11
Unbounded body Β· ephemeral audit log
Body cap (partial) Β· audit forwards to stdout/syslog
appendix Β· security audit β open & fit
Remaining items & where it fits
π
F6 Β· TLS verification
Icinga TLS can be disabled/downgraded. Needs default-verify + pinned CA.
π
F7 Β· CSRF Β· π§ F10 Β· Prompt injection
No Origin check on JSON mutations yet; attacker-influenced alert text reaches the LLM.
Plus low-severity F12βF15 (error leakage, query escaping, dep pinning, scan-pivot). Full report: SECURITY_AUDIT.md.
β Legible, self-hosted β cloud, local, or no LLM
Not yet a fit
β Internet-exposed without F6/F7 closed
β Multi-replica HA until CNPG is cluster-verified
β "Set & forget" auto-remediation (by design)
wrapping up
nannyπΌ
Automates the first-responder's first ten minutes, holds back the blips that self-heal, and pages a human β interactively, on their phone β only when one is genuinely needed. Every incident leaves a ticket and an audit trail.
π
Fewer pages
the adaptive hold suppresses what self-resolves
β‘
Faster triage
evidence gathered before a human is involved β LLM optional
π
Always audited
100% of incidents ticketed, every action logged
Questions? Let's talk.
Try it yourself: nanny demo β self-contained, no integrations Β· Viktor von Drakk Β· Director of Operations Β· DomainTools