# Aither — Incident Response Runbook

**Last reviewed:** 11 May 2026 · **Owner:** Max Geurtsen · **Phone:** +31 6 4502 9680 (24/7 for confirmed P0)

NIS2-aligned incident handling playbook. Lives in the public repo so clients can review, and in the operator's offline runbook for if Aither's systems are themselves compromised.

## Severity classification

| Sev | Definition                                                            | Acknowledge | Mitigate | Notify customer    |
|-----|-----------------------------------------------------------------------|-------------|----------|--------------------|
| P0  | Confirmed customer data exposure / RCE / total outage > 30 min        | 1 hour      | 4 hours  | 24 hours (written) |
| P1  | Suspected exposure / partial outage / auth bypass / 5xx > 5 min       | 2 hours     | 24 hours | 24 hours (written) |
| P2  | Single-feature broken / missing security header / dep vuln (high)     | 24 hours    | 7 days   | If client-impacting|
| P3  | Cosmetic / dep vuln (low)                                             | 7 days      | 30 days  | No                 |

## P0 / P1 — first 60 minutes

1. **Contain** — pull the affected route via Vercel "Deploy Hooks" or push a kill-switch commit. Don't wait for full diagnosis.
2. **Preserve evidence** — snapshot Vercel logs (last 24h), Supabase audit log, GitHub events. Save to `incidents/<date>-<slug>/` in a private repo.
3. **Triage** — write down: what was exposed, who was affected, when did it start, when was it detected, when was it contained.
4. **Notify Max** — if it's not Max already responding, reach via WhatsApp +31 6 4502 9680. If Max unreachable >2h on P0, contact insurance (Hiscox cyber: TBD).
5. **External help on P0** — designated incident-response counsel: TBD (engage NCSC or NCC Group on retainer when revenue warrants).

## Customer notification template

> Subject: Security incident notification — [DATE]
>
> Hi [client name],
>
> On [DATE] at [TIME CET] we detected [BRIEF DESCRIPTION OF INCIDENT] affecting [SCOPE]. Your data may have been [exposed / accessed / lost / unaffected].
>
> What we know now:
> - [Bullet 1: when it started, ended]
> - [Bullet 2: what data was involved]
> - [Bullet 3: what we did to contain it]
>
> What we don't know yet:
> - [Bullet]
>
> What you should do now:
> - [Action 1: e.g. rotate any shared credentials]
> - [Action 2: e.g. monitor [account] for unusual activity]
>
> We'll update you within 24 hours with a full written report and remediation timeline. Reply to this email or call +31 6 4502 9680 if you have questions.
>
> — Max Geurtsen, Aither

## Detection channels

| Source                      | Coverage                              | Threshold for action                 |
|-----------------------------|---------------------------------------|--------------------------------------|
| Vercel function logs        | All serverless errors                 | >10 errors/hour or any 5xx burst     |
| Supabase audit log          | Auth + database events                | Failed admin login, mass data export |
| Better Stack uptime monitor | 5-min health check on `/`             | 2 consecutive failures               |
| `/api/csp-report`           | CSP violations                        | Any new external host                |
| Customer reports            | `security@aithergrowth.com` / chatbot | Acknowledge within 2h business hrs   |
| GitHub secret scanning      | Push-time + scheduled                 | Any match                            |
| Dependabot alerts           | Weekly + on advisory                  | Critical/High immediately            |

## Service-specific recovery

### Website / chatbot down
1. Check Vercel dashboard → Deployments → most recent. Roll back if a recent deploy caused it.
2. If Vercel itself is down (status.vercel.com), wait + post status page update.

### LLM endpoint failing
1. Check Anthropic status page (status.anthropic.com).
2. The chatbot has a `botReply()` local fallback — verify it's serving.
3. If outage > 1 hour, switch `LLM_PROVIDER=openai` env var in Vercel as backup.

### Voice (ElevenLabs) failing
1. Check elevenlabs.io status.
2. Browser SpeechSynthesis fallback runs automatically (already implemented).
3. If credit balance < 5%, trigger `/api/tts` 503 response so chat falls back gracefully (TODO: implement quota guard).

### Supabase down
1. User auth + course progress are read-only impacted.
2. Static pages still serve.
3. If outage > 4 hours, post banner: "course progress paused; nothing lost".

### Lead-routing email fails
1. Brevo down → mailto fallback in chat already kicks in (v82 form).
2. Escalate manually: paste lead details into max@aithergrowth.com directly.

## Lessons learned

After every P0/P1, within 30 days:
1. Write blameless post-mortem (`incidents/<date>/POST-MORTEM.md`)
2. Add the failure mode to this runbook
3. Add a regression test or monitor
4. Share sanitised summary with affected customers
