v1.3.7 — Target health badges in the panel¶
Closes the v1.3.6 Bug 5 deferral. The operator can now see per-target health state directly on the Target group detail page instead of filtering the Logs browser for caddy_error entries.
What changed¶
New: Health column in the Target group page¶
Each row in the targets table now renders a TargetHealthBadge next to the Enabled column:
| Badge | When |
|---|---|
healthy (green + checkmark) | Target is in Caddy's upstream pool, real traffic has succeeded (num_requests > 0), and no failed active health probe logged in the last 90 s. The hint shows the request counter (e.g. 1234 reqs). |
unhealthy (red + X) | Caddy's active health checker logged a failure for this target in the last 90 s. The hint shows either the rejected HTTP status code (302, 401, 500, ...) or a truncated network-error string (dial tcp: connection refused, i/o timeout). |
unknown (grey + ?) | No data. Target was just added and Caddy has not probed it yet, the target group is disabled, or the panel cannot reach Caddy's admin API. |
Hovering the badge reveals a multi-line tooltip with the full last-checked timestamp, unabbreviated error string, and lifetime counters (Requests: N / Fails: M).
New: GET /api/targets/health¶
The endpoint returns every target across every group in a single response:
{
"targets": [
{
"target_id": 42,
"target_group_id": 3,
"host": "192.0.2.156",
"port": 8000,
"enabled": true,
"status": "unhealthy",
"last_status_code": 302,
"last_error": "unexpected status code 302",
"last_checked_at": "2026-04-24T19:45:22Z",
"num_requests": 1234,
"num_fails": 12
}
],
"fetched_at": "2026-04-24T19:45:45Z"
}
Cached 30 s in memory; the reconciler path invalidates the cache after every mutation so a freshly-added or deleted target appears (or disappears) on the next poll without waiting for the TTL to lapse.
Data source — why hybrid¶
The user filing asked to investigate Option A (Caddy admin API) first, falling back to Option B (log parsing) if Option A alone didn't give enough detail.
Option A (Caddy admin): GET /reverse_proxy/upstreams works and returns [{"address":"host:port","num_requests":N,"fails":M}] for every upstream Caddy knows about. But it does not expose last_status_code, last_error, or last_checked_at — these are specific to the health checker's internal state and are only observable via its log output.
Decision: hybrid Option A + Option B.
- Option A is the authoritative list of upstreams plus live counters.
- Option B is a 90-second-window scan of the ingested
caddy_errorlog rows whose raw JSON carries"logger":"http.handlers.reverse_proxy.health_checker.active". The parse pulls the three fields Option A is missing.
The log tail was already in-process via the v1.0 logs/ingestor package — no new file tail or goroutine is added by this release.
Status derivation rules¶
The classifyTarget function in api/target_health.go reduces the three raw signals to one verdict:
| Conditions | Status |
|---|---|
| Target not in Caddy's upstreams list | unknown |
| Recent (≤90 s) health_checker failure log for this address | unhealthy (with last_status_code / last_error / last_checked_at) |
| In upstreams, no recent failure log | healthy |
Caddy's active health checker only emits a log line when a probe fails; successful probes are silent. So "in upstreams + no failure log in the window" is the healthy signal -- it covers both "probes are passing" and "no active probe is configured, passive checks have not tripped either".
num_requests from /reverse_proxy/upstreams turns out to be currently-in-flight, not cumulative, so it's surfaced as an info field in the tooltip but doesn't gate the verdict. This was a correction made mid-smoke-test (see REPORT) after observing that counters stayed at 0 across 5 deliberate requests -- the field semantics aren't documented by Caddy but the source confirms it tracks activeRequests.
Frontend¶
- New
TargetHealthBadgecomponent (frontend/src/components/TargetHealthBadge.tsx). TargetGroupDetailpage adds ahealthstate map, arefreshHealth()callback, and a 30 s polling interval that clears on unmount.- Transient fetch failures do NOT clear the map to
unknown— we keep the last good snapshot so a panel blip or Caddy admin hiccup doesn't cascade into every row flickering.
Tests¶
Backend (17 tests pass):
caddy.Client.Upstreams— decodes array, handlesnullpool, propagates non-200.parseHealthCheckerLine— unexpected status + network failure variants, rejects wrong logger, rejects malformed JSON.classifyTarget— healthy from traffic, unhealthy by status code, unhealthy by network error, unknown when not in Caddy, unknown when in Caddy but zero traffic.recentHealthCheckerEvents— keeps most recent per address, ignores entries older than the window, ignores rows from other loggers.TargetHealthCache— serves cached body within TTL,Invalidate()zeroes the timestamp.
Frontend: no unit test framework set up in this repo; visual verification done in prod during smoke tests (see REPORT).
Docs¶
docs/features/reverse-proxy.md— new "Health monitoring" section explaining the three badge states, polling cadence, and common causes of red badges.docs/operations/troubleshooting.md— two new entries: "Target group page showsunhealthy 302" (thehealth_check_expect_statusdefault mismatch) and "Target staysunknownforever".
Not changed¶
- No DB migrations. The endpoint reads existing
log_entriesrows — the only schema requirement is therawcolumn populated by the v1.0 ingestor, which has been in place since the initial release. - No changes to the v1.3.6 CrowdSec auto-bootstrap flow. Bugs 1-4 from the previous release stay untouched.
- No changes to
target_groupsortargetstable schemas. - The "edit expect-status directly from the badge" quick action flagged in the filing as
optional / maybe v1.3.8is not in this release — the current tooltip + troubleshooting entry cover the diagnostic need; making the edit one-click is a separate UX pass.
Upgrade¶
Drop-in:
The endpoint is new, so no existing call site changes behaviour. On first page load after the upgrade every badge will flash unknown for a second while the 90-s log window fills in — this is expected and clears itself on the second poll.
Related¶
- v1.3.6 — the release that deferred this bug.
- Reverse proxy → Health monitoring — operator-facing reference.
- Troubleshooting → Target group page shows unhealthy 302 — when the default expect-status is wrong for the backend.