v1.3.7 — Target health badges in the panel¶

Closes the v1.3.6 Bug 5 deferral. The operator can now see per-target health state directly on the Target group detail page instead of filtering the Logs browser for caddy_error entries.

What changed¶

New: Health column in the Target group page¶

Each row in the targets table now renders a TargetHealthBadge next to the Enabled column:

Badge	When
`healthy` (green + checkmark)	Target is in Caddy's upstream pool, real traffic has succeeded (`num_requests > 0`), and no failed active health probe logged in the last 90 s. The hint shows the request counter (e.g. `1234 reqs`).
`unhealthy` (red + X)	Caddy's active health checker logged a failure for this target in the last 90 s. The hint shows either the rejected HTTP status code (`302`, `401`, `500`, ...) or a truncated network-error string (`dial tcp: connection refused`, `i/o timeout`).
`unknown` (grey + `?`)	No data. Target was just added and Caddy has not probed it yet, the target group is disabled, or the panel cannot reach Caddy's admin API.

Hovering the badge reveals a multi-line tooltip with the full last-checked timestamp, unabbreviated error string, and lifetime counters (Requests: N / Fails: M).

New: `GET /api/targets/health`¶

The endpoint returns every target across every group in a single response:

{
  "targets": [
    {
      "target_id": 42,
      "target_group_id": 3,
      "host": "192.0.2.156",
      "port": 8000,
      "enabled": true,
      "status": "unhealthy",
      "last_status_code": 302,
      "last_error": "unexpected status code 302",
      "last_checked_at": "2026-04-24T19:45:22Z",
      "num_requests": 1234,
      "num_fails": 12
    }
  ],
  "fetched_at": "2026-04-24T19:45:45Z"
}

Cached 30 s in memory; the reconciler path invalidates the cache after every mutation so a freshly-added or deleted target appears (or disappears) on the next poll without waiting for the TTL to lapse.

Data source — why hybrid¶

The user filing asked to investigate Option A (Caddy admin API) first, falling back to Option B (log parsing) if Option A alone didn't give enough detail.

Option A (Caddy admin): GET /reverse_proxy/upstreams works and returns [{"address":"host:port","num_requests":N,"fails":M}] for every upstream Caddy knows about. But it does not expose last_status_code, last_error, or last_checked_at — these are specific to the health checker's internal state and are only observable via its log output.

Decision: hybrid Option A + Option B.

Option A is the authoritative list of upstreams plus live counters.
Option B is a 90-second-window scan of the ingested caddy_error log rows whose raw JSON carries "logger":"http.handlers.reverse_proxy.health_checker.active". The parse pulls the three fields Option A is missing.

The log tail was already in-process via the v1.0 logs/ingestor package — no new file tail or goroutine is added by this release.

Status derivation rules¶

The classifyTarget function in api/target_health.go reduces the three raw signals to one verdict:

Conditions	Status
Target not in Caddy's upstreams list	`unknown`
Recent (≤90 s) health_checker failure log for this address	`unhealthy` (with `last_status_code` / `last_error` / `last_checked_at`)
In upstreams, no recent failure log	`healthy`

Caddy's active health checker only emits a log line when a probe fails; successful probes are silent. So "in upstreams + no failure log in the window" is the healthy signal -- it covers both "probes are passing" and "no active probe is configured, passive checks have not tripped either".

num_requests from /reverse_proxy/upstreams turns out to be currently-in-flight, not cumulative, so it's surfaced as an info field in the tooltip but doesn't gate the verdict. This was a correction made mid-smoke-test (see REPORT) after observing that counters stayed at 0 across 5 deliberate requests -- the field semantics aren't documented by Caddy but the source confirms it tracks activeRequests.

Frontend¶

New TargetHealthBadge component (frontend/src/components/TargetHealthBadge.tsx).
TargetGroupDetail page adds a health state map, a refreshHealth() callback, and a 30 s polling interval that clears on unmount.
Transient fetch failures do NOT clear the map to unknown — we keep the last good snapshot so a panel blip or Caddy admin hiccup doesn't cascade into every row flickering.

Tests¶

Backend (17 tests pass):

caddy.Client.Upstreams — decodes array, handles null pool, propagates non-200.
parseHealthCheckerLine — unexpected status + network failure variants, rejects wrong logger, rejects malformed JSON.
classifyTarget — healthy from traffic, unhealthy by status code, unhealthy by network error, unknown when not in Caddy, unknown when in Caddy but zero traffic.
recentHealthCheckerEvents — keeps most recent per address, ignores entries older than the window, ignores rows from other loggers.
TargetHealthCache — serves cached body within TTL, Invalidate() zeroes the timestamp.

Frontend: no unit test framework set up in this repo; visual verification done in prod during smoke tests (see REPORT).

Docs¶

docs/features/reverse-proxy.md — new "Health monitoring" section explaining the three badge states, polling cadence, and common causes of red badges.
docs/operations/troubleshooting.md — two new entries: "Target group page shows unhealthy 302" (the health_check_expect_status default mismatch) and "Target stays unknown forever".

Not changed¶

No DB migrations. The endpoint reads existing log_entries rows — the only schema requirement is the raw column populated by the v1.0 ingestor, which has been in place since the initial release.
No changes to the v1.3.6 CrowdSec auto-bootstrap flow. Bugs 1-4 from the previous release stay untouched.
No changes to target_groups or targets table schemas.
The "edit expect-status directly from the badge" quick action flagged in the filing as optional / maybe v1.3.8 is not in this release — the current tooltip + troubleshooting entry cover the diagnostic need; making the edit one-click is a separate UX pass.

Upgrade¶

Drop-in:

cd argos-edge
git pull
docker compose build
docker compose up -d

The endpoint is new, so no existing call site changes behaviour. On first page load after the upgrade every badge will flash unknown for a second while the 90-s log window fills in — this is expected and clears itself on the second poll.

v1.3.6 — the release that deferred this bug.
Reverse proxy → Health monitoring — operator-facing reference.
Troubleshooting → Target group page shows unhealthy 302 — when the default expect-status is wrong for the backend.