Skip to content

v1.3.7 — Target health badges in the panel

Closes the v1.3.6 Bug 5 deferral. The operator can now see per-target health state directly on the Target group detail page instead of filtering the Logs browser for caddy_error entries.

What changed

New: Health column in the Target group page

Each row in the targets table now renders a TargetHealthBadge next to the Enabled column:

Badge When
healthy (green + checkmark) Target is in Caddy's upstream pool, real traffic has succeeded (num_requests > 0), and no failed active health probe logged in the last 90 s. The hint shows the request counter (e.g. 1234 reqs).
unhealthy (red + X) Caddy's active health checker logged a failure for this target in the last 90 s. The hint shows either the rejected HTTP status code (302, 401, 500, ...) or a truncated network-error string (dial tcp: connection refused, i/o timeout).
unknown (grey + ?) No data. Target was just added and Caddy has not probed it yet, the target group is disabled, or the panel cannot reach Caddy's admin API.

Hovering the badge reveals a multi-line tooltip with the full last-checked timestamp, unabbreviated error string, and lifetime counters (Requests: N / Fails: M).

New: GET /api/targets/health

The endpoint returns every target across every group in a single response:

{
  "targets": [
    {
      "target_id": 42,
      "target_group_id": 3,
      "host": "192.0.2.156",
      "port": 8000,
      "enabled": true,
      "status": "unhealthy",
      "last_status_code": 302,
      "last_error": "unexpected status code 302",
      "last_checked_at": "2026-04-24T19:45:22Z",
      "num_requests": 1234,
      "num_fails": 12
    }
  ],
  "fetched_at": "2026-04-24T19:45:45Z"
}

Cached 30 s in memory; the reconciler path invalidates the cache after every mutation so a freshly-added or deleted target appears (or disappears) on the next poll without waiting for the TTL to lapse.

Data source — why hybrid

The user filing asked to investigate Option A (Caddy admin API) first, falling back to Option B (log parsing) if Option A alone didn't give enough detail.

Option A (Caddy admin): GET /reverse_proxy/upstreams works and returns [{"address":"host:port","num_requests":N,"fails":M}] for every upstream Caddy knows about. But it does not expose last_status_code, last_error, or last_checked_at — these are specific to the health checker's internal state and are only observable via its log output.

Decision: hybrid Option A + Option B.

  • Option A is the authoritative list of upstreams plus live counters.
  • Option B is a 90-second-window scan of the ingested caddy_error log rows whose raw JSON carries "logger":"http.handlers.reverse_proxy.health_checker.active". The parse pulls the three fields Option A is missing.

The log tail was already in-process via the v1.0 logs/ingestor package — no new file tail or goroutine is added by this release.

Status derivation rules

The classifyTarget function in api/target_health.go reduces the three raw signals to one verdict:

Conditions Status
Target not in Caddy's upstreams list unknown
Recent (≤90 s) health_checker failure log for this address unhealthy (with last_status_code / last_error / last_checked_at)
In upstreams, no recent failure log healthy

Caddy's active health checker only emits a log line when a probe fails; successful probes are silent. So "in upstreams + no failure log in the window" is the healthy signal -- it covers both "probes are passing" and "no active probe is configured, passive checks have not tripped either".

num_requests from /reverse_proxy/upstreams turns out to be currently-in-flight, not cumulative, so it's surfaced as an info field in the tooltip but doesn't gate the verdict. This was a correction made mid-smoke-test (see REPORT) after observing that counters stayed at 0 across 5 deliberate requests -- the field semantics aren't documented by Caddy but the source confirms it tracks activeRequests.

Frontend

  • New TargetHealthBadge component (frontend/src/components/TargetHealthBadge.tsx).
  • TargetGroupDetail page adds a health state map, a refreshHealth() callback, and a 30 s polling interval that clears on unmount.
  • Transient fetch failures do NOT clear the map to unknown — we keep the last good snapshot so a panel blip or Caddy admin hiccup doesn't cascade into every row flickering.

Tests

Backend (17 tests pass):

  • caddy.Client.Upstreams — decodes array, handles null pool, propagates non-200.
  • parseHealthCheckerLine — unexpected status + network failure variants, rejects wrong logger, rejects malformed JSON.
  • classifyTarget — healthy from traffic, unhealthy by status code, unhealthy by network error, unknown when not in Caddy, unknown when in Caddy but zero traffic.
  • recentHealthCheckerEvents — keeps most recent per address, ignores entries older than the window, ignores rows from other loggers.
  • TargetHealthCache — serves cached body within TTL, Invalidate() zeroes the timestamp.

Frontend: no unit test framework set up in this repo; visual verification done in prod during smoke tests (see REPORT).

Docs

  • docs/features/reverse-proxy.md — new "Health monitoring" section explaining the three badge states, polling cadence, and common causes of red badges.
  • docs/operations/troubleshooting.md — two new entries: "Target group page shows unhealthy 302" (the health_check_expect_status default mismatch) and "Target stays unknown forever".

Not changed

  • No DB migrations. The endpoint reads existing log_entries rows — the only schema requirement is the raw column populated by the v1.0 ingestor, which has been in place since the initial release.
  • No changes to the v1.3.6 CrowdSec auto-bootstrap flow. Bugs 1-4 from the previous release stay untouched.
  • No changes to target_groups or targets table schemas.
  • The "edit expect-status directly from the badge" quick action flagged in the filing as optional / maybe v1.3.8 is not in this release — the current tooltip + troubleshooting entry cover the diagnostic need; making the edit one-click is a separate UX pass.

Upgrade

Drop-in:

cd argos-edge
git pull
docker compose build
docker compose up -d

The endpoint is new, so no existing call site changes behaviour. On first page load after the upgrade every badge will flash unknown for a second while the 90-s log window fills in — this is expected and clears itself on the second poll.