Skip to content

v1.3.35.3 -- Demo: wire crowdsec-init sidecar (machine credentials fix)

A bugfix on top of v1.3.35.2. The demo's init.sh invoked docker compose up -d --no-deps argos crowdsec caddy, which explicitly excluded the crowdsec-init sidecar. That sidecar runs cscli machines add argos-panel and writes credentials to a shared volume the panel imports on first boot; without it, every panel-to-LAPI call returned 403 (country reconciler ticks, threats UI, AppSec metrics endpoint, system health recent_errors).

The crowdsec-init service was always defined in the demo's docker-compose.yml (mirrored from prod via the demo's compose override pattern). The bug was 100% in the script that brings the stack up.

argosVersion and frontend/package.json bumped from 1.3.35.2 to 1.3.35.3. Image rebuild required.

Symptoms (pre-fix)

  • /api/system/health returned recent_errors: null (the reconciler couldn't enumerate decisions to populate the field).
  • /api/threats/decisions returned LAPI 403: access forbidden.
  • /api/threats/appsec/metrics returned unavailable, requires machine credentials.
  • Panel logs showed lapi 403: {"message":"access forbidden"} on every country reconciler tick (every 5 minutes, per country, ~8 lines/tick = ~96/hour of error noise).
  • cscli machines list inside argos-demo-crowdsec showed ONLY the bouncer-internal localhost machine; no argos-panel registration.

Root cause

scripts/demo/init.sh line 78 (pre-fix):

( cd "${DEMO_DIR}" && docker compose up -d --no-deps argos crowdsec caddy )

The --no-deps flag explicitly disables the depends_on chain. The base compose has the chain set up correctly:

argos:
  depends_on:
    crowdsec-init:
      condition: service_completed_successfully
crowdsec-init:
  depends_on:
    crowdsec:
      condition: service_healthy

If we let compose drive ordering, the sidecar runs and exits before the panel even starts. The --no-deps short-circuited that.

The original intent of --no-deps argos crowdsec caddy was likely "be explicit about what we want running" — a defensive posture. It backfired because the explicit list omitted the init service.

Fix

scripts/demo/init.sh now uses:

( cd "${DEMO_DIR}" && docker compose up -d )

Plain docker compose up -d brings up every service the override declares, in the right order, per depends_on.

Three new wait/verify steps in init.sh

  1. Bumped panel-healthcheck timeout from 60s → 120s. The crowdsec-init step takes 10-30s on a cold-start (first-time hub-update inside the sidecar's cscli invocation), so the panel's Starting window is longer than before.
  2. crowdsec-init exit-code check — the script now warns loudly if the sidecar exited non-zero, and dumps the sidecar's last 10 log lines. exit 0 = credentials written; anything else = the panel will see 403s post-boot and the operator should know.
  3. Wait-for-argos-panel-machine-registration loop — polls cscli machines list for the argos-panel row. The credential-import inside the panel happens on the next reconcile tick (a few seconds after panel boot), so this loop confirms the import landed before init.sh moves on to the seed step.

Smoke phase 3c — panel-LAPI integration

scripts/smoke/demo-environment.sh gains a new phase between the existing 3b and 4. Three assertions:

3c-i:  cscli machines list contains 'argos-panel'
3c-ii: zero 'lapi 403' lines in panel logs (last 30s window)
3c-iii: credentials sentinel /data/shared/crowdsec-machine-
        credentials.yaml is absent (consumed by panel import)

All three must PASS for the smoke to proceed; any failure points at a specific stage of the credentials chain (3c-i = init sidecar didn't run; 3c-ii = panel hasn't imported yet or import failed; 3c-iii = import didn't run).

Live evidence (post-fix)

$ docker exec argos-demo-crowdsec cscli machines list
 localhost    127.0.0.1   2026-04-28T18:56:05Z  ✔️ ...
 argos-panel  172.20.0.4  2026-04-28T18:55:18Z  ✔️ ...

$ docker inspect argos-demo-crowdsec-init --format \
    '{{.State.Status}}: ExitCode={{.State.ExitCode}}'
exited: ExitCode=0

$ docker logs argos-demo-panel | grep -i credential
... INFO crowdsec: machine credentials imported from init sidecar
    user=argos-panel
    path=/data/shared/crowdsec-machine-credentials.yaml
... INFO crowdsec: client wired url=http://crowdsec:8081
    machine_write=true

$ docker exec argos-demo-panel sh -c \
    'test -f /data/shared/crowdsec-machine-credentials.yaml \
     && echo present || echo absent'
absent

$ docker logs argos-demo-panel --since 60s | grep -c 'lapi 403'
0

Files changed

  • scripts/demo/init.sh — drop --no-deps argos crowdsec caddy; bump panel healthcheck timeout 60s → 120s; add crowdsec-init exit-code check + machine-registered wait loop.
  • scripts/smoke/demo-environment.sh — new phase 3c (panel-LAPI integration: machines list + log scan + sentinel-consumed checks).
  • backend/cmd/argos/main.goargosVersion 1.3.35.2 → 1.3.35.3.
  • frontend/package.jsonversion 1.3.35.2 → 1.3.35.3.
  • scripts/demo/docker-compose.override.yml — image pin argos-prod-argos:1.3.35.3.
  • docs/release-notes/v1.3.35.3.md (this file)
  • CHANGELOG.md, mkdocs.yml

Smoke gate

scripts/smoke/demo-environment.sh --yes PASS end-to-end with new phase 3c green. Self-executed against the live host pre-tag for v1.3.35.3.

Upgrade

cd ~/argos-edge
git pull
make sync-prod && make build-prod-image

scripts/demo/init.sh
# panel ready at http://localhost:9181  login: demo / demo1234

# Verify the panel-LAPI integration:
docker exec argos-demo-crowdsec cscli machines list
# expected: a row with 'argos-panel' in addition to 'localhost'

# After login, /system Health card should show 'recent_errors'
# as an empty array (or a populated array of real errors), NOT
# null. Threats tab should render decisions without 403.
# AppSec metrics tab should render counters.

If you have a v1.3.35 or v1.3.35.2 demo stack still up:

scripts/demo/teardown.sh --purge
scripts/demo/init.sh

The volume reset is required because the panel's settings DB in the existing demo will have leftover state from the broken init; cleanest path is a full reset.

What this enables

The screenshot capture session can now show fully-functional panels: every surface that depends on panel-to-LAPI calls (threats, AppSec metrics, country reconciler state, system health) will render correctly. v1.3.35.2's per-surface density expansion only paid off for surfaces that read from the panel DB directly; v1.3.35.3 closes the LAPI-integrated surfaces too.