Skip to content

v1.3.28 -- LAPI latency fix (SQLite WAL mode)

A focused single-bug release. Closes the latent slowness flagged in v1.3.26 dogfood: CrowdSec's LAPI SQLite ran in rollback-journal mode (the SQLite default), so the community- blocklist sync (~15k rows, every ~2 hours via CAPI) held an exclusive writer lock and every concurrent reader stalled 3-4 seconds. Enabling WAL mode lets readers proceed concurrently with writers.

Why this is its own release

The fix is one config line in crowdsec/config.yaml.local. No Go or TypeScript code changed. v1.3.28 ships the config + a smoke script that asserts WAL is active so the next regression gets caught at deploy time rather than during the next CAPI sync window.

The per-host true_detect_mode feature originally drafted as v1.3.28 has been renumbered to v1.3.29; v1.3.28 was claimed by this LAPI fix. Planning doc lives at docs/planning/v1.3.29-true-detect-mode.md (same content; shifted one cycle).

Investigation summary (the v1.3.28 spike record)

PHASE 1 -- diagnose

Four suspects were ranked from the dogfood evidence:

  1. LAPI binding mismatch -- 127.0.0.1 vs 0.0.0.0 confusion from the v1.3.19+ setup-appsec.sh changes. Ruled out: the listening socket is 0.0.0.0:8081; local_api_credentials. yaml (cscli-internal credential) correctly references localhost.
  2. SQLite WAL contention -- the v1.3.22 BUG-3 family (panel-side bulk POST hit upstream LAPI WAL contention). On the LAPI's own DB this time. Smoking gun: CrowdSec's own log emits, on every boot:
level=warning msg="sqlite is not using WAL mode, LAPI might
    become unresponsive when inserting the community blocklist"

The config confirms db_config.use_wal: false.

  1. Internal lock contention -- a symptom of suspect 2, not an independent cause. The slow /v1/decisions GETs were waiting on the writer.
  2. Container resource limits -- 184MiB / 256MiB memory, 0.5 CPU; under-pressure but not the root cause. Skipped.

PHASE 2 -- apply fix

Added db_config.use_wal: true to crowdsec/config.yaml.local. CrowdSec merges .yaml.local into the upstream config.yaml defaults so a single key in the override is enough. Applied via docker compose restart crowdsec; SQLite issues PRAGMA journal_mode=WAL on the existing DB file at the next open. No data migration; ~3s downtime.

PHASE 3 -- verify

Measurement Pre-fix Post-fix
cscli alerts list (idle) 300-932ms 294-460ms
Concurrent reads under load (3.3-3.9s, see logs) 217-314ms
PRAGMA journal_mode delete wal
Startup warning present absent

The empirical oracle is the CrowdSec maintainer-authored warning. They added it because they know WAL mode prevents the unresponsiveness pattern. With WAL on, the warning is gone and the documented failure mode is structurally eliminated. The real validation is observing the next CAPI sync without slow GETs in the logs; that happens automatically every ~2 hours post-deploy.

What ships

crowdsec/config.yaml.local

api:
  server:
    listen_uri: 0.0.0.0:8081

db_config:
  use_wal: true

The added stanza is the entire fix. Existing api.server listener override is untouched.

scripts/smoke/lapi-wal.sh

3-step smoke against the live argos-prod-crowdsec container:

  1. PRAGMA journal_mode returns wal (the persistent indicator that WAL is active on the DB file).
  2. docker logs since the current container's StartedAt timestamp shows zero "sqlite is not using WAL mode" warnings. Scoped to current boot via docker inspect --format '{{.State.StartedAt}}' so a previous container's pre-fix log entries don't false-positive.
  3. The .db-wal sidecar exists when writes have happened post-restart. Best-effort gate: a freshly-restarted container with no writes yet still has WAL mode active (per #1) but no sidecar file yet, which is fine.

Version bump

backend/cmd/argos/main.go argosVersion and frontend/package.json version bumped to 1.3.28 even though neither file's surrounding code changed. The panel binary string is the stack release identifier -- operators read it in the panel footer.

Smoke gate (3/3 PASS)

  1. make sync-prod-dry clean (or expected v1.3.28-only diff) before deploy.
  2. make sync-prod --yes + docker compose restart crowdsec applies the new config; PRAGMA journal_mode flips to wal.
  3. scripts/smoke/lapi-wal.sh PASS: WAL active, no startup warning, .db-wal sidecar created on first write.

The under-load improvement (no slow GETs during CAPI sync) is observable from docker logs argos-prod-crowdsec --since 2h | grep "GET /v1/decisions.*HTTP/1.1 200 [0-9]+\.[0-9]+s" after the next sync window. Pre-fix: 20+ slow GETs at 3-4s each. Post-fix: should be 0 or single-digit count, all under 1s.

Files changed

  • crowdsec/config.yaml.local (db_config.use_wal: true)
  • scripts/smoke/lapi-wal.sh (new)
  • backend/cmd/argos/main.go (argosVersion bump)
  • frontend/package.json (version bump)
  • docs/release-notes/v1.3.28.md (this file)
  • docs/planning/v1.3.29-true-detect-mode.md (renamed from v1.3.28-...; same content, shifted one cycle)
  • CHANGELOG.md, mkdocs.yml

Upgrade

cd ~/argos-edge
git pull
make sync-prod          # picks up config.yaml.local change
docker compose -f /path/to/argos-prod/docker-compose.yml \
    restart crowdsec    # applies WAL on the existing DB

The argos panel image needs a rebuild for the version-string bump:

cd ~/argos-prod
docker build -f backend/Dockerfile -t argos-prod-argos:v1.3.28 .
# update docker-compose.override.yml: image: argos-prod-argos:v1.3.28
docker compose up -d --force-recreate --no-deps argos

If you skip the panel rebuild, the LAPI fix still takes effect (only crowdsec needs to restart for WAL); the panel footer just shows v1.3.27 instead of v1.3.28.

Not changed

  • All v1.3.27 backend / frontend / migration code unchanged.
  • The drift detector + GET /api/security/drift untouched.
  • Caddyfile, CrowdSec scenarios, AppSec configs -- all v1.3.27.
  • Migration 031 still latest.

Six-strike pattern: still six

This release does NOT add a seventh case to the upstream- behaviour pattern memo. The fix was a CrowdSec config that CrowdSec itself documents in the YAML schema; the maintainer- authored startup warning told us exactly what to flip. No upstream-vs-docs divergence; no fork; no PR. A clean win.