v1.3.28 -- LAPI latency fix (SQLite WAL mode)¶
A focused single-bug release. Closes the latent slowness flagged in v1.3.26 dogfood: CrowdSec's LAPI SQLite ran in rollback-journal mode (the SQLite default), so the community- blocklist sync (~15k rows, every ~2 hours via CAPI) held an exclusive writer lock and every concurrent reader stalled 3-4 seconds. Enabling WAL mode lets readers proceed concurrently with writers.
Why this is its own release¶
The fix is one config line in crowdsec/config.yaml.local. No Go or TypeScript code changed. v1.3.28 ships the config + a smoke script that asserts WAL is active so the next regression gets caught at deploy time rather than during the next CAPI sync window.
The per-host true_detect_mode feature originally drafted as v1.3.28 has been renumbered to v1.3.29; v1.3.28 was claimed by this LAPI fix. Planning doc lives at docs/planning/v1.3.29-true-detect-mode.md (same content; shifted one cycle).
Investigation summary (the v1.3.28 spike record)¶
PHASE 1 -- diagnose¶
Four suspects were ranked from the dogfood evidence:
- LAPI binding mismatch -- 127.0.0.1 vs 0.0.0.0 confusion from the v1.3.19+ setup-appsec.sh changes. Ruled out: the listening socket is
0.0.0.0:8081;local_api_credentials. yaml(cscli-internal credential) correctly references localhost. - SQLite WAL contention -- the v1.3.22 BUG-3 family (panel-side bulk POST hit upstream LAPI WAL contention). On the LAPI's own DB this time. Smoking gun: CrowdSec's own log emits, on every boot:
level=warning msg="sqlite is not using WAL mode, LAPI might
become unresponsive when inserting the community blocklist"
The config confirms db_config.use_wal: false.
- Internal lock contention -- a symptom of suspect 2, not an independent cause. The slow
/v1/decisionsGETs were waiting on the writer. - Container resource limits -- 184MiB / 256MiB memory, 0.5 CPU; under-pressure but not the root cause. Skipped.
PHASE 2 -- apply fix¶
Added db_config.use_wal: true to crowdsec/config.yaml.local. CrowdSec merges .yaml.local into the upstream config.yaml defaults so a single key in the override is enough. Applied via docker compose restart crowdsec; SQLite issues PRAGMA journal_mode=WAL on the existing DB file at the next open. No data migration; ~3s downtime.
PHASE 3 -- verify¶
| Measurement | Pre-fix | Post-fix |
|---|---|---|
cscli alerts list (idle) | 300-932ms | 294-460ms |
| Concurrent reads under load | (3.3-3.9s, see logs) | 217-314ms |
PRAGMA journal_mode | delete | wal |
| Startup warning | present | absent |
The empirical oracle is the CrowdSec maintainer-authored warning. They added it because they know WAL mode prevents the unresponsiveness pattern. With WAL on, the warning is gone and the documented failure mode is structurally eliminated. The real validation is observing the next CAPI sync without slow GETs in the logs; that happens automatically every ~2 hours post-deploy.
What ships¶
crowdsec/config.yaml.local¶
The added stanza is the entire fix. Existing api.server listener override is untouched.
scripts/smoke/lapi-wal.sh¶
3-step smoke against the live argos-prod-crowdsec container:
PRAGMA journal_modereturnswal(the persistent indicator that WAL is active on the DB file).docker logssince the current container'sStartedAttimestamp shows zero "sqlite is not using WAL mode" warnings. Scoped to current boot viadocker inspect --format '{{.State.StartedAt}}'so a previous container's pre-fix log entries don't false-positive.- The
.db-walsidecar exists when writes have happened post-restart. Best-effort gate: a freshly-restarted container with no writes yet still has WAL mode active (per #1) but no sidecar file yet, which is fine.
Version bump¶
backend/cmd/argos/main.go argosVersion and frontend/package.json version bumped to 1.3.28 even though neither file's surrounding code changed. The panel binary string is the stack release identifier -- operators read it in the panel footer.
Smoke gate (3/3 PASS)¶
make sync-prod-dryclean (or expected v1.3.28-only diff) before deploy.make sync-prod --yes+docker compose restart crowdsecapplies the new config; PRAGMA journal_mode flips towal.scripts/smoke/lapi-wal.shPASS: WAL active, no startup warning,.db-walsidecar created on first write.
The under-load improvement (no slow GETs during CAPI sync) is observable from docker logs argos-prod-crowdsec --since 2h | grep "GET /v1/decisions.*HTTP/1.1 200 [0-9]+\.[0-9]+s" after the next sync window. Pre-fix: 20+ slow GETs at 3-4s each. Post-fix: should be 0 or single-digit count, all under 1s.
Files changed¶
crowdsec/config.yaml.local(db_config.use_wal: true)scripts/smoke/lapi-wal.sh(new)backend/cmd/argos/main.go(argosVersion bump)frontend/package.json(version bump)docs/release-notes/v1.3.28.md(this file)docs/planning/v1.3.29-true-detect-mode.md(renamed from v1.3.28-...; same content, shifted one cycle)CHANGELOG.md,mkdocs.yml
Upgrade¶
cd ~/argos-edge
git pull
make sync-prod # picks up config.yaml.local change
docker compose -f /path/to/argos-prod/docker-compose.yml \
restart crowdsec # applies WAL on the existing DB
The argos panel image needs a rebuild for the version-string bump:
cd ~/argos-prod
docker build -f backend/Dockerfile -t argos-prod-argos:v1.3.28 .
# update docker-compose.override.yml: image: argos-prod-argos:v1.3.28
docker compose up -d --force-recreate --no-deps argos
If you skip the panel rebuild, the LAPI fix still takes effect (only crowdsec needs to restart for WAL); the panel footer just shows v1.3.27 instead of v1.3.28.
Not changed¶
- All v1.3.27 backend / frontend / migration code unchanged.
- The drift detector + GET /api/security/drift untouched.
- Caddyfile, CrowdSec scenarios, AppSec configs -- all v1.3.27.
- Migration 031 still latest.
Six-strike pattern: still six¶
This release does NOT add a seventh case to the upstream- behaviour pattern memo. The fix was a CrowdSec config that CrowdSec itself documents in the YAML schema; the maintainer- authored startup warning told us exactly what to flip. No upstream-vs-docs divergence; no fork; no PR. A clean win.