v1.3.33 -- Country bans alert-shape fix (eighth-strike)¶
A focused critical-fix release. v1.3.31 dogfood revealed silent decision loss on the operator's prod stack: 8 country bans the panel claimed as active had ZERO active LAPI decisions. Root cause was a fundamental shape mismatch between argos's per-CIDR alert emit and CrowdSec's flush.max_items: 5000 cap.
This is the eighth strike in the upstream-behaviour pattern: the canonical-CAPI shape is 1 alert with N decisions inside; argos's pre-v1.3.33 shape was N alerts each with 1 decision. Both are documented as valid /v1/alerts payloads -- only the CAPI shape survives at scale.
What broke¶
In v1.3.22 the panel adopted chunked batch POST to LAPI (500-CIDR chunks) so BR's 5009 ranges wouldn't deadlock the HTTP request. Each chunk POST sent an array of 500 alert envelopes, each carrying ONE decision. That's the documented /v1/alerts payload for ad-hoc operator-issued bans.
What the operator's dogfood discovered: CrowdSec's LAPI keeps at most flush.max_items: 5000 alert rows. Add a country with
5000 ranges and the alert table overflows; CrowdSec flushes the oldest alerts to bring the count back under cap, and each flushed alert's child decisions cascade-delete with it.
Compounding effect: every NEW country expansion adds N alerts. The OLDER expansions silently lose their decisions as the cap keeps cycling. After 8 country expansions on a stack (IR, IN, RU, TR, US, VN, CN, KR, NG with ~38k CIDRs total), the LAPI's 5000-alert cap had cycled multiple times. The most recent expansion (BR) was the only one with active decisions at any given moment; the rest were silently flushed.
Why CAPI works at scale¶
CrowdSec's own community-blocklist sync uses a different shape:
{
"scenario": "update : +15000/-0 IPs",
"source": {
"scope": "crowdsecurity/community-blocklist",
"value": ""
},
"events_count": 0,
"decisions": [
{"scope": "Ip", "value": "...", "origin": "CAPI"},
... (15000 entries)
]
}
ONE alert envelope. 15,000 decisions inside. One row against the 5000-alert cap. The flush cascade never fires.
Three bundled fixes¶
Fix 1: AddRangeDecisions shape restructure (root cause)¶
The panel's LAPI client now emits one alert per call carrying all N decisions inside, mirroring the CAPI shape. v1.3.22's 500-decision chunking is preserved as a TRANSPORT-level chunking (one HTTP POST per chunk), but each chunk now produces a single alert with up to 500 decisions inside.
Concrete numbers post-v1.3.33:
| Country | Ranges | Pre-v1.3.33 alerts | Post-v1.3.33 alerts |
|---|---|---|---|
| NG | 471 | 471 | 1 |
| IR | 1,454 | 1,454 | 3 |
| BR | 5,009 | 5,009 | 11 |
| sum of 25 avg-2k-range countries | ~50k | 50,000 | ~100 |
The 5000-alert cap is never approached again under realistic operator workloads.
Mixed-origin batches now error explicitly. Each call must be homogeneous in Origin; the only existing caller (Expander.Ban) already satisfies this contract.
Fix 2: country reconciler + state column (defensive layer)¶
Migration 033 adds a state column on country_ban_expansions with CHECK constraint ('active', 'drifted'). New country.Reconciler runs every 5 minutes via the standard goroutine + ticker pattern, compares panel cidr_count vs LAPI count per origin, flips the state when divergence exceeds 1%.
Catches drift causes not covered by Fix 1: manual cscli mutations, panel restart during a writer window, future upstream changes that shift the LAPI semantics again.
The reconciler classifier uses tolerance scaling: 5000-CIDR country drifts at 51 missing decisions; 5-CIDR country drifts at 2 missing (tolerance floors at 1). Empirically tuned against community-blocklist re-sync noise (CAPI re-pushes every 2h with brief perturbations during the swap).
UI surfacing of the state field is in the API response; frontend rendering of the drift indicator is queued for a follow-up release. API contract is in place so the frontend work is purely visual.
Fix 3: smoke isolation¶
scripts/smoke/country-expansion-async.sh now requires explicit operator-set country codes:
Default TEST_COUNTRY=XX and FAIL_TEST_COUNTRY=YY (RFC 3166 reserved codes) refuse to run; the v1.3.31 smoke shipped without this gate, and its blanket-DELETE on cleanup contributed to the operator's dogfood incident.
Same isolation pattern as the v1.3.21 country-block.sh smoke from the beginning. Both smokes now share a consistent contract: refuse-to-run with placeholders so a bare invocation cannot accidentally mutate operator state.
Smoke gate (all PASS post-deploy)¶
lapi-flush-cap.sh(new, 8 phases):- Pre-test alert count: 5001
- Add NG (471 ranges): +1 alert -> 5002
- Add IR (1454 ranges): +3 alerts -> 5005
- Both countries' decisions remain active (no flush cascade)
- Cleanup
country-reconciler.sh(new, 5 phases): operator- mediated due to the 5-min reconciler interval; runs manually post-deploy.- Existing smokes (regression sweep):
lapi-wal.sh,scenario-descriptions.sh,scenarios-toggle.sh,appsec-tuning.sh,host-crud.sh,whitelist-roundtrip.sh-- all PASS unchanged.
Mid-impl gotchas (caught + fixed pre-tag)¶
- Test schema sync: the country package's openTestDB + jobsDB helpers needed the new
statecolumn added. - Mutex-test flakiness: my refactor made
AddRangeDecisionsfaster, shrinking the window where the serialisation test could observe id1=running while id2=pending. Bumped the fake'saddDelay100ms -> 250ms and the sample window 150ms -> 400ms. - Smoke assertion too strict: initial smoke asserted
delta == 1per country. IR (1454 ranges, 3 chunks) produced delta=3. Adjusted toceil(cidr_count/500). This is the v1.3.22 chunking still doing its job; the invariant is "alert count proportional to chunk count, not CIDR count".
Files changed¶
backend/migrations/033_country_expansions_state.{up,down}.sql(new)backend/internal/crowdsec/client.go(AddRangeDecisionsrewrite +CountDecisionsByOrigin)backend/internal/crowdsec/client_test.go(test asserts new shape; new test for heterogeneous-origin rejection)backend/internal/security/country/reconciler.go(new)backend/internal/security/country/reconciler_test.go(new, 4 tests)backend/internal/security/country/expander.go(Expansion.Statefield; LIST query selects state)backend/internal/security/country/expander_test.gojobs_test.go(test schema sync)backend/internal/db/migrate_test.go(rollback chain extended)backend/cmd/argos/main.go(Reconciler.Start wiring)scripts/smoke/country-expansion-async.sh(isolation gate)scripts/smoke/lapi-flush-cap.sh(new)scripts/smoke/country-reconciler.sh(new)docs/release-notes/v1.3.33.md(this file)CHANGELOG.md,mkdocs.yml, version bump
Operator action required post-deploy¶
The operator's prod stack had 8 banned countries silently deactivated during v1.3.31. To restore enforcement:
# Re-apply each country one at a time and verify decisions
# persist across multiple expansions (the v1.3.33 fix's
# real-world validation):
for cc in IN IR RU TR US VN CN KR NG; do
curl -fsS -X POST \
-H "Cookie: argos_session=$SESSION" \
-H "Content-Type: application/json" \
-d '{"duration":"168h","reason":"v1.3.33 re-apply"}' \
"http://localhost:9180/api/security/countries/${cc}/expand" \
| jq -r '.id'
sleep 30 # let the async job complete
# Verify decisions persisted (NOT 0):
docker exec argos-prod-crowdsec cscli decisions list \
--origin "argos-country-${cc}" --limit 0 \
| tail -n +2 | wc -l
done
Pre-v1.3.33 each new country would have wiped the previous ones; post-v1.3.33 they all coexist.
Not changed¶
- All v1.3.32 backend / frontend / migration code unchanged.
- Async-job pattern (v1.3.31), drift detector (v1.3.27), true_detect_mode (v1.3.29), scenario descriptions (v1.3.30) all unchanged.
- LAPI WAL mode (v1.3.28) still active; helps reduce contention but does not address the alert-cap issue (which is row-count-driven, not lock-driven).
Eighth-strike pattern: lessons¶
Documented in project_four_strike_upstream_pattern.md. The key lesson for future LAPI bulk-emit code paths:
When emitting bulk LAPI data, mirror CAPI shape -- it's the only shape upstream tested at scale.
The pre-v1.3.33 shape (N alerts, 1 decision each) is documented in CrowdSec's OpenAPI spec as valid; LAPI accepts it without complaint. The cap-flush behaviour is invisible to a unit-tested emit path because the test only verifies the request shape, not the long-term state evolution after many emits.
Smoke verifies EFFECT. Unit tests verify spec. Both are necessary; neither is sufficient.