v1.3.33 -- Country bans alert-shape fix (eighth-strike)¶

A focused critical-fix release. v1.3.31 dogfood revealed silent decision loss on the operator's prod stack: 8 country bans the panel claimed as active had ZERO active LAPI decisions. Root cause was a fundamental shape mismatch between argos's per-CIDR alert emit and CrowdSec's flush.max_items: 5000 cap.

This is the eighth strike in the upstream-behaviour pattern: the canonical-CAPI shape is 1 alert with N decisions inside; argos's pre-v1.3.33 shape was N alerts each with 1 decision. Both are documented as valid /v1/alerts payloads -- only the CAPI shape survives at scale.

What broke¶

In v1.3.22 the panel adopted chunked batch POST to LAPI (500-CIDR chunks) so BR's 5009 ranges wouldn't deadlock the HTTP request. Each chunk POST sent an array of 500 alert envelopes, each carrying ONE decision. That's the documented /v1/alerts payload for ad-hoc operator-issued bans.

What the operator's dogfood discovered: CrowdSec's LAPI keeps at most flush.max_items: 5000 alert rows. Add a country with

5000 ranges and the alert table overflows; CrowdSec flushes the oldest alerts to bring the count back under cap, and each flushed alert's child decisions cascade-delete with it.

Compounding effect: every NEW country expansion adds N alerts. The OLDER expansions silently lose their decisions as the cap keeps cycling. After 8 country expansions on a stack (IR, IN, RU, TR, US, VN, CN, KR, NG with ~38k CIDRs total), the LAPI's 5000-alert cap had cycled multiple times. The most recent expansion (BR) was the only one with active decisions at any given moment; the rest were silently flushed.

Why CAPI works at scale¶

CrowdSec's own community-blocklist sync uses a different shape:

{
  "scenario": "update : +15000/-0 IPs",
  "source": {
    "scope": "crowdsecurity/community-blocklist",
    "value": ""
  },
  "events_count": 0,
  "decisions": [
    {"scope": "Ip", "value": "...", "origin": "CAPI"},
    ...  (15000 entries)
  ]
}

ONE alert envelope. 15,000 decisions inside. One row against the 5000-alert cap. The flush cascade never fires.

Three bundled fixes¶

Fix 1: `AddRangeDecisions` shape restructure (root cause)¶

The panel's LAPI client now emits one alert per call carrying all N decisions inside, mirroring the CAPI shape. v1.3.22's 500-decision chunking is preserved as a TRANSPORT-level chunking (one HTTP POST per chunk), but each chunk now produces a single alert with up to 500 decisions inside.

Concrete numbers post-v1.3.33:

Country	Ranges	Pre-v1.3.33 alerts	Post-v1.3.33 alerts
NG	471	471	1
IR	1,454	1,454	3
BR	5,009	5,009	11
sum of 25 avg-2k-range countries	~50k	50,000	~100

The 5000-alert cap is never approached again under realistic operator workloads.

Mixed-origin batches now error explicitly. Each call must be homogeneous in Origin; the only existing caller (Expander.Ban) already satisfies this contract.

Fix 2: country reconciler + state column (defensive layer)¶

Migration 033 adds a state column on country_ban_expansions with CHECK constraint ('active', 'drifted'). New country.Reconciler runs every 5 minutes via the standard goroutine + ticker pattern, compares panel cidr_count vs LAPI count per origin, flips the state when divergence exceeds 1%.

Catches drift causes not covered by Fix 1: manual cscli mutations, panel restart during a writer window, future upstream changes that shift the LAPI semantics again.

The reconciler classifier uses tolerance scaling: 5000-CIDR country drifts at 51 missing decisions; 5-CIDR country drifts at 2 missing (tolerance floors at 1). Empirically tuned against community-blocklist re-sync noise (CAPI re-pushes every 2h with brief perturbations during the swap).

UI surfacing of the state field is in the API response; frontend rendering of the drift indicator is queued for a follow-up release. API contract is in place so the frontend work is purely visual.

Fix 3: smoke isolation¶

scripts/smoke/country-expansion-async.sh now requires explicit operator-set country codes:

TEST_COUNTRY=BR FAIL_TEST_COUNTRY=TR \
  ./scripts/smoke/country-expansion-async.sh

Default TEST_COUNTRY=XX and FAIL_TEST_COUNTRY=YY (RFC 3166 reserved codes) refuse to run; the v1.3.31 smoke shipped without this gate, and its blanket-DELETE on cleanup contributed to the operator's dogfood incident.

Same isolation pattern as the v1.3.21 country-block.sh smoke from the beginning. Both smokes now share a consistent contract: refuse-to-run with placeholders so a bare invocation cannot accidentally mutate operator state.

Smoke gate (all PASS post-deploy)¶

lapi-flush-cap.sh (new, 8 phases):
Pre-test alert count: 5001
Add NG (471 ranges): +1 alert -> 5002
Add IR (1454 ranges): +3 alerts -> 5005
Both countries' decisions remain active (no flush cascade)
Cleanup
country-reconciler.sh (new, 5 phases): operator- mediated due to the 5-min reconciler interval; runs manually post-deploy.
Existing smokes (regression sweep): lapi-wal.sh, scenario-descriptions.sh, scenarios-toggle.sh, appsec-tuning.sh, host-crud.sh, whitelist-roundtrip.sh -- all PASS unchanged.

Mid-impl gotchas (caught + fixed pre-tag)¶

Test schema sync: the country package's openTestDB + jobsDB helpers needed the new state column added.
Mutex-test flakiness: my refactor made AddRangeDecisions faster, shrinking the window where the serialisation test could observe id1=running while id2=pending. Bumped the fake's addDelay 100ms -> 250ms and the sample window 150ms -> 400ms.
Smoke assertion too strict: initial smoke asserted delta == 1 per country. IR (1454 ranges, 3 chunks) produced delta=3. Adjusted to ceil(cidr_count/500). This is the v1.3.22 chunking still doing its job; the invariant is "alert count proportional to chunk count, not CIDR count".

Files changed¶

backend/migrations/033_country_expansions_state.{up,down}.sql (new)
backend/internal/crowdsec/client.go (AddRangeDecisions rewrite + CountDecisionsByOrigin)
backend/internal/crowdsec/client_test.go (test asserts new shape; new test for heterogeneous-origin rejection)
backend/internal/security/country/reconciler.go (new)
backend/internal/security/country/reconciler_test.go (new, 4 tests)
backend/internal/security/country/expander.go (Expansion.State field; LIST query selects state)
backend/internal/security/country/expander_test.go
jobs_test.go (test schema sync)
backend/internal/db/migrate_test.go (rollback chain extended)
backend/cmd/argos/main.go (Reconciler.Start wiring)
scripts/smoke/country-expansion-async.sh (isolation gate)
scripts/smoke/lapi-flush-cap.sh (new)
scripts/smoke/country-reconciler.sh (new)
docs/release-notes/v1.3.33.md (this file)
CHANGELOG.md, mkdocs.yml, version bump

Operator action required post-deploy¶

The operator's prod stack had 8 banned countries silently deactivated during v1.3.31. To restore enforcement:

# Re-apply each country one at a time and verify decisions
# persist across multiple expansions (the v1.3.33 fix's
# real-world validation):

for cc in IN IR RU TR US VN CN KR NG; do
    curl -fsS -X POST \
        -H "Cookie: argos_session=$SESSION" \
        -H "Content-Type: application/json" \
        -d '{"duration":"168h","reason":"v1.3.33 re-apply"}' \
        "http://localhost:9180/api/security/countries/${cc}/expand" \
    | jq -r '.id'
    sleep 30   # let the async job complete

    # Verify decisions persisted (NOT 0):
    docker exec argos-prod-crowdsec cscli decisions list \
        --origin "argos-country-${cc}" --limit 0 \
        | tail -n +2 | wc -l
done

Pre-v1.3.33 each new country would have wiped the previous ones; post-v1.3.33 they all coexist.

Not changed¶

All v1.3.32 backend / frontend / migration code unchanged.
Async-job pattern (v1.3.31), drift detector (v1.3.27), true_detect_mode (v1.3.29), scenario descriptions (v1.3.30) all unchanged.
LAPI WAL mode (v1.3.28) still active; helps reduce contention but does not address the alert-cap issue (which is row-count-driven, not lock-driven).

Eighth-strike pattern: lessons¶

Documented in project_four_strike_upstream_pattern.md. The key lesson for future LAPI bulk-emit code paths:

When emitting bulk LAPI data, mirror CAPI shape -- it's the only shape upstream tested at scale.

The pre-v1.3.33 shape (N alerts, 1 decision each) is documented in CrowdSec's OpenAPI spec as valid; LAPI accepts it without complaint. The cap-flush behaviour is invisible to a unit-tested emit path because the test only verifies the request shape, not the long-term state evolution after many emits.

Smoke verifies EFFECT. Unit tests verify spec. Both are necessary; neither is sufficient.