v1.3.22 -- Country expansion: revoke fix + supernet rollup architecture¶
A focused two-bug release. v1.3.21 country expansion shipped with two latent bugs that only became visible when the operator exercised the lifecycle end-to-end against a real LAPI:
- Revoke was structurally broken. The panel sent
?origins=(plural) on the DELETE path; LAPI's DELETE handler only accepts?origin=(singular). The GET handler accepts the plural form, which obscured the asymmetry until prod-smoke. Verified Apr 25 2026 with HTTP 500 from real LAPI:'origins' doesn't exist: invalid filter. - Bulk insertion silently dropped most entries. LAPI's /v1/alerts batch is NOT atomic at scale. A 21,521-alert single POST committed only ~5,001 IPv6 entries before SQLite WAL lock contention silently abandoned the rest -- with
201 Createdreturned to the client. v1.3.21's per-CIDR loop ran a similar pattern with the same result, but the slowness masked it. Verified Apr 25 2026: panel reported 21,521 inserts, LAPI stored 5,001, zero IPv4. BR test IPs (146.70.98.104, 200.221.2.45 etc.) were NOT enforced at the Caddy edge.
The fix is two changes:
- Singular
origin=on DELETE. Single-character upstream filter-name fix. - Architecture pivot: supernet rollup at /16 floor. Instead of pushing every DB-IP-Lite CIDR (21k+ for BR, ~5k for KR), the panel runs the MMDB output through a rollup that compresses to <= ~200 supernets where possible, with a /16 floor (v4) and /28 floor (v6) preventing over-blocking of neighbouring address space. This lands within LAPI's atomic-batch comfort zone for most countries (50-200 entries, sub-second push) and for fragmented allocations like BR (~5,000 entries, ~25 seconds).
What ships¶
crowdsec.Client.AddRangeDecisions(ctx, []input) + WriteHTTP¶
backend/internal/crowdsec/client.go now exposes a batch method for Range decisions, used by the country expander. A separate WriteHTTP http.Client (5 min timeout, vs 10s for HTTP) lets batch writes hold the LAPI request open for tens of seconds without competing with the short-timeout polling reads. Single-input AddRangeDecision becomes a one-line wrapper for callers that don't have a batch shape.
DeleteDecisionsByOrigin filter fix¶
// before:
u := c.URL + "/v1/decisions?" + url.Values{"origins": {origin}}.Encode()
// after:
u := c.URL + "/v1/decisions?" + url.Values{"origin": {origin}}.Encode()
Cited inline in the source comment with the upstream pkg/database/decisions.go permalinks (GET handler L75 vs DELETE handler L471). TestDeleteDecisionsByOriginUsesSingularParam locks the regression: a future refactor that goes back to the plural name fails the test.
country.RollupToSupernets(cidrs, target)¶
New file backend/internal/security/country/rollup.go. Compresses a CIDR list to <= target supernets while never widening past MinV4Prefix=16 or MinV6Prefix=28. The walk picks the longest prefix length that fits the budget; if no length in [floor, maxBits] fits, returns the floor-length result even if it exceeds budget. Coverage is preserved: every input IP is still contained in at least one output supernet.
Expander.Ban chunked batch with continue-on-error¶
The Ban path now:
- Resolves CIDRs from the MMDB.
- Runs them through
RollupToSupernets. - Splits into chunks of
ChunkSize=500per /v1/alerts POST. - On a chunk failure, logs + continues; tracks
failed_chunksin the response. - Persists the tracking row with the COMMITTED CIDRs only;
decision_idsis the JSON of CIDR strings (not opaque LAPI IDs). Revoke uses the origin tag, not the IDs, so this stays robust.
Panel http.Server.WriteTimeout raised to 20 min¶
backend/internal/server/server.go. The previous 30s ceiling cut the country-expansion handler off mid-loop and left the operator with no signal whether the work succeeded. v1.3.23 will revisit with an async background-job path; v1.3.22's chunked sync path needs the timeout headroom for fragmented countries.
Frontend partial-success surface¶
The Settings page "Country bans (expanded)" toast now reads "added BR: 5009 of 5009 CIDR ranges committed" on full success, "added BR: 4500 of 5009 CIDR ranges committed (1 chunks failed -- retry to fill in)" on partial. Submit button label updated: Expanding (up to ~10 min for large countries)....
Smoke¶
scripts/smoke/country-block.sh PASSes against the v1.3.22 prod stack:
BR re-expansion via POST /api/security/countries/expand:
HTTP 201 in 25s
cidr_count=5009, requested_count=5009, failed_chunks=0
LAPI Range count for argos-country-BR: 5009 (cscli caps display at 5001)
Per-IP probes (XFF spoofed, run from inside docker network so
trusted_proxies honours the header):
146.70.98.104 (BR, M247) -> HTTP 403 PASS
149.102.251.103 (BR, Datacamp) -> HTTP 403 PASS
200.221.2.45 (BR, UFRJ) -> HTTP 403 PASS
177.10.0.1 (BR, Clean Net) -> HTTP 403 PASS
8.8.8.8 (US, Google) -> HTTP 302 PASS (not blocked)
1.1.1.1 (US, Cloudflare) -> HTTP 302 PASS (not blocked)
The negative controls (8.8.8.8, 1.1.1.1) prove the /16 floor is working. v1.3.22's first iteration without the floor widened supernets to /6 and over-blocked both these IPs; the floor change caught it before tag.
Tests¶
19 country-package tests + 3 crowdsec-client tests pass:
- 9 v1.3.21 expander tests carry over (happy path, code validation, unknown country, replace-on-conflict, revoke happy/missing, list ordering, case insensitivity).
- New
TestBanChunksLargeInput-- 2500 inputs ÷ 500 chunk_size = 5 batch calls, asserts the chunked path is on. - New
TestBanContinuesOnChunkFailure-- middle chunk fails,failed_chunks=1, persisted row has the surviving CIDRs. - New
TestBanUnwindsWhenAllChunksFail-- every chunk fails, origin-tag delete cleans up, no row persisted. - Renamed
TestBanSmallInputUsesSingleBatch-- small input still produces exactly 1 batch call. - 7 rollup tests covering small-input passthrough, adjacent- prefix collapse, coverage invariant, BR-size simulation, v4/v6 split, empty input, malformed input.
- New
TestRollupRespectsV4Floor-- regression lock for the /6 over-blocking incident: 50 disjoint /24s in distant /8s, budget=10. The algorithm must NOT widen below /16; the test fails if it does. - 3 crowdsec-client tests covering the singular-origin fix, batch-emit shape, empty-input no-op.
The four-strike upstream-behaviour pattern¶
v1.3.22 closes the fourth case in a recurring failure mode: bugs that pass unit tests with fakes but fail against real upstream. For the working agreement going forward (memorised):
| Release | Bug | Where unit tests with fakes fell short |
|---|---|---|
| v1.3.18 | client_ip vs remote_ip semantics in Caddy v2.7+ | tests checked JSON shape, not Caddy runtime resolution |
| v1.3.20 | Plugin lacks scope=Country support entirely | tests verified the panel emit, not the upstream plugin behavior |
| v1.3.22 BUG-2 | LAPI filter naming asymmetry (GET origins / DELETE origin) | tests stubbed the LAPI client, never validated against real LAPI's filter-name parser |
| v1.3.22 BUG-3 | LAPI silently drops bulk inserts under SQLite WAL lock contention | tests asserted on emit shape, not on persisted state |
The lesson, written into memory:
Smoke must verify EFFECT (per-IP enforcement), not specs (row counts). Specs reflect assumptions about reality. Reality wins. Adjust specs to empirical findings, not the other way around.
The numerical smoke gate ≤200 entries was unreachable for BR (real allocation = ~5000 /16 supernets); the gate relaxes to "per-IP enforcement holds, no over-blocking, push completes in <30s". v1.3.22 ships with that adjusted gate.
Trade-offs¶
- Most countries fit ≤200 supernets, ~sub-second push. BR / IN / IR-class fragmented allocations need ~3-5k supernets, ~20-30s push. The Settings UI surfaces this honestly via the button label and the toast.
- /16 floor means slight over-coverage for some countries. A /16 supernet is ~65k IPs; an ISP allocation rarely fills a full /16, so some neighbouring IPs may share the supernet. Acceptable for the operator-trust model (better to slightly over-block within the same /16 than to severely over-block at /6).
- 5009 LAPI Range decisions per banned country. Active set scales linearly with operator-issued country bans. 10 fragmented countries x 5000 = 50,000 decisions, well within LAPI / bouncer radix-tree capacity.
Files changed¶
Backend¶
backend/internal/crowdsec/client.go--WriteHTTP,AddRangeDecisions,DeleteDecisionsByOriginfilter fix,ListDecisionsByScope.backend/internal/crowdsec/client_test.go(new) -- 3 regression-lock tests.backend/internal/security/country/expander.go-- chunked Ban with continue-on-error + RequestedCount/FailedChunks in BanResult.backend/internal/security/country/rollup.go(new) -- RollupToSupernets + family-aware floor.backend/internal/security/country/rollup_test.go(new) -- 7 tests including v4-floor regression lock.backend/internal/security/country/expander_test.go-- updated fakeLAPI for batch shape, new chunking + continue-on-error tests.backend/internal/security/country/source.go-- MMDBSource applies rollup before returning.backend/internal/server/server.go-- WriteTimeout 20m.
Frontend¶
frontend/src/api/client.ts--requested_countandfailed_chunkson CountryExpansionResult.frontend/src/pages/Settings.tsx-- partial-failure toast, longer button label.
Docs¶
docs/release-notes/v1.3.22.md(this file).docs/operations/access-control.md-- per-country timing expectations + floor rationale.docs/planning/v1.3.23-async-country-expansion.md(new) -- scope sketch for the async background-job path.CHANGELOG.md, version bump.
Upgrade¶
No DB migration. No env vars. No compose surface change.
If you have v1.3.21-issued country expansions in country_ban_expansions: the rows describe entries that LAPI silently dropped. Recommend re-issuing the expansion (the panel's POST /api/security/countries/expand is idempotent on country_code -- INSERT OR REPLACE) so LAPI gets a clean post-rollup set.
# For each existing expansion, re-issue:
curl -X POST https://your-host/api/security/countries/expand \
-b "argos_session=..." \
-d '{"country_code":"BR","duration":"8760h","reason":"v1.3.22 re-expand"}'
Not changed¶
- DB schema (migration 029 still latest).
- API endpoint shapes (POST/GET/DELETE /api/security/countries/* unchanged).
- v1.3.20
enable_streaming: falseemit -- still in place; required for any non-IP scope to enforce. - v1.3.19 self-block banner, whitelist lifecycle.
hosts.true_detect_modeschema column from v1.3.19 stays dormant.