Skip to content

v1.3.22 -- Country expansion: revoke fix + supernet rollup architecture

A focused two-bug release. v1.3.21 country expansion shipped with two latent bugs that only became visible when the operator exercised the lifecycle end-to-end against a real LAPI:

  1. Revoke was structurally broken. The panel sent ?origins= (plural) on the DELETE path; LAPI's DELETE handler only accepts ?origin= (singular). The GET handler accepts the plural form, which obscured the asymmetry until prod-smoke. Verified Apr 25 2026 with HTTP 500 from real LAPI: 'origins' doesn't exist: invalid filter.
  2. Bulk insertion silently dropped most entries. LAPI's /v1/alerts batch is NOT atomic at scale. A 21,521-alert single POST committed only ~5,001 IPv6 entries before SQLite WAL lock contention silently abandoned the rest -- with 201 Created returned to the client. v1.3.21's per-CIDR loop ran a similar pattern with the same result, but the slowness masked it. Verified Apr 25 2026: panel reported 21,521 inserts, LAPI stored 5,001, zero IPv4. BR test IPs (146.70.98.104, 200.221.2.45 etc.) were NOT enforced at the Caddy edge.

The fix is two changes:

  1. Singular origin= on DELETE. Single-character upstream filter-name fix.
  2. Architecture pivot: supernet rollup at /16 floor. Instead of pushing every DB-IP-Lite CIDR (21k+ for BR, ~5k for KR), the panel runs the MMDB output through a rollup that compresses to <= ~200 supernets where possible, with a /16 floor (v4) and /28 floor (v6) preventing over-blocking of neighbouring address space. This lands within LAPI's atomic-batch comfort zone for most countries (50-200 entries, sub-second push) and for fragmented allocations like BR (~5,000 entries, ~25 seconds).

What ships

crowdsec.Client.AddRangeDecisions(ctx, []input) + WriteHTTP

backend/internal/crowdsec/client.go now exposes a batch method for Range decisions, used by the country expander. A separate WriteHTTP http.Client (5 min timeout, vs 10s for HTTP) lets batch writes hold the LAPI request open for tens of seconds without competing with the short-timeout polling reads. Single-input AddRangeDecision becomes a one-line wrapper for callers that don't have a batch shape.

DeleteDecisionsByOrigin filter fix

// before:
u := c.URL + "/v1/decisions?" + url.Values{"origins": {origin}}.Encode()
// after:
u := c.URL + "/v1/decisions?" + url.Values{"origin": {origin}}.Encode()

Cited inline in the source comment with the upstream pkg/database/decisions.go permalinks (GET handler L75 vs DELETE handler L471). TestDeleteDecisionsByOriginUsesSingularParam locks the regression: a future refactor that goes back to the plural name fails the test.

country.RollupToSupernets(cidrs, target)

New file backend/internal/security/country/rollup.go. Compresses a CIDR list to <= target supernets while never widening past MinV4Prefix=16 or MinV6Prefix=28. The walk picks the longest prefix length that fits the budget; if no length in [floor, maxBits] fits, returns the floor-length result even if it exceeds budget. Coverage is preserved: every input IP is still contained in at least one output supernet.

Expander.Ban chunked batch with continue-on-error

The Ban path now:

  1. Resolves CIDRs from the MMDB.
  2. Runs them through RollupToSupernets.
  3. Splits into chunks of ChunkSize=500 per /v1/alerts POST.
  4. On a chunk failure, logs + continues; tracks failed_chunks in the response.
  5. Persists the tracking row with the COMMITTED CIDRs only; decision_ids is the JSON of CIDR strings (not opaque LAPI IDs). Revoke uses the origin tag, not the IDs, so this stays robust.

Panel http.Server.WriteTimeout raised to 20 min

backend/internal/server/server.go. The previous 30s ceiling cut the country-expansion handler off mid-loop and left the operator with no signal whether the work succeeded. v1.3.23 will revisit with an async background-job path; v1.3.22's chunked sync path needs the timeout headroom for fragmented countries.

Frontend partial-success surface

The Settings page "Country bans (expanded)" toast now reads "added BR: 5009 of 5009 CIDR ranges committed" on full success, "added BR: 4500 of 5009 CIDR ranges committed (1 chunks failed -- retry to fill in)" on partial. Submit button label updated: Expanding (up to ~10 min for large countries)....

Smoke

scripts/smoke/country-block.sh PASSes against the v1.3.22 prod stack:

BR re-expansion via POST /api/security/countries/expand:
  HTTP 201 in 25s
  cidr_count=5009, requested_count=5009, failed_chunks=0
  LAPI Range count for argos-country-BR: 5009 (cscli caps display at 5001)

Per-IP probes (XFF spoofed, run from inside docker network so
trusted_proxies honours the header):
  146.70.98.104 (BR, M247)        -> HTTP 403  PASS
  149.102.251.103 (BR, Datacamp)  -> HTTP 403  PASS
  200.221.2.45 (BR, UFRJ)         -> HTTP 403  PASS
  177.10.0.1 (BR, Clean Net)      -> HTTP 403  PASS
  8.8.8.8 (US, Google)            -> HTTP 302  PASS (not blocked)
  1.1.1.1 (US, Cloudflare)        -> HTTP 302  PASS (not blocked)

The negative controls (8.8.8.8, 1.1.1.1) prove the /16 floor is working. v1.3.22's first iteration without the floor widened supernets to /6 and over-blocked both these IPs; the floor change caught it before tag.

Tests

19 country-package tests + 3 crowdsec-client tests pass:

  • 9 v1.3.21 expander tests carry over (happy path, code validation, unknown country, replace-on-conflict, revoke happy/missing, list ordering, case insensitivity).
  • New TestBanChunksLargeInput -- 2500 inputs ÷ 500 chunk_size = 5 batch calls, asserts the chunked path is on.
  • New TestBanContinuesOnChunkFailure -- middle chunk fails, failed_chunks=1, persisted row has the surviving CIDRs.
  • New TestBanUnwindsWhenAllChunksFail -- every chunk fails, origin-tag delete cleans up, no row persisted.
  • Renamed TestBanSmallInputUsesSingleBatch -- small input still produces exactly 1 batch call.
  • 7 rollup tests covering small-input passthrough, adjacent- prefix collapse, coverage invariant, BR-size simulation, v4/v6 split, empty input, malformed input.
  • New TestRollupRespectsV4Floor -- regression lock for the /6 over-blocking incident: 50 disjoint /24s in distant /8s, budget=10. The algorithm must NOT widen below /16; the test fails if it does.
  • 3 crowdsec-client tests covering the singular-origin fix, batch-emit shape, empty-input no-op.

The four-strike upstream-behaviour pattern

v1.3.22 closes the fourth case in a recurring failure mode: bugs that pass unit tests with fakes but fail against real upstream. For the working agreement going forward (memorised):

Release Bug Where unit tests with fakes fell short
v1.3.18 client_ip vs remote_ip semantics in Caddy v2.7+ tests checked JSON shape, not Caddy runtime resolution
v1.3.20 Plugin lacks scope=Country support entirely tests verified the panel emit, not the upstream plugin behavior
v1.3.22 BUG-2 LAPI filter naming asymmetry (GET origins / DELETE origin) tests stubbed the LAPI client, never validated against real LAPI's filter-name parser
v1.3.22 BUG-3 LAPI silently drops bulk inserts under SQLite WAL lock contention tests asserted on emit shape, not on persisted state

The lesson, written into memory:

Smoke must verify EFFECT (per-IP enforcement), not specs (row counts). Specs reflect assumptions about reality. Reality wins. Adjust specs to empirical findings, not the other way around.

The numerical smoke gate ≤200 entries was unreachable for BR (real allocation = ~5000 /16 supernets); the gate relaxes to "per-IP enforcement holds, no over-blocking, push completes in <30s". v1.3.22 ships with that adjusted gate.

Trade-offs

  • Most countries fit ≤200 supernets, ~sub-second push. BR / IN / IR-class fragmented allocations need ~3-5k supernets, ~20-30s push. The Settings UI surfaces this honestly via the button label and the toast.
  • /16 floor means slight over-coverage for some countries. A /16 supernet is ~65k IPs; an ISP allocation rarely fills a full /16, so some neighbouring IPs may share the supernet. Acceptable for the operator-trust model (better to slightly over-block within the same /16 than to severely over-block at /6).
  • 5009 LAPI Range decisions per banned country. Active set scales linearly with operator-issued country bans. 10 fragmented countries x 5000 = 50,000 decisions, well within LAPI / bouncer radix-tree capacity.

Files changed

Backend

  • backend/internal/crowdsec/client.go -- WriteHTTP, AddRangeDecisions, DeleteDecisionsByOrigin filter fix, ListDecisionsByScope.
  • backend/internal/crowdsec/client_test.go (new) -- 3 regression-lock tests.
  • backend/internal/security/country/expander.go -- chunked Ban with continue-on-error + RequestedCount/FailedChunks in BanResult.
  • backend/internal/security/country/rollup.go (new) -- RollupToSupernets + family-aware floor.
  • backend/internal/security/country/rollup_test.go (new) -- 7 tests including v4-floor regression lock.
  • backend/internal/security/country/expander_test.go -- updated fakeLAPI for batch shape, new chunking + continue-on-error tests.
  • backend/internal/security/country/source.go -- MMDBSource applies rollup before returning.
  • backend/internal/server/server.go -- WriteTimeout 20m.

Frontend

  • frontend/src/api/client.ts -- requested_count and failed_chunks on CountryExpansionResult.
  • frontend/src/pages/Settings.tsx -- partial-failure toast, longer button label.

Docs

  • docs/release-notes/v1.3.22.md (this file).
  • docs/operations/access-control.md -- per-country timing expectations + floor rationale.
  • docs/planning/v1.3.23-async-country-expansion.md (new) -- scope sketch for the async background-job path.
  • CHANGELOG.md, version bump.

Upgrade

cd argos-edge
git pull
docker compose build
docker compose up -d

No DB migration. No env vars. No compose surface change.

If you have v1.3.21-issued country expansions in country_ban_expansions: the rows describe entries that LAPI silently dropped. Recommend re-issuing the expansion (the panel's POST /api/security/countries/expand is idempotent on country_code -- INSERT OR REPLACE) so LAPI gets a clean post-rollup set.

# For each existing expansion, re-issue:
curl -X POST https://your-host/api/security/countries/expand \
  -b "argos_session=..." \
  -d '{"country_code":"BR","duration":"8760h","reason":"v1.3.22 re-expand"}'

Not changed

  • DB schema (migration 029 still latest).
  • API endpoint shapes (POST/GET/DELETE /api/security/countries/* unchanged).
  • v1.3.20 enable_streaming: false emit -- still in place; required for any non-IP scope to enforce.
  • v1.3.19 self-block banner, whitelist lifecycle.
  • hosts.true_detect_mode schema column from v1.3.19 stays dormant.