v1.3.31 -- Async country expansion (queued from v1.3.22)¶

The v1.3.22 synchronous chunked-batch path was functionally correct but blocked the operator's HTTP request for ~30s on fragmented countries (BR's ~5009 ranges in 11 chunks). The panel's WriteTimeout was bumped to 20 minutes specifically to keep the handler within ceiling, and the UI's submit button just spun for the full window with no progress feedback.

v1.3.31 swaps the handler for an async submit + poll flow. The operator sees an immediate 202 + a live progress bar; the panel's worker goroutine drives the LAPI chunked POST in the background and writes per-chunk progress into a new country_expansion_jobs table. A panel restart mid-flight is recoverable: boot-time recovery transitions stale running jobs to failed with error_message='panel restarted' so the operator can re-submit cleanly.

Pattern: async-job¶

This release establishes the reusable shape for any future long-running operation (audit retention sweeps, scenario re-installs, etc.):

Progress-shadow table with state enum pending|running|completed|failed, progress counters, error_message, timestamps.
Single-worker mutex in the runner so concurrent submits queue (state=pending) instead of contending. Avoids the v1.3.22 LAPI WAL contention finding -- multiple parallel bulk POSTs would re-trigger it.
Goroutine bound to the panel's main-context so the work outlives the request that triggered it. Cleanly exits when the panel SIGTERM cancels the context.
Boot-time recovery transitions pending/running rows from a prior panel instance to failed with the standard panel restarted message. Idempotent.
Polling endpoint with mtime/state-driven contract: client polls every 1s until state in {completed, failed}. No WebSocket / SSE infra needed for the homelab scale.

Documented in project_async_job_pattern.md for the next release that needs it.

What ships¶

Migration 032: `country_expansion_jobs`¶

CREATE TABLE country_expansion_jobs (
    id              INTEGER PRIMARY KEY AUTOINCREMENT,
    country_code    TEXT NOT NULL,
    state           TEXT NOT NULL CHECK (state IN
                       ('pending', 'running', 'completed', 'failed')),
    chunks_total    INTEGER NOT NULL DEFAULT 0,
    chunks_done     INTEGER NOT NULL DEFAULT 0,
    chunks_failed   INTEGER NOT NULL DEFAULT 0,
    cidr_committed  INTEGER NOT NULL DEFAULT 0,
    requested_count INTEGER NOT NULL DEFAULT 0,
    duration        TEXT NOT NULL DEFAULT '',
    reason          TEXT NOT NULL DEFAULT '',
    error_message   TEXT NOT NULL DEFAULT '',
    created_at      TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    started_at      TIMESTAMP,
    completed_at    TIMESTAMP,
    created_by      TEXT NOT NULL DEFAULT ''
);
CREATE INDEX idx_country_expansion_jobs_country
    ON country_expansion_jobs(country_code, created_at DESC);
CREATE INDEX idx_country_expansion_jobs_state
    ON country_expansion_jobs(state);

Backend: `country.JobRunner`¶

backend/internal/security/country/jobs.go. Public surface:

NewJobRunner(ctx, db, expander, logger) -- ctx is the panel's main-context; binds the long-lived appCtx for the background goroutines.
Submit(reqCtx, cc, duration, reason, actor) -> (jobID, error): inserts a pending row, spawns a goroutine that waits on the global mutex and then runs Expander.BanWithProgress with a callback that updates chunks_done / cidr_committed / chunks_failed live.
Get(ctx, id) -> *Job: ErrJobNotFound on missing row.
ListByCountry(ctx, cc, limit) -> []*Job: most-recent-first. Empty cc returns the cross-country recent list.
RecoverOnBoot(ctx) -> error: transitions pending|running rows to failed with error_message='panel restarted'.

Backend: `Expander.BanWithProgress`¶

Refactored from the v1.3.22 chunk loop. Same logic; the only change is a ProgressFn callback fired after each chunk:

type ProgressFn func(chunkIdx, totalChunks, cidrCommitted, chunksFailed int)

The synchronous Ban() becomes a thin wrapper that calls BanWithProgress with no callback, so existing tests and callers are unaffected.

API endpoints¶

POST /api/security/countries/{cc}/expand (replaces the v1.3.21 body-based shape). Body: {"duration":"4h","reason":"..."}. Returns 202 + the new job row.
GET /api/security/jobs/{id} -> the job row.
GET /api/security/jobs?country=XX&limit=N -> recent jobs. Empty country returns cross-country recent.

Frontend: `CountryBansSection`¶

The Settings -> Country bans section now:

POST -> receives the pending job row
polls GET /jobs/{id} every 1s
renders a progress bar driven by chunks_done / chunks_total + the running cidr_committed count
toasts success or error on terminal state (or "still running after 10 min" if the polling cap fires)

Smoke¶

scripts/smoke/country-expansion-async.sh -- 8-phase EFFECT smoke:

[1/8] POST /api/security/countries/BR/expand        -> 202 + job_id
[2/8] poll until terminal (timeout 120s)
[3/8] assert state=completed + chunks_done=chunks_total
[4/8] cscli decisions list --origin argos-country-BR > 4000
[5/8] stop crowdsec to simulate LAPI down
[6/8] POST .../TR/expand                            -> 202 + job_id
[7/8] poll -> assert state=failed + error_message populated
[8/8] start crowdsec back; wait for healthy

Live result on prod stack: - BR: 11/11 chunks, 5009 ranges committed in <60s - TR (LAPI down): state=failed, error_message='all 5 chunks failed: Post "http://crowdsec:8081/v1/alerts": dial tcp: lookup crowdsec on 127.0.0.11:53: no such host' - crowdsec returns to healthy within 30s of compose start

Mid-impl gotchas¶

modernc/sqlite + :memory: connection pool. Each pool connection sees a private :memory: DB, so the goroutine's UPDATE landed on a different DB than Submit's INSERT in tests. Fix: db.SetMaxOpenConns(1) in the test helper. Harmless in production where the panel uses a real file-backed DB. Documented in jobs_test.go.
cscli decisions list paginates at 100 by default. The smoke initially counted 100 instead of 5009. Pass --limit 0 to remove the cap.

Files changed¶

backend/migrations/032_country_expansion_jobs.{up,down}.sql (new)
backend/internal/security/country/jobs.go (new) + jobs_test.go (new, 8 tests)
backend/internal/security/country/expander.go (refactor: Ban -> thin wrapper; BanWithProgress is the new path)
backend/internal/api/security_country.go (new path-based handler + GetCountryJob + ListCountryJobs)
backend/internal/api/handlers.go (CountryJobs field)
backend/internal/server/server.go (route swap + new jobs endpoints)
backend/cmd/argos/main.go (instantiate JobRunner + RecoverOnBoot at startup)
backend/internal/db/migrate_test.go (rollback chain extended for 032)
frontend/src/api/client.ts (CountryExpansionJob type + job endpoints)
frontend/src/pages/Settings.tsx (CountryBansSection async polling + progress bar)
scripts/smoke/country-expansion-async.sh (new, 8 phases)
docs/release-notes/v1.3.31.md (this file)
CHANGELOG.md, mkdocs.yml

Upgrade¶

cd ~/argos-edge
git pull
make sync-prod                  # picks up the new migration
                                # + setup-appsec.sh (none in
                                # v1.3.31) + smoke script
docker compose -f /path/to/argos-prod/docker-compose.yml \
    restart crowdsec            # bind-mount inode refresh
                                # discipline (v1.3.29 lesson;
                                # no setup-appsec.sh changes
                                # in v1.3.31, but reset is
                                # cheap)

Then rebuild + redeploy the panel:

cd ~/argos-prod
docker build -f backend/Dockerfile -t argos-prod-argos:v1.3.31 .
# update docker-compose.override.yml: image: argos-prod-argos:v1.3.31
docker compose up -d --force-recreate --no-deps argos

The migration runs at panel startup. RecoverOnBoot transitions any pending/running rows from the (just-killed) prior panel instance to failed; in a fresh upgrade where the table didn't exist before, it's a no-op.

Not changed¶

All v1.3.30 backend / frontend code unchanged outside the files listed above.
LAPI WAL mode (v1.3.28) untouched.
Drift detector (v1.3.27), true_detect_mode (v1.3.29), scenario descriptions (v1.3.30) all unchanged.
The synchronous country.Ban function still exists for any external caller; the API just no longer routes through it.

Future-proof: what's reusable¶

The async-job pattern this release establishes is intentionally generic. Future long-running operations need:

A new table <thing>_jobs with the same state enum.
A new <Thing>JobRunner with the same Submit / Get / ListByCountry / RecoverOnBoot API.
A new POST endpoint that returns 202 + the job row.
The same GET /api/security/jobs/{id} polling endpoint serves them all (state-machine is identical).

The frontend polling helper can also be lifted into a shared useJobPolling(jobId) hook when a second job type lands.