v1.3.31 -- Async country expansion (queued from v1.3.22)¶
The v1.3.22 synchronous chunked-batch path was functionally correct but blocked the operator's HTTP request for ~30s on fragmented countries (BR's ~5009 ranges in 11 chunks). The panel's WriteTimeout was bumped to 20 minutes specifically to keep the handler within ceiling, and the UI's submit button just spun for the full window with no progress feedback.
v1.3.31 swaps the handler for an async submit + poll flow. The operator sees an immediate 202 + a live progress bar; the panel's worker goroutine drives the LAPI chunked POST in the background and writes per-chunk progress into a new country_expansion_jobs table. A panel restart mid-flight is recoverable: boot-time recovery transitions stale running jobs to failed with error_message='panel restarted' so the operator can re-submit cleanly.
Pattern: async-job¶
This release establishes the reusable shape for any future long-running operation (audit retention sweeps, scenario re-installs, etc.):
- Progress-shadow table with state enum
pending|running|completed|failed, progress counters,error_message, timestamps. - Single-worker mutex in the runner so concurrent submits queue (state=pending) instead of contending. Avoids the v1.3.22 LAPI WAL contention finding -- multiple parallel bulk POSTs would re-trigger it.
- Goroutine bound to the panel's main-context so the work outlives the request that triggered it. Cleanly exits when the panel SIGTERM cancels the context.
- Boot-time recovery transitions pending/running rows from a prior panel instance to failed with the standard
panel restartedmessage. Idempotent. - Polling endpoint with mtime/state-driven contract: client polls every 1s until state in
{completed, failed}. No WebSocket / SSE infra needed for the homelab scale.
Documented in project_async_job_pattern.md for the next release that needs it.
What ships¶
Migration 032: country_expansion_jobs¶
CREATE TABLE country_expansion_jobs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
country_code TEXT NOT NULL,
state TEXT NOT NULL CHECK (state IN
('pending', 'running', 'completed', 'failed')),
chunks_total INTEGER NOT NULL DEFAULT 0,
chunks_done INTEGER NOT NULL DEFAULT 0,
chunks_failed INTEGER NOT NULL DEFAULT 0,
cidr_committed INTEGER NOT NULL DEFAULT 0,
requested_count INTEGER NOT NULL DEFAULT 0,
duration TEXT NOT NULL DEFAULT '',
reason TEXT NOT NULL DEFAULT '',
error_message TEXT NOT NULL DEFAULT '',
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
started_at TIMESTAMP,
completed_at TIMESTAMP,
created_by TEXT NOT NULL DEFAULT ''
);
CREATE INDEX idx_country_expansion_jobs_country
ON country_expansion_jobs(country_code, created_at DESC);
CREATE INDEX idx_country_expansion_jobs_state
ON country_expansion_jobs(state);
Backend: country.JobRunner¶
backend/internal/security/country/jobs.go. Public surface:
NewJobRunner(ctx, db, expander, logger)-- ctx is the panel's main-context; binds the long-lived appCtx for the background goroutines.Submit(reqCtx, cc, duration, reason, actor) -> (jobID, error): inserts apendingrow, spawns a goroutine that waits on the global mutex and then runsExpander.BanWithProgresswith a callback that updateschunks_done/cidr_committed/chunks_failedlive.Get(ctx, id) -> *Job: ErrJobNotFound on missing row.ListByCountry(ctx, cc, limit) -> []*Job: most-recent-first. Emptyccreturns the cross-country recent list.RecoverOnBoot(ctx) -> error: transitionspending|runningrows tofailedwitherror_message='panel restarted'.
Backend: Expander.BanWithProgress¶
Refactored from the v1.3.22 chunk loop. Same logic; the only change is a ProgressFn callback fired after each chunk:
The synchronous Ban() becomes a thin wrapper that calls BanWithProgress with no callback, so existing tests and callers are unaffected.
API endpoints¶
POST /api/security/countries/{cc}/expand(replaces the v1.3.21 body-based shape). Body:{"duration":"4h","reason":"..."}. Returns 202 + the new job row.GET /api/security/jobs/{id}-> the job row.GET /api/security/jobs?country=XX&limit=N-> recent jobs. Emptycountryreturns cross-country recent.
Frontend: CountryBansSection¶
The Settings -> Country bans section now:
- POST -> receives the pending job row
- polls
GET /jobs/{id}every 1s - renders a progress bar driven by
chunks_done / chunks_total+ the runningcidr_committedcount - toasts success or error on terminal state (or "still running after 10 min" if the polling cap fires)
Smoke¶
scripts/smoke/country-expansion-async.sh -- 8-phase EFFECT smoke:
[1/8] POST /api/security/countries/BR/expand -> 202 + job_id
[2/8] poll until terminal (timeout 120s)
[3/8] assert state=completed + chunks_done=chunks_total
[4/8] cscli decisions list --origin argos-country-BR > 4000
[5/8] stop crowdsec to simulate LAPI down
[6/8] POST .../TR/expand -> 202 + job_id
[7/8] poll -> assert state=failed + error_message populated
[8/8] start crowdsec back; wait for healthy
Live result on prod stack: - BR: 11/11 chunks, 5009 ranges committed in <60s - TR (LAPI down): state=failed, error_message='all 5 chunks failed: Post "http://crowdsec:8081/v1/alerts": dial tcp: lookup crowdsec on 127.0.0.11:53: no such host' - crowdsec returns to healthy within 30s of compose start
Mid-impl gotchas¶
- modernc/sqlite +
:memory:connection pool. Each pool connection sees a private :memory: DB, so the goroutine's UPDATE landed on a different DB than Submit's INSERT in tests. Fix:db.SetMaxOpenConns(1)in the test helper. Harmless in production where the panel uses a real file-backed DB. Documented injobs_test.go. cscli decisions listpaginates at 100 by default. The smoke initially counted 100 instead of 5009. Pass--limit 0to remove the cap.
Files changed¶
backend/migrations/032_country_expansion_jobs.{up,down}.sql(new)backend/internal/security/country/jobs.go(new) +jobs_test.go(new, 8 tests)backend/internal/security/country/expander.go(refactor:Ban-> thin wrapper;BanWithProgressis the new path)backend/internal/api/security_country.go(new path-based handler +GetCountryJob+ListCountryJobs)backend/internal/api/handlers.go(CountryJobsfield)backend/internal/server/server.go(route swap + new jobs endpoints)backend/cmd/argos/main.go(instantiate JobRunner + RecoverOnBoot at startup)backend/internal/db/migrate_test.go(rollback chain extended for 032)frontend/src/api/client.ts(CountryExpansionJob type + job endpoints)frontend/src/pages/Settings.tsx(CountryBansSection async polling + progress bar)scripts/smoke/country-expansion-async.sh(new, 8 phases)docs/release-notes/v1.3.31.md(this file)CHANGELOG.md,mkdocs.yml
Upgrade¶
cd ~/argos-edge
git pull
make sync-prod # picks up the new migration
# + setup-appsec.sh (none in
# v1.3.31) + smoke script
docker compose -f /path/to/argos-prod/docker-compose.yml \
restart crowdsec # bind-mount inode refresh
# discipline (v1.3.29 lesson;
# no setup-appsec.sh changes
# in v1.3.31, but reset is
# cheap)
Then rebuild + redeploy the panel:
cd ~/argos-prod
docker build -f backend/Dockerfile -t argos-prod-argos:v1.3.31 .
# update docker-compose.override.yml: image: argos-prod-argos:v1.3.31
docker compose up -d --force-recreate --no-deps argos
The migration runs at panel startup. RecoverOnBoot transitions any pending/running rows from the (just-killed) prior panel instance to failed; in a fresh upgrade where the table didn't exist before, it's a no-op.
Not changed¶
- All v1.3.30 backend / frontend code unchanged outside the files listed above.
- LAPI WAL mode (v1.3.28) untouched.
- Drift detector (v1.3.27), true_detect_mode (v1.3.29), scenario descriptions (v1.3.30) all unchanged.
- The synchronous
country.Banfunction still exists for any external caller; the API just no longer routes through it.
Future-proof: what's reusable¶
The async-job pattern this release establishes is intentionally generic. Future long-running operations need:
- A new table
<thing>_jobswith the same state enum. - A new
<Thing>JobRunnerwith the same Submit / Get / ListByCountry / RecoverOnBoot API. - A new POST endpoint that returns 202 + the job row.
- The same
GET /api/security/jobs/{id}polling endpoint serves them all (state-machine is identical).
The frontend polling helper can also be lifted into a shared useJobPolling(jobId) hook when a second job type lands.