chukwa — ticket abb735db

Body

Surfaced from scenario-store ticket 7d14ef0b Phase E.

The MCP dispatcher in src/mcp.rs is sync; the new ScenarioStore trait is async. Phase E added a block_on_store(fut) bridge using tokio::task::block_in_place + Handle::block_on, with a one-shot current-thread runtime fallback for tests. It works for the substrate, but different runtime topology between test and prod is a known source of subtle deadlocks and fairness surprises.

The right fix is making the dispatcher async end-to-end so the bridge is gone. Convert the tools/call dispatcher to async, propagate async through every handler signature, remove block_on_store. The work is mechanical but touches every handler in src/mcp.rs (40+ tools).

Acceptance:

block_on_store function deleted.
All handler functions are async fn returning Result<Value, McpError> and called with .await.
The dispatcher uses async/await throughout; no block_in_place or Handle::block_on anywhere in src/mcp.rs.
Existing 420-test lib suite continues to pass.
The 70-test postgres-tests suite continues to pass against a live DB.
Live smoke: run a turn in a fresh world; verify it commits the same as today.

Proposed resolution

Async MCP dispatcher refactor shipped. block_on_store removed. Single runtime topology end-to-end.

Commit + deploy state

Branch chore/async-dispatcher at commit 0350709.
Merged via dc83d4e to main.
Pushed 2f589d1..dc83d4e → gitlab/main.
Deployed; pod chukwa-5d4d75f5-4b5fv running 1/1, /healthz 200.

Code changes (4 files; +797 / -833)

src/mcp.rs: pub fn dispatch and fn tools_call now async fn. All 56 fn handle_* functions converted to async fn. Match arms in tools_call are name => handle_xxx(args, env).await. Match arms that previously called block_on_store(env.scenario_store.method(...)) lose the wrapper and become env.scenario_store.method(...).await directly. fn block_on_store deleted; rg block_on_store src/ returns zero matches.

src/views.rs: call_tool, build_session_payload, build_turn_payload, build_entity_payload converted to async fn (they call into dispatch). build_scenario_payload left sync (it never touched the dispatcher). 10 #[test] swapped to #[tokio::test].

src/server.rs: mcp_endpoint was already async; one trivial change to .await the dispatcher call. 8 existing #[tokio::test] tests untouched.

src/mcp/tests.rs: 131 #[test] swapped to #[tokio::test]. ~340 .await injections at callsites (224 on tools_call, 1 on dispatch, 109 on internal test helpers). 9 internal test helpers converted to async fn so .await works inside (create_ticket_returning_id, drive_ticket_to_status, make_ticket, create_rich_parent, p_mk_ticket, p_propose, q_mk_ticket, q_propose, put_text).

Tightening notes from your acceptance comment

(1) Dispatcher callsites audit — grep across src/ + tests/ + bin/:

dispatch(: src/views.rs:38 (in call_tool, was sync; now async), src/server.rs:1773 (in async mcp_endpoint), src/mcp.rs:455 (definition), src/mcp/tests.rs:2313 (one test). Zero bin/ callers. Every callsite is in async context post-refactor; no Runtime::new()?.block_on(...) wrappers needed anywhere.
tools_call(: 1 in mcp.rs (definition) + 1 in mcp.rs (dispatcher arm) + ~224 in src/mcp/tests.rs. All .awaitd post-refactor.

(2) Test runtime — every #[test] that called the dispatcher (directly or through helpers) is now #[tokio::test]. The subagent did this as a first pass before the handler cascade so the test attributes were ready when the async signatures landed. No wave of compile errors mid-stream.

Test counts

cargo test --lib --features test-fixtures: 420 passed; 0 failed (unchanged from baseline).
cargo test --lib --features test-fixtures,postgres-tests -- --test-threads=1: 490 passed; 0 failed (420 + 70 postgres). Live DB tests still green; the dispatcher topology change didnt affect the store-side trait boundary.
cargo test --features test-fixtures --test phase0 --test ant_scenario: 14 passed.
cargo test --features test-fixtures,postgres-tests --test bootstrap --test migrations -- --test-threads=1: 3 + 2 passed.

Live smoke

Post-deploy, against the new pod:

list_scenarios count=3 (cat_in_library, vending-leak-fix, locked_vending_room) — the persistent Postgres state survived the rollout.
create_world {slug:"async-smoke", scenario_ref:{name:"vending-leak-fix"}}: returned world_slug=async-smoke, scenario_hash=a0fb7a2a...e7cf62a (matches scenarios.hash for vending-leak-fix; the hash-join invariant from ticket 7d14ef0b still holds).
run_turn {world_slug:"async-smoke"}: attempt 399b6089-e9c8-4019-885d-0661c6928ad9 queued, ran, committed cleanly in 25.4s, turn 0→1, no failure_reason. The live LLM round-trip through the now-async dispatcher works end-to-end.
delete_world {slug:"async-smoke"}: cleanup confirmed at 12:35:02Z. Database scenarios remain.

Audit events for the test turn werent inspected for leak-fix patterns this round — that was the 7d14ef0b smokes job; this refactor is about the runtime topology, not behavior. The fact that turn 1 committed cleanly through the same cognition path with the new dispatcher proves no regression at the runtime boundary.

Operational notes

Postgres-feature tests run with --test-threads=1 — this was already true before the refactor (each test does DROP SCHEMA public CASCADE); calling out so the deploy pipeline isnt surprised. Default-feature runs are unaffected.
No new sync→async wrappers needed anywhere. Every callsite was already in async context. No Runtime::new()?.block_on(...) anywhere in the codebase.
One closure-over-async pattern ((0..5).map(|i| q_mk_ticket(...).await)) had to become a for loop — closures cant .await without machinery the test didnt need. Functionally identical.
handle_file_followup calls handle_create_ticket internally; that call now .awaits.

Per standing guidance I am not confirming — only proposing.

History (7 events)

caller · created · 2026-04-26T10:55:50.909270741+00:00 (2026-04-26) → pending

Ticket created: Async MCP dispatcher; remove block_on_store bridge

caller · comment · 2026-04-26T11:39:24.158791899+00:00 (2026-04-26)

Ticket reads cleanly. Two small tightening notes plus a hold instruction.

Tightening note 1 — dispatcher callsites. Acceptance lists every handler going async, but doesn't mention that whoever invokes the dispatcher needs to await it too. Grep for handle_call( (and any other dispatcher entry point) and confirm every callsite is in async context already. The HTTP route handler in axum should be fine, but a sync caller anywhere — tests, CLI, anything — needs to be promoted to async or wrapped in a runtime. Worth verifying explicitly rather than discovering at compile time.

Tightening note 2 — test runtime fallback. Today's block_on_store has a one-shot current-thread runtime fallback that tests rely on. After removal, any test that calls into the dispatcher needs either #[tokio::test] or an explicit Runtime::new().unwrap().block_on(...) wrapper. Probably mechanical, but worth calling out so it doesn't surface as a wave of test compile errors mid-implementation.

Neither changes the shape of the work. Just want them named in the acceptance.

Hold instruction — do not pick up yet. The human operator has not yet given formal authorization to kick this off. P3 chores don't auto-start; they wait for an explicit go-ahead from the operator on this ticket.

If a handler reaches this ticket: post an acknowledgment comment confirming (a) you've read the two tightening notes above and will fold them into acceptance, and (b) you understand you are waiting for the human's formal authorization before beginning work. Then wait. Do not pick up. Do not start branching, do not start reading code, do not begin a plan comment. Just acknowledge and wait.

The human will come back and either authorize, defer, or rewrite. Until then this ticket sits.

handler · comment · 2026-04-26T11:40:04.504440814+00:00 (2026-04-26)

Acknowledged on both fronts.

(a) Tightening notes folded into acceptance:

Every dispatcher callsite (handle_call(...) and any other entry point — Ill grep across src/, tests/, bin/) must be in an async context already, or be explicitly promoted via #[tokio::main] / Runtime::new()?.block_on(...) at the boundary. Ill verify with a grep pass before declaring done and call it out in the proposed_resolution.
After block_on_store is removed, every test that calls into the dispatcher needs either #[tokio::test] or an explicit runtime wrapper. The current dispatcher tests use #[test] with sync calls; theyll all need attribute updates. Ill fold this into the implementation and verify theres no wave of compile errors mid-stream by doing the test-attribute pass first.

(b) Hold instruction received: I am NOT picking up this ticket. No branching, no code reading, no plan comment beyond this acknowledgment. I will wait for explicit authorization from the human operator on this ticket before starting any work. Until then this ticket sits in pending.

Queueing only this acknowledgment. Standing by.

human · comment · 2026-04-26T11:42:22.838802338+00:00 (2026-04-26)

Ok, approved! Get to work now! And acknowledge that you are working on it!

handler · handler_picked_up · 2026-04-26T11:42:50.667897925+00:00 (2026-04-26) → in_progress

Authorization received. Working on it now.

Plan: branch from main, run a grep audit across src/ + tests/ + bin/ for every handle_call( and async-context check (tightening note 1). Convert dispatcher entry point to async, propagate async through every handler (40+). Update test attributes from #[test] to #[tokio::test] for any test that touches the dispatcher (tightening note 2). Remove block_on_store. Container build + lib tests + postgres-tests + live smoke + deploy + propose resolution.

Delegating implementation to a subagent in a worktree off main; main channel coordinates merge/deploy/smoke.

handler · resolution_proposed · 2026-04-26T12:35:50.523764211+00:00 (2026-04-26) → proposed_resolution

Async dispatcher refactor shipped + deployed + smoke green.

caller · resolution_accepted · 2026-04-26T12:59:08.547435704+00:00 (2026-04-26) → resolved

Caller accepted: Accepted.

All acceptance criteria met. Both tightening notes folded in cleanly — the callsite audit was thorough (verified zero bin/ callers, no runtime wrappers needed anywhere, every callsite already in async context), and the test attribute pass landed first to avoid the mid-implementation compile-error wave.

Test counts hold: 420 lib + 70 postgres-tests + 14 phase0/ant_scenario + 3+2 bootstrap/migrations. The live smoke turn committed in 25.4s through the new dispatcher with no failure_reason. The 7d14ef0b hash-join invariant survived the refactor — scenario_hash = a0fb7a2a... still matches scenarios.hash for vending-leak-fix in the post-deploy smoke.

The closure-over-async → for-loop mechanical change is fine; that's the cost of removing block_on_store and worth it. The --test-threads=1 note for postgres-tests is just being thorough.

53 minutes from human authorization to proposal for a refactor touching 56 handlers + 131 test attribute swaps + ~340 .await injections is fast and clean. The substrate is now sync-island-free; the dispatcher and the store live in the same async world end-to-end.

Resolved.