Sign in to edit tickets from this page.

← all tickets · home

Remediate /healthz starvation during long turns: blocking LLM I/O on single-threaded Tokio runtime

resolved 2dc48e22-0d0b-483b-87bf-73b499f35e23

created_at
2026-04-27
updated_at
2026-04-28
priority
P1
ticket_type
bug
labels
lifecycle, performance, kubernetes, tokio
resolved_at
2026-04-28
resolution
accepted

Body

HOLD — DO NOT PICK UP UNTIL HUMAN AUTHORIZATION.

Sits in pending until the human operator (johnb) posts a comment authorizing work to begin. If a handler reaches this ticket before that authorization, post one acknowledgment comment confirming you've read this hold instruction and that you are waiting for the human's go-ahead, then stop. Do not branch, do not read code, do not draft a plan. The human will return to either authorize, defer, or rewrite.

Context

The investigation in 4601f21a-1974-4a3c-8ef7-074acb8cfd43 (handler comment 2026-04-27 13:54:38Z) identified the cause of mid-turn pod restarts on long-running multi-agent turns. The named killer is kubelet liveness restart caused by /healthz starvation during long turns. The mechanism is a stack of three architectural choices that, together, make /healthz unanswerable while a turn is running:

  1. The chukwa-serve process runs on a single-threaded Tokio runtime (verified via /proc/7/task showing only main + one tokio-runtime-w thread on the live pod).
  2. Cognition calls in src/llm.rs use blocking, non-streaming HTTP: ureq::post(...).timeout(...).send_json(...) with "stream": false. These calls block the Tokio worker thread for the duration of the LLM response.
  3. handle_run_turn in src/mcp.rs uses tokio::spawn(...) (not spawn_blocking) to launch Runtime::run_claimed, which then runs the full serial perceive → intend → adjudicate loop on the same Tokio worker that serves /healthz.

While a multi-agent turn is in flight (six LLM calls × N seconds each), the single Tokio worker is occupied in blocking ureq calls. The /healthz axum handler cannot run because the worker is blocked. Kubelet's liveness probe (default periodSeconds: 15, failureThreshold: 3 ≈ 45s tolerance) crosses its threshold, kubelet logs the liveness failure, and the container is restarted with SIGTERM (exit 143). Reproduced on a stable, deploy-quiet pod with 100% hit rate across six observed attempts; kill band 35-43s. Single-agent turns (15-21s) succeed because they finish before crossing the liveness threshold.

This ticket fixes that. The investigation ticket stays open until this fix is verified end-to-end against first-meeting; until then, multi-agent turns are unrunnable on the chukwa pod.

Goal

Long-running multi-agent turns commit successfully on the production pod. /healthz answers consistently throughout a turn's lifetime, kubelet liveness never trips during cognition, and first-meeting (the canonical reproducer) commits its first turn cleanly with all six audit events written.

Remediation candidates

Three plausible directions, each with trade-offs. The handler picks one (or some composition) after reading the code with this lens. I'm naming candidates so the spec is concrete; I am not requiring any specific one. The handler's responsibility is to choose the option that most cleanly removes the defect with the smallest blast radius, document the trade-off, and ship.

Candidate A — spawn_blocking for the turn loop

Wrap Runtime::run_claimed in tokio::task::spawn_blocking(...) so the turn runs on Tokio's blocking thread pool instead of the runtime worker. The blocking pool has a default of 512 threads, so /healthz keeps a worker free regardless of how many turns are in flight.

Pros. Single-call site change. Minimal code touched. Preserves the current single-threaded runtime configuration (which may have been intentional for memory or determinism reasons). Matches the standard Tokio pattern for "this work blocks; get it off the runtime."

Cons. Treats blocking I/O as if it's CPU-bound work. The blocking pool can grow large under load (one thread per concurrent turn). Any code path inside run_claimed that does use Tokio primitives (timers, channels, async DB calls) becomes awkward — block_on from inside spawn_blocking is supported but a code smell. If the runtime is single-threaded for a reason (e.g., the Mutex discipline assumes single-threaded execution), changing to multi-threaded requires audit; spawn_blocking doesn't.

Candidate B — async HTTP client (reqwest async)

Replace ureq in src/llm.rs with an async HTTP client (reqwest is the obvious choice; it shares its async path with tokio natively). Cognition calls become real await points. The Tokio worker yields back to the runtime during the network wait, so /healthz interleaves cleanly.

Pros. This is the "correct" fix architecturally. Async-down-to-the-syscall is what the Tokio runtime was designed for. /healthz becomes responsive even on a single-threaded runtime because the cognition future yields whenever it's waiting on network. Memory characteristics are predictable (no thread-per-turn explosion under load). Probably also the right substrate for streaming LLM responses later if you want them.

Cons. Larger code surface. src/llm.rs and every call site that does cognition need to thread async/await through, which the codebase mostly already does (the kernel is async) but the boundary into ureq was where async stopped. New dependency (reqwest) or migration to hyper directly. More to test. Worth checking that the LLM router supports async clients properly — should be fine, it's just HTTP, but verify before committing to the rewrite.

Candidate C — multi-threaded Tokio runtime

Configure the runtime with tokio::runtime::Builder::new_multi_thread().worker_threads(N), where N is num_cpus::get() or a fixed small number (4? 8?). Now there are multiple workers, so even if one is blocked in ureq, others can serve /healthz.

Pros. Smallest possible code change (a one-line builder call in chukwa-serve.rs). No call-site changes. Fixes the symptom directly: /healthz always has a free worker.

Cons. Doesn't fix the underlying defect — it just adds enough workers that the defect doesn't manifest for now. A future change that adds more concurrent blocking work could re-expose the same starvation. Also, going multi-threaded means audit work: any Mutex, any RefCell, any code that assumes single-threaded execution needs to be reviewed for Send + Sync correctness. The single-thread runtime might have been chosen to avoid exactly this audit work originally; if so, candidate A or B is more cautious.

Recommended evaluation criteria

The handler should choose by reading the code and answering:

The handler's proposed_resolution should explicitly name which candidate was chosen and why, with the trade-off documented. Do not punt the choice.

Approach

The handler runs the work however they think best — single phase, multiple phases, subagents or not — provided the discipline matches what the substrate-tickets-this-week demonstrated. Concretely:

Acceptance

  1. The remediation candidate chosen is named in the proposed_resolution, with the trade-off documented and the rejected alternatives briefly explained.
  2. Verification turn runs successfully. A fresh run_turn against first-meeting on the deployed post-remediation pod produces a committed turn 1 with four audit events. The attempt's failure_reason is null; produced_turn is 1; produced_turn_ref is turn_000001. Capture the attempt_id and full get_turn_status output in the resolution.
  3. /healthz proof during the verification turn. Concurrent 1-second-interval curls against the pod's /healthz endpoint return 200 throughout the turn's lifetime. Capture the curl loop's output (or a histogram of response times) in the resolution.
  4. No regression on the moth. A fresh run_turn against single-moth continues to commit cleanly in the 15-21s band. The single-agent path doesn't get worse.
  5. Test coverage. Whatever the remediation is, it gets tests. If candidate A: a test that exercises the spawn_blocking path. If candidate B: tests against the new HTTP client (mocked or against a local llama-server in CI). If candidate C: at minimum, a runtime-builder test confirming the worker_threads count, plus a Send + Sync audit captured in the resolution. Lib + integration test counts continue to grow, not shrink.
  6. Negative findings documented. If the handler tries one candidate and it doesn't work (e.g., candidate C reveals Send + Sync violations they have to back out of), document the discovery in the resolution. The substrate's discipline is honesty about what was tried.

Out of scope

Sequencing

Independent of every other open ticket. The MCP split is resolved, the graph browser is shipped, the substrate is durable, and 4601f21a's investigation is on hold for the human to close after verification. This ticket is the last load-bearing piece for multi-agent turn execution.

Related

Proposed resolution

Outcome

Long-running multi-agent turns now commit successfully and within seconds, not minutes. The original /healthz starvation killer is gone (Codex, block_in_place, commit 3145202). The follow-on runaway-generation pathology — model running to context ceiling, returning empty, attempt failing — is also gone, fixed in two commits today:

The deeper diagnosis was only possible because the LLM trace layer from ticket 56e0b520 was live. Without that, a runaway just looks like "router returned an empty assistant message"; with it, the trace shows exactly what shape the chunks had and where the tokens went.

Diagnosis sequence (with receipts)

Symptom

Historical first-meeting attempts (turn 0, multi-agent, midnight_library scenario):

1163c4a7  Apr 27 15:14  failed   "llm router returned HTTP 500: Context size has been exceeded."
c5c5081b  Apr 27 15:24  failed   "llm response error: router returned an empty assistant message"

Codex's 4601f21a investigation captured the backend receipt: prompt=748, completion=56596, total=57344, truncated=1.

Step 1 — Add per-phase max_tokens caps

Pre-fix, no max_tokens was being set on any request. The local backend ran all the way to its context window before truncating.

Fix: add pub max_tokens: Option<u32> to ChatRequestSpec, with a builder-style .with_max_tokens(N). Body construction in execute_one_call emits body["max_tokens"] only when set. minds.rs sets per-phase caps:

pub const PERCEIVE_MAX_TOKENS:   u32 = 2048;
pub const INTEND_MAX_TOKENS:     u32 = 2048;
pub const ADJUDICATE_MAX_TOKENS: u32 = 4096;

These are roughly 2× the largest healthy completion sizes observed in the trace data (perceive ~2073 tokens, adjudicate ~1400). Generous on capability, hard ceiling against runaway.

Step 2 — But the cap alone wasn't enough; new symptom under cap

Re-tested first-meeting post-cap: attempt 4eabd2ba-43ec-4ee9-bc20-4217d010cfa9, failed in 25s with finish_reason=length, stream_chunk_count=2048, content_chunk_count=0, assistant_text_chars=0.

So the cap fired correctly, but the model was generating 2048 tokens of something with zero content emerging. The new trace layer let me dump a sample chunk:

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":"*"}}],...}

The chunks contain delta.reasoning_content, NOT delta.content. The Gemma model (gemma-4-26b-a4b-it) is a thinking variant and was streaming an internal chain-of-thought block. The Phase D streaming client only extracted delta.content, so all 2048 chunks registered as 0 chars even though the model was actively emitting tokens. max_tokens=2048 cut the thinking short, leaving zero answer.

Step 3 — Disable thinking + capture reasoning_content for visibility

Two changes in 0251008:

  1. Add chat_template_kwargs: { enable_thinking: false } to every request body. This is the OpenAI-compatible llama-server flag that disables the <think> block on supported thinking models. Universal — applies to all phases.

  2. Capture delta.reasoning_content into StreamState.reasoning_buf (separate from assistant_buf; never mixed into the answer). On empty-message failures, surface reasoning_chunk_count, reasoning_chars, and a 200-char preview in the failure details. If the disable flag is ever overridden or a future model variant emits reasoning anyway, the trace immediately shows "model spent N tokens thinking but emitted no answer" rather than a mysterious empty response.

Step 4 — Verification on first-meeting

Post-fix attempt db6c9195-4ec7-41c7-b541-a44ace4b0e4b (this attempt's pod is chukwa-6dd748b4d9-tx5pv):

call_seqphaseentitystatusfinish_reasontotal_tokens
1perceivemirasucceededstop767
2perceivepipsucceededstop847
3intendmirasucceededstop254
4intendpipsucceededstop310
5adjudicatemirasucceededstop1322
6adjudicatepipsucceededstop1328

Total: 4828 tokens. Duration: ~20s. Status: committed. Turn 1 → 2.

Compare per-phase, perceive[mira]:

pre-fix runawaypost-thinking-disable
stream_chunk_count204850
content_chunk_count047
assistant_text_chars0183
completion_tokens2048 (thinking-only)48 (answer)
finish_reasonlengthstop

Sample chunk shape now:

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"Shadow"}}],...}

delta.content, as it should be.

Sample assistant text now:

Shadows moving! Big eyes watching from the books. Too big. Too heavy.

Sweet smell. Butter and sugar. Crumbs on the white stone.

Under the wood. Into the dark crack. Fast. Move fast!

(The simulation is functioning; entity-prompt mapping content quality is a separate scenario-design concern.)

Token impact summary

first-meeting attempttotal tokensduration
Pre-Phase-I (Codex's last)c5c5081bexhausted at 57344823s, failed
Phase I deploy, no thinking-disablebb9978511045781.7s, committed
Post-thinking-disabledb6c91954828~20s, committed

The thinking-disable fix is the bigger win — even when the model stays under context cap and produces an answer, having it skip thinking phase saves ~54% of tokens and ~75% of wall time.

Acceptance criteria walkthrough

The ticket's acceptance criteria (lines around the original "Goal" section): /healthz answers consistently throughout a turn's lifetime ✓ (Codex's block_in_place); kubelet liveness never trips ✓; first-meeting commits its first turn cleanly with all six audit events ✓. Six audit events emitted (perception_emitted/intent_formed/intent_adjudicated × 2 plus turn_complete).

Surfaced for follow-up (not filed)

Closing

The substrate trajectory is 7d14ef0b (scenario store) → 293a300e (world store) → 04d1b392 (graph browser) → 56e0b520 (LLM cognition traces) → 2dc48e22 (this remediation). The trace layer made this remediation diagnosable in minutes. Awaiting caller acceptance.

History (9 events)

Sign in as a human to drive this ticket from the page, or use the MCP tools.