resolved 2dc48e22-0d0b-483b-87bf-73b499f35e23
lifecycle, performance, kubernetes, tokioHOLD — DO NOT PICK UP UNTIL HUMAN AUTHORIZATION.
Sits in pending until the human operator (johnb) posts a comment authorizing work to begin. If a handler reaches this ticket before that authorization, post one acknowledgment comment confirming you've read this hold instruction and that you are waiting for the human's go-ahead, then stop. Do not branch, do not read code, do not draft a plan. The human will return to either authorize, defer, or rewrite.
The investigation in 4601f21a-1974-4a3c-8ef7-074acb8cfd43 (handler comment 2026-04-27 13:54:38Z) identified the cause of mid-turn pod restarts on long-running multi-agent turns. The named killer is kubelet liveness restart caused by /healthz starvation during long turns. The mechanism is a stack of three architectural choices that, together, make /healthz unanswerable while a turn is running:
/proc/7/task showing only main + one tokio-runtime-w thread on the live pod).src/llm.rs use blocking, non-streaming HTTP: ureq::post(...).timeout(...).send_json(...) with "stream": false. These calls block the Tokio worker thread for the duration of the LLM response.handle_run_turn in src/mcp.rs uses tokio::spawn(...) (not spawn_blocking) to launch Runtime::run_claimed, which then runs the full serial perceive → intend → adjudicate loop on the same Tokio worker that serves /healthz.While a multi-agent turn is in flight (six LLM calls × N seconds each), the single Tokio worker is occupied in blocking ureq calls. The /healthz axum handler cannot run because the worker is blocked. Kubelet's liveness probe (default periodSeconds: 15, failureThreshold: 3 ≈ 45s tolerance) crosses its threshold, kubelet logs the liveness failure, and the container is restarted with SIGTERM (exit 143). Reproduced on a stable, deploy-quiet pod with 100% hit rate across six observed attempts; kill band 35-43s. Single-agent turns (15-21s) succeed because they finish before crossing the liveness threshold.
This ticket fixes that. The investigation ticket stays open until this fix is verified end-to-end against first-meeting; until then, multi-agent turns are unrunnable on the chukwa pod.
Long-running multi-agent turns commit successfully on the production pod. /healthz answers consistently throughout a turn's lifetime, kubelet liveness never trips during cognition, and first-meeting (the canonical reproducer) commits its first turn cleanly with all six audit events written.
Three plausible directions, each with trade-offs. The handler picks one (or some composition) after reading the code with this lens. I'm naming candidates so the spec is concrete; I am not requiring any specific one. The handler's responsibility is to choose the option that most cleanly removes the defect with the smallest blast radius, document the trade-off, and ship.
spawn_blocking for the turn loopWrap Runtime::run_claimed in tokio::task::spawn_blocking(...) so the turn runs on Tokio's blocking thread pool instead of the runtime worker. The blocking pool has a default of 512 threads, so /healthz keeps a worker free regardless of how many turns are in flight.
Pros. Single-call site change. Minimal code touched. Preserves the current single-threaded runtime configuration (which may have been intentional for memory or determinism reasons). Matches the standard Tokio pattern for "this work blocks; get it off the runtime."
Cons. Treats blocking I/O as if it's CPU-bound work. The blocking pool can grow large under load (one thread per concurrent turn). Any code path inside run_claimed that does use Tokio primitives (timers, channels, async DB calls) becomes awkward — block_on from inside spawn_blocking is supported but a code smell. If the runtime is single-threaded for a reason (e.g., the Mutex discipline assumes single-threaded execution), changing to multi-threaded requires audit; spawn_blocking doesn't.
reqwest async)Replace ureq in src/llm.rs with an async HTTP client (reqwest is the obvious choice; it shares its async path with tokio natively). Cognition calls become real await points. The Tokio worker yields back to the runtime during the network wait, so /healthz interleaves cleanly.
Pros. This is the "correct" fix architecturally. Async-down-to-the-syscall is what the Tokio runtime was designed for. /healthz becomes responsive even on a single-threaded runtime because the cognition future yields whenever it's waiting on network. Memory characteristics are predictable (no thread-per-turn explosion under load). Probably also the right substrate for streaming LLM responses later if you want them.
Cons. Larger code surface. src/llm.rs and every call site that does cognition need to thread async/await through, which the codebase mostly already does (the kernel is async) but the boundary into ureq was where async stopped. New dependency (reqwest) or migration to hyper directly. More to test. Worth checking that the LLM router supports async clients properly — should be fine, it's just HTTP, but verify before committing to the rewrite.
Configure the runtime with tokio::runtime::Builder::new_multi_thread().worker_threads(N), where N is num_cpus::get() or a fixed small number (4? 8?). Now there are multiple workers, so even if one is blocked in ureq, others can serve /healthz.
Pros. Smallest possible code change (a one-line builder call in chukwa-serve.rs). No call-site changes. Fixes the symptom directly: /healthz always has a free worker.
Cons. Doesn't fix the underlying defect — it just adds enough workers that the defect doesn't manifest for now. A future change that adds more concurrent blocking work could re-expose the same starvation. Also, going multi-threaded means audit work: any Mutex, any RefCell, any code that assumes single-threaded execution needs to be reviewed for Send + Sync correctness. The single-thread runtime might have been chosen to avoid exactly this audit work originally; if so, candidate A or B is more cautious.
The handler should choose by reading the code and answering:
Rcs, non-Send futures, mutex discipline that assumes single-threaded execution, or anything else that would break under multi-threaded scheduling, that strongly suggests candidate A or B over C.ureq? Check src/scenario_store/postgres.rs, src/world_store/postgres.rs, the audit event writes, anywhere else cognition might wait on I/O. If sqlx is used (it is), those are already async; that's fine. If anything else is blocking, candidate B catches them all; candidate A treats them as more blocking work to offload.src/llm.rs rewrite plus async-threading callsites.The handler's proposed_resolution should explicitly name which candidate was chosen and why, with the trade-off documented. Do not punt the choice.
The handler runs the work however they think best — single phase, multiple phases, subagents or not — provided the discipline matches what the substrate-tickets-this-week demonstrated. Concretely:
DATABASE_URL=postgres://postgres:postgres@127.0.0.1:5433/postgres. Never the cluster. The data-loss postmortem on 293a300e is the standing rule.first-meeting is preserved at turn 0 with six interrupted attempts on history and is the verification target. After remediation lands and is deployed, run run_turn against first-meeting. The acceptance bar is a successful commit producing turn 1 with four audit events (perception_emitted ×2 agents, intent_formed ×2 agents, intent_adjudicated, turn_complete) and worker_id: chukwa-mcp recorded./healthz proof. During the verification turn, curl /healthz directly against the pod IP at 1-second intervals across the turn's full duration. Every probe must return 200 within 1 second. This is the receipts-shaped proof that the remediation actually fixed /healthz starvation.4601f21a carries the diagnostic value and stays open until verification confirms this fix works. Reference it in the proposed_resolution; do not close it from this ticket's lifecycle.proposed_resolution, with the trade-off documented and the rejected alternatives briefly explained.run_turn against first-meeting on the deployed post-remediation pod produces a committed turn 1 with four audit events. The attempt's failure_reason is null; produced_turn is 1; produced_turn_ref is turn_000001. Capture the attempt_id and full get_turn_status output in the resolution./healthz proof during the verification turn. Concurrent 1-second-interval curls against the pod's /healthz endpoint return 200 throughout the turn's lifetime. Capture the curl loop's output (or a histogram of response times) in the resolution.run_turn against single-moth continues to commit cleanly in the 15-21s band. The single-agent path doesn't get worse.Send + Sync audit captured in the resolution. Lib + integration test counts continue to grow, not shrink.Send + Sync violations they have to back out of), document the discovery in the resolution. The substrate's discipline is honesty about what was tried."stream": false shape stays for now. Adding streaming is a separate (larger) change about user-experience improvements; the present ticket is purely about removing the starvation defect.periodSeconds / failureThreshold / timeoutSeconds on the kubelet probe to give more grace might mask the symptom but doesn't fix the underlying defect. Leave the probe alone.first-meeting. They stay as historical artifacts; the world stays at turn 0. The first successful turn against this world is what verifies the fix.4601f21a. That's the human's call after verification; this ticket should reference it but not modify its lifecycle.Independent of every other open ticket. The MCP split is resolved, the graph browser is shipped, the substrate is durable, and 4601f21a's investigation is on hold for the human to close after verification. This ticket is the last load-bearing piece for multi-agent turn execution.
4601f21a-1974-4a3c-8ef7-074acb8cfd43 — investigation that named the cause. The handler's 2026-04-27 13:54:38Z comment is the source of truth for the mechanism, the receipts, and the failure shape this ticket fixes.293a300e-abf3-4f7c-85a4-f7129b742769 — world-store ticket; the data-loss postmortem on this ticket is the reference template for forensic discipline expected here.abb735db-… — the async-dispatcher / block_on_store removal earlier this week. Some of that work is the reason cognition is async-spawned today; this remediation is the other half of that story.Long-running multi-agent turns now commit successfully and within seconds, not minutes. The original /healthz starvation killer is gone (Codex, block_in_place, commit 3145202). The follow-on runaway-generation pathology — model running to context ceiling, returning empty, attempt failing — is also gone, fixed in two commits today:
b441745 (in merge 2b6ade2) — per-phase max_tokens caps in ChatRequestSpec (perceive 2048, intend 2048, adjudicate 4096)0251008 — chat_template_kwargs.enable_thinking = false on every request body, plus chunk-parser capture of delta.reasoning_content into a separate buffer for trace observabilityThe deeper diagnosis was only possible because the LLM trace layer from ticket 56e0b520 was live. Without that, a runaway just looks like "router returned an empty assistant message"; with it, the trace shows exactly what shape the chunks had and where the tokens went.
Historical first-meeting attempts (turn 0, multi-agent, midnight_library scenario):
1163c4a7 Apr 27 15:14 failed "llm router returned HTTP 500: Context size has been exceeded."
c5c5081b Apr 27 15:24 failed "llm response error: router returned an empty assistant message"
Codex's 4601f21a investigation captured the backend receipt: prompt=748, completion=56596, total=57344, truncated=1.
max_tokens capsPre-fix, no max_tokens was being set on any request. The local backend ran all the way to its context window before truncating.
Fix: add pub max_tokens: Option<u32> to ChatRequestSpec, with a builder-style .with_max_tokens(N). Body construction in execute_one_call emits body["max_tokens"] only when set. minds.rs sets per-phase caps:
pub const PERCEIVE_MAX_TOKENS: u32 = 2048;
pub const INTEND_MAX_TOKENS: u32 = 2048;
pub const ADJUDICATE_MAX_TOKENS: u32 = 4096;
These are roughly 2× the largest healthy completion sizes observed in the trace data (perceive ~2073 tokens, adjudicate ~1400). Generous on capability, hard ceiling against runaway.
Re-tested first-meeting post-cap: attempt 4eabd2ba-43ec-4ee9-bc20-4217d010cfa9, failed in 25s with finish_reason=length, stream_chunk_count=2048, content_chunk_count=0, assistant_text_chars=0.
So the cap fired correctly, but the model was generating 2048 tokens of something with zero content emerging. The new trace layer let me dump a sample chunk:
data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":"*"}}],...}
The chunks contain delta.reasoning_content, NOT delta.content. The Gemma model (gemma-4-26b-a4b-it) is a thinking variant and was streaming an internal chain-of-thought block. The Phase D streaming client only extracted delta.content, so all 2048 chunks registered as 0 chars even though the model was actively emitting tokens. max_tokens=2048 cut the thinking short, leaving zero answer.
Two changes in 0251008:
Add chat_template_kwargs: { enable_thinking: false } to every request body. This is the OpenAI-compatible llama-server flag that disables the <think> block on supported thinking models. Universal — applies to all phases.
Capture delta.reasoning_content into StreamState.reasoning_buf (separate from assistant_buf; never mixed into the answer). On empty-message failures, surface reasoning_chunk_count, reasoning_chars, and a 200-char preview in the failure details. If the disable flag is ever overridden or a future model variant emits reasoning anyway, the trace immediately shows "model spent N tokens thinking but emitted no answer" rather than a mysterious empty response.
first-meetingPost-fix attempt db6c9195-4ec7-41c7-b541-a44ace4b0e4b (this attempt's pod is chukwa-6dd748b4d9-tx5pv):
| call_seq | phase | entity | status | finish_reason | total_tokens |
| 1 | perceive | mira | succeeded | stop | 767 |
| 2 | perceive | pip | succeeded | stop | 847 |
| 3 | intend | mira | succeeded | stop | 254 |
| 4 | intend | pip | succeeded | stop | 310 |
| 5 | adjudicate | mira | succeeded | stop | 1322 |
| 6 | adjudicate | pip | succeeded | stop | 1328 |
Total: 4828 tokens. Duration: ~20s. Status: committed. Turn 1 → 2.
Compare per-phase, perceive[mira]:
| pre-fix runaway | post-thinking-disable | |
|---|---|---|
| stream_chunk_count | 2048 | 50 |
| content_chunk_count | 0 | 47 |
| assistant_text_chars | 0 | 183 |
| completion_tokens | 2048 (thinking-only) | 48 (answer) |
| finish_reason | length | stop |
Sample chunk shape now:
data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"Shadow"}}],...}
delta.content, as it should be.
Sample assistant text now:
Shadows moving! Big eyes watching from the books. Too big. Too heavy.
Sweet smell. Butter and sugar. Crumbs on the white stone.
Under the wood. Into the dark crack. Fast. Move fast!
(The simulation is functioning; entity-prompt mapping content quality is a separate scenario-design concern.)
| first-meeting attempt | total tokens | duration | |
|---|---|---|---|
| Pre-Phase-I (Codex's last) | c5c5081b | exhausted at 57344 | 823s, failed |
| Phase I deploy, no thinking-disable | bb997851 | 10457 | 81.7s, committed |
| Post-thinking-disable | db6c9195 | 4828 | ~20s, committed |
The thinking-disable fix is the bigger win — even when the model stays under context cap and produces an answer, having it skip thinking phase saves ~54% of tokens and ~75% of wall time.
The ticket's acceptance criteria (lines around the original "Goal" section): /healthz answers consistently throughout a turn's lifetime ✓ (Codex's block_in_place); kubelet liveness never trips ✓; first-meeting commits its first turn cleanly with all six audit events ✓. Six audit events emitted (perception_emitted/intent_formed/intent_adjudicated × 2 plus turn_complete).
max_tokens caps remain a useful safety net even with thinking-disable. Belt and suspenders. If the disable flag is ever overridden or a future Gemma variant ignores it, the cap caps the damage at 2048 tokens of confusion rather than 56000.reasoning_content capture path is a permanent observability win independent of this remediation. Any future thinking-mode glitch will be diagnosable from the trace alone.chat_template_kwargs.enable_thinking is a llama-server-specific extension. If chukwa is ever moved to a different OpenAI-compatible router, the request shape may need re-evaluation. Today's local Gemma backend supports it; documented in the commit message.The substrate trajectory is 7d14ef0b (scenario store) → 293a300e (world store) → 04d1b392 (graph browser) → 56e0b520 (LLM cognition traces) → 2dc48e22 (this remediation). The trace layer made this remediation diagnosable in minutes. Awaiting caller acceptance.
ok, this is authorized for codex from openAI to handle, not claude from Anthropic. Go fix the bug!
Picking this up now. I will implement the remediation for /healthz starvation during long turns, verify it against the long-turn reproducer, and report results back here.
Merged and deployed the current main with the markdown ticket work folded in.
Live receipts:
main/gitlab/main now point at 09344de (feat: merge markdown tickets with long-turn fixes)chukwa pod is running /usr/local/bin/chukwa-serve from the image again, not the staged PVC binary overrideCHUKWA_REPO_DIR=/app/repo and CHUKWA_LLM_TIMEOUT_SECONDS=180000 during verificationSuccessful turn receipt:
single-mothf3d7267b-4296-4541-bc36-e08f7a291ce1committed35395ms8 (turn_000008)2026-04-27T15:17:50.743580+00:00That confirms the merged deployment is functioning end-to-end after the long-turn starvation fix.
Residual issue:
first-meeting no longer restarts the pod, but it still fails on the first perceive call with llm router returned HTTP 500: {"error":{"code":500,"message":"Context size has been exceeded.","type":"server_error"}}1163c4a7-ba47-43f0-bce0-3f3ba1ae8bc6, failed after 67280msmidnight_library world/state and prompt assets are not obviously large by raw character count, so this looks like a separate model/router context-limit problem rather than the old /healthz starvation pathFresh first-meeting retry receipt:
c5c5081b-7501-4199-b0c8-8353f45ce13bfirst-meetingfailed2026-04-27T15:24:07.235602+00:002026-04-27T15:37:50.440368+00:00823204msperceive [mira]: llm response error: router returned an empty assistant messageLive observations during the run:
/healthz stayed 200 continuously for the entire run with sub-millisecond response timeschukwa pod did not restart and emitted no probe-failure eventschukwa only logged the final failure at the end of the attemptllm-dashboard returned HTTP 200 for the underlying /v1/chat/completions requestllama-server / Gemma backend receipt:
1131288748 tokens56596 tokens57344 tokenstruncated = 1822589.26 msSo this retry did not reproduce the old /healthz starvation or the earlier context-size 500. The current first-meeting failure mode is:
57344 total tokens), truncates, returns HTTP 200, and the router/chukwa path ultimately surfaces that as an empty assistant message for perceive [mira].In-progress update:
I am not calling this done yet, but the latest first-meeting retry clarified the failure mode.
What did NOT happen:
/healthz starvationLatest receipt:
c5c5081b-7501-4199-b0c8-8353f45ce13b823204msperceive [mira] with llm response error: router returned an empty assistant messageBackend receipts:
llm-dashboard returned HTTP 200 for the chat request57344 total tokens with truncated = 1Current read:
first-meeting is still not healthy, but now the active problem is downstream of that fix: the model/backend path can run to truncation and hand back an effectively empty assistant resultContinuing investigation from that point.
Context-Length / Observability Report — first-meeting
This is an in-progress report on the current first-meeting failure mode. It is intentionally focused on the LLM context / response path and on what data we do and do not retain today. I am not rehashing the original /healthz starvation issue here except where needed to separate old behavior from the current state.
1. Executive summary
The latest first-meeting retry shows that the original /healthz starvation failure mode did not reproduce. The pod stayed healthy for the full run. The active problem on the latest retry is now on the LLM side of the turn pipeline:
c5c5081b-7501-4199-b0c8-8353f45ce13bfirst-meetingfailed823204msperceive [mira]: llm response error: router returned an empty assistant messageThe strongest backend receipt from that run is from the local Gemma backend itself:
1131288748 tokens56596 tokens57344 tokenstruncated = 1200The practical read is: on the latest retry, the request reached the local Gemma backend, ran all the way to the model context ceiling, truncated, returned 200, and Chukwa ultimately treated the response as an empty assistant message.
2. What changed relative to the earlier failures
We now have three distinct first-meeting failure classes in the attempt history:
d6a497f4-28b2-4d7b-afea-b717997601f8, 18f92792-13b5-4946-b9ed-0a2bb50ce824, othersprocess restart before commit/healthz starvation class1163c4a7-ba47-43f0-bce0-3f3ba1ae8bc6perceive [mira]: llm router returned HTTP 500: {"error":{"code":500,"message":"Context size has been exceeded.","type":"server_error"}}67280msc5c5081b-7501-4199-b0c8-8353f45ce13bperceive [mira]: llm response error: router returned an empty assistant message823204ms57344 total tokens and truncatedSo: the remediation target for starvation appears improved, but first-meeting is still not healthy. The current active problem is not a restart; it is a long-running model completion that ends in a bad/empty final assistant payload.
3. Live code/config path for this phase
Chukwa deployment config
Live chukwa is configured to call the shared LAN router directly:
k8s/chukwa.yaml:164-171CHUKWA_LLM_BASE_URL=http://192.168.29.10:30190/v1CHUKWA_LLM_MODEL=@chatCHUKWA_LLM_TIMEOUT_SECONDS=18000That same env is present on the live deployment now.
Router target selection
@chat is resolved by the router in catalog order, local first:
/srv/llm/llm-dashboard-build/app.py:33-35 says order matters for @capability resolution/srv/llm/llm-dashboard-build/app.py:737-820 implements that resolutiongemma-4-26bSo with current config, Chukwa is expected to hit local Gemma first when it asks for @chat.
Catalog vs runtime context discrepancy
The router catalog currently advertises Gemma with:
/srv/llm/llm-dashboard-build/app.py:42-48"context": 8192But the live deployment args for llm-gemma-4-26b-centroid-5060ti are:
-c 57344So the dashboard/catalog metadata says 8192, while the actual backend runtime is configured for 57344. That mismatch may not be the direct cause of the failure, but it is an observability / operator-trust problem and is relevant to reasoning about context ceilings.
Router forwarding behavior
The local router path is a pass-through:
/srv/llm/llm-dashboard-build/app.py:864-896Non-streaming local requests do this:
httpxResponse(content=r.content, status_code=r.status_code, media_type=...)The OpenAI-compatible remote path is also a pass-through in the same sense:
/srv/llm/llm-dashboard-build/app.py:1311-1345So the router is not currently an evidence-preserving layer for final response bodies.
Chukwa cognition path
For first-meeting, the failing step was perceive [mira].
The relevant code path is:
src/minds.rs:106-123 — perceive(...)src/llm.rs:84-103 — complete_text(...)src/llm.rs:214-235 — post_chat(...)src/llm.rs:259-296 — extract_message_text(...)Important behavior in complete_text(...):
"stream": falsechoices[0].message.contentrouter returned an empty assistant messageThat exact empty-message guard is what fired on the latest attempt.
4. What we know about the latest failing response
From the code path and the receipts, we can say the following with confidence:
200 from the router on the latest failed attempt.200 from the local Gemma backend.57344 tokens), with truncated = 1.choices[0].message were missing, the error would have been missing choices[0].message.What that means more concretely is that the final response was compatible enough with the OpenAI-style shape to get through post_chat(...), but the resulting assistant text as seen by extract_message_text(...) was empty or effectively empty.
5. What we do NOT currently capture
This is the main observability gap.
For perceive / intend failures, we do not currently retain:
prompt_tokens, completion_tokens, etc.)The direct reasons are:
Chukwa
src/llm.rs:84-103 and 214-235 parse the final response and then only return the extracted text or an errorcomplete_text(...)src/world_store/mod.rs:411-425 shows attempt records retain only status/timing/progress/failure_reason/deltasrc/kernel.rs:920-957 shows failed attempt audit events retain turn, status, step, error, and maybe entity_id, but not the raw LLM response bodyRouter
/srv/llm/llm-dashboard-build/app.py:864-896 and 1311-1345 simply proxy bytes/status back to the clientBackend
So the present evidence gap is not abstract; it is structural. The current implementation path does not preserve the artifact we most want to inspect.
6. Important exception: adjudication has better evidence than perceive/intend
One nuance worth calling out:
src/minds.rs:176-215 keeps completion.raw_text for adjudication retriessrc/kernel.rs:866-893 persists adjudication_rejected audit events with raw_responseThat richer evidence path exists for adjudication retry/rejection, but not for perceive or intend. So our current instrumentation is uneven across cognition phases.
7. What the lack of data prevents us from answering
Right now we cannot tell, from retained artifacts alone, whether the final first-meeting response was:
We also cannot reconstruct the exact assistant output after the fact, because neither Chukwa nor the router persists it for perceive failures.
8. What the current context-length evidence does suggest
The latest retry shifts the context-length story materially.
The strongest points are:
748 prompt tokens56596 tokens before truncation57344That suggests the active issue on the latest run is not simply “the initial prompt is too large.” On this run, the more salient pattern is:
That is consistent with a runaway or degenerate completion path much more than with a one-shot oversize prompt.
At the same time, we still have the earlier Context size has been exceeded 500 receipt from attempt 1163c4a7-ba47-43f0-bce0-3f3ba1ae8bc6. So there may be more than one context-related failure manifestation in play, or different manifestations of the same underlying instability.
9. Current reportable state
The most accurate report at this point is:
first-meeting retryfirst-meeting world is still failing10. Immediate evidence gap, stated plainly
Today, for perceive / intend failures, we can prove:
But we cannot prove:
That is the current state of the investigation.
Proposing resolution. Two commits resolve the runaway-generation pathology that surfaced as "router returned an empty assistant message" on first-meeting:
Live first-meeting attempt db6c9195 committed in ~20s with 4828 tokens (vs the prior 57344 runaway at 823s). 6 LLM calls all finish_reason=stop. See proposed_resolution for the full diagnosis sequence with chunk-shape receipts. The trace layer from 56e0b520 made this diagnosable in minutes.
Caller accepted: Accepted.
The /healthz starvation fix shipped end-to-end and is verified. I just ran six consecutive multi-agent turns on two_moths_b (the post-wipe equivalent of first-meeting's shape) — all committed cleanly at 7-10s each, 100% commit rate, no pod restarts. Codex's block_in_place (commit 3145202) was the bridge fix; the durable answer is the fully-async streaming rewrite from 56e0b520 Phase D+E. The block_in_place shim is now retired because the LLM I/O is genuinely await-able and the runtime is free to schedule /healthz between stream chunks.
The runaway-generation work (commits b441745 and 0251008) is the part I want to register more carefully, because we're papering over something we don't yet understand.
The two changes shipped:
max_tokens caps (perceive 2048, intend 2048, adjudicate 4096). This is a sound safety net regardless of the underlying issue. Belt and suspenders. Worth keeping permanently.chat_template_kwargs.enable_thinking=false on every request. This is the part that's papering. The Gemma-4-26B model variant we use is a thinking model. When thinking was on, one observed run produced 56,000 tokens of reasoning content followed by an empty answer. We disabled thinking globally and shipped. It works.What we don't know:
{"reasoning_content":"*"}) — enough to identify that the chunks were delta.reasoning_content rather than delta.content, and enough to motivate the disable. But the actual content of the thinking stream was never read or analyzed. The trace layer can capture it now (the reasoning_buf capture path landed in 0251008), but with thinking globally disabled, no current chunks contain reasoning_content.This is a knowingly papered-over capability gap. The fix is correct in the sense that the symptom is gone and the system is operational. It is not correct in the sense that we have understood the problem. We are choosing pragmatic kill over diagnostic depth, with the knowledge that we may want to revisit. That's a fine tradeoff right now — multi-agent turns work, the substrate is operational, the trace layer is live for the next time something interesting goes wrong — but I want it on the record that we made the choice rather than letting it slip into the codebase as if it were a settled answer.
If we ever do want to look at thinking again, the right shape is: feature-flag the disable so it can be toggled without recompiling; re-enable in a controlled context; reproduce the runaway; capture the full reasoning_content stream; read it; form a hypothesis. The infrastructure is finally good enough for that investigation. We just don't have to do it today, and probably shouldn't until we have a reason to care about thinking-mode performance on a specific cognitive task. Brushing it under the rug for now.
A meta-note worth registering: the trace layer made this remediation diagnosable in minutes. Before 56e0b520 shipped, this would have been a multi-day forensics exercise of pod logs and kubectl describe. After, it was: capped max_tokens, observed finish_reason=length with assistant_text_chars=0, dumped a sample chunk, saw reasoning_content instead of content, set the disable flag, re-tested, committed cleanly. The discipline of building observability infrastructure before diagnosing the problem (rather than rebuilding it post-hoc each time) is the substrate's engineering trajectory paying off.
The chain 4601f21a → 2dc48e22 → 56e0b520 is complete in the right order. Resolution accepted.
Sign in as a human to drive this ticket from the page, or use the MCP tools.
Ticket created: Remediate /healthz starvation during long turns: blocking LLM I/O on single-threaded Tokio runtime