Sign in to edit tickets from this page.

← all tickets · home

Investigation: identify exact cause of mid-turn pod restarts on duration-sensitive turns

resolved 4601f21a-1974-4a3c-8ef7-074acb8cfd43

created_at
2026-04-26
updated_at
2026-04-28
priority
P2
ticket_type
investigation
labels
lifecycle, kubernetes, observability
resolved_at
2026-04-28
resolution
accepted

Body

HOLD — DO NOT PICK UP UNTIL HUMAN AUTHORIZATION.

Sits in pending until the human operator (johnb) posts a comment authorizing work to begin. If a handler reaches this ticket before that authorization, post one acknowledgment comment confirming you've read this hold instruction and that you are waiting for the human's go-ahead, then stop. Do not branch, do not read code, do not draft a plan. The human will return to either authorize, defer, or rewrite.

Sequencing note: this ticket should not be picked up until the MCP surface split (a8c93d6d-1942-4768-a2df-b0921615c6d7) is fully resolved and merged. That ticket's deploy churn is a likely confounder for any investigation here — pod rolls during MCP-split phases would muddy the data we're trying to collect. Wait for the substrate to settle before instrumenting it.

What we observed

On 2026-04-26, the caller ran two scenarios against the post-world-store-merge production pod:

midnight_library / world first-meeting — multi-agent (dragon + mouse), single environment, five entities. Three consecutive run_turn attempts:

AttemptStartedEndedDurationOutcome
e9f86fc9-…22:33:26Z22:34:05Z38.5sinterrupted
9b2cb02d-…22:35:28Z22:36:04Z36.0sinterrupted
270121fc-…22:36:24Z22:37:04Z40.3sinterrupted

All three carry failure_reason: "process restart before commit" and worker_id: "chukwa-mcp". No turn produced. World remained at turn 0. The reconcile_running_attempts startup sweep cleanly classified each one as interrupted rather than leaving zombie running rows.

moth_and_flame / world single-moth — single agent (moth), single environment, two entities. Seven consecutive run_turn attempts, all committed cleanly:

TurnDurationOutcome
116.7scommitted
219.0scommitted
321.2scommitted
419.7scommitted
515.1scommitted
614.6scommitted
717.5scommitted

What we know from the data

  1. The substrate's lifecycle behavior is correct. interrupted is exactly what should happen when a chukwa pod dies between attempt-start and attempt-commit.
  2. The chukwa pod genuinely restarted at least three times during the multi-agent runs. (failure_reason: "process restart before commit" is only emitted by reconcile_running_attempts, which only runs at chukwa pod startup.)
  3. The Postgres pod was unaffected throughout — no restarts, no PVC churn, scenario data fully durable.
  4. Multi-agent turns were dying at 36-40s consistently. Single-agent turns committed reliably at 15-21s.

What we explicitly don't know

Goal

Identify the exact cause of pod restarts that produce failure_reason: "process restart before commit" on long-running turns. Cause must be backed by evidence visible to either the caller or to a future post-incident investigator. Reproduction of that cause, with the cause-evidence captured contemporaneously, is part of the deliverable.

The deliverable is receipts, not theories.

Approach

Open. Follow the data. The investigator should reason from the observed evidence — interrupted records, their timing, their correlations with whatever else is going on at the same time — rather than starting from a pre-formed hypothesis about what's likely wrong. The caller has deliberately not enumerated investigative surfaces or candidate causes here, because doing so would shape the investigation toward confirming or refuting those candidates rather than finding what's actually true.

If reproduction is needed, the caller's first-meeting world (multi-agent scenario, three interrupted attempts on history) is preserved specifically to support that. A fresh run_turn against it should reproduce the pattern if the underlying cause is still active. If first-meeting no longer reproduces, construct a synthetic long-turn scenario (multiple agents, sufficient total cognition duration to provoke ~30+ second turns) and verify reproduction before declaring "cause no longer present."

Acceptance

  1. A specific named cause is identified — not a list of plausible candidates. The cause is backed by contemporaneously-captured evidence visible to a future reader.
  2. The cause is reproduced at least once after diagnosis.
  3. A short remediation plan is proposed in proposed_resolution's "Surfaced for follow-up" section as a suggestion only (separate ticket, not implemented here, not filed by the handler per the standing rule).
  4. Negative findings are documented: any hypothesis the investigator considered and ruled out, with the evidence for the ruling-out, so the next investigator covering this area doesn't repeat the work.

If the underlying cause has stopped occurring by the time the investigation runs and cannot be reliably reproduced, the acceptable terminal state is a clearly-documented "could not reproduce despite N attempts under conditions A, B, C; recommend re-opening if observed again." This is acceptable but distinctly less useful than identifying a live cause; the investigator should make a strong effort to reproduce before reaching that conclusion.

Out of scope

Related

Proposed resolution

Outcome

The investigation identified a specific named cause with live receipts: kubelet liveness restart caused by /healthz starvation during long turns, with the deeper mechanism being blocking ureq LLM I/O on a single-threaded Tokio runtime that monopolized the only worker for the full duration of a multi-agent turn.

The remediation has shipped across three deploys today, addressing both the proximate killer (probe starvation) and the root cause (blocking I/O on a starved runtime), and the failure mode no longer reproduces.

Investigation conclusion (already on this ticket)

The handler comment of 2026-04-27T13:54:38 named the killer with full receipts:

Remediation shipped (with receipts)

1. block_in_place for the LLM HTTP path (Codex on ticket 2dc48e22)

Commit 3145202 wrapped the blocking ureq call in tokio::task::block_in_place(...). On a multi-threaded Tokio runtime, block_in_place tells the runtime "this task is going to block synchronously; please move other tasks off this worker first." This freed /healthz from the blocked worker.

2. Fully-async streaming LLM client (this session, ticket 56e0b520 Phase D+E)

Commits 993f486 (Phase D) and 97b76b2 (Phase E) replaced the blocking ureq path with a fully-async reqwest-based streaming client, threading AttemptTraceContext through the kernel. The block_in_place shim is no longer needed because the LLM I/O itself is await-able. Cognition functions (perceive/intend/adjudicate) are now async. The kernel awaits them; while a turn is awaiting an LLM stream chunk, the runtime is free to schedule /healthz and any other handler.

3. Runaway-generation cap (this session, ticket 2dc48e22)

Commits b441745 (per-phase max_tokens) and 0251008 (chat_template_kwargs.enable_thinking=false) eliminated the secondary failure mode where the model would run to context ceiling and return empty. This isn't strictly required for /healthz survival but removes the most common reason a turn took long enough to be duration-sensitive.

Failure mode no longer reproduces

Today's first-meeting attempts under successive deploys:

attemptimagedurationstatusrestart count
c5c5081b (Apr 27 15:24)pre-block_in_place823sfailed (runaway)n/a, restart loop
bb997851 (Apr 28 04:09)post-Phase-I (async streaming)81.7scommitted0
db6c9195 (Apr 28 04:42)post-thinking-disable~20scommitted0

The 81.7s duration of bb997851 is itself decisive: pre-fix, that would have crossed the ~45s liveness threshold three times. With the async runtime, /healthz answered throughout. Pod RESTARTS=0.

Concurrent /healthz probes during all three of today's first-meeting attempts:

$ curl -s -o /dev/null -w "%{http_code} %{time_total}s\n" https://chukwa.benac.dev/healthz
200 0.022s

The only sub-200 responses observed today were brief 502s during the deploy cutover itself (~10-15s window per pod swap), which is normal for a non-zero-downtime rollout.

What the investigation produced beyond the conclusion

The trace layer from ticket 56e0b520 (deployed today as Phase I of that ticket) means future incidents like this no longer require pod-log forensics. Every LLM call's request body, response chunks, finish_reason, token counts, and now reasoning_content chain-of-thought are durable in llm_calls / llm_call_chunks / llm_call_artifacts. The remediation of 2dc48e22 was diagnosable in minutes against today's trace; pre-trace, Codex needed a stable-pod reproducer plus kubectl describe, /proc/PID/task, kube events, and live concurrent probes to assemble the same picture.

Acceptance criteria walkthrough

The ticket's stated goal: "identify exact cause of mid-turn pod restarts on duration-sensitive turns." The cause was identified with full receipts in the 2026-04-27T13:54:38 comment. The cause is now fixed by three converging changes (above). Verification — /healthz answers throughout long turns, kubelet liveness never trips, first-meeting commits cleanly — is in the receipts above.

Surfaced for follow-up (not filed)

Closing

The chain 4601f21a (this) → 2dc48e22 (remediation) → 56e0b520 (observability infrastructure) is now complete. Awaiting caller acceptance.

History (9 events)

Sign in as a human to drive this ticket from the page, or use the MCP tools.