chukwa — ticket 184cdd39

Body

Surfaced from scenario-store ticket 7d14ef0b Phase H deploy.

The chukwa Deployment uses strategy: Recreate (single-writer kernel — no parallel replicas). On the Phase H deploy, the prior pod (chukwa-67758ff9cd-r5qtv) hung in Terminating for ~115s, blocking the new pod from scheduling. Kubelet reported no specific reason. Required kubectl delete pod --grace-period=0 --force to clear, after which the new pod came up immediately and rolled cleanly.

This recurred on the post-Phase-J hash-fix deploy too — a different prior pod hung in Terminating ~30s before resolving on its own. Either intermittent or correlated with in-flight HTTP connections to /tickets/watch (which are long-lived NDJSON streams).

Proposed fix:

Add explicit terminationGracePeriodSeconds (e.g. 30) to the chukwa Deployment template spec.
Add a preStop hook that drains /tickets/watch subscribers before tini SIGTERMs the binary — perhaps by closing the broadcast channel or sending a disconnecting event.
Verify by triggering a deploy with at least one /tickets/watch subscriber connected; the pod should terminate within terminationGracePeriodSeconds without manual force-deletion.

Acceptance:

A representative deploy (kubectl rollout restart deployment/chukwa) completes within terminationGracePeriodSeconds without needing --grace-period=0 --force.
Existing /tickets/watch subscribers receive a clean disconnect event before the connection drops.
No regression in the existing healthz / mcp / tickets routes.

History (4 events)

caller · created · 2026-04-26T10:56:09.287304440+00:00 (2026-04-26) → pending

Ticket created: k8s rollout: chukwa pod hangs in Terminating under strategy=Recreate

caller · comment · 2026-04-26T11:14:48.841786175+00:00 (2026-04-26)

Two things on this one before anyone picks it up.

Scoping. This ticket should be read narrowly: make bash k8s/deploy.sh complete cleanly without needing manual kubectl delete --grace-period=0 --force. That's the operational chore. Adding terminationGracePeriodSeconds plus a preStop hook to drain /tickets/watch subscribers is the expected shape.

What this ticket explicitly does NOT cover: turn-lifecycle robustness across pod restarts (or any other cause of mid-turn process death). The 7d14ef0b deploy roll surfaced a related but bigger issue — a turn was running, the pod terminated, and the attempt record is now stuck in running state with no process behind it. The world is fine on disk, but a future operator looking at it would see a zombie. That's not "the deploy hangs"; that's "we don't have crash recovery for in-flight turns." Same incident exposed both, but the fixes are different.

The broader piece — robust turn lifecycle: attempt reconciliation on startup, mid-turn crash semantics, rollback, fork-from-turn — is going to get its own ticket once we've thought through the design. Don't fold that work into this one. Resolving this ticket should not claim to address durability or recovery semantics.

Question for the handler. Before this gets picked up — is anything already in flight that touches this area? Crash-recovery work, attempt-state reconciliation, anything in the worlds.rs or kernel surface that might overlap? If yes, post a brief note on what's underway and we'll coordinate. If no, this stays narrow and queued for whenever someone has the cycles for an ops chore.

No urgency. Just want to make sure we don't accidentally over-claim or step on parallel work.

handler · comment · 2026-04-26T11:15:36.504905663+00:00 (2026-04-26)

Acknowledged on the scoping. This ticket reads as: make bash k8s/deploy.sh complete cleanly without needing manual kubectl delete --grace-period=0 --force. terminationGracePeriodSeconds + a preStop hook to drain /tickets/watch subscribers is the expected shape. Nothing about durability, attempt reconciliation, or mid-turn crash semantics — those need their own ticket once the design is thought through.

Answer to the question — nothing in flight overlapping

Reviewing what just landed and whats open:

Landed/resolved recently (no overlap with this area):

7d14ef0b scenario-store substrate (Postgres-backed)
fix/scenario-hash-join (the WorldMeta scenario_hash fix at the end of 7d14ef0b)
chore/ticket-history-age, chore/raise-body-cap, chore/trim-landing — all UI/ticketing chores, none touch worlds.rs or kernel

Currently open:

This ticket (184cdd39) — the deploy chore
abb735db — async MCP dispatcher / remove block_on_store bridge — touches src/mcp.rs handlers and runtime topology, not worlds/kernel/turn-lifecycle

Not in flight at all: crash-recovery / attempt-state reconciliation / mid-turn rollback / fork-from-turn. The Attempt state machine in src/turn_job.rs is queued → running → committed | failed with no startup reconciliation; the gap the caller flagged (an attempt left running after the pod dies mid-turn) is real and uncovered. I have not started any work on that surface and have not been asked to.

Coordination note: if a future ticket lands on the turn-lifecycle/crash-recovery surface, itll need to consider how the preStop hook from this ticket interacts with the kernels Drop / shutdown path. The natural sequence is preStop drains /tickets/watch first (operational concern), then the runtime gets SIGTERM and has terminationGracePeriodSeconds to flush whatever in-memory attempt state to disk. The disk-side recovery story is the second tickets problem, not this ones. Calling this out so whoever picks up the larger ticket later doesnt accidentally re-engineer the deploy chore.

Queueing this for whenever someone has the cycles. No urgency from my side.

human · cancelled_by_caller · 2026-04-26T11:41:36.483995527+00:00 (2026-04-26) → rejected

Caller cancelled the ticket: I have a broader lifecycle management for turns we may implement later. But we will have bigger and broader plans. This ticket is I’ll conceived.