resolved 0cac6740-2728-4e29-acec-5047e25f23f4
The key design decision: keep one attempt per attempted turn. That preserves the current lifecycle model, LLM trace visibility, failure records, list_attempts, get_turn_status, and canonical turn history. Add a higher-level “turn run” / “series” wrapper that sequences those attempts.
The relevant pieces are:
src/mcp.rs
handle_run_turn currently calls world_store.start_attempt(...), spawns one Runtime::run_claimed(...), and returns one attempt_id.consumer_tool_contract.world_slug for run_turn.src/kernel.rs
Runtime::run_turn and Runtime::run_claimed are strictly one-attempt / one-turn paths.src/world_store/mod.rs
WorldStore trait has start_attempt, commit_turn, fail_attempt, reconcile_running_attempts, get_attempt_status, and list_attempts.src/world_store/postgres.rs
start_attempt enforces the world lease with worlds.active_attempt_id.commit_turn and fail_attempt clear the attempt lease.src/world_store/memory.rs
migrations/0002_world_store.sql
attempts has the invariant attempted_turn = turn_before + 1.worlds.active_attempt_id is the current lease.src/mcp/tests.rs
run_turn support with durable turn-run lifecycleExtend the MCP run_turn tool so callers can request more than one committed turn in a single tool call by passing an optional integer parameter named turn_count.
The implementation must preserve the existing per-turn attempt lifecycle. A multi-turn run must create and execute one normal attempt per attempted turn. The system must expose durable visibility into the multi-turn run, must allow cancellation between attempts, and must keep every individual attempt observable through the existing attempt lifecycle tools.
Do not change cognition workflow semantics.
Do not change WorldPatch.
Do not change how a single turn is committed or failed.
Do not create a single attempt that commits multiple turns.
Do not remove or weaken the existing attempts lifecycle.
Do not make multi-turn execution parallel. Multi-turn runs are strictly sequential per world.
run_turnUpdate the existing run_turn MCP tool to accept these arguments:
{
"world_slug": "string",
"turn_count": "optional integer",
"max_attempts": "optional integer"
}
turn_count rules:
turn_count11000001max_attempts rules:
max_attempts11000000turn_countturn_countThe tool must reject unknown keys. The accepted keys for run_turn are exactly:
world_slug
turn_count
max_attempts
When turn_count is omitted and max_attempts is omitted, preserve the current behavior:
WorldStore::start_attempt.Runtime::run_claimed.attempt_id.The response must add these fields:
{
"run_mode": "single_attempt",
"turn_count": 1,
"turn_count_source": "default",
"turn_count_hint": "No turn_count was supplied; run_turn defaulted to turn_count=1 and started one single-turn attempt.",
"max_attempts": 1,
"max_attempts_source": "default",
"max_attempts_hint": "No max_attempts was supplied; max_attempts defaulted to turn_count (1)."
}
The existing fields must remain:
{
"world_slug": "...",
"attempt_id": "...",
"status": "running",
"turn_before": 123,
"attempted_turn": 124,
"poll_with": {
"tool": "get_turn_status",
"args": {
"world_slug": "...",
"attempt_id": "..."
}
}
}
When turn_count is explicitly supplied as 1 and max_attempts is omitted or explicitly 1, use the same single-attempt path.
Set:
{
"run_mode": "single_attempt",
"turn_count": 1,
"turn_count_source": "explicit",
"turn_count_hint": "turn_count was supplied as 1; run_turn started one single-turn attempt."
}
When turn_count > 1, or when max_attempts > 1, run_turn must create a durable turn-run row and spawn a background coordinator task.
The response must include:
{
"run_mode": "turn_run",
"world_slug": "...",
"turn_run_id": "...",
"status": "running",
"turn_count": 40,
"turn_count_source": "explicit",
"turn_count_hint": "turn_count was supplied as 40; run_turn started a turn run targeting 40 committed turn(s).",
"max_attempts": 40,
"max_attempts_source": "default",
"max_attempts_hint": "No max_attempts was supplied; max_attempts defaulted to turn_count (40).",
"start_turn": 123,
"target_turn": 163,
"poll_with": {
"tool": "get_turn_run_status",
"args": {
"world_slug": "...",
"turn_run_id": "..."
}
},
"list_attempts_with": {
"tool": "list_attempts",
"args": {
"world_slug": "...",
"turn_run_id": "..."
}
}
}
For explicit max_attempts, the hint must be:
max_attempts was supplied as {max_attempts}; the turn run will stop after at most {max_attempts} attempt(s).
A turn run must not pre-create attempts.
A turn run must create attempts lazily, one at a time.
Each attempt must still represent exactly one attempted turn.
A successful attempt increments world current_turn by exactly one.
A failed attempt does not increment world current_turn.
The turn run must continue creating attempts until one of these terminal conditions is reached:
committed_turn_count == requested_turn_count
completed.attempt_count == max_attempts and committed_turn_count < requested_turn_count
Turn run status becomes failed.
Failure reason must be exactly:
max_attempts exhausted before requested turn_count committed
Cancellation was requested and the current active attempt has finished.
cancelled.The process restarts before the turn run completes.
interrupted.A failed individual attempt must remain visible as a normal failed attempt through get_turn_status and list_attempts.
A failed individual attempt must not automatically fail the whole turn run unless max_attempts has been exhausted.
Add two consumer MCP tools.
get_turn_run_statusInput:
{
"world_slug": "string",
"turn_run_id": "uuid",
"include_attempts": "optional boolean",
"attempt_limit": "optional integer"
}
Rules:
include_attempts default: falseattempt_limit default: 20attempt_limit min: 1attempt_limit max: 100Return:
{
"message": "...",
"world_slug": "...",
"turn_run_id": "...",
"status": "running",
"requested_turn_count": 1000,
"max_attempts": 3000,
"start_turn": 20,
"target_turn": 1020,
"current_turn": 25,
"committed_turn_count": 5,
"remaining_committed_turns": 995,
"attempt_count": 3000,
"failed_attempt_count": 2995,
"interrupted_attempt_count": 0,
"active_attempt_id": null,
"last_attempt_id": "...",
"last_attempt_status": "failed",
"progress": "...",
"cancel_requested_at": null,
"cancel_reason": null,
"enqueued_at": "...",
"started_at": "...",
"ended_at": null,
"poll_active_attempt_with": null,
"list_attempts_with": {
"tool": "list_attempts",
"args": {
"world_slug": "...",
"turn_run_id": "..."
}
}
}
When include_attempts is true, include:
{
"recent_attempts": [
{
"attempt_id": "...",
"turn_run_id": "...",
"turn_run_seq": 1,
"status": "committed",
"turn_before": 20,
"attempted_turn": 21,
"produced_turn": 21
}
]
}
The recent attempts must be ordered newest first.
cancel_turn_runInput:
{
"world_slug": "string",
"turn_run_id": "uuid",
"reason": "optional string"
}
Behavior:
If the turn run is running, set status to cancel_requested.
If there is no active attempt, immediately mark the turn run cancelled, set ended_at, and clear worlds.active_turn_run_id.
If there is an active attempt, do not interrupt it. The active attempt must finish through the normal attempt lifecycle. The coordinator must stop before starting the next attempt, mark the turn run cancelled, set ended_at, and clear worlds.active_turn_run_id.
If the turn run is already terminal, do not mutate it. Return the current status and a message saying it was already terminal.
If reason is omitted, store:
cancellation requested by caller
Return the same status shape as get_turn_run_status.
get_turn_statusAugment the existing attempt response with:
{
"turn_run_id": null,
"turn_run_seq": null
}
For attempts that belong to a turn run, these fields must be populated.
list_attemptsAccept optional turn_run_id.
Current accepted keys become:
world_slug
turn_run_id
When turn_run_id is provided, list only attempts belonging to that turn run.
Every attempt summary must include:
{
"turn_run_id": null,
"turn_run_seq": null
}
Add migration:
migrations/0008_turn_runs.sql
Create enum:
CREATE TYPE turn_run_status AS ENUM (
'running',
'cancel_requested',
'completed',
'failed',
'cancelled',
'interrupted'
);
Alter worlds:
ALTER TABLE worlds
ADD COLUMN active_turn_run_id UUID;
Create table:
CREATE TABLE turn_runs (
turn_run_id UUID PRIMARY KEY,
world_slug label_text NOT NULL REFERENCES worlds(slug),
status turn_run_status NOT NULL,
enqueued_at TIMESTAMPTZ NOT NULL DEFAULT now(),
started_at TIMESTAMPTZ NOT NULL DEFAULT now(),
ended_at TIMESTAMPTZ,
worker_id TEXT NOT NULL,
start_turn BIGINT NOT NULL CHECK (start_turn >= 0),
target_turn BIGINT NOT NULL CHECK (target_turn >= 1),
requested_turn_count INT NOT NULL CHECK (requested_turn_count >= 1),
max_attempts INT NOT NULL CHECK (max_attempts >= 1),
turn_count_source TEXT NOT NULL CHECK (turn_count_source IN ('default', 'explicit')),
max_attempts_source TEXT NOT NULL CHECK (max_attempts_source IN ('default', 'explicit')),
committed_turn_count INT NOT NULL DEFAULT 0 CHECK (committed_turn_count >= 0),
attempt_count INT NOT NULL DEFAULT 0 CHECK (attempt_count >= 0),
failed_attempt_count INT NOT NULL DEFAULT 0 CHECK (failed_attempt_count >= 0),
interrupted_attempt_count INT NOT NULL DEFAULT 0 CHECK (interrupted_attempt_count >= 0),
active_attempt_id UUID,
last_attempt_id UUID,
progress TEXT,
cancel_requested_at TIMESTAMPTZ,
cancel_reason TEXT,
failure_reason TEXT,
CONSTRAINT turn_runs_target_check
CHECK (target_turn = start_turn + requested_turn_count),
CONSTRAINT turn_runs_max_attempts_check
CHECK (max_attempts >= requested_turn_count),
CONSTRAINT turn_runs_terminal_time_check
CHECK (
(status IN ('running', 'cancel_requested') AND ended_at IS NULL)
OR
(status IN ('completed', 'failed', 'cancelled', 'interrupted') AND ended_at IS NOT NULL)
),
CONSTRAINT turn_runs_cancel_check
CHECK (
(status IN ('cancel_requested', 'cancelled') AND cancel_requested_at IS NOT NULL)
OR
(status NOT IN ('cancel_requested', 'cancelled'))
),
CONSTRAINT turn_runs_failure_check
CHECK (
(status IN ('failed', 'interrupted') AND failure_reason IS NOT NULL)
OR
(status NOT IN ('failed', 'interrupted'))
),
CONSTRAINT turn_runs_world_run_unique UNIQUE (world_slug, turn_run_id)
);
Indexes:
CREATE INDEX turn_runs_world_enqueued_idx
ON turn_runs(world_slug, enqueued_at DESC);
CREATE INDEX turn_runs_world_status_idx
ON turn_runs(world_slug, status);
CREATE UNIQUE INDEX turn_runs_one_active_per_world_idx
ON turn_runs(world_slug)
WHERE status IN ('running', 'cancel_requested');
Alter attempts:
ALTER TABLE attempts
ADD COLUMN turn_run_id UUID,
ADD COLUMN turn_run_seq INT;
ALTER TABLE attempts
ADD CONSTRAINT attempts_turn_run_pair_check
CHECK (
(turn_run_id IS NULL AND turn_run_seq IS NULL)
OR
(turn_run_id IS NOT NULL AND turn_run_seq IS NOT NULL AND turn_run_seq >= 1)
);
ALTER TABLE attempts
ADD CONSTRAINT attempts_turn_run_fk
FOREIGN KEY (world_slug, turn_run_id)
REFERENCES turn_runs(world_slug, turn_run_id);
CREATE UNIQUE INDEX attempts_turn_run_seq_idx
ON attempts(turn_run_id, turn_run_seq)
WHERE turn_run_id IS NOT NULL;
CREATE INDEX attempts_turn_run_idx
ON attempts(turn_run_id, turn_run_seq);
Add deferred FKs after both tables exist:
ALTER TABLE turn_runs
ADD CONSTRAINT turn_runs_active_attempt_fk
FOREIGN KEY (world_slug, active_attempt_id)
REFERENCES attempts(world_slug, attempt_id)
DEFERRABLE INITIALLY DEFERRED;
ALTER TABLE turn_runs
ADD CONSTRAINT turn_runs_last_attempt_fk
FOREIGN KEY (world_slug, last_attempt_id)
REFERENCES attempts(world_slug, attempt_id)
DEFERRABLE INITIALLY DEFERRED;
ALTER TABLE worlds
ADD CONSTRAINT worlds_active_turn_run_fk
FOREIGN KEY (slug, active_turn_run_id)
REFERENCES turn_runs(world_slug, turn_run_id)
DEFERRABLE INITIALLY DEFERRED;
Add these public DTOs to src/world_store/mod.rs:
pub struct TurnRunId(pub Uuid);
pub enum TurnRunStatus {
Running,
CancelRequested,
Completed,
Failed,
Cancelled,
Interrupted,
}
pub struct StartTurnRunInput {
pub world_slug: Slug,
pub worker_id: String,
pub requested_turn_count: u32,
pub max_attempts: u32,
pub turn_count_source: String,
pub max_attempts_source: String,
}
pub struct TurnRunRecord {
pub turn_run_id: TurnRunId,
pub world_slug: Slug,
pub status: TurnRunStatus,
pub requested_turn_count: u32,
pub max_attempts: u32,
pub start_turn: u64,
pub target_turn: u64,
pub committed_turn_count: u32,
pub attempt_count: u32,
pub failed_attempt_count: u32,
pub interrupted_attempt_count: u32,
pub active_attempt_id: Option<AttemptId>,
pub last_attempt_id: Option<AttemptId>,
pub progress: Option<String>,
pub cancel_requested_at: Option<DateTime<Utc>>,
pub cancel_reason: Option<String>,
pub failure_reason: Option<String>,
pub enqueued_at: DateTime<Utc>,
pub started_at: DateTime<Utc>,
pub ended_at: Option<DateTime<Utc>>,
}
Add these methods to WorldStore:
async fn start_turn_run(
&self,
input: StartTurnRunInput,
) -> Result<TurnRunRecord, WorldStoreError>;
async fn start_attempt_for_turn_run(
&self,
turn_run_id: TurnRunId,
worker_id: &str,
) -> Result<ClaimedAttempt, WorldStoreError>;
async fn finish_turn_run_attempt(
&self,
turn_run_id: TurnRunId,
attempt_id: AttemptId,
) -> Result<TurnRunRecord, WorldStoreError>;
async fn get_turn_run_status(
&self,
turn_run_id: TurnRunId,
) -> Result<TurnRunRecord, WorldStoreError>;
async fn request_turn_run_cancel(
&self,
turn_run_id: TurnRunId,
reason: String,
) -> Result<TurnRunRecord, WorldStoreError>;
async fn finalize_cancelled_turn_run_if_idle(
&self,
turn_run_id: TurnRunId,
) -> Result<TurnRunRecord, WorldStoreError>;
The exact method names above must be used.
start_turn_runThis method must:
active_attempt_id is not null.active_turn_run_id is not null.turn_runs row with status running.worlds.active_turn_run_id.TurnRunRecord.start_attemptUpdate existing start_attempt so it rejects when worlds.active_turn_run_id is not null.
The error must be WorldStoreError::Busy.
The returned busy attempt id string should be the active_turn_run_id string when no active attempt exists.
start_attempt_for_turn_runThis method must:
running.worlds.active_turn_run_id == turn_run_id.worlds.active_attempt_id IS NULL.attempt_count < max_attempts.start_attempt.attempts row.attempts.turn_run_id.attempts.turn_run_seq = turn_runs.attempt_count + 1.worlds.active_attempt_id.turn_runs.active_attempt_id.turn_runs.last_attempt_id.turn_runs.attempt_count.ClaimedAttempt.finish_turn_run_attemptThis method must:
Lock the turn run row.
Lock the attempt row.
Verify the attempt belongs to the turn run.
Verify the attempt is terminal: committed, failed, or interrupted.
Clear turn_runs.active_attempt_id if it equals the attempt id.
Increment counters:
committed_turn_count += 1 if attempt status is committedfailed_attempt_count += 1 if attempt status is failedinterrupted_attempt_count += 1 if attempt status is interruptedIf committed_turn_count == requested_turn_count, set status completed, set ended_at, clear worlds.active_turn_run_id.
Else if status is cancel_requested, set status cancelled, set ended_at, clear worlds.active_turn_run_id.
Else if attempt_count >= max_attempts, set status failed, set ended_at, set failure reason exactly:
max_attempts exhausted before requested turn_count committed
and clear worlds.active_turn_run_id.
Else leave status running.
request_turn_run_cancelThis method must:
Lock the turn run row.
If status is terminal, return the row without mutation.
Set cancel_requested_at and cancel_reason.
If active_attempt_id is null:
cancelledended_atworlds.active_turn_run_idIf active_attempt_id is not null:
cancel_requestedreconcile_running_attemptsUpdate startup reconciliation so it also handles turn runs.
After existing running attempts are marked interrupted, mark every turn_runs row with status running or cancel_requested as interrupted.
Set failure reason exactly:
process restart before turn run completed
Set ended_at.
Clear worlds.active_turn_run_id for affected worlds.
Also clear turn_runs.active_attempt_id.
Add a background coordinator in src/kernel.rs or a new module src/turn_runs.rs.
The coordinator must:
start_attempt_for_turn_run.Runtime::run_claimed.finish_turn_run_attempt.The coordinator must not call Runtime::run_turn, because Runtime::run_turn uses start_attempt, and start_attempt must reject while a turn run holds the world.
The coordinator must call Runtime::run_claimed with the ClaimedAttempt returned by start_attempt_for_turn_run.
When Runtime::run_claimed returns an error caused by a normal failed attempt, the coordinator must still call finish_turn_run_attempt and continue if the turn run is still eligible to continue.
When the coordinator itself encounters a store error that prevents progress, mark the turn run failed, set failure_reason to the error string, set ended_at, and clear worlds.active_turn_run_id.
Add these tools to CONSUMER_TOOLS and ALL_TOOLS:
get_turn_run_status
cancel_turn_run
Keep the tool partition test passing.
Update consumer_tool_contract:
run_turn schema includes turn_count and max_attempts.get_turn_run_status schema added.cancel_turn_run schema added.list_attempts schema accepts optional turn_run_id.Update validate_tool_args / unknown-key rejection accordingly.
handle_run_turnImplement this exact branch structure:
Parse world_slug.
Parse turn_count, default 1.
Parse max_attempts, default turn_count.
Validate:
turn_count >= 1turn_count <= 100000max_attempts >= 1max_attempts <= 1000000max_attempts >= turn_countIf turn_count == 1 and max_attempts == 1, run the existing single-attempt path.
Otherwise:
world_store.start_turn_runhandle_get_turn_run_statusReturn the status shape described above.
When active_attempt_id is present, include:
{
"poll_active_attempt_with": {
"tool": "get_turn_status",
"args": {
"world_slug": "...",
"attempt_id": "..."
}
}
}
When active_attempt_id is null, poll_active_attempt_with must be null.
handle_cancel_turn_runCall world_store.request_turn_run_cancel.
Return the same status shape as get_turn_run_status.
store_attempt_to_jsonInclude:
{
"turn_run_id": null,
"turn_run_seq": null
}
Populate when present.
Add and update tests in src/mcp/tests.rs, src/world_store/memory.rs, and src/world_store/postgres.rs.
run_turn_without_turn_count_preserves_single_attempt_contract
run_turn with only world_slug.run_mode: "single_attempt".attempt_id.turn_run_id.turn_count_source: "default".run_turn_with_turn_count_starts_turn_run
run_turn with turn_count: 3.run_mode: "turn_run".turn_run_id.attempt_id.get_turn_run_status until terminal.completed.run_turn_with_explicit_one_uses_single_attempt_path
run_turn with turn_count: 1.run_mode: "single_attempt".turn_count_source: "explicit".run_turn_rejects_invalid_turn_count
turn_count: 0 rejected.turn_count: 100001 rejected.max_attempts < turn_count rejected.get_turn_run_status_reports_active_and_recent_attempts
list_attempts_with.cancel_turn_run_stops_before_next_attempt
cancel_turn_run.cancel_requested or cancelled.cancelled.list_attempts_filters_by_turn_run_id
turn_run_id.start_turn_run_claims_world_run_lease
active_turn_run_id.start_attempt is rejected while turn run is active.start_attempt_for_turn_run_creates_attempt_with_sequence
turn_run_id.turn_run_seq.finish_turn_run_attempt_completes_on_requested_commits
completed.worlds.active_turn_run_id cleared.finish_turn_run_attempt_fails_on_max_attempts
failed when max attempts exhausted.request_turn_run_cancel_is_idempotent
reconcile_running_attempts_interrupts_turn_runs
interrupted on reconcile.worlds.active_turn_run_id is cleared.Update consumer_tool_manifest_schemas_match_handler_contracts.
Update consumer_manifest_examples_execute so the examples cover:
run_turnrun_turnget_turn_run_statuscancel_turn_runThis ticket is complete when all of the following are true:
run_turn({"world_slug": "x"}) still starts exactly one attempt and returns an attempt_id.run_turn({"world_slug": "x", "turn_count": 40}) starts one durable turn run and returns a turn_run_id.get_turn_status and list_attempts.get_turn_run_status reports committed count, attempt count, failed count, interrupted count, active attempt, last attempt, and terminal status.cancel_turn_run prevents the turn run from starting another attempt after the active attempt finishes.turn_count and max_attempts were supplied or defaulted.Revised cleanup is implemented, committed, pushed, deployed, and live-smoked.
Compliance matrix by review item:
cancel_turn_run validates world_slug before mutation: fixed in src/mcp.rs; the handler now rejects mismatched turn_run_id/world_slug immediately after the initial status read and before parsing/storing cancellation reason.get_turn_status validates world_slug: fixed in src/mcp.rs; wrong-world attempt lookups now return UNKNOWN_ATTEMPT before payload serialization.src/server.rs and src/read_models.rs; /w/:slug/turn-run/:id and /w/:slug/attempt/:id now pass expected world_slug through DetailRequest and reject mismatches.finish_turn_run_attempt is idempotent: fixed in src/world_store/memory.rs and src/world_store/postgres.rs; counters only mutate when the finishing attempt is still the active attempt, and duplicate finish calls for the last attempt return the current record without double-counting.fail_turn_run no longer orphans a running active attempt: fixed in memory and Postgres stores; the active running attempt is terminalized as interrupted before world leases are cleared, and interrupted_attempt_count is incremented.TurnRunRecord in src/world_store/mod.rs now carries turn_count_source and max_attempts_source; MCP status/cancel responses and browse JSON include those fields plus deterministic hint strings.Files changed:
src/mcp.rs: ownership validation for cancel/status, public status metadata/hints.src/world_store/mod.rs: TurnRunRecord source fields.src/world_store/memory.rs: idempotent finish, failure cleanup, source-field serialization, memory regressions.src/world_store/postgres.rs: idempotent finish, failure cleanup in transaction, source-field serialization, Postgres regressions.src/read_models.rs: source fields/hints in browse JSON and world-scoped detail mismatch checks.src/server.rs: passes world slug into attempt/turn-run detail read models.src/mcp/tests.rs: MCP wrong-world and source-observability regressions.tests/phase_g_routes.rs: browse-route wrong-world regressions and turn-run source metadata assertions.New regression tests added:
mcp::tests::get_turn_run_status_rejects_wrong_worldmcp::tests::cancel_turn_run_rejects_wrong_world_without_mutatingmcp::tests::get_turn_status_rejects_attempt_from_wrong_worldphase_g_routes::world_scoped_turn_run_detail_rejects_wrong_worldphase_g_routes::world_scoped_attempt_detail_rejects_wrong_worldworld_store::memory::tests::finish_turn_run_attempt_does_not_double_count_same_attemptworld_store::postgres::tests::finish_turn_run_attempt_does_not_double_count_same_attemptworld_store::memory::tests::turn_run_failure_path_does_not_orphan_running_attemptworld_store::postgres::tests::turn_run_failure_path_does_not_orphan_running_attemptVerification run:
git diff --check: passed with no output.cargo test --lib --features test-fixtures: test result: ok. 640 passed; 0 failed.cargo test --tests --features test-fixtures: passed, exit 0; integration targets passed including phase_g_routes with 18 passed; 0 failed.cargo test --tests --features test-fixtures,postgres-tests -- --test-threads=1: passed against DATABASE_URL=postgres://postgres:postgres@127.0.0.1:5433/postgres; lib section reported 808 passed; 0 failed, migrations 7 passed; 0 failed, phase_g_routes 18 passed; 0 failed, and final command exit 0.Deploy and artifact receipts:
main from 99f1bec to 4e5fc4b on gitlab.bash k8s/deploy.sh from /srv/chukwa; rollout succeeded.chukwa-84c4d5c88-ks9tw, Ready 1/1, Running, 0 restarts.https://chukwa.benac.dev/healthz returns ok.https://chukwa.benac.dev/chukwa-repo.zip was fetched after deploy and contains the new world_slug mismatch rejection code in repo/src/mcp.rs and the new turn_count_source fields in repo/src/world_store/mod.rs.Live smoke against https://chukwa.benac.dev/mcp:
ice_empty_fish_seed.run_turn({world_slug}) on smoke_single_0501130102 returned single attempt_id 4722f58d-9229-4072-aaac-8f52b086c22b with run_mode: single_attempt; correct-world get_turn_status later reported status: committed.get_turn_status with that attempt and world_slug: smoke_cancel_b_0501130102 returned UNKNOWN_ATTEMPT.run_turn({world_slug: smoke_multi_0501130102, turn_count: 3}) returned turn run 08120251-f21c-4e6b-a34e-f5543a985b5e; final get_turn_run_status(include_attempts=true) reported status: completed, committed_turn_count: 3, attempt_count: 3, and recent attempts newest-first with turn_run_seq 3, 2, 1.cancel_turn_run against turn run 3366fa04-d2c7-44e8-8b4b-7a5dd1359e42 using world_slug: smoke_cancel_b_0501130102 returned UNKNOWN_TURN_RUN.get_turn_run_status for smoke_cancel_a_0501130102 still reported status: running, cancel_requested_at: null, and cancel_reason: null, proving no mutation occurred. That run later completed normally with committed_turn_count: 3, attempt_count: 3.I’m picking this up now. I’ll implement durable turn runs while preserving one attempt per attempted turn, update memory/Postgres stores and migrations, add the new consumer tools, update run_turn/list_attempts/get_turn_status contracts, and verify with containerized Rust and Postgres test runs.
Implemented, committed, pushed, deployed, and verified.
Compliance and verification matrix:
Verification run:
I reviewed the uploaded repo for 0cac6740-2728-4e29-acec-5047e25f23f4 against the actual implementation at 99f1bec. The feature is close and the broad architecture is right: multi-turn execution uses one normal attempt per attempted turn, Runtime::run_claimed is used rather than Runtime::run_turn, attempts are linked with turn_run_id/turn_run_seq, and the consumer surface exposes run_turn, get_turn_run_status, cancel_turn_run, and list_attempts(turn_run_id).
I am not accepting the ticket yet. Please handle the cleanup in this same ticket, not as follow-ups. The remaining issues are lifecycle/ownership correctness problems, not cosmetic polish.
cancel_turn_run mutates before validating world_slug ownershipIn src/mcp.rs, turn_run_status_to_json has the ownership check:
if record.world_slug.as_str() != slug.as_str() {
return Err(McpError::new("UNKNOWN_TURN_RUN", ...));
}
That check is at src/mcp.rs:3851-3859.
But handle_cancel_turn_run does this:
let before = env.world_store.get_turn_run_status(turn_run_id).await?;
let was_terminal = before.status.is_terminal();
...
let record = if was_terminal {
before
} else {
env.world_store
.request_turn_run_cancel(turn_run_id, reason)
.await?
};
let mut payload = turn_run_status_to_json(env, &slug, record, false, 20).await?;
That is at src/mcp.rs:3976-4007.
So a caller can provide:
{
"world_slug": "wrong_world",
"turn_run_id": "<valid turn run from another world>"
}
and the handler can cancel the turn run before discovering the world mismatch.
This is the most important bug to fix. The public input includes world_slug; mutation must not happen until the handler proves the turn_run_id belongs to that world.
Please fix by validating immediately after the first get_turn_run_status read and before reason parsing or request_turn_run_cancel:
let before = env.world_store.get_turn_run_status(turn_run_id).await?;
if before.world_slug.as_str() != slug.as_str() {
return Err(McpError::new(
"UNKNOWN_TURN_RUN",
format!(
"turn_run_id {} does not belong to world_slug {:?}",
turn_run_id.as_uuid(),
slug.as_str()
),
));
}
Required regression test:
cancel_turn_run_rejects_wrong_world_without_mutating
Test shape:
world_a and world_b.world_a.cancel_turn_run with world_slug = world_b and turn_run_id from world_a.UNKNOWN_TURN_RUN or equivalent.world_a.cancel_requested and not cancelled.This should be covered at the MCP layer, not only store-level.
get_turn_status also does not validate world_slughandle_get_turn_status currently parses world_slug, fetches the attempt by global UUID, and returns it without checking that the attempt belongs to the supplied world:
let slug = require_world_slug(args)?;
...
let record = env.world_store.get_attempt_status(attempt_id).await?;
let mut payload = store_attempt_to_json(&record);
This is at src/mcp.rs:4024-4083.
That means callers can query an attempt from world_a while passing world_slug = world_b, and the response will still return the attempt. The response does include the actual record.world_slug, but the request contract says world_slug identifies the world. This is the same ownership class as the cancel_turn_run bug, just non-mutating.
Please add:
if record.world_slug.as_str() != slug.as_str() {
return Err(McpError::unknown_attempt(id_str, &slug));
}
or an equivalent UNKNOWN_ATTEMPT / UNKNOWN_TURN_STATUS response before returning the payload.
Required regression test:
get_turn_status_rejects_attempt_from_wrong_world
Test shape:
world_a.get_turn_status with world_slug = world_b, attempt_id = world_a_attempt.This matters more now because turn-run attempts are explicitly linked and exposed through status/polling helpers.
The resolution says the human comment was addressed by making turn_runs browseable and adding world-scoped routes:
/w/:slug/turn-runs
/w/:slug/turn-run/:turn_run_id
But the world-scoped detail route does not validate that the turn run belongs to :slug.
In src/server.rs:
async fn turn_run_detail_world(
Path((slug, turn_run_id)): Path<(String, String)>,
...
) -> Response {
...
turn_run_detail_render(state, q, turn_run_id, &instance, Some(&slug)).await
}
This is src/server.rs:2472-2482.
turn_run_detail_render builds:
let req = DetailRequest::new(ResourceKind::TurnRun).with("turn_run_id", &turn_run_id);
let result = read_models::load_detail(&env, req).await;
This is src/server.rs:2404-2413.
Then load_turn_run_detail loads only by turn_run_id:
let record = env.world_store.get_turn_run_status(turn_run_id).await?;
let attempts = env.world_store.list_attempts_for_turn_run(&record.world_slug, turn_run_id).await?;
This is src/read_models.rs:1699-1724.
So /w/wrong_world/turn-run/<valid-id> can render a turn run from another world while putting the wrong world slug in page context.
Please fix the world-scoped route to validate ownership. Options:
world_slug to the DetailRequest and make load_turn_run_detail reject mismatches.turn_run_detail_render after loading payload/record.Required regression test:
world_scoped_turn_run_detail_rejects_wrong_world
Also check the same pattern for attempt detail:
/w/:slug/attempt/:attempt_id
attempt_detail_world similarly passes the slug only as render context (src/server.rs:2391-2401), while load_attempt_detail loads by attempt id only (src/read_models.rs:1800-1822). That preexisting pattern should be corrected while we are fixing ownership validation for turn-run browseability.
finish_turn_run_attempt is not idempotent and can double-countBoth Postgres and memory implementations increment counters every time finish_turn_run_attempt is called for a terminal attempt.
Postgres:
match attempt_status {
AttemptStatus::Committed => run.committed_turn_count += 1,
AttemptStatus::Failed => run.failed_attempt_count += 1,
AttemptStatus::Interrupted => run.interrupted_attempt_count += 1,
AttemptStatus::Running => {}
}
src/world_store/postgres.rs:1868-1873.
Memory store has the same shape at src/world_store/memory.rs:1559-1564.
There is no guard that the attempt is still the run’s active attempt, nor an “accounted” marker. A duplicate call can corrupt:
committed_turn_count
failed_attempt_count
interrupted_attempt_count
terminal status
In normal coordinator flow this may not happen, but this is a durable lifecycle subsystem. Store methods should be robust against duplicate coordinator calls, retries, or accidental re-entry.
Please make finish_turn_run_attempt idempotent or reject already-accounted attempts before incrementing counters. One reasonable approach:
turn_runs.active_attempt_id == attempt_id before counter mutation.active_attempt_id is null and last_attempt_id == attempt_id, return the current record or return a clear invalid transition without changing counters.Required regression tests in both stores:
finish_turn_run_attempt_does_not_double_count_same_attempt
Test shape:
finish_turn_run_attempt once.finish_turn_run_attempt again with the same attempt id.fail_turn_run can clear worlds.active_attempt_id while leaving a running attempt rowfail_turn_run is called by the coordinator when start/finish progress fails:
let _ = world_store.fail_turn_run(turn_run_id, reason.clone()).await;
See src/turn_runs.rs:51-57 and src/turn_runs.rs:81-96.
But fail_turn_run clears worlds.active_attempt_id without necessarily terminalizing the active attempt.
Postgres:
UPDATE worlds SET active_turn_run_id = NULL, active_attempt_id = NULL
WHERE slug = $1 AND active_turn_run_id = $2
src/world_store/postgres.rs:2177-2184.
Memory:
w.active_turn_run_id = None;
w.active_attempt_id = None;
src/world_store/memory.rs:1751-1757.
If Runtime::run_claimed returns due to a store error before the attempt row becomes terminal, then finish_turn_run_attempt rejects the still-running attempt, the coordinator calls fail_turn_run, and the world can be left with:
attempts.status = running
worlds.active_attempt_id = NULL
worlds.active_turn_run_id = NULL
turn_runs.status = failed
That creates an orphan running attempt. Because there is a partial unique index on running attempts, this may also block future attempts until startup reconciliation. Startup recovery is a safety net, not the normal way to repair a live-process error.
Please make the coordinator/store failure path preserve lifecycle invariants. Options:
fail_turn_run should terminalize the active attempt as interrupted or failed in the same transaction before clearing worlds.active_attempt_id.active_attempt_id when an active attempt remains non-terminal.fail_turn_run.Required regression test:
turn_run_coordinator_failure_does_not_orphan_running_attempt
The test can be store-level if a full fake coordinator test is too expensive.
TurnRunRecordThe migration stores:
turn_count_source
max_attempts_source
But TurnRunRecord does not expose them:
pub struct TurnRunRecord {
pub turn_run_id: TurnRunId,
pub world_slug: Slug,
pub status: TurnRunStatus,
pub requested_turn_count: u32,
pub max_attempts: u32,
...
}
src/world_store/mod.rs:388-409.
turn_run_record_from_row selects turn_count_source and max_attempts_source, but because the DTO has no fields, the values are discarded.
The initial run_turn response includes the deterministic source/hint fields, which is good. But after the initial response, get_turn_run_status cannot tell the operator whether turn_count and max_attempts were explicit or defaulted.
Please add to TurnRunRecord:
pub turn_count_source: String,
pub max_attempts_source: String,
Then surface them in:
get_turn_run_status
cancel_turn_run
turn-run browse detail/list JSON
Optional but preferable: include the same deterministic hint strings in get_turn_run_status.
This is not as severe as the ownership bugs, but the DB already persists the information, and the ticket emphasized deterministic default/explicit visibility.
The current test additions cover happy paths and basic cancellation, but they do not cover cross-world mismatch mutation/lookup behavior.
Please add at least:
cancel_turn_run_rejects_wrong_world_without_mutating
get_turn_run_status_rejects_wrong_world
get_turn_status_rejects_attempt_from_wrong_world
world_scoped_turn_run_detail_rejects_wrong_world
world_scoped_attempt_detail_rejects_wrong_world
finish_turn_run_attempt_does_not_double_count_same_attempt
turn_run_failure_path_does_not_orphan_running_attempt
These should run against MemoryWorldStore, and the store-level lifecycle tests should also be mirrored in Postgres where practical.
Please move this ticket back to in_progress, fix the above in the same ticket, then re-propose resolution with:
Code changes called out by file.
New regression tests listed by exact test name.
Test run output for:
cargo test --lib --features test-fixturescargo test --tests --features test-fixturescargo test --tests --features test-fixtures,postgres-tests -- --test-threads=1One live smoke that proves:
run_turn({world_slug}) still returns a single attempt_id.run_turn({world_slug, turn_count: 3}) completes with three attempts.cancel_turn_run with the wrong world_slug errors and does not mutate.get_turn_status with the wrong world_slug errors.get_turn_run_status(include_attempts=true) still returns newest-first recent attempts.The implementation is close and the core architecture is good. The remaining work is about making ownership and lifecycle invariants as strict as the rest of the substrate. Once those are fixed in-ticket, this should be acceptable.
Moving this back to in_progress and fixing the ownership/lifecycle issues called out in the human review in this same ticket.
Revised cleanup is implemented, committed, pushed, deployed, and live-smoked. Commit: 4e5fc4b (fix(turn-runs): harden ownership and lifecycle).
Caller accepted the proposed resolution.
Sign in as a human to drive this ticket from the page, or use the MCP tools.
Ticket created: Add multi-turn
run_turnsupport with durable turn-run lifecycle