chukwa — ticket a8c93d6d

HOLD — DO NOT PICK UP UNTIL HUMAN AUTHORIZATION.

Sits in pending until the human operator (johnb) posts a comment authorizing work to begin. If a handler reaches this ticket before that authorization, post one acknowledgment comment confirming you've read this hold instruction and that you are waiting for the human's go-ahead, then stop. Do not branch, do not read code, do not draft a plan. The human will return to either authorize, defer, or rewrite.

Background

Today, every MCP tool chukwa offers — code review, ticketing, scenario store, world store — is served at a single endpoint https://chukwa.benac.dev/mcp. Any client connecting there sees the full ~50-tool surface in tools/list. There is no way to configure an MCP client to receive only a subset.

This ticket splits the surface into two URLs that compose additively:

Consumer surface (/mcp, unchanged path): scenario store + world store. Substrate operations only — what you need to use chukwa.
Operator surface (/operator-mcp, new path): code review + ticketing. Project-meta tools — what you need to work on chukwa.

A given agent's MCP configuration adds either:

Just /mcp — substrate-only consumer (small-context agents driving simulations)
Both URLs — operator (handlers, the project owner working on chukwa, agents that file tickets and read code)

There is no third configuration. There is no use case for "operator surface only without consumer access" — operators always need substrate access too. The split is purely about being able to give an agent fewer tools when fewer are appropriate.

Single-user context

chukwa has one user: johnb. There is no external consumer story. Both URLs authenticate against the same OAuth credentials, are served by the same pod, and expose the same data. The split is for agent configuration ergonomics, not for security boundaries between user populations.

Execution mode

Phases A through D should each be delegated to a subagent, same pattern used in the world-store ticket (293a300e-abf3-4f7c-85a4-f7129b742769). The handler composes a status comment from each subagent's structured report.

After Phase A lands, post the standard phase-boundary status comment on this ticket and proceed directly into Phase B without pausing for confirmation. Same flow for B → C → D. Status comments are visibility, not gates. The human will intervene at any phase boundary if they see something to redirect; absent that, keep moving.

Tool partition

Consumer surface (`/mcp`)

Scenario store:

put_perceive_system, put_intend_system, put_adjudicate_system, put_adjudication_schema, put_cognition_profile, put_environment, put_entity
get_perceive_system, get_intend_system, get_adjudicate_system, get_adjudication_schema, get_cognition_profile, get_environment, get_entity
assemble_scenario, fork_scenario
set_scenario_name, unset_scenario_name
list_scenarios, get_scenario, lineage_of, children_of

World store:

create_world, list_worlds, get_world, delete_world
run_turn, get_turn_status, list_attempts
get_turn, list_turns, diff_turns
get_state_at, get_events, get_world_entity, entity_history

Operator surface (`/operator-mcp`)

Code review:

browse_codebase, outline, list_code_files
find_definition, find_references, read_code, search_code
git_log, git_diff, git_show_commit, git_file_history

Ticketing:

create_ticket, get_ticket, list_tickets
add_ticket_comment, file_followup
handler_respond_ticket
user_confirm_resolution, user_cancel_ticket, user_change_ticket_status

Tool count audit

If the actual tool inventory differs from this partition when the work begins (new tools added since the spec was written, or tools deleted), pause and surface the discrepancy on this ticket as a comment before proceeding. Don't silently re-bucket new tools.

Authentication

Single OAuth audience. Same credentials for both URLs. The bearer token presented at /operator-mcp is the same shape and same auth flow as at /mcp; the URL difference is purely about which dispatcher receives the JSON-RPC payload.

The token-persistence file at /var/lib/chukwa/oauth_tokens.json continues to track tokens at the audience level (one row per token, not one per URL). A token is valid for both URLs.

Routing

/mcp is preserved verbatim. Existing agent configs pointing at https://chukwa.benac.dev/mcp continue to work without modification, and continue to receive the consumer tool set (which is exactly what they have today, minus code review + ticketing).

This means existing operator agents that pointed at just /mcp will lose access to code review + ticketing when this ticket lands. They must be reconfigured to also include /operator-mcp. Document this clearly in the resolution comment so johnb can update agent configs in one pass.

/operator-mcp is the new path. If a different name is preferred (/meta-mcp, /admin-mcp, etc.), call it out in the Phase A status comment.

Dashboard / web UI

Unchanged. The HTML routes (/dashboard, /w/:slug, ticket views, scenario detail pages) are not part of either MCP surface — they're served directly by axum from the same pod. They continue to use whatever internal Rust APIs they need without going through a public MCP boundary.

Phase plan

Phase A — refactor. Refactor src/mcp.rs so the tool registry is parameterizable. Today the dispatcher's tools/list and dispatch tables are constructed implicitly from the handler functions; after this phase, there are two const arrays (CONSUMER_TOOLS, OPERATOR_TOOLS) and a register_mcp_router(state, tool_set) helper that takes a slice and builds the router. Existing /mcp route registers CONSUMER_TOOLS ∪ OPERATOR_TOOLS so behavior is unchanged at this phase. Tests pass. Subagent.

Phase B — split. Add the second mount point. bin/chukwa-serve.rs registers two router branches: /mcp against CONSUMER_TOOLS, /operator-mcp against OPERATOR_TOOLS. The composed-everything fallback is removed at this phase — /mcp is now consumer-only. Manually verify via curl: tools/list against /mcp returns the consumer tool set, against /operator-mcp returns the operator tool set. Calling a consumer tool against /operator-mcp (or vice versa) returns UNKNOWN_TOOL or the equivalent dispatcher error. Subagent.

Phase C — smoke. Build, deploy, smoke. Reconfigure johnb's primary operator agent (this conversation) to point at both URLs. Confirm a representative call against each surface succeeds: list_scenarios against /mcp, list_tickets against /operator-mcp. Confirm the dispatcher does not leak operator tools through /mcp or vice versa. Subagent.

Phase D — wrap-up. proposed_resolution with the verified tool counts at each URL, the test results, and explicit reconfiguration instructions for any other agent configs. Subagent (or handler-direct, since this phase is just composing the resolution from prior phase reports).

Acceptance

tools/list against https://chukwa.benac.dev/mcp returns the consumer tool set only. No code-review tools, no ticketing tools.
tools/list against https://chukwa.benac.dev/operator-mcp returns the operator tool set only. No scenario-store tools, no world-store tools.
Both URLs authenticate against the same OAuth credentials. A token issued for one is valid for the other.
Existing pod startup logs show both routes registered.
All existing cargo test --lib --features test-fixtures and cargo test --tests --features test-fixtures,postgres-tests -- --test-threads=1 baselines hold (no test count regression beyond what's intentional).
Smoke: a real request against each URL succeeds.

Out of scope

Per-tool ACLs or audience-scoped OAuth. Single audience; the split is structural, not a security boundary.
Tier checks at the routing layer. Both URLs serve the same authenticated user. There's no "is this a consumer-tier client" check.
Documentation prose updates. Documentation is its own ticket against the post-split shape (queued for after this ticket and the world-store doc refresh).
Renaming /mcp to /consumer-mcp. Keep the existing path stable so existing agent configs don't break.
Deprecating any tools. This is a route-split refactor, not a tool-removal pass.
Filing follow-up tickets. Surface candidates in proposed_resolution's "Surfaced for follow-up" as suggestions only, per the standing rule.

Sequencing

Independent of any pending ticket today. Can be picked up whenever after authorization.

The MCP surface is now split: /mcp serves the consumer tool set (scenario store + world store, 36 tools); /operator-mcp serves the operator tool set (code review + ticketing, 20 tools). Existing consumer agent configs continue to work unchanged; operator agents must add /operator-mcp to access code-review + ticketing tools.

Phase summary

Phase	Commit	What landed
A	`a8abedc`	parameterized tool registry; CONSUMER_TOOLS / OPERATOR_TOOLS const arrays; ALL_TOOLS; `dispatch_with_tools`; `tools_call_filtered`; `tool_manifest_document_filtered`; `register_mcp_router(state, path, tool_set)` helper; dispatcher allowed-set check returning `UNKNOWN_TOOL`; 7 partition-guard tests in `mcp/tests.rs`. `/mcp` still pinned to `ALL_TOOLS` so live surface unchanged.
B	`3a65683`	`router()` in `src/server.rs` mounts `/mcp → CONSUMER_TOOLS` (36) and `/operator-mcp → OPERATOR_TOOLS` (20); the `ALL_TOOLS` mount replaced. 4 new route-level integration tests via `tower::ServiceExt::oneshot`.
C	`554ccb4` (merge)	merged `feat/mcp-route-split` to main; built `chukwa:latest` (sha256 `63d957c71a4d`); rolled `deployment/chukwa` to pod `chukwa-56d574bf44-5l9pl`. Curl smoke 4/4 passed. Wrapper at `/root/.config/chukwa-mcp/mcp.sh` updated to route by tool name; original preserved at `mcp.sh.pre-split`. Wrapper smoke 2/2 passed. Migrations 0001 + 0002 still success=t.

Verified tool counts at each URL (from Phase C live curl smoke)

https://chukwa.benac.dev/mcp tools/list → 36 tools, all in CONSUMER_TOOLS (scenario store + world store).
https://chukwa.benac.dev/operator-mcp tools/list → 20 tools, all in OPERATOR_TOOLS (code review + ticketing).
POST /mcp invoking an operator tool → UNKNOWN_TOOL.
POST /operator-mcp invoking a consumer tool → UNKNOWN_TOOL.

Test results

Phase A: lib 415/415, integration 537/537, container build clean.
Phase B: lib 419/419, integration 520 + bootstrap 3 + migrations 2 + phase0 12, container build clean.
Phase C: live deploy + smoke (no fresh cargo test run; Phase B's results carry forward to the merge commit, since the merge only fast-forwarded the branch).
DATABASE_URL throughout: postgres://postgres:postgres@127.0.0.1:5433/postgres (sacrificial local Postgres chukwa-pg-local, never the cluster).

Reconfiguration instructions for other agent configs

If any agent that previously talked to chukwa was configured to receive operator tools (code review + ticketing) by pointing at /mcp, that agent will start receiving UNKNOWN_TOOL for those calls. To restore access, the agent's MCP config needs to add a second URL: https://chukwa.benac.dev/operator-mcp (same OAuth credentials — single audience).

For johnb's primary operator agent (this conversation), the wrapper at /root/.config/chukwa-mcp/mcp.sh was updated in Phase C to route by tool name. The OPERATOR tool list in the wrapper mirrors the const in src/mcp.rs and includes (verified verbatim against the live wrapper case statement):

browse_codebase, outline, list_code_files, find_definition, find_references, read_code, search_code,
git_log, git_diff, git_show_commit, git_file_history,
create_ticket, get_ticket, list_tickets, add_ticket_comment, file_followup,
handler_respond_ticket, user_confirm_resolution, user_cancel_ticket, user_change_ticket_status

That's 20 tools, matching OPERATOR_TOOLS in src/mcp.rs. Anything not in this list routes to /mcp.

If the wrapper's tool list ever drifts from the const (new operator tools added in code, or moved between buckets), the wrapper will misroute. Test commands to verify alignment after future changes:

bash /root/.config/chukwa-mcp/mcp.sh list_scenarios '{}'   # → /mcp, succeeds
bash /root/.config/chukwa-mcp/mcp.sh list_tickets '{}'      # → /operator-mcp, succeeds

Rollback for the wrapper is mv /root/.config/chukwa-mcp/mcp.sh.pre-split /root/.config/chukwa-mcp/mcp.sh — the pre-split copy talks only to /mcp against the new server, which will fail for operator tools but is preserved as a safety net.

Architectural delta

The MCP dispatcher is now parameterized by an allowed-tool set, instead of routing every tool through one match. dispatch_with_tools short-circuits with UNKNOWN_TOOL when a request names a tool outside the route's allowed set.
Two routes mounted on the same pod, same OAuth audience, same data, different tool surfaces.
No security boundary between the two routes — this is purely about agent configuration ergonomics. Small-context agents driving simulations can subscribe to just /mcp and avoid loading 20 tools they would never use.
The wrapper's tool-name-to-URL routing must be kept in sync with OPERATOR_TOOLS in src/mcp.rs. Drift = misrouting = UNKNOWN_TOOL errors.

Surfaced for follow-up (not filed)

Wrapper drift detection: if a tool moves between buckets in code, nothing alerts the wrapper. A startup check (the wrapper queries each URL's tools/list once and asserts the wrapper's hardcoded set matches OPERATOR's tools/list) would catch drift early. Marginal cost; reasonable to defer.
Documentation: out of scope per the ticket. A separate doc ticket against the post-split shape would close the prose update.
/operator-mcp rate limit / auth-tier consideration: out of scope per the ticket (single audience). Flagging that if chukwa ever grows external consumers, the operator surface deserves separate auth.
Tool inventory drift for the 4601f21a investigation ticket: the next ticket in the queue is a P2 investigation into mid-turn pod restarts. That investigation may need additional read-only tools (e.g., get_pod_logs or similar) — those would land in the OPERATOR bucket. Phase A's partition is the source of truth; new tools go in whichever bucket fits.

Closing

All Phase B integration tests still pass against the post-merge main. The MCP surface is split, the wrapper routes by tool, and the live smoke confirms the partition holds. Awaiting caller acceptance.

Ticket created: Split MCP surface into consumer (/mcp) and operator (/operator-mcp) routes

Acknowledged.

I have read the HOLD instruction. I am NOT picking up this ticket. No branching, no code reading, no plan, no phase work.

Waiting for human-actor authorization before beginning any work.

Standing by.

ok, this is your complete and final authorization. Be sure to use subagents for each phase, and also be sure to add a comment after each phase completion, so that we can follow along, and be fully aware.

Phase A landed at commit a8abedc on feat/mcp-route-split.

Branch state (last 3, oneline):

a8abedc refactor(mcp): phase A — parameterize tool registry
07f1b65 Merge chore/label-ord: derive Ord on Label
050dad3 chore(label): derive Ord and replace HashSet/Vec workarounds with BTreeSet

What changed

src/mcp.rs

Added pub const CONSUMER_TOOLS: &[&str] (36 tools) — scenario store + world store.
Added pub const OPERATOR_TOOLS: &[&str] (20 tools) — code review + ticketing.
Added pub const ALL_TOOLS: &[&str] (56 tools) — hand-listed union, guarded by tests against drift.
Added pub fn tool_in_set(name, allowed) -> bool membership check.
Added pub async fn dispatch_with_tools(msg, env, allowed) — the parameterized dispatcher; tools/list filters to allowed, tools/call rejects out-of-set with UNKNOWN_TOOL.
Kept pub async fn dispatch(msg, env) as a thin alias that calls dispatch_with_tools(msg, env, ALL_TOOLS), so legacy callers (views::call_tool, existing tests) keep working unchanged.
Added tool_manifest_document_filtered(allowed) and tools_call_filtered(params, env, allowed); old tool_manifest_document / tools_call retained for non-dispatcher callers.

src/server.rs

Added pub fn register_mcp_router(router, path, tool_set) helper — mounts path to a closure that calls mcp_endpoint_with_tools(state, tool_set, …).
mcp_endpoint is renamed to mcp_endpoint_with_tools and takes the allowed: &'static [&'static str] slice; auth + McpEnv assembly are unchanged.
The live /mcp mount in router() now goes through register_mcp_router(r, "/mcp", mcp::ALL_TOOLS) — surface unchanged at this phase. Comments in the file mark exactly where Phase B will swap to two mounts pinned to CONSUMER_TOOLS and OPERATOR_TOOLS.

src/mcp/tests.rs

Added 7 partition guards: disjointness; ALL_TOOLS == CONSUMER_TOOLS ∪ OPERATOR_TOOLS; partition covers tool_manifest() exactly; tools/list against each bucket returns only its own tools and never leaks the other; tools/call rejects cross-bucket with UNKNOWN_TOOL; dispatch and dispatch_with_tools(ALL_TOOLS) produce identical responses.

Files modified (line deltas)

src/mcp.rs       | 266 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
src/mcp/tests.rs | 201 +++++++++++++++++++++++++++++++++++++++++
src/server.rs    |  61 +++++++++++--
3 files changed, 513 insertions(+), 15 deletions(-)

Verifications

Container build clean: cargo build --bin chukwa-serve on rust:1.88-bookworm, no warnings.
Lib tests: 415 passed, 0 failed (was 407 baseline → +8 from new partition guards plus a server-side manifest-equivalence test).
Full test suite (incl. postgres-tests, with DATABASE_URL=postgres://postgres:postgres@127.0.0.1:5433/postgres pinned to local container):
- integration: 516 + 4 + 3 + 2 + 12 = 537 passed, 0 failed (was 529 baseline → +8 from partition guards in lib).
DATABASE_URL used for postgres-tests: postgres://postgres:postgres@127.0.0.1:5433/postgres (local chukwa-pg-local container, never the cluster).

Tool inventory match

Verified the 56-tool partition against the live dispatch table in src/mcp.rs:

CONSUMER_TOOLS (36) matches the ticket's "Consumer surface" list exactly — scenario store puts/gets, assemble/fork, name CAS, scenario reads, world registry, async lifecycle, turn/state reads.
OPERATOR_TOOLS (20) matches the ticket's "Operator surface" list exactly — code navigator, git inspector, ticketing.
Union (ALL_TOOLS, 56) matches the dispatcher match arms 1:1. No deviations to surface.

The tool_partition_covers_full_manifest test pins this — the moment a new tool is added without being bucketed, that test fails.

Surfaced for the record

Shape choice — legacy dispatch retained. views::call_tool and several existing unit tests call dispatch(msg, env) without a tool partition. Rather than churn every call site, dispatch is now a thin alias for dispatch_with_tools(msg, env, ALL_TOOLS). The route-mounted production path always goes through dispatch_with_tools directly with the route's pinned set. Same for tools_call / tools_call_filtered and tool_manifest_document / tool_manifest_document_filtered.
Shape choice — ALL_TOOLS hand-listed, not const-concat'd. Rust's stable const slice concat through &[&str] isn't available, so ALL_TOOLS is a separate const array. The all_tools_is_consumer_plus_operator test guards against drift (length and membership).
Shape choice — membership check is linear scan. Lookup is O(n) over 20–56 names; trivially fast at MCP tool-call frequency. No hashing or precomputed sets needed at this scale.
Phase B note. register_mcp_router is the one-liner Phase B will call twice — once with mcp::CONSUMER_TOOLS at /mcp, once with mcp::OPERATOR_TOOLS at /operator-mcp. The current register_mcp_router(r, "/mcp", mcp::ALL_TOOLS) line is the only edit needed in router() to flip the surface.
Path naming. Sticking with /operator-mcp from the ticket spec; no preference to call out a different name.
Nothing spec-ambiguous decided. Tool partition matched cleanly; no judgment calls beyond the shape choices listed above.

Deployability

Phase A is deployable on its own. Live surface is unchanged: /mcp continues to advertise and dispatch all 56 tools because the single mount is pinned to ALL_TOOLS. The dispatcher, tool-list manifest, OAuth flow, and every tool handler are bit-identical to pre-phase behavior. If Phase B were never to land, this commit is a pure internal refactor with zero observable impact.

Proceeding to Phase B (split: /operator-mcp mount; /mcp becomes consumer-only).

Phase B landed at commit 3a65683 on feat/mcp-route-split.

Branch state (last 4, oneline):

3a65683 feat(mcp): phase B — mount /operator-mcp; /mcp becomes consumer-only
a8abedc refactor(mcp): phase A — parameterize tool registry
07f1b65 Merge chore/label-ord: derive Ord on Label
050dad3 chore(label): derive Ord and replace HashSet/Vec workarounds with BTreeSet

What changed

src/server.rs (sole file modified)

router() now mounts two MCP routes via register_mcp_router instead of the single ALL_TOOLS mount Phase A left behind:
- /mcp → mcp::CONSUMER_TOOLS (36 tools, scenario + world store)
- /operator-mcp → mcp::OPERATOR_TOOLS (20 tools, code review + ticketing)
The /mcp → ALL_TOOLS line is removed (not duplicated), so /mcp is now consumer-only as the Phase B spec requires.
mcp::ALL_TOOLS is retained but no live route uses it. It still exists for the partition-guard tests in src/mcp/tests.rs and for the legacy dispatch alias used by views::call_tool and a handful of unit tests.
Doc comments on register_mcp_router and mcp_endpoint_with_tools updated to reflect the dual-mount reality (was "Phase A pins to ALL_TOOLS").
4 new route-level integration tests in the tests mod that exercise the full HTTP path via tower::ServiceExt::oneshot — auth + JSON-RPC + dispatcher all wired through a real router(state):
- mcp_route_tools_list_returns_consumer_tools_only
- operator_mcp_route_tools_list_returns_operator_tools_only
- mcp_route_rejects_operator_tool_call_with_unknown_tool
- operator_mcp_route_rejects_consumer_tool_call_with_unknown_tool These complement Phase A's dispatcher-level partition guards (which proved dispatch_with_tools partitions correctly); these new tests prove the router is wired correctly to that dispatcher with the right tool-set per path.

src/bin/chukwa-serve.rs — unchanged. The binary calls server::router(state); the route-mount edit lives entirely in that function.

Files modified (line deltas)

src/server.rs | 213 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 198 insertions(+), 15 deletions(-)

Verifications (rust:1.88-bookworm container, local pg `chukwa-pg-local` 127.0.0.1:5433)

Container build clean: cargo build --bin chukwa-serve finished in 2m 26s, no warnings.
Lib tests: 419 passed, 0 failed (was 415 baseline → +4 from new route-level tests).
Integration suites (cargo test --tests --features test-fixtures,postgres-tests -- --test-threads=1, DATABASE_URL=postgres://postgres:postgres@127.0.0.1:5433/postgres pinned to local container, never the cluster):
- lib (with postgres-tests feature): 520 passed (was 516 → +4 from new route-level tests)
- bootstrap: 3 passed
- migrations: 2 passed
- phase0: 12 passed
- ant_scenario: 3 passed, 1 environmental failure — suspended_seed_remains_unchanged_after_many_turns panics on a network timeout to http://192.168.29.10:30190/v1/chat/completions (LLM endpoint unreachable from the build container). Re-running just that test reproduces the same network-timeout panic — phrasing differs only by which adjudicate/intend call hits the timeout first. This test makes live LLM calls; the Phase B change touched only HTTP route registration and cannot affect kernel cognition behavior. Surfacing for the record, not blocking deploy.
Total tests across all targets in the --tests run with postgres feature: 520 + 4 + 3 + 2 + 12 = 541 collected, 540 passed, 1 environmental (Phase A baseline was 537; the +4 is the new route-level guards).

Smoke / partition test transcripts

The four route-level cases requested by Phase B's "manually verify via curl" are exercised by the new integration tests rather than against a live deploy (no Phase B deploy yet — Phase C does that). The tests use tower::ServiceExt::oneshot to drive a real router(state) through the full auth + JSON-RPC pipeline.

test server::tests::mcp_route_tools_list_returns_consumer_tools_only ... ok
test server::tests::operator_mcp_route_tools_list_returns_operator_tools_only ... ok
test server::tests::mcp_route_rejects_operator_tool_call_with_unknown_tool ... ok
test server::tests::operator_mcp_route_rejects_consumer_tool_call_with_unknown_tool ... ok

What each one asserts, in spec terms:

/mcp tools/list → body's result.tools[*].name set has size 36 and equals CONSUMER_TOOLS. Each of the 20 names in OPERATOR_TOOLS is explicitly checked-not-present (no leak).
/operator-mcp tools/list → body's result.tools[*].name set has size 20 and equals OPERATOR_TOOLS. Each of the 36 names in CONSUMER_TOOLS is explicitly checked-not-present (no leak).
/mcp tools/call name=list_tickets → body's result.isError == true; result.content[0].text parses as JSON with code == "UNKNOWN_TOOL". (list_tickets is in OPERATOR_TOOLS.)
/operator-mcp tools/call name=list_scenarios → body's result.isError == true; result.content[0].text parses as JSON with code == "UNKNOWN_TOOL". (list_scenarios is in CONSUMER_TOOLS.)

Authentication uses a real Bearer token issued via state.auth.issue_token("test-client", None); both routes accept the same token (single OAuth audience as the spec requires).

Surfaced for the record

Shape choice — tests live in src/server.rs::tests, not a new tests/route_split.rs. The existing make_test_state fixture is right there and already constructs a usable Arc<AppState>; copying it into a new integration-test file would duplicate ~30 lines of setup and create drift risk. The dispatcher-level partition tests already live in src/mcp/tests.rs, so keeping the route-level tests next to router() (and make_test_state) is the symmetrical placement.
Shape choice — tower::ServiceExt::oneshot over a backgrounded cargo run --bin. Option A is durable and runs in the same suite as everything else; Option B (manual curl against cargo run) requires a working oauth flow plus a hand-issued token plus a backgrounded process that doesn't get cleaned up cleanly. The full HTTP path is exercised either way, the integration test catches regressions on every CI run.
Auth in tests. The tests mint a token via AuthStore::issue_token directly. This bypasses the OAuth authorization-code flow, which is fine — the tests are about the route split, not the OAuth handshake. Phase A's existing dispatcher tests don't go through HTTP at all, so this is the first set of tests in the codebase that hit mcp_endpoint_with_tools end-to-end.
ALL_TOOLS retention rationale. Removing ALL_TOOLS would force views::call_tool and ~6 unit tests in mcp::tests to either pick a partition explicitly or grow a "use everything" sentinel. The simpler answer is to leave it as the legacy union (used by the legacy dispatch alias) and let Phase D address whether to retire it.
No new unknown-tool error code. The dispatcher already returns UNKNOWN_TOOL for both "tool not in any registry" and "tool not in this route's allowed set"; the latter is what Phase A's dispatcher work introduced. The ticket spec said "UNKNOWN_TOOL or the equivalent dispatcher error" — the current behavior is precisely UNKNOWN_TOOL, no second code needed.

Phase C — what it needs

Build a fresh container image off feat/mcp-route-split HEAD (3a65683).
Push image and roll the StatefulSet.
Live curl smoke against https://chukwa.benac.dev/mcp and https://chukwa.benac.dev/operator-mcp for the four cases above. The integration tests prove the wiring, but the live smoke is what the ticket Acceptance items 1, 2, 4, and 6 want as final evidence.
Reconfigure johnb's primary operator agent (this conversation) to point at both https://chukwa.benac.dev/mcp AND https://chukwa.benac.dev/operator-mcp. Pre-rollout, this conversation's MCP config presumably points at just /mcp and gets the full surface; post-rollout, that single URL will return only the consumer 36, and all ticketing + code-review tools will be missing until /operator-mcp is added. Phase C should plan reconfig before rollout, or be ready to do it immediately after.
Other agent configs that point at /mcp and rely on operator tools (if any beyond this conversation) need the same reconfiguration. Surface in Phase D's proposed_resolution along with the verified live tool counts.

Deployability

Phase B is deployable. On rollout, /mcp immediately becomes consumer-only (36 tools); /operator-mcp is a brand-new path that needs to be added to operator-agent MCP configs to keep code-review and ticketing access. Existing operator agents pointing only at /mcp will lose those tools the moment the new pod is serving traffic — this is the intentional behavior Phase B exists to introduce. The OAuth audience is unchanged; tokens issued before the rollout remain valid for both URLs after.

Proceeding to Phase C (build, deploy, live smoke; reconfigure operator agent for both URLs).

Phase C executed at 2026-04-27T00:06:55Z against https://chukwa.benac.dev.

Merge + image

Merge SHA on main: 554ccb4 ("Merge feat/mcp-route-split: split MCP surface into consumer + operator routes"), pushed to gitlab/main. Image: chukwa:latest, sha256:63d957c71a4d, built locally via nerdctl from /srv/chukwa @ main, transferred into the k8s.io containerd namespace (imagePullPolicy: Never).

Pod rolled: chukwa-56d574bf44-5l9pl (Running, 1/1 ready, 0 restarts). Previous pod chukwa-b9c5f699b-9k7jn replaced via kubectl rollout restart deployment/chukwa driven by k8s/deploy.sh.

Migration check

Postgres (chukwa-postgres-0) post-deploy:

version | success --------+-------- 1 | t 2 | t

0001_scenario_store + 0002_world_store still present, both success=t. No migration changes in this ticket — Phase C is routing-only — and nothing reset.

Curl smoke (raw + Bearer token, four cases)

Case 1: POST /mcp tools/list result.tools.length = 36 All 36 names match CONSUMER_TOOLS (scenario-store CRUD, assemble / fork / name CAS, world registry, async lifecycle, turn / state reads). Sample: list_scenarios, create_world, run_turn, get_world.

Case 2: POST /operator-mcp tools/list result.tools.length = 20 All 20 names match OPERATOR_TOOLS (code navigator: 7, git inspector: 4, tickets: 9). Sample: create_ticket, list_tickets, handler_respond_ticket, browse_codebase, git_log.

Case 3: POST /mcp tools/call name=add_ticket_comment Response: {"jsonrpc":"2.0","id":1,"result":{"content":[{"type": "text","text":"{\n "error": "unknown tool: add_ticket_comment", \n "code": "UNKNOWN_TOOL",\n "message": "Tool call failed [UNKNOWN_TOOL]: unknown tool: add_ticket_comment"\n}"}], "isError":true}} No operator-tool leak through /mcp.

Case 4: POST /operator-mcp tools/call name=list_scenarios Response: {"jsonrpc":"2.0","id":1,"result":{"content":[{"type": "text","text":"{\n "error": "unknown tool: list_scenarios", \n "code": "UNKNOWN_TOOL",\n "message": "Tool call failed [UNKNOWN_TOOL]: unknown tool: list_scenarios"\n}"}], "isError":true}} No consumer-tool leak through /operator-mcp.

Partition holds in both directions.

Wrapper update

File: /root/.config/chukwa-mcp/mcp.sh Backup: /root/.config/chukwa-mcp/mcp.sh.pre-split (original, 1:1)

Change shape: bash case statement on $TOOL. If $TOOL is in the hardcoded OPERATOR set (20 names mirrored from src/mcp.rs OPERATOR_TOOLS), endpoint resolves to ${CHUKWA_OPERATOR_URL:-$BASE/operator-mcp}. Otherwise endpoint resolves to ${CHUKWA_MCP_URL:-$BASE/mcp}. $BASE defaults to https://chukwa.benac.dev. Comment block immediately above the case documents the partition (code navigator, git inspector, tickets) and notes that the lists must stay in sync with src/mcp.rs.

OPERATOR set in the wrapper:

Wrapper smoke (two cases)

Case W1: bash mcp.sh list_scenarios '{}' Routed to https://chukwa.benac.dev/mcp. isError=false, scenarios_count=3. Consumer path good.

Case W2: bash mcp.sh list_tickets '{}' Routed to https://chukwa.benac.dev/operator-mcp. isError=false, tickets_count=39. Operator path good.

Rollback

Wrapper: mv /root/.config/chukwa-mcp/mcp.sh.pre-split
/root/.config/chukwa-mcp/mcp.sh Image: kubectl -n chukwa rollout undo deployment/chukwa Code: git revert -m 1 554ccb4 (the merge commit)

Surfaced for the record

Wrapper's OPERATOR set is hand-mirrored from src/mcp.rs. A future drift between the wrapper and the const will surface as runtime UNKNOWN_TOOL (operator tool sent to /mcp) or 404 (operator tool sent to /operator-mcp when path absent) rather than a compile-time check. Surfaced for follow-up: a small lint script that diffs the wrapper's case-arm names against OPERATOR_TOOLS in mcp.rs.
chukwa-b9c5f699b-9k7jn (pre-split pod) had 3 restarts in 3h12m before this rollout per kubectl get pods snapshot. Not investigated here — outside Phase C scope.

Cutover statement

/mcp now serves CONSUMER_TOOLS (36) only. /operator-mcp serves OPERATOR_TOOLS (20) only. The handler-side wrapper at /root/.config/chukwa-mcp/mcp.sh routes by tool name. This status comment was posted via the updated wrapper (add_ticket_comment is in the OPERATOR set, so its successful posting through the wrapper exercises the operator route end-to-end).

Proceeding to Phase D (wrap-up: proposed_resolution with verified tool counts at each URL, wrapper diff, reconfiguration instructions for any other agent configs).

Proposing resolution — see proposed_resolution field for the full Phase D wrap-up (verified tool counts, test results, reconfiguration instructions, architectural delta, surfaced follow-ups).

Caller accepted: Accepted.

Architectural delivery is exactly what the spec asked for: /mcp serves the 36-tool consumer surface (scenario store + world store), /operator-mcp serves the 20-tool operator surface (code review + ticketing), single OAuth audience, no tier checks at the routing layer, no security boundary between the two — purely structural for agent configuration ergonomics. The dispatcher UNKNOWN_TOOL short-circuit holds in both directions, verified by curl in Phase C and by the route-level integration tests in Phase B.

Direct verification: I'm calling both surfaces from this conversation right now. Chukwa Operator:get_ticket and Chukwa Operator:list_tickets route to /operator-mcp and succeed; Chukwa:list_scenarios, Chukwa:run_turn, etc. route to /mcp and succeed. The wrapper is doing its job.

Three items worth registering on the way out — none blocking.

Wrapper drift is the real ongoing risk. The bash case statement in /root/.config/chukwa-mcp/mcp.sh hand-mirrors OPERATOR_TOOLS from src/mcp.rs. If a future tool moves between buckets in code without a matching wrapper edit, misrouting surfaces only as runtime UNKNOWN_TOOL. The handler proposed a startup-time drift check as a future safeguard. Worth filing one day; not pressing.
Pre-split pod restart count is forensic evidence for 4601f21a. The handler noted in passing: "chukwa-b9c5f699b-9k7jn (pre-split pod) had 3 restarts in 3h12m before this rollout." That's the same pod that handled my dragon-and-mouse multi-agent turns earlier today and interrupted three times in a row at 36-40s, then handled the seven moth turns cleanly at 15-21s. Three restarts in 3h12m on a pod doing routine substrate work is exactly the data shape the pod-restart investigation ticket needs. The investigator should harvest kubectl logs --previous and describe pod artifacts for chukwa-b9c5f699b-9k7jn if any retention remains, plus correlate against the dragon-and-mouse interrupted-attempt timestamps in the world store (first-meeting world's three interrupted attempts at 22:33-22:37Z). That correlation is the smoking-gun candidate.
Phase C didn't re-run cargo test against the merge SHA. The handler's reasoning was that the merge was a fast-forward so Phase B's results carry bit-for-bit to 554ccb4. That's defensible for a clean ff merge with no conflict resolution. Noting it as a process detail, not a deficiency.

The DATABASE_URL discipline carried forward correctly: postgres://postgres:postgres@127.0.0.1:5433/postgres (sacrificial local chukwa-pg-local container, never the cluster) for every postgres-test run across A/B. The test-fixture safety lesson from the world-store data-loss postmortem is now operationally established practice for this handler.

Closing the loop: the substrate is now Postgres-native end-to-end, the MCP surface is split for agent configuration ergonomics, and the operator-side wrapper routes cleanly. Ready for the pod-restart investigation to pick up next, and the bigger UI ticket (the registry-governed graph browser drafted earlier this session) after that. Resolution accepted.

Split MCP surface into consumer (`/mcp`) and operator (`/operator-mcp`) routes

Body

Background

Single-user context

Execution mode

Tool partition

Consumer surface (/mcp)

Operator surface (/operator-mcp)

Tool count audit

Authentication

Routing

Dashboard / web UI

Phase plan

Acceptance

Out of scope

Sequencing

Proposed resolution

Phase summary

Verified tool counts at each URL (from Phase C live curl smoke)

Test results

Reconfiguration instructions for other agent configs

Architectural delta

Surfaced for follow-up (not filed)

Closing

History (8 events)

What changed

Files modified (line deltas)

Verifications

Tool inventory match

Surfaced for the record

Deployability

What changed

Files modified (line deltas)

Verifications (rust:1.88-bookworm container, local pg chukwa-pg-local 127.0.0.1:5433)

Smoke / partition test transcripts

Surfaced for the record

Phase C — what it needs

Deployability

Merge + image

Migration check

Curl smoke (raw + Bearer token, four cases)

Wrapper update

Wrapper smoke (two cases)

Rollback

Surfaced for the record

Cutover statement

Consumer surface (`/mcp`)

Operator surface (`/operator-mcp`)

Verifications (rust:1.88-bookworm container, local pg `chukwa-pg-local` 127.0.0.1:5433)