chukwa — ticket d3e78844

Add cognition workflow reference-validation at the scenario store layer, whereas now it is only happening in the runtime.

Fix the cognition workflow reference-validation gap at the scenario store layer, while preserving the existing runtime/kernel validation as defense-in-depth.

Do not remove, weaken, bypass, or relax any runtime/kernel validation or reference-resolution checks. The runtime checks should remain in place as a final safety net. This change is about ensuring invalid schema/source hashes are caught earlier, when workflows are bound to cognition profiles or assembled into scenarios, rather than being discovered for the first time during execution.

Goal

A cognition profile must never become an executable scenario contract unless its workflow’s referenced hashes resolve.

Specifically:

put_cognition_profile with inline workflow content must reject workflows whose source_ref, tool source refs, tool schema refs, node final_schema_hash, or apply.final_schema_hash point at missing rows.
put_cognition_profile with { workflow_hash: "..." } must reject an existing workflow if that workflow’s internal schema/source hashes do not resolve.
assemble_scenario must validate profiles supplied by profile hash too, not only inline profile content.
Existing bad rows should not be allowed to pass assembly merely because the scenario uses CognitionProfileRef::Hash.
Missing external refs must surface as StoreError::NotFound, with a message naming the exact missing hash and kind, e.g. json_schema hash <hash> does not exist.

Files to change

Primary:

src/scenario_store/postgres.rs
src/scenario_store/memory.rs

Also update comments/docs where they currently imply reference resolution only happens in assembly:

src/workflow_validation.rs

Add tests in:

src/scenario_store/postgres.rs
src/scenario_store/memory.rs
src/mcp/tests.rs

Core implementation change

Move workflow reference validation into profile resolution.

Right now resolve_profile_ref / resolve_profile allows this path:

CognitionProfileRef::Hash
→ select workflow_hash
→ return ResolvedProfile

That is the bug. The hash branch must validate the workflow it finds.

The inline branch also currently uses resolve_workflow_for_profile, which only structurally validates inline workflows. That is also the bug. Replace that structural-only behavior with full reference validation.

Postgres implementation instructions

In src/scenario_store/postgres.rs, change resolve_and_validate_workflow_ref so it returns the resolved workflow hash and whether it inserted a new workflow:

async fn resolve_and_validate_workflow_ref(
    tx: &mut Transaction<'_, Postgres>,
    profile_context: &str,
    r: &super::CognitionWorkflowRef,
) -> Result<(String, bool), StoreError>

Its behavior should be:

Hash form:
  - SELECT content FROM cognition_workflows WHERE hash = $1
  - If missing:
      StoreError::NotFound("cognition_workflow hash <hash> does not exist (profile <profile_context>)")
  - Structurally validate the fetched content.
  - Resolve every referenced source/schema hash.
  - Return (hash.clone(), false)

Inline form:
  - Structurally validate the inline content.
  - Compute canonical workflow hash.
  - If it is the canonical placeholder workflow hash, skip external ref resolution.
  - Otherwise resolve every referenced source/schema hash BEFORE inserting.
  - If any referenced hash is missing, return StoreError::NotFound.
  - Only after all refs resolve, insert into cognition_workflows.
  - Return (hash, was_new)

The important ordering is: validate refs before inserting inline workflow content. This prevents put_cognition_profile from leaving behind invalid workflow rows when validation fails.

The missing-hash errors should be explicit:

StoreError::NotFound(format!(
    "response_source hash {source_hash} does not exist \
     (referenced by cognition_workflow {workflow_hash}, profile {profile_context})"
))

and:

StoreError::NotFound(format!(
    "json_schema hash {schema_hash} does not exist \
     (referenced by cognition_workflow {workflow_hash}, profile {profile_context})"
))

Keep the existing canonical placeholder workflow exception, because that placeholder intentionally uses fake source/schema hashes.

Then change resolve_workflow_for_profile to stop doing structural-only validation. Either remove it or make it a thin wrapper around the full helper:

async fn resolve_workflow_for_profile(
    tx: &mut Transaction<'_, Postgres>,
    agent_label: &str,
    r: &super::CognitionWorkflowRef,
) -> Result<(String, bool), StoreError> {
    let context = if agent_label.is_empty() {
        "put_cognition_profile"
    } else {
        agent_label
    };

    resolve_and_validate_workflow_ref(tx, context, r).await
}

Now update resolve_profile_ref.

For CognitionProfileRef::Hash, after selecting workflow_hash, validate that workflow before returning:

let workflow_hash = select_cognition_profile_row(tx, hash)
    .await?
    .ok_or_else(|| {
        StoreError::NotFound(format!(
            "cognition_profile hash {hash} does not exist"
        ))
    })?;

resolve_workflow_for_profile(
    tx,
    agent_label,
    &ContentRef::Hash {
        hash: workflow_hash.clone(),
    },
)
.await?;

For CognitionProfileRef::Inline, keep resolving through resolve_workflow_for_profile, but now that function must do full ref validation, not just structural validation.

Then remove the current assembly-only pre-pass:

for (label, r) in &input.cognition_profiles {
    if let CognitionProfileRef::Inline { content } = r {
        resolve_and_validate_workflow_ref(&mut tx, label, &content.workflow_hash)
            .await?;
    }
}

That pre-pass is now obsolete and incomplete. Validation must happen inside profile resolution for both inline and hash forms. Removing it also avoids double-upserting inline workflows and fixes new_components.cognition_workflows accounting.

Memory store implementation instructions

Mirror the same logic in src/scenario_store/memory.rs.

Change:

fn resolve_and_validate_workflow_ref_memory(...)

to return:

Result<(String, bool), StoreError>

For inline workflows, do not insert into inner.cognition_workflows until after external refs have been checked. This matters more in memory than Postgres because there is no transaction rollback.

Then update:

resolve_workflow_for_profile_memory
resolve_profile
assemble_scenario
fork_scenario

the same way as Postgres.

The profile hash branch in memory must validate:

cognition_profile hash
→ workflow_hash
→ cognition_workflows[workflow_hash]
→ structural validation
→ response_sources/json_schemas reference validation

No stored profile hash should bypass this.

Do not change this behavior

Do not make put_cognition_workflow resolve external refs yet unless the product decision changes.

It is okay for put_cognition_workflow to remain structural-only, because an isolated workflow may be authored before its sources/schemas exist.

The validation boundary being fixed here is: when a workflow is bound to a cognition profile or used in scenario assembly/forking, it must be executable.

Do not remove or weaken runtime/kernel validation.

The runtime should continue to validate/resolve workflow schema and source references exactly as it does today. Store-level validation is an additional earlier guardrail, not a replacement. Runtime validation remains defense-in-depth for corrupted rows, manual database edits, future store bugs, migration mistakes, or any scenario artifact produced before this fix.

MCP behavior

handle_put_cognition_profile already maps store NotFound to:

McpError::from_store_error(e, "UNKNOWN_HASH")

Keep that. The desired tool response for a missing schema/source during put_cognition_profile is:

{
  "isError": true,
  "code": "UNKNOWN_HASH",
  "message": "json_schema hash <hash> does not exist ..."
}

For assemble_scenario, it is acceptable if the code remains BAD_SCENARIO, but the message must explicitly identify the missing hash and kind. Do not return a vague runtime failure.

Required regression tests

Add tests for both Postgres and memory.

1. `put_cognition_profile` rejects inline workflow with missing schema hash

Setup:

- Insert a valid response_source.
- Do not insert the schema.
- Build an inline workflow that references:
    source_ref = existing source hash
    final_schema_hash = fake 64-char hash
    apply.final_schema_hash = same fake hash
- Call put_cognition_profile with that inline workflow.

Expected:

Err(StoreError::NotFound(msg))
msg contains "json_schema"
msg contains the fake schema hash
msg contains "does not exist"

Also assert the invalid inline workflow was not inserted into cognition_workflows.

2. `put_cognition_profile` rejects workflow_hash whose internal refs are missing

Setup:

- Insert a structurally valid cognition_workflow with put_cognition_workflow.
- That workflow should reference a fake source or fake schema hash.
- Call put_cognition_profile with { workflow_hash: workflow_r.hash }.

Expected:

Err(StoreError::NotFound(msg))
msg names the missing response_source/json_schema hash

This confirms the hash form is not trusted just because the workflow row exists.

3. `assemble_scenario` rejects stored profile hash pointing at invalid workflow

Because put_cognition_profile will no longer allow creation of a bad profile, seed the bad profile directly in the test.

Postgres setup:

- Insert invalid-ref workflow via put_cognition_workflow.
- Directly INSERT INTO cognition_profiles(hash, workflow_hash)
  using any valid 64-char profile hash and the workflow hash.
- Assemble a scenario whose cognition_profiles map uses:
    CognitionProfileRef::Hash { hash: seeded_profile_hash }

Expected:

Err(StoreError::NotFound(msg))
msg contains "json_schema" or "response_source"
msg contains the missing hash

Memory setup:

- Insert invalid workflow via put_cognition_workflow.
- Directly insert a CognitionProfileRow into inner.cognition_profiles.
- Assemble with CognitionProfileRef::Hash.

Expected is the same.

This is the exact regression for the observed bug.

4. MCP test: `put_cognition_profile` reports `UNKNOWN_HASH`

In src/mcp/tests.rs, add a test that calls the tool surface:

{
  "name": "put_cognition_profile",
  "arguments": {
    "content": {
      "workflow": {
        "...": "workflow with existing source hash and fake schema hash"
      }
    }
  }
}

Expected:

out["isError"] == true
body["code"] == "UNKNOWN_HASH"
body["message"] contains "json_schema hash"
body["message"] contains fake hash
body["message"] contains "does not exist"

Acceptance criteria

This bug is fixed only when all of these are true:

put_cognition_profile inline invalid workflow → immediate UNKNOWN_HASH / NotFound
put_cognition_profile workflow_hash to invalid workflow row → immediate UNKNOWN_HASH / NotFound
assemble_scenario inline profile invalid workflow → immediate failure before manifest persistence
assemble_scenario profile hash pointing to invalid workflow → immediate failure before manifest persistence
fork_scenario inherited/upserted profile hash pointing to invalid workflow → immediate failure
runtime/kernel remains protected, but ordinary tool/store paths no longer allow runtime/kernel to be the first place this class of missing schema/source hash is discovered

The validation must be store-level and transaction-safe. Runtime should never be the first place this missing hash is discovered.

Add Add cognition workflow reference-validation at the scenario store layer

Body

Goal

Files to change

Core implementation change

Postgres implementation instructions

Memory store implementation instructions

Do not change this behavior

MCP behavior

Required regression tests

1. `put_cognition_profile` rejects inline workflow with missing schema hash

2. `put_cognition_profile` rejects workflow_hash whose internal refs are missing

3. `assemble_scenario` rejects stored profile hash pointing at invalid workflow

4. MCP test: `put_cognition_profile` reports `UNKNOWN_HASH`

Acceptance criteria

Proposed resolution

History (4 events)

Add Add cognition workflow reference-validation at the **scenario store layer**

Body

Goal

Files to change

Core implementation change

Postgres implementation instructions

Memory store implementation instructions

Do not change this behavior

MCP behavior

Required regression tests

1. put_cognition_profile rejects inline workflow with missing schema hash

2. put_cognition_profile rejects workflow_hash whose internal refs are missing

3. assemble_scenario rejects stored profile hash pointing at invalid workflow

4. MCP test: put_cognition_profile reports UNKNOWN_HASH

Acceptance criteria

Proposed resolution

History (4 events)

Add Add cognition workflow reference-validation at the scenario store layer

1. `put_cognition_profile` rejects inline workflow with missing schema hash

2. `put_cognition_profile` rejects workflow_hash whose internal refs are missing

3. `assemble_scenario` rejects stored profile hash pointing at invalid workflow

4. MCP test: `put_cognition_profile` reports `UNKNOWN_HASH`