# Guardrails

{% tabs %}
{% tab title="End users" %}
Guardrails are CrossWatch safety systems.

They prevent bad snapshots and flaky APIs from causing destructive writes.

#### The ones you’ll notice most

* **Auth gating**: bad auth skips the pair.
* **Provider down**: writes are skipped.
* **Drop guard**: tiny snapshots don’t trigger mass deletes.
* **Mass delete protection**: large removal waves get blocked.

#### If CrossWatch “does nothing”

That is often a guardrail doing its job.

Check the run log for events like:

* `pair:skip` (auth)
* `writes:skipped` (provider down)
* `snapshot:suspect` (drop guard)
* `mass_delete:blocked`

#### If you really want big deletes

Only do this if you have backups and a restore plan.

Then explicitly enable mass deletes and removals in your pair settings.

Related:

* Two-way delete propagation: [Two-way sync](/blueprint-architecture/orchestrator/two-way-sync.md)
* Delete memory: [Tombstones](/blueprint-architecture/orchestrator/tombstones.md)
  {% endtab %}

{% tab title="Power users" %}
Guardrails are the orchestrator safety mechanisms. They prevent bad snapshots and flaky writes from becoming bad deletes.

This page documents what is actually wired today. It ignores “future knobs”.

Core modules:

* Drop guard: `_snapshots.py`
* Mass delete protection: `_pairs_massdelete.py`
* Tombstones: `_tombstones.py`
* Blackbox: `_blackbox.py`
* Unresolved: `_unresolved.py`
* PhantomGuard: `_phantoms.py`
* Health gating: `_pairs.py` + provider `health()`

***

### Overview

Guardrails run before and around writes. They prefer to block or skip instead of guessing.

Key outcomes:

* skip pairs when auth is broken
* skip writes when a provider is down
* suppress deletes when snapshots look wrong
* quarantine items that keep failing

***

### Guardrails (what they do)

#### 1) Health gating

Goal: don’t write when auth is broken or a provider is down.

**Auth failures**

If a provider health says `auth_failed`:

* the orchestrator skips the pair entirely
* emits `pair:skip reason=auth_failed`

**Provider down**

If a provider health says `down`:

* one-way: writes to that destination are skipped; items become unresolved
* two-way: writes to that side are skipped; **observed deletions are disabled**

{% hint style="info" %}
Observed deletions are disabled when a provider is `down`. “Down” often looks like an empty snapshot. That would otherwise look like “everything was deleted”.
{% endhint %}

***

#### 2) Drop guard (suspect snapshot coercion)

Goal: if a snapshot shrinks massively without checkpoint movement, treat it as bad and reuse the baseline.

**Code**

`coerce_suspect_snapshot(...)` in `_snapshots.py`

Triggers only when:

* `sync.drop_guard == True`
* provider `capabilities.index_semantics == "present"` (default)
* previous baseline is sufficiently large
* current snapshot is tiny relative to baseline
* checkpoint did not advance

Defaults (runtime):

* `runtime.suspect_min_prev = 20`
* `runtime.suspect_shrink_ratio = 0.10`

Effect:

* replaces `cur_idx` with `prev_idx` for planning
* emits `snapshot:suspect` / `snapshot.guard`

This prevents a transient 0-item response from turning into “remove 900 items”.

***

#### 3) Mass delete protection

Goal: block huge removal plans unless explicitly allowed.

**Code**

`_pairs_massdelete.py`

Input:

* `removes` list
* `baseline_size` (destination effective index size)
* `sync.allow_mass_delete` (default False)
* `runtime.suspect_shrink_ratio` (same ratio, default 0.10)

If mass deletes are not allowed and:

* `len(removes) > baseline_size * ratio` then:
* it drops the entire removal list
* emits `mass_delete:blocked`

This is intentionally blunt. If you really want large deletions, you must opt in.

***

#### 4) Tombstones (pair-scoped deletion memory)

Goal: avoid ping-pong deletes and re-add loops, especially in two-way.

**Code**

`_tombstones.py`

Where stored:

* `/config/.cw_state/tombstones.json`

Two-way loads tombstones scoped to:

* feature + pair (`PAIR = "-".join(sorted([A,B]))`)

Tokens recorded on successful removals:

* canonical key (e.g., `imdb:tt...`)
* all ID tokens in `item["ids"]` (imdb/tmdb/tvdb/simkl/mal/anilist/etc.)

Key format in file:

* `{feature}:{PAIR}|{token}` → `{ "at": epoch, "why": "remove" }`

TTL pruning:

* `sync.tombstone_ttl_days` (default 30)
* older tomb entries are ignored (and can be cleaned later)

How tombstones affect planning:

* prevent re-adding an item that was intentionally removed
* allow two-way to decide “this is a real delete, propagate it”

***

#### 5) Observed deletions (two-way only)

Goal: infer “real deletes” from baseline vs live snapshot.

In two-way, observed deletions are:

* `prev.keys - cur.keys`

But they are only trusted when:

* not bootstrapping
* provider not down
* snapshot not suspect (drop guard didn’t trigger)
* `sync.include_observed_deletes == True`

Observed deletions are immediately written into tombstones (with ID tokens) and removed from effective indices so they don’t get re-added.

This is the main delete-propagation mechanism. It is also the biggest “oops” risk. That’s why it is guarded heavily.

***

#### 6) Blackbox (failure cooldown)

Goal: stop re-attempting flapping items after repeated failures.

**Code**

`_blackbox.py` and `apply_blocklist` usage in `_pairs_blocklist.py`

Blackbox maintains:

* `*.flap.json` counters (consecutive failures)
* `*.blackbox.json` blocked keys (cooldown)

Promotion:

* after `sync.blackbox.promote_after` consecutive failures

Applied today:

* blocks **adds only**
* and only for **non-watchlist** features (watchlist uses PhantomGuard instead)

Pruning:

* `sync.blackbox.cooldown_days` controls how long entries stay blocked

{% hint style="warning" %}
Some config flags exist (`enabled`, `block_adds`, `block_removes`) but aren’t fully enforced in wiring.

Treat blackbox as “always active if you call `record_attempts` / `record_success`”.
{% endhint %}

***

#### 7) Unresolved (don’t hammer what keeps failing)

Goal: when a key repeatedly fails to apply, stop retrying it every run.

**Code**

`_unresolved.py`

How it’s recorded:

* when apply returns failures, those keys are written to:
  * `*.unresolved.pending.json`

How it’s used:

* orchestrator loads unresolved keys (depending on file naming availability)
* then removes them from planned **adds**

{% hint style="warning" %}
The loader looks for `*.unresolved.json`. The recorder writes `*.unresolved.pending.json`.

If you don’t have a promotion step (pending → active), unresolved blocking is weaker.
{% endhint %}

Unresolved is still useful. It may need a small wiring pass to fully “bite”.

***

#### 8) PhantomGuard (watchlist anti-flap)

Goal: stop re-adding watchlist items that “succeed” but don’t stick.

**Code**

`_phantoms.py`

PhantomGuard is applied in one-way and two-way for `feature == "watchlist"` (and partially for ratings).

It tracks:

* keys that were attempted as adds
* whether they were later observed in the destination snapshot (“actually stuck”)

If an item repeatedly fails to “stick”, it becomes a phantom and further adds are blocked for `ttl_days`.

Files (scoped):

* `{feature}.{src}-{dst}.{scope}.phantoms.json`
* `{feature}.{src}-{dst}.{scope}.last_success.json`

Wiring quirk:

* PhantomGuard config is read from **root-level** `cfg["blackbox"]`, not `sync.blackbox`.

So yes: there are *two* “blackbox-ish” systems:

* `_blackbox.py` (failure cooldown for non-watchlist adds)
* `_phantoms.py` (watchlist stickiness guard)

***

#### 9) Manual policy (human override)

Goal: let the user pin adds or block items regardless of snapshots.

Stored in `state.manual.json` (merged with `state.json` at load time).

Manual adds:

* merged into the source index before diffing

Manual blocks:

* applied to both adds and removes after planning
* match by canonical key, id tokens, or title-year token

This is the “I know better, do it anyway / don’t ever sync this title” escape hatch.

***

### Guardrail ordering (why it matters)

In practice, the orchestrator relies on this order to stay safe:

1. Health gating (skip if auth/down)
2. Drop guard (fix suspect snapshots before planning)
3. Index semantic merge (delta providers get baseline merge)
4. Plan diffs
5. Mass delete protection (remove big deletions)
6. Blocklists (tombstones/unresolved/blackbox)
7. PhantomGuard (watchlist)
8. Apply with confirmations + unresolved bookkeeping
9. Persist baselines/checkpoints

If you change ordering, you can accidentally disable a guardrail.

***

### Related pages

* Snapshot coercion details: [Snapshots](/blueprint-architecture/orchestrator/snapshots.md)
* Two-way delete propagation: [Two-way sync](/blueprint-architecture/orchestrator/two-way-sync.md)
* Delete memory: [Tombstones](/blueprint-architecture/orchestrator/tombstones.md)
* Failure suppression: [Blackbox](/blueprint-architecture/orchestrator/blackbox.md), [Phantom Guard](/blueprint-architecture/orchestrator/phantom-guard.md), [Unresolved](/blueprint-architecture/orchestrator/unresolved.md)
  {% endtab %}
  {% endtabs %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://wiki.crosswatch.app/blueprint-architecture/orchestrator/guardrails.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
