# Snapshots

{% tabs %}
{% tab title="End users" %}
Snapshots are CrossWatch’s “current view” of each provider.

Plans are only as good as snapshots.

#### Why snapshots matter

* If a provider returns stale data, plans look wrong.
* If a provider returns empty data, deletes can look scary.
* CrossWatch uses safety checks before trusting a big drop.

#### What protects you

* **Drop guard** can treat a tiny snapshot as “suspect”.
* It then reuses the previous baseline for planning.
* That prevents mass deletes from transient outages.

#### What to do when things look stale

1. Run again. Many providers are eventually consistent.
2. Check provider health. Auth failures skip work.
3. Disable in-run snapshot caching (`runtime.snapshot_ttl_sec = 0`).

Related:

* Where stale data comes from: [Caching layers](/blueprint-architecture/orchestrator/caching-layers.md)
* Safety model: [Guardrails](/blueprint-architecture/orchestrator/guardrails.md)
* Runtime knobs: [Runtime](/blueprint-architecture/orchestrator/runtime.md)
  {% endtab %}

{% tab title="Power users" %}
Snapshots are the orchestrator’s normalized view of a provider’s data. They are also called **indices** in code.

Code: `cw_platform/orchestrator/_snapshots.py`

Called by: `_pairs_oneway.py`, `_pairs_twoway.py`

***

### Overview

#### Terms

* **Snapshot / index**: current items for one provider + one feature.
  * Shape: `canonical_key -> item dict`
  * Type alias: `SnapIndex = dict[str, dict[str, Any]]`
* **Baseline**: last known good snapshot persisted in `state.json`.
* **Checkpoint**: provider marker used to detect stale snapshots.

Snapshots are built per run. They can be memoized in-memory during that run.

***

### Technical reference

#### Where snapshots come from

Every sync provider implements `build_index(config, feature=...)`.

`build_snapshots_for_feature(...)` loops all loaded providers and, per provider:

1. Verifies the provider claims it supports this feature (`ops.features()`).
2. Verifies at least one enabled pair needs this snapshot.
3. Verifies the provider is configured (`ops.is_configured(config)` if implemented).
4. Calls `ops.build_index(...)`.
5. Normalizes the output into a `SnapIndex`.
6. Runs watchlist-only post-processing (coalescing, ANILIST backfill).
7. Stores in the per-run memo cache (optional TTL).

***

#### “Only build what we actually need”

Snapshots aren’t built for every provider every time.

#### Feature gating

A provider must advertise the feature:

* `ops.features()` returns a mapping like `{"watchlist": True, "ratings": True, ...}`
* If `features()[feature]` is falsy → skipped.

#### Pair gating

`allowed_providers_for_feature(config, feature)` scans `config["pairs"]` and collects providers that participate in at least one enabled pair that runs the feature.

If your config includes a provider but no pair uses it for that feature, it won’t get indexed.

#### Configuration gating

If the provider implements `is_configured(config)`, it must return truthy, otherwise it’s skipped.

This avoids wasting time querying providers that are present in code but not configured in `config.json`.

***

#### Normalization: provider output → SnapIndex

Providers may return:

* a **list** of item dicts, or
* a **dict** mapping `provider_key -> item dict`

Both normalize to:

```
{ canonical_key: { ...item fields... } }
```

**Canonical keys**

Canonical keys are created by `cw_platform.id_map.canonical_key(item)`. The key is typically based on the best available external ID:

Priority order (`KEY_PRIORITY`) starts with: `imdb > tmdb > tvdb > trakt > mal > anilist > kitsu > anidb > simkl > plex > guid > slug`

**If the provider returns a dict**

If `build_index()` returns a dict, the orchestrator will pick the better of:

* `provider_key` (the dict key, stripped of any `@suffix`)
* `computed_key` (from `canonical_key(item)`)

Selection rule:

* Pick whichever has higher key priority (e.g., prefer `imdb:...` over `simkl:...`).
* If one is missing, use the other.

This keeps keys stable and improves cross-provider matching.

***

#### Coalescing duplicates (watchlist only)

Watchlist snapshots run a coalescing pass after normalization:

* `_coalesce_by_shared_ids(idx, feature="watchlist")`
* If two keys share any ID token, they are grouped.

Then the orchestrator picks a “best key” for the group:

* best key priority (imdb first, etc.)
* tie-breaker: item with the most IDs wins

Finally it merges item dicts:

* missing/empty fields are filled from the other items
* `ids` dict is merged (missing IDs filled in)

Result: fewer duplicates during diffing. This matters when providers disagree on the primary key.

***

#### ANILIST watchlist key backfill

If an ANILIST watchlist snapshot exists, keys may be improved:

* `_maybe_backfill_anilist_shadow(snaps, feature="watchlist")`

What it does:

* Builds a token lookup from other providers’ watchlist items.
* For each ANILIST item, finds a higher-priority matching key.
* Rekeys the ANILIST snapshot entry to that better key.
* Enriches the ANILIST item’s `ids` with missing IDs from the matched item.

**Shadow file**

When it can extract ANILIST identifiers, it writes a scoped shadow file:

* `/config/.cw_state/anilist_watchlist_shadow.<scope>.json` (scoped filename via `scoped_file(...)`)

Stored fields include:

* `anilist_id`
* optional `list_entry_id`
* optional `mal`
* optional `source_ids` (IDs from the matched “better-key” item)
* `updated_at`, and some light metadata (`type`, `title`, `year`)

Scope rules:

* Only written when a real pair scope is active.
* Not written for “unscoped/default/health” scopes.

Net effect: better key stability across providers for anime-heavy setups.

***

#### Snapshot memoization (per-run cache)

`build_snapshots_for_feature` supports an in-memory memo cache:

* `SnapCache = dict[(provider, feature), (ts, index)]`

If `snap_ttl_sec > 0` and the cache entry is fresh:

* The orchestrator reuses the cached snapshot and skips API calls.

Important:

* Empty snapshots and snapshots built during a “degraded” provider call are **not cached**.
* This cache is **not persisted**; it resets each run.

***

#### Checkpoints (used to detect stale/bad snapshots)

A provider can optionally expose `activities(config)`.

`module_checkpoint(ops, config, feature)` reads that mapping and chooses a relevant marker:

* watchlist: `watchlist` or `ptw` or `updated_at`
* ratings: `ratings` or `updated_at`
* history: `history` or `updated_at`
* otherwise: `updated_at`

Previous checkpoints come from `state.json` via:

* `prev_checkpoint(state, provider, feature)`

Checkpoints are treated as strings, but coercion/parsing exists for ISO timestamps and numeric epoch values.

***

#### Drop guard: coercing a suspect snapshot back to baseline

This prevents destructive plans caused by transient snapshot drops.

Enabled by:

* `sync.drop_guard: true` (per pair/feature config)

Implemented in:

* `coerce_suspect_snapshot(...)`

It triggers only when the provider’s `capabilities()["index_semantics"]` is `"present"` (default).

**Conditions**

Given:

* `prev_idx` = previous baseline index
* `cur_idx` = freshly fetched snapshot

The snapshot is considered **suspect** if:

1. The previous baseline is “big enough”:
   * `len(prev_idx) >= suspect_min_prev` (default via config runtime: 20)
2. The new snapshot shrank too much:
   * `len(cur_idx) == 0` OR `len(cur_idx) <= len(prev_idx) * suspect_shrink_ratio`\
     default shrink ratio: `0.10`
3. And the checkpoint did **not** progress:
   * same checkpoint, or
   * “now” parses to a time <= previous, or
   * previous exists but current is missing

If all match:

* the function returns `prev_idx` instead of `cur_idx`
* and marks the snapshot as coerced (with a reason string)

This prevents downstream planners from seeing massive “removes” driven by a transient provider failure.

**Observability**

When `runtime.suspect_debug` is true (default), it emits:

* `snapshot:suspect` events with counts/checkpoints/reason

***

#### What counts get logged

Snapshot logging uses `_eventish_count(feature, idx)`:

* watchlist: `len(idx)`
* history: counts only entries that have `watched_at`/`last_watched_at`
* ratings: counts only entries that have rating-related fields

This keeps logs meaningful (otherwise you’d count “shell” items).

***

### Troubleshooting

If you’re diagnosing weird plans:

* If a provider returns a dict keyed by something low-priority (e.g., `simkl:`) but items contain `imdb:` IDs, the snapshot will likely be rekeyed to `imdb:` keys.
* If you see massive removals right after a provider outage, check if `drop_guard` is enabled and whether checkpoints progressed.
* For anime/watchlist oddities, check whether ANILIST got rekeyed and whether the shadow file exists under `/config/.cw_state/`.

### Related pages

* Snapshot TTL and debug flags: [Runtime](/blueprint-architecture/orchestrator/runtime.md)
* How suspect snapshots prevent bad deletes: [Guardrails](/blueprint-architecture/orchestrator/guardrails.md)
* Where snapshots can get stale: [Caching layers](/blueprint-architecture/orchestrator/caching-layers.md)
  {% endtab %}
  {% endtabs %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://wiki.crosswatch.app/blueprint-architecture/orchestrator/snapshots.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
