Snapshots

How the orchestrator builds, normalizes, and caches provider snapshots (indices), including canonical keys, checkpoints, and drop-guard safety.

Snapshots are CrossWatch’s “current view” of each provider.

Plans are only as good as snapshots.

Why snapshots matter

If a provider returns stale data, plans look wrong.
If a provider returns empty data, deletes can look scary.
CrossWatch uses safety checks before trusting a big drop.

What protects you

Drop guard can treat a tiny snapshot as “suspect”.
It then reuses the previous baseline for planning.
That prevents mass deletes from transient outages.

What to do when things look stale

Run again. Many providers are eventually consistent.
Check provider health. Auth failures skip work.
Disable in-run snapshot caching (runtime.snapshot_ttl_sec = 0).

Where stale data comes from: Caching layers
Safety model: Guardrails
Runtime knobs: Runtime

Snapshots are the orchestrator’s normalized view of a provider’s data. They are also called indices in code.

Code: cw_platform/orchestrator/_snapshots.py

Called by: _pairs_oneway.py, _pairs_twoway.py

Overview

Terms

Snapshot / index: current items for one provider + one feature.
- Shape: canonical_key -> item dict
- Type alias: SnapIndex = dict[str, dict[str, Any]]
Baseline: last known good snapshot persisted in state.json.
Checkpoint: provider marker used to detect stale snapshots.

Snapshots are built per run. They can be memoized in-memory during that run.

Technical reference

Where snapshots come from

Every sync provider implements build_index(config, feature=...).

build_snapshots_for_feature(...) loops all loaded providers and, per provider:

Verifies the provider claims it supports this feature (ops.features()).
Verifies at least one enabled pair needs this snapshot.
Verifies the provider is configured (ops.is_configured(config) if implemented).
Calls ops.build_index(...).
Normalizes the output into a SnapIndex.
Runs watchlist-only post-processing (coalescing, ANILIST backfill).
Stores in the per-run memo cache (optional TTL).

“Only build what we actually need”

Snapshots aren’t built for every provider every time.

Feature gating

A provider must advertise the feature:

ops.features() returns a mapping like {"watchlist": True, "ratings": True, ...}
If features()[feature] is falsy → skipped.

Pair gating

allowed_providers_for_feature(config, feature) scans config["pairs"] and collects providers that participate in at least one enabled pair that runs the feature.

If your config includes a provider but no pair uses it for that feature, it won’t get indexed.

Configuration gating

If the provider implements is_configured(config), it must return truthy, otherwise it’s skipped.

This avoids wasting time querying providers that are present in code but not configured in config.json.

Normalization: provider output → SnapIndex

Providers may return:

a list of item dicts, or
a dict mapping provider_key -> item dict

Both normalize to:

{ canonical_key: { ...item fields... } }

Canonical keys

Canonical keys are created by cw_platform.id_map.canonical_key(item). The key is typically based on the best available external ID:

Priority order (KEY_PRIORITY) starts with: imdb > tmdb > tvdb > trakt > mal > anilist > kitsu > anidb > simkl > plex > guid > slug

If the provider returns a dict

If build_index() returns a dict, the orchestrator will pick the better of:

provider_key (the dict key, stripped of any @suffix)
computed_key (from canonical_key(item))

Selection rule:

Pick whichever has higher key priority (e.g., prefer imdb:... over simkl:...).
If one is missing, use the other.

This keeps keys stable and improves cross-provider matching.

Coalescing duplicates (watchlist only)

Watchlist snapshots run a coalescing pass after normalization:

_coalesce_by_shared_ids(idx, feature="watchlist")
If two keys share any ID token, they are grouped.

Then the orchestrator picks a “best key” for the group:

best key priority (imdb first, etc.)
tie-breaker: item with the most IDs wins

Finally it merges item dicts:

missing/empty fields are filled from the other items
ids dict is merged (missing IDs filled in)

Result: fewer duplicates during diffing. This matters when providers disagree on the primary key.

ANILIST watchlist key backfill

If an ANILIST watchlist snapshot exists, keys may be improved:

_maybe_backfill_anilist_shadow(snaps, feature="watchlist")

What it does:

Builds a token lookup from other providers’ watchlist items.
For each ANILIST item, finds a higher-priority matching key.
Rekeys the ANILIST snapshot entry to that better key.
Enriches the ANILIST item’s ids with missing IDs from the matched item.

Shadow file

When it can extract ANILIST identifiers, it writes a scoped shadow file:

/config/.cw_state/anilist_watchlist_shadow.<scope>.json (scoped filename via scoped_file(...))

Stored fields include:

anilist_id
optional list_entry_id
optional mal
optional source_ids (IDs from the matched “better-key” item)
updated_at, and some light metadata (type, title, year)

Scope rules:

Only written when a real pair scope is active.
Not written for “unscoped/default/health” scopes.

Net effect: better key stability across providers for anime-heavy setups.

Snapshot memoization (per-run cache)

build_snapshots_for_feature supports an in-memory memo cache:

SnapCache = dict[(provider, feature), (ts, index)]

If snap_ttl_sec > 0 and the cache entry is fresh:

The orchestrator reuses the cached snapshot and skips API calls.

Important:

Empty snapshots and snapshots built during a “degraded” provider call are not cached.
This cache is not persisted; it resets each run.

Checkpoints (used to detect stale/bad snapshots)

A provider can optionally expose activities(config).

module_checkpoint(ops, config, feature) reads that mapping and chooses a relevant marker:

watchlist: watchlist or ptw or updated_at
ratings: ratings or updated_at
history: history or updated_at
otherwise: updated_at

Previous checkpoints come from state.json via:

prev_checkpoint(state, provider, feature)

Checkpoints are treated as strings, but coercion/parsing exists for ISO timestamps and numeric epoch values.

Drop guard: coercing a suspect snapshot back to baseline

This prevents destructive plans caused by transient snapshot drops.

Enabled by:

sync.drop_guard: true (per pair/feature config)

Implemented in:

coerce_suspect_snapshot(...)

It triggers only when the provider’s capabilities()["index_semantics"] is "present" (default).

Conditions

Given:

prev_idx = previous baseline index
cur_idx = freshly fetched snapshot

The snapshot is considered suspect if:

The previous baseline is “big enough”:
- len(prev_idx) >= suspect_min_prev (default via config runtime: 20)
The new snapshot shrank too much:
- len(cur_idx) == 0 OR len(cur_idx) <= len(prev_idx) * suspect_shrink_ratio default shrink ratio: 0.10
And the checkpoint did not progress:
- same checkpoint, or
- “now” parses to a time <= previous, or
- previous exists but current is missing

If all match:

the function returns prev_idx instead of cur_idx
and marks the snapshot as coerced (with a reason string)

This prevents downstream planners from seeing massive “removes” driven by a transient provider failure.

Observability

When runtime.suspect_debug is true (default), it emits:

snapshot:suspect events with counts/checkpoints/reason

What counts get logged

Snapshot logging uses _eventish_count(feature, idx):

watchlist: len(idx)
history: counts only entries that have watched_at/last_watched_at
ratings: counts only entries that have rating-related fields

This keeps logs meaningful (otherwise you’d count “shell” items).

Troubleshooting

If you’re diagnosing weird plans:

If a provider returns a dict keyed by something low-priority (e.g., simkl:) but items contain imdb: IDs, the snapshot will likely be rekeyed to imdb: keys.
If you see massive removals right after a provider outage, check if drop_guard is enabled and whether checkpoints progressed.
For anime/watchlist oddities, check whether ANILIST got rekeyed and whether the shadow file exists under /config/.cw_state/.

Snapshot TTL and debug flags: Runtime
How suspect snapshots prevent bad deletes: Guardrails
Where snapshots can get stale: Caching layers

PreviousOrchestrator NextBlackbox

Last updated 14 hours ago

Good afternoon

hashtagWhy snapshots matter

hashtagWhat protects you

hashtagWhat to do when things look stale

hashtagOverview

hashtagTerms

hashtagTechnical reference

hashtagWhere snapshots come from

hashtag“Only build what we actually need”

hashtagFeature gating

hashtagPair gating

hashtagConfiguration gating

hashtagNormalization: provider output → SnapIndex

hashtagCoalescing duplicates (watchlist only)

hashtagANILIST watchlist key backfill

hashtagSnapshot memoization (per-run cache)

hashtagCheckpoints (used to detect stale/bad snapshots)

hashtagDrop guard: coercing a suspect snapshot back to baseline

hashtagWhat counts get logged

hashtagTroubleshooting

hashtagRelated pages