Snapshots
How the orchestrator builds, normalizes, and caches provider snapshots (indices), including canonical keys, checkpoints, and drop-guard safety.
Snapshots are CrossWatch’s “current view” of each provider.
Plans are only as good as snapshots.
Why snapshots matter
If a provider returns stale data, plans look wrong.
If a provider returns empty data, deletes can look scary.
CrossWatch uses safety checks before trusting a big drop.
What protects you
Drop guard can treat a tiny snapshot as “suspect”.
It then reuses the previous baseline for planning.
That prevents mass deletes from transient outages.
What to do when things look stale
Run again. Many providers are eventually consistent.
Check provider health. Auth failures skip work.
Disable in-run snapshot caching (
runtime.snapshot_ttl_sec = 0).
Related:
Where stale data comes from: Caching layers
Safety model: Guardrails
Runtime knobs: Runtime
Snapshots are the orchestrator’s normalized view of a provider’s data. They are also called indices in code.
Code: cw_platform/orchestrator/_snapshots.py
Called by: _pairs_oneway.py, _pairs_twoway.py
Overview
Terms
Snapshot / index: current items for one provider + one feature.
Shape:
canonical_key -> item dictType alias:
SnapIndex = dict[str, dict[str, Any]]
Baseline: last known good snapshot persisted in
state.json.Checkpoint: provider marker used to detect stale snapshots.
Snapshots are built per run. They can be memoized in-memory during that run.
Technical reference
Where snapshots come from
Every sync provider implements build_index(config, feature=...).
build_snapshots_for_feature(...) loops all loaded providers and, per provider:
Verifies the provider claims it supports this feature (
ops.features()).Verifies at least one enabled pair needs this snapshot.
Verifies the provider is configured (
ops.is_configured(config)if implemented).Calls
ops.build_index(...).Normalizes the output into a
SnapIndex.Runs watchlist-only post-processing (coalescing, ANILIST backfill).
Stores in the per-run memo cache (optional TTL).
“Only build what we actually need”
Snapshots aren’t built for every provider every time.
Feature gating
A provider must advertise the feature:
ops.features()returns a mapping like{"watchlist": True, "ratings": True, ...}If
features()[feature]is falsy → skipped.
Pair gating
allowed_providers_for_feature(config, feature) scans config["pairs"] and collects providers that participate in at least one enabled pair that runs the feature.
If your config includes a provider but no pair uses it for that feature, it won’t get indexed.
Configuration gating
If the provider implements is_configured(config), it must return truthy, otherwise it’s skipped.
This avoids wasting time querying providers that are present in code but not configured in config.json.
Normalization: provider output → SnapIndex
Providers may return:
a list of item dicts, or
a dict mapping
provider_key -> item dict
Both normalize to:
Canonical keys
Canonical keys are created by cw_platform.id_map.canonical_key(item). The key is typically based on the best available external ID:
Priority order (KEY_PRIORITY) starts with: imdb > tmdb > tvdb > trakt > mal > anilist > kitsu > anidb > simkl > plex > guid > slug
If the provider returns a dict
If build_index() returns a dict, the orchestrator will pick the better of:
provider_key(the dict key, stripped of any@suffix)computed_key(fromcanonical_key(item))
Selection rule:
Pick whichever has higher key priority (e.g., prefer
imdb:...oversimkl:...).If one is missing, use the other.
This keeps keys stable and improves cross-provider matching.
Coalescing duplicates (watchlist only)
Watchlist snapshots run a coalescing pass after normalization:
_coalesce_by_shared_ids(idx, feature="watchlist")If two keys share any ID token, they are grouped.
Then the orchestrator picks a “best key” for the group:
best key priority (imdb first, etc.)
tie-breaker: item with the most IDs wins
Finally it merges item dicts:
missing/empty fields are filled from the other items
idsdict is merged (missing IDs filled in)
Result: fewer duplicates during diffing. This matters when providers disagree on the primary key.
ANILIST watchlist key backfill
If an ANILIST watchlist snapshot exists, keys may be improved:
_maybe_backfill_anilist_shadow(snaps, feature="watchlist")
What it does:
Builds a token lookup from other providers’ watchlist items.
For each ANILIST item, finds a higher-priority matching key.
Rekeys the ANILIST snapshot entry to that better key.
Enriches the ANILIST item’s
idswith missing IDs from the matched item.
Shadow file
When it can extract ANILIST identifiers, it writes a scoped shadow file:
/config/.cw_state/anilist_watchlist_shadow.<scope>.json(scoped filename viascoped_file(...))
Stored fields include:
anilist_idoptional
list_entry_idoptional
maloptional
source_ids(IDs from the matched “better-key” item)updated_at, and some light metadata (type,title,year)
Scope rules:
Only written when a real pair scope is active.
Not written for “unscoped/default/health” scopes.
Net effect: better key stability across providers for anime-heavy setups.
Snapshot memoization (per-run cache)
build_snapshots_for_feature supports an in-memory memo cache:
SnapCache = dict[(provider, feature), (ts, index)]
If snap_ttl_sec > 0 and the cache entry is fresh:
The orchestrator reuses the cached snapshot and skips API calls.
Important:
Empty snapshots and snapshots built during a “degraded” provider call are not cached.
This cache is not persisted; it resets each run.
Checkpoints (used to detect stale/bad snapshots)
A provider can optionally expose activities(config).
module_checkpoint(ops, config, feature) reads that mapping and chooses a relevant marker:
watchlist:
watchlistorptworupdated_atratings:
ratingsorupdated_athistory:
historyorupdated_atotherwise:
updated_at
Previous checkpoints come from state.json via:
prev_checkpoint(state, provider, feature)
Checkpoints are treated as strings, but coercion/parsing exists for ISO timestamps and numeric epoch values.
Drop guard: coercing a suspect snapshot back to baseline
This prevents destructive plans caused by transient snapshot drops.
Enabled by:
sync.drop_guard: true(per pair/feature config)
Implemented in:
coerce_suspect_snapshot(...)
It triggers only when the provider’s capabilities()["index_semantics"] is "present" (default).
Conditions
Given:
prev_idx= previous baseline indexcur_idx= freshly fetched snapshot
The snapshot is considered suspect if:
The previous baseline is “big enough”:
len(prev_idx) >= suspect_min_prev(default via config runtime: 20)
The new snapshot shrank too much:
len(cur_idx) == 0ORlen(cur_idx) <= len(prev_idx) * suspect_shrink_ratiodefault shrink ratio:0.10
And the checkpoint did not progress:
same checkpoint, or
“now” parses to a time <= previous, or
previous exists but current is missing
If all match:
the function returns
prev_idxinstead ofcur_idxand marks the snapshot as coerced (with a reason string)
This prevents downstream planners from seeing massive “removes” driven by a transient provider failure.
Observability
When runtime.suspect_debug is true (default), it emits:
snapshot:suspectevents with counts/checkpoints/reason
What counts get logged
Snapshot logging uses _eventish_count(feature, idx):
watchlist:
len(idx)history: counts only entries that have
watched_at/last_watched_atratings: counts only entries that have rating-related fields
This keeps logs meaningful (otherwise you’d count “shell” items).
Troubleshooting
If you’re diagnosing weird plans:
If a provider returns a dict keyed by something low-priority (e.g.,
simkl:) but items containimdb:IDs, the snapshot will likely be rekeyed toimdb:keys.If you see massive removals right after a provider outage, check if
drop_guardis enabled and whether checkpoints progressed.For anime/watchlist oddities, check whether ANILIST got rekeyed and whether the shadow file exists under
/config/.cw_state/.
Related pages
Snapshot TTL and debug flags: Runtime
How suspect snapshots prevent bad deletes: Guardrails
Where snapshots can get stale: Caching layers
Last updated