Describe the bug
When using Microsoft Fabric (TSQL) with virtual_environment_mode = "dev_only", models that have been removed from a plan continue to exist as snapshot records in the state. This causes the janitor to repeatedly attempt cleanup of non-existent (or already-dropped) models and fail.
Environment
- Engine: Microsoft Fabric / TSQL
virtual_environment_mode = "dev_only"
Steps to reproduce
- Configure SQLMesh with a Fabric connection and
virtual_environment_mode = "dev_only"
- Apply a plan that includes a model (e.g.,
my_schema.my_model)
- Remove the model from the project and apply a new plan
- Run the janitor (or wait for it to run automatically)
- Observe that the snapshot record for the removed model still exists in state
Expected behavior
After the janitor runs, the snapshot record for the removed model should be deleted from state.
Actual behavior
The snapshot record persists in state. The janitor logs errors on each subsequent run when attempting to clean up the model.
Root cause analysis
The issue is a two-step failure in the janitor's cleanup flow in sqlmesh/core/janitor.py:
snapshot_evaluator.cleanup(target_snapshots=batch.cleanup_tasks, ...) # step 1
state_sync.delete_expired_snapshots(...) # step 2
delete_expired_snapshots (step 2) is only called if cleanup (step 1) succeeds. If step 1 raises an exception, the snapshot records are intentionally retained for retry — but on Fabric, the retry never succeeds.
The failure originates in _cleanup_snapshot in sqlmesh/core/snapshot/evaluator.py:
try:
evaluation_strategy.delete(table_name, ...)
except Exception:
if adapter.get_data_object(table_name) is not None:
raise # re-raises if table still exists
logger.warning("Skipping cleanup ...")
In dev_only mode, snapshot.table_name(is_deployable=True) returns the original unversioned table name (e.g., my_schema.my_model) rather than a versioned sqlmesh__-prefixed name. This table lives in the user's actual Fabric warehouse (catalog), so accessing it requires a catalog switch.
Fabric's set_current_catalog implementation closes the connection pool and reopens it with a new catalog configuration (sqlmesh/core/engine_adapter/fabric.py). This teardown/rebuild cycle can fail in two ways:
- Mode A:
drop_table fails due to a connection/auth error triggered by the catalog switch. The fallback get_data_object call then also throws (same connection state issue), propagating the exception up and aborting the cleanup batch before delete_expired_snapshots is reached.
- Mode B:
drop_table fails, get_data_object returns non-None (the drop failed so the table still exists), the original exception is re-raised — same result.
This is specific to dev_only mode because in full mode the physical tables use versioned sqlmesh__schema names that are less likely to require a catalog switch during cleanup.
Relevant code locations
sqlmesh/core/janitor.py — sequential cleanup → delete_expired_snapshots flow
sqlmesh/core/snapshot/evaluator.py — _cleanup_snapshot exception handling
sqlmesh/core/engine_adapter/fabric.py — set_current_catalog / _drop_catalog connection teardown
sqlmesh/core/state_sync/db/snapshot.py — get_expired_snapshots (snapshot expiry detection works correctly; the problem is in the cleanup step)
Possible fix
The get_data_object fallback call in _cleanup_snapshot should be made more resilient to Fabric connection errors — either by catching exceptions from get_data_object itself and treating them as "table unknown / skip", or by ensuring the Fabric adapter properly handles catalog context before the get_data_object query. Additionally, it may be worth investigating whether the catalog switch can be avoided entirely during janitor cleanup by using a fully-qualified table name query against INFORMATION_SCHEMA that does not require switching the active catalog.
Describe the bug
When using Microsoft Fabric (TSQL) with
virtual_environment_mode = "dev_only", models that have been removed from a plan continue to exist as snapshot records in the state. This causes the janitor to repeatedly attempt cleanup of non-existent (or already-dropped) models and fail.Environment
virtual_environment_mode = "dev_only"Steps to reproduce
virtual_environment_mode = "dev_only"my_schema.my_model)Expected behavior
After the janitor runs, the snapshot record for the removed model should be deleted from state.
Actual behavior
The snapshot record persists in state. The janitor logs errors on each subsequent run when attempting to clean up the model.
Root cause analysis
The issue is a two-step failure in the janitor's cleanup flow in
sqlmesh/core/janitor.py:delete_expired_snapshots(step 2) is only called ifcleanup(step 1) succeeds. If step 1 raises an exception, the snapshot records are intentionally retained for retry — but on Fabric, the retry never succeeds.The failure originates in
_cleanup_snapshotinsqlmesh/core/snapshot/evaluator.py:In
dev_onlymode,snapshot.table_name(is_deployable=True)returns the original unversioned table name (e.g.,my_schema.my_model) rather than a versionedsqlmesh__-prefixed name. This table lives in the user's actual Fabric warehouse (catalog), so accessing it requires a catalog switch.Fabric's
set_current_catalogimplementation closes the connection pool and reopens it with a new catalog configuration (sqlmesh/core/engine_adapter/fabric.py). This teardown/rebuild cycle can fail in two ways:drop_tablefails due to a connection/auth error triggered by the catalog switch. The fallbackget_data_objectcall then also throws (same connection state issue), propagating the exception up and aborting the cleanup batch beforedelete_expired_snapshotsis reached.drop_tablefails,get_data_objectreturns non-None(the drop failed so the table still exists), the original exception is re-raised — same result.This is specific to
dev_onlymode because infullmode the physical tables use versionedsqlmesh__schemanames that are less likely to require a catalog switch during cleanup.Relevant code locations
sqlmesh/core/janitor.py— sequentialcleanup→delete_expired_snapshotsflowsqlmesh/core/snapshot/evaluator.py—_cleanup_snapshotexception handlingsqlmesh/core/engine_adapter/fabric.py—set_current_catalog/_drop_catalogconnection teardownsqlmesh/core/state_sync/db/snapshot.py—get_expired_snapshots(snapshot expiry detection works correctly; the problem is in the cleanup step)Possible fix
The
get_data_objectfallback call in_cleanup_snapshotshould be made more resilient to Fabric connection errors — either by catching exceptions fromget_data_objectitself and treating them as "table unknown / skip", or by ensuring the Fabric adapter properly handles catalog context before theget_data_objectquery. Additionally, it may be worth investigating whether the catalog switch can be avoided entirely during janitor cleanup by using a fully-qualified table name query againstINFORMATION_SCHEMAthat does not require switching the active catalog.