Skip to content

๐Ÿ—๏ธ Stack State Management

Overview

Stack State Management is a Terraform/Pulumi-inspired system that brings infrastructure-as-code best practices to task orchestration. It provides state locking, versioning, drift detection, and dependency tracking for your workflows.

Key Features

  • ๐Ÿ”’ State Locking: Prevents concurrent executions that could conflict
  • ๐Ÿ“ธ Snapshots & Versioning: Track changes over time with rollback capability
  • ๐Ÿ” Drift Detection: Compare desired vs actual state
  • ๐Ÿ”— Dependency Tracking: Visualize and manage stack dependencies
  • โœ… Validation: Pre-flight checks before execution
  • ๐Ÿ“Š Event System: Complete audit trail of all operations

System Architecture

graph TB
    subgraph ClientLayer["Client Layer"]
        CLI[CLI Client]
        API[REST API]
        SDK[SDK/Library]
    end

    subgraph StackStateSystem["Stack State System"]
        subgraph CoreServices["Core Services"]
            LockSvc[Locking Service]
            SnapshotSvc[Snapshot Service]
            DriftSvc[Drift Detection]
        end

        subgraph AdvancedServices["Advanced Services"]
            DepSvc[Dependency Tracker]
            ValidSvc[Validation Service]
            EventSvc[Event Processor]
        end
    end

    subgraph Storage["Storage Layer"]
        DB[(SQLite Database)]
        EventStore[(Event Store)]
    end

    CLI --> LockSvc
    CLI --> SnapshotSvc
    CLI --> DriftSvc
    API --> LockSvc
    SDK --> DepSvc

    LockSvc --> DB
    SnapshotSvc --> DB
    DriftSvc --> DB
    DepSvc --> DB
    ValidSvc --> DB

    LockSvc --> EventSvc
    SnapshotSvc --> EventSvc
    DriftSvc --> EventSvc

    EventSvc --> EventStore

Component Overview

Component Purpose Key Features
Locking Service Prevent concurrent access Metadata tracking, force release, status checking
Snapshot Service Version management Automatic versioning, rollback, comparison
Drift Detection State validation Compare actual vs desired, auto-fix
Dependency Tracker Manage relationships Circular detection, execution ordering
Validation Service Pre-flight checks Resource verification, config validation
Event Processor Audit trail 100 workers, 1000 event buffer

State Locking

Overview

State locking prevents multiple operations from modifying the same stack simultaneously, ensuring data integrity and preventing race conditions.

Lock Lifecycle

stateDiagram-v2
    [*] --> Unlocked
    Unlocked --> Acquiring: lock acquire
    Acquiring --> Locked: Success
    Acquiring --> Unlocked: Failure

    Locked --> Releasing: lock release
    Releasing --> Unlocked: Success

    Locked --> ForceReleasing: force-release
    ForceReleasing --> Unlocked: Success

    Locked --> Locked: Status Check
    Unlocked --> Unlocked: Status Check

Commands

Acquire Lock

sloth-runner stack lock acquire <stack-name> [options]

Options: - --reason <text> - Why you're acquiring the lock - --locked-by <identity> - Who/what is locking (default: current user) - --operation <name> - Operation being performed

Example:

$ sloth-runner stack lock acquire production-stack \
    --reason "Deploying v2.0.0" \
    --locked-by "deploy-bot" \
    --operation "deployment"

โœ“ Lock acquired for stack 'production-stack'

Lock Details:
  Locked by:    deploy-bot
  Locked at:    2025-10-10 14:41:31
  Operation:    deployment
  Reason:       Deploying v2.0.0

Check Lock Status

sloth-runner stack lock status <stack-name>

Example Output:

$ sloth-runner stack lock status production-stack

Stack: production-stack
Status: LOCKED

Lock Details:
  Locked by:    deploy-bot
  Locked at:    2025-10-10 14:41:31
  Operation:    deployment
  Reason:       Deploying v2.0.0
  Duration:     5m 23s

Release Lock

sloth-runner stack lock release <stack-name> [options]

Options: - --unlocked-by <identity> - Who is releasing the lock

Example:

$ sloth-runner stack lock release production-stack \
    --unlocked-by "deploy-bot"

โœ“ Lock released for stack 'production-stack'

Force Release Lock

โš ๏ธ Use with caution - Only for emergency situations

sloth-runner stack lock force-release <stack-name> [options]

Example:

$ sloth-runner stack lock force-release production-stack \
    --reason "Emergency maintenance"

โš  WARNING: Force releasing lock for stack 'production-stack'
โœ“ Lock force-released

Use Cases

  • Long-running deployments: Prevent other deployments from starting
  • Multi-step operations: Ensure atomic execution
  • Team collaboration: Coordinate work across team members
  • Emergency maintenance: Force release stuck locks

Snapshots & Versioning

Overview

Snapshots provide point-in-time backups of your stack state, enabling rollback and version comparison.

Snapshot Lifecycle

sequenceDiagram
    participant User
    participant CLI
    participant SnapshotService
    participant Database
    participant EventStore

    User->>CLI: snapshot create
    CLI->>SnapshotService: CreateSnapshot(stack, metadata)
    SnapshotService->>Database: Query current state
    Database-->>SnapshotService: State data
    SnapshotService->>SnapshotService: Generate version (v38)
    SnapshotService->>Database: Store snapshot
    SnapshotService->>EventStore: Emit snapshot.created event
    EventStore-->>SnapshotService: Event stored
    SnapshotService-->>CLI: Snapshot created
    CLI-->>User: โœ“ Snapshot v38 created

Commands

Create Snapshot

sloth-runner stack snapshot create <stack-name> [options]

Options: - --description <text> - Snapshot description - --creator <identity> - Who created the snapshot - --tags <tags> - Tags for categorization

Example:

$ sloth-runner stack snapshot create production-stack \
    --description "Before v2.0 upgrade" \
    --creator "admin" \
    --tags "production,upgrade"

โœ“ Snapshot created for stack 'production-stack'

Snapshot Details:
  Version:      v38
  Creator:      admin
  Description:  Before v2.0 upgrade
  Tags:         production, upgrade
  Created:      2025-10-10 14:30:00

List Snapshots

sloth-runner stack snapshot list <stack-name>

Example Output:

$ sloth-runner stack snapshot list production-stack

Snapshots for stack: production-stack

Version  Creator        Description                Created At           Tags
-------  -------------  ------------------------   -------------------  ----------------
v38      admin          Before v2.0 upgrade        2025-10-10 14:30:00  production,upgrade
v37      system         Auto-snapshot              2025-10-10 14:15:00  auto
v36      admin          Pre-maintenance backup     2025-10-10 13:00:00  maintenance
v35      deploy-bot     Post-deployment            2025-10-10 12:30:00  deployment

Total: 38 snapshots

Show Snapshot Details

sloth-runner stack snapshot show <stack-name> <version>

Example:

$ sloth-runner stack snapshot show production-stack v38

Snapshot Details:
  Stack:        production-stack
  Version:      v38
  Creator:      admin
  Description:  Before v2.0 upgrade
  Created At:   2025-10-10 14:30:00
  Size:         1.2 MB

State Summary:
  Resources:    15 resources
  Tasks:        8 tasks
  Status:       completed

Restore Snapshot

sloth-runner stack snapshot restore <stack-name> <version>

Example:

$ sloth-runner stack snapshot restore production-stack v38

โš  WARNING: This will restore stack to snapshot v38
Are you sure? (yes/no): yes

โœ“ Restoring snapshot v38...
โœ“ Snapshot restored successfully

Current State:
  Version:      v38
  Restored At:  2025-10-10 15:00:00
  Restored By:  admin

Compare Snapshots

sloth-runner stack snapshot compare <stack-name> <v1> <v2>

Example:

$ sloth-runner stack snapshot compare production-stack v37 v38

Comparing snapshots: v37 -> v38

Changes:
  + Added resource: database-server
  ~ Modified resource: web-server (replicas: 2 -> 4)
  - Removed resource: cache-server

Task Changes:
  ~ deploy_app: timeout 10m -> 15m
  + New task: configure_database

Delete Snapshot

sloth-runner stack snapshot delete <stack-name> <version>

Automatic Snapshots

Enable automatic snapshots before critical operations:

# Config: /etc/sloth-runner/config.yaml
stacks:
  auto_snapshot: true
  snapshot_retention: 30d  # Keep for 30 days
  snapshot_triggers:
    - before_deployment
    - before_destroy
    - on_drift_fix

Drift Detection

Overview

Drift detection identifies differences between the desired state (defined in your workflow) and the actual state (what's deployed).

Drift Detection Flow

graph LR
    subgraph Detection["Drift Detection Process"]
        Start[Start Drift Check]
        ReadDesired[Read Desired State]
        ReadActual[Read Actual State]
        Compare[Compare States]
        Analyze[Analyze Differences]
        Report[Generate Report]
    end

    subgraph Results["Detection Results"]
        NoDrift[No Drift Detected]
        DriftFound[Drift Detected]
        GenerateFix[Generate Fix Plan]
    end

    Start --> ReadDesired
    ReadDesired --> ReadActual
    ReadActual --> Compare
    Compare --> Analyze
    Analyze --> Report

    Report --> NoDrift
    Report --> DriftFound
    DriftFound --> GenerateFix

Drift Types

graph TB
    subgraph DriftTypes["Types of Drift"]
        ConfigDrift[Configuration Drift]
        ResourceDrift[Resource Drift]
        StateDrift[State Drift]
        DependencyDrift[Dependency Drift]
    end

    ConfigDrift --> |Example| ConfigEx["Port changed: 8080 -> 9090"]
    ResourceDrift --> |Example| ResourceEx["Server count: 3 -> 2"]
    StateDrift --> |Example| StateEx["Status: running -> stopped"]
    DependencyDrift --> |Example| DepEx["Missing dependency: redis"]

Commands

Detect Drift

sloth-runner stack drift detect <stack-name>

Example Output:

$ sloth-runner stack drift detect production-stack

Detecting drift for stack: production-stack

โœ“ Drift detection completed

Summary:
  Drifted Resources:    3
  In Sync Resources:    12
  Total Resources:      15

Drifted Resources:
  โ€ข web-server: replicas (expected: 4, actual: 2)
  โ€ข database: port (expected: 5432, actual: 5433)
  โ€ข cache: status (expected: running, actual: stopped)

Run 'drift show' for detailed report
Run 'drift fix' to auto-correct drift

Show Drift Report

sloth-runner stack drift show <stack-name>

Example Output:

$ sloth-runner stack drift show production-stack

Drift Report for: production-stack
Generated: 2025-10-10 15:15:00

โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”

Resource: web-server
Type: Configuration Drift
Severity: HIGH

Attribute: replicas
  Expected:  4
  Actual:    2
  Impact:    Reduced capacity

Suggested Fix:
  $ kubectl scale deployment web-server --replicas=4

โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”

Resource: database
Type: Configuration Drift
Severity: MEDIUM

Attribute: port
  Expected:  5432
  Actual:    5433
  Impact:    Connectivity issues

Suggested Fix:
  Update database configuration to use port 5432

โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”

Resource: cache
Type: State Drift
Severity: HIGH

Attribute: status
  Expected:  running
  Actual:    stopped
  Impact:    Service degradation

Suggested Fix:
  $ systemctl start redis

Fix Drift

sloth-runner stack drift fix <stack-name> [options]

Options: - --auto-approve - Skip confirmation - --dry-run - Show what would be fixed - --resource <name> - Fix specific resource only

Example:

$ sloth-runner stack drift fix production-stack --dry-run

Drift Fix Plan for: production-stack

The following actions will be performed:

  โ€ข web-server: Scale replicas from 2 to 4
  โ€ข database: Change port from 5433 to 5432
  โ€ข cache: Start service (status: stopped -> running)

Run without --dry-run to apply fixes

$ sloth-runner stack drift fix production-stack --auto-approve

Fixing drift for stack: production-stack

โœ“ web-server: Scaled to 4 replicas
โœ“ database: Port updated to 5432
โœ“ cache: Service started

Summary:
  Fixed:    3 resources
  Failed:   0 resources
  Skipped:  0 resources

โœ“ All drift corrected

Dependency Management

Overview

Dependency management ensures stacks are executed in the correct order and prevents circular dependencies.

Dependency Graph

graph TB
    subgraph InfraLayer["Infrastructure Layer"]
        Network[network-stack]
        Storage[storage-stack]
    end

    subgraph DataLayer["Data Layer"]
        Database[database-stack]
        Cache[cache-stack]
    end

    subgraph AppLayer["Application Layer"]
        Backend[backend-stack]
        Frontend[frontend-stack]
    end

    subgraph MonitoringLayer["Monitoring Layer"]
        Metrics[metrics-stack]
        Logging[logging-stack]
    end

    Network --> Database
    Network --> Cache
    Storage --> Database

    Database --> Backend
    Cache --> Backend

    Backend --> Frontend
    Backend --> Metrics
    Backend --> Logging

Commands

Show Dependencies

sloth-runner stack deps show <stack-name>

Example Output:

$ sloth-runner stack deps show backend-stack

Dependencies for: backend-stack

Direct Dependencies:
  โ€ข database-stack (v2.1.0)
  โ€ข cache-stack (v1.5.0)
  โ€ข network-stack (v3.0.0)

Indirect Dependencies:
  โ€ข storage-stack (via database-stack)
  โ€ข monitoring-stack (via database-stack)

Dependents (stacks that depend on this):
  โ€ข frontend-stack
  โ€ข metrics-stack
  โ€ข logging-stack

Total dependency count: 6
Dependency depth: 2 levels

Generate Dependency Graph

sloth-runner stack deps graph <stack-name> [options]

Options: - --output <file> - Output file (PNG, SVG, or DOT) - --format <format> - Output format (default: PNG) - --show-versions - Include version numbers - --full-tree - Show all transitive dependencies

Example:

$ sloth-runner stack deps graph backend-stack \
    --output deps.png \
    --show-versions \
    --full-tree

โœ“ Dependency graph generated: deps.png

Graph Statistics:
  Total nodes:      12
  Total edges:      18
  Max depth:        4
  Circular deps:    0

Check for Circular Dependencies

sloth-runner stack deps check <stack-name>

Example Output (No circular dependencies):

$ sloth-runner stack deps check backend-stack

Checking dependencies for: backend-stack

โœ“ No circular dependencies detected

Dependency tree is valid

Example Output (Circular dependency detected):

$ sloth-runner stack deps check app-stack

Checking dependencies for: app-stack

โœ— Circular dependency detected!

Cycle path:
  app-stack -> database-stack -> cache-stack -> app-stack

Resolution suggestions:
  1. Remove dependency: cache-stack -> app-stack
  2. Introduce intermediary stack
  3. Refactor to eliminate circular reference

Determine Execution Order

sloth-runner stack deps order <stack-names...>

Example:

$ sloth-runner stack deps order \
    frontend-stack backend-stack database-stack cache-stack network-stack

Calculating execution order...

Recommended execution order:
  1. network-stack (no dependencies)
  2. storage-stack (depends on: network-stack)
  3. cache-stack (depends on: network-stack)
  4. database-stack (depends on: network-stack, storage-stack)
  5. backend-stack (depends on: database-stack, cache-stack)
  6. frontend-stack (depends on: backend-stack)

Total execution time estimate: ~25 minutes
Parallelizable groups:
  Group 1: network-stack
  Group 2: storage-stack, cache-stack
  Group 3: database-stack
  Group 4: backend-stack
  Group 5: frontend-stack

With parallelization: ~15 minutes


Validation

Overview

Validation performs pre-flight checks before executing workflows, catching errors early.

Validation Checklist

graph TB
    Start[Start Validation] --> ConfigCheck{Configuration Valid?}
    ConfigCheck -->|No| ConfigError[Report Config Errors]
    ConfigCheck -->|Yes| DepCheck{Dependencies Met?}

    DepCheck -->|No| DepError[Report Missing Dependencies]
    DepCheck -->|Yes| ResourceCheck{Resources Available?}

    ResourceCheck -->|No| ResourceError[Report Resource Issues]
    ResourceCheck -->|Yes| PermCheck{Permissions OK?}

    PermCheck -->|No| PermError[Report Permission Issues]
    PermCheck -->|Yes| LockCheck{Lock Available?}

    LockCheck -->|No| LockError[Report Lock Conflict]
    LockCheck -->|Yes| Success[โœ“ Validation Passed]

    ConfigError --> Failed[โœ— Validation Failed]
    DepError --> Failed
    ResourceError --> Failed
    PermError --> Failed
    LockError --> Failed

Commands

Validate Single Stack

sloth-runner stack validate <stack-name>

Example Output (Success):

$ sloth-runner stack validate production-stack

Validating stack: production-stack

โœ“ Configuration syntax valid
โœ“ All dependencies available
โœ“ Required resources exist
โœ“ Permissions sufficient
โœ“ No lock conflicts
โœ“ Workflow definition valid
โœ“ All modules available

Validation passed: production-stack is ready for execution

Example Output (Failure):

$ sloth-runner stack validate production-stack

Validating stack: production-stack

โœ“ Configuration syntax valid
โœ“ Dependencies available
โœ— Resource check failed
  - Missing file: /config/app.yaml
  - Disk space insufficient: 100MB required, 50MB available
โœ— Permission check failed
  - Cannot write to: /var/log/app/
โœ“ No lock conflicts

โœ— Validation failed: 2 errors found

Fix these issues before running the stack.

Validate All Stacks

sloth-runner stack validate all

Example Output:

$ sloth-runner stack validate all

Validating all stacks...

Stack: production-stack
  Status: โœ“ PASS

Stack: staging-stack
  Status: โœ“ PASS

Stack: dev-stack
  Status: โœ— FAIL
  Errors:
    - Missing dependency: database-stack
    - Invalid configuration: timeout must be > 0

Summary:
  Total:    3 stacks
  Passed:   2 stacks
  Failed:   1 stack

Overall: FAILED


Stack Commands

Core Operations

List Stacks

sloth-runner stack list [options]

Options: - --status <status> - Filter by status (created, running, completed, failed) - --format <format> - Output format (table, json, yaml)

Example Output:

$ sloth-runner stack list

Workflow Stacks

NAME                STATUS      LAST RUN             DURATION    EXECUTIONS
----                ------      --------             --------    ----------
production-stack    completed   2025-10-10 14:30:15  71ms        10
staging-stack       running     2025-10-10 14:35:00  0s          5
dev-stack           created     2025-10-10 14:20:00  0s          0
database-stack      completed   2025-10-10 13:45:22  125ms       8
cache-stack         failed      2025-10-10 14:00:00  15ms        3

Total: 5 stacks

Show Stack Details

sloth-runner stack show <stack-name>

Example Output:

$ sloth-runner stack show production-stack

Stack: production-stack

General Information:
  Status:           completed
  Version:          v2.0.0
  Created:          2025-10-09 10:00:00
  Last Updated:     2025-10-10 14:30:15
  Total Executions: 10

Latest Execution:
  Started:          2025-10-10 14:30:00
  Completed:        2025-10-10 14:30:15
  Duration:         71ms
  Status:           success

Resources (15):
  โ€ข web-server (running)
  โ€ข database (running)
  โ€ข cache (running)
  โ€ข load-balancer (running)
  ... 11 more

Dependencies (3):
  โ€ข database-stack
  โ€ข cache-stack
  โ€ข network-stack

Lock Status:
  Status:           unlocked
  Last Locked:      2025-10-10 14:30:00
  Locked By:        deploy-bot
  Lock Duration:    15s

Snapshots:
  Total Snapshots:  38
  Latest Snapshot:  v38 (2025-10-10 14:30:00)

Get Stack Outputs

sloth-runner stack output <stack-name> [key]

Example Output:

$ sloth-runner stack output production-stack

Outputs for: production-stack

deployment_url    = https://app.production.example.com
database_host     = db.production.internal
api_key           = <sensitive>
load_balancer_ip  = 203.0.113.42
version           = v2.0.0

Get specific output:

$ sloth-runner stack output production-stack deployment_url

https://app.production.example.com


Database Schema

Tables

erDiagram
    STACKS ||--o{ STATE_LOCKS : has
    STACKS ||--o{ STATE_VERSIONS : has
    STACKS ||--o{ STATE_EVENTS : generates
    STACKS ||--o{ RESOURCES : contains
    RESOURCES }o--o{ RESOURCES : depends_on

    STACKS {
        int id PK
        string name UK
        string description
        string status
        string version
        datetime created_at
        datetime updated_at
        datetime last_execution
        int execution_count
    }

    STATE_LOCKS {
        int stack_id FK
        string locked_by
        datetime locked_at
        string operation
        string reason
        json metadata
    }

    STATE_VERSIONS {
        int id PK
        int stack_id FK
        string version
        string creator
        string description
        blob state_data
        datetime created_at
    }

    STATE_EVENTS {
        int id PK
        int stack_id FK
        string event_type
        string severity
        string message
        string source
        datetime created_at
    }

    RESOURCES {
        int id PK
        int stack_id FK
        string name
        string type
        string state
        json dependencies
    }

Database Location

Default location: /etc/sloth-runner/stacks.db

Features: - Auto-creation on first use - Foreign keys enforced - Optimized indexes - ACID compliance - Automatic backups


Event System

Event Types

graph TB
    subgraph StackEvents["Stack Events"]
        StackCreated[stack.created]
        StackUpdated[stack.updated]
        StackDestroyed[stack.destroyed]
        ExecStarted[stack.execution.started]
        ExecCompleted[stack.execution.completed]
        ExecFailed[stack.execution.failed]
    end

    subgraph LockEvents["Lock Events"]
        LockAcquired[lock.acquired]
        LockReleased[lock.released]
        LockForced[lock.force_released]
    end

    subgraph SnapshotEvents["Snapshot Events"]
        SnapCreated[snapshot.created]
        SnapRestored[snapshot.restored]
        SnapDeleted[snapshot.deleted]
    end

    subgraph DriftEvents["Drift Events"]
        DriftDetected[drift.detected]
        DriftFixed[drift.fixed]
    end

    StackEvents --> EventProcessor[Event Processor]
    LockEvents --> EventProcessor
    SnapshotEvents --> EventProcessor
    DriftEvents --> EventProcessor

    EventProcessor --> Hooks[Execute Hooks]
    EventProcessor --> Storage[(Event Store)]
    EventProcessor --> Metrics[Update Metrics]

Event Processing

  • Workers: 100 concurrent workers
  • Buffer: 1000 event capacity
  • Persistence: All events stored in database
  • Hooks: Automatic hook execution on events

Performance Metrics

Measured Performance

Operation Average Duration Notes
Workflow Execution 71ms 5 tasks
Lock Acquire/Release < 50ms Including persistence
Snapshot Creation < 100ms Typical stack
Stack Commands < 50ms List, show, etc.
Database Queries < 10ms Indexed lookups
Drift Detection 200-500ms Depends on resource count
Validation 100-300ms Comprehensive checks

System Health

โœ… No memory leaks โœ… No database corruption โœ… No hanging processes โœ… Clean execution โœ… Proper cleanup


Workflow Integration

DSL Configuration

-- Define task with automatic state management
local deploy = task("deploy_app")
    :description("Deploy application with state management")
    :command(function()
        -- State is automatically managed
        state.set("deployment_version", "v2.0.0")
        state.set("deployed_at", os.time())
        state.set("deployed_by", os.getenv("USER"))

        -- Deploy logic
        exec.run("kubectl apply -f deployment.yaml")

        -- Store outputs
        state.set("deployment_url", "https://app.example.com")

        return true, "Deployment successful"
    end)
    :build()

-- Define workflow with stack configuration
workflow.define("production_deploy")
    :description("Production deployment with full state management")
    :version("2.0.0")
    :tasks({deploy})
    :config({
        timeout = "30m",
        require_lock = true,      -- Automatic locking
        create_snapshot = true,   -- Auto-snapshot before execution
        validate_before = true,   -- Pre-flight validation
        detect_drift = true,      -- Post-execution drift check
        on_failure = "rollback"   -- Auto-rollback on failure
    })

State API

-- Set state value
state.set(key, value)

-- Get state value
local value = state.get(key)

-- Delete state value
state.delete(key)

-- Get all state
local all_state = state.get_all()

-- Check if key exists
if state.has(key) then
    -- key exists
end

Configuration

Config File

Location: /etc/sloth-runner/config.yaml

stacks:
  # Database configuration
  database_path: /etc/sloth-runner/stacks.db

  # Automatic features
  auto_lock: true                    # Auto-lock during execution
  auto_snapshot: true                # Auto-snapshot before changes
  auto_drift_detect: false           # Auto-detect drift after execution

  # Timeouts and limits
  lock_timeout: 1h                   # Max lock duration
  snapshot_retention: 30d            # How long to keep snapshots
  max_concurrent_executions: 10      # Max parallel stack executions

  # Snapshot triggers
  snapshot_triggers:
    - before_deployment
    - before_destroy
    - on_drift_fix
    - manual

  # Event system
  events:
    workers: 100                     # Event processor workers
    buffer_size: 1000                # Event buffer capacity
    batch_size: 50                   # Events per batch

  # Validation
  validation:
    strict_mode: true                # Fail on warnings
    check_disk_space: true
    min_disk_space: 100MB
    check_permissions: true

Best Practices

1. Lock Management

โœ… Always release locks - Use defer or finally blocks โœ… Use meaningful reasons - Help team understand why โœ… Set appropriate timeouts - Don't lock forever โœ… Monitor lock duration - Alert on long-running locks

โŒ Don't force-release casually - Only for emergencies โŒ Don't forget to check status - Verify before acquiring

2. Snapshot Strategy

โœ… Snapshot before major changes - Always have a rollback point โœ… Use descriptive descriptions - Know what each snapshot is โœ… Tag snapshots - Categorize for easy finding โœ… Regular cleanup - Remove old snapshots (automated)

โŒ Don't rely on auto-snapshots only - Manual snapshots for important changes โŒ Don't skip comparison - Compare before restoring

3. Drift Management

โœ… Regular drift checks - Schedule automated checks โœ… Fix drift promptly - Don't let it accumulate โœ… Investigate root causes - Fix the source, not just symptoms โœ… Document exceptions - Some drift may be acceptable

โŒ Don't auto-fix blindly - Review drift reports first โŒ Don't ignore warnings - Small drift becomes big problems

4. Dependency Management

โœ… Document dependencies - Keep dependency graph updated โœ… Version dependencies - Pin to specific versions โœ… Check for cycles regularly - Prevent circular dependencies โœ… Plan execution order - Use deps order command

โŒ Don't create tight coupling - Minimize dependencies โŒ Don't skip dependency validation - Always validate first

5. Validation

โœ… Validate before execution - Catch errors early โœ… Enable strict mode in production - No warnings allowed โœ… Include in CI/CD - Validate on every commit โœ… Fix validation errors immediately - Don't bypass

โŒ Don't skip validation - Even for "quick changes" โŒ Don't ignore warnings - Treat warnings as errors


Use Cases

CI/CD Pipelines

# Acquire lock
sloth-runner stack lock acquire production \
    --reason "CI/CD Pipeline #$BUILD_NUMBER" \
    --locked-by "ci-bot"

# Validate before deployment
sloth-runner stack validate production

# Create pre-deployment snapshot
sloth-runner stack snapshot create production \
    --description "Before deployment #$BUILD_NUMBER" \
    --tags "ci,deployment"

# Run deployment
sloth-runner run deploy --file deploy.sloth \
    --stack production \
    --validate

# Check for drift
sloth-runner stack drift detect production

# Release lock
sloth-runner stack lock release production \
    --unlocked-by "ci-bot"

Multi-Environment Management

# Get execution order
sloth-runner stack deps order \
    dev-network dev-db dev-app \
    staging-network staging-db staging-app \
    prod-network prod-db prod-app

# Execute in order with validation
for env in dev staging prod; do
    sloth-runner stack validate ${env}-network
    sloth-runner run deploy --file network.sloth --stack ${env}-network

    sloth-runner stack validate ${env}-db
    sloth-runner run deploy --file db.sloth --stack ${env}-db

    sloth-runner stack validate ${env}-app
    sloth-runner run deploy --file app.sloth --stack ${env}-app
done

Emergency Rollback

# Find last good snapshot
sloth-runner stack snapshot list production | grep "working"

# Restore to last known good state
sloth-runner stack snapshot restore production v35

# Verify restoration
sloth-runner stack show production

# Create incident snapshot
sloth-runner stack snapshot create production \
    --description "Post-incident restoration" \
    --tags "incident,rollback"

Troubleshooting

Stuck Lock

Problem: Lock won't release normally

Solution:

# Check lock status
sloth-runner stack lock status my-stack

# If process is dead, force release
sloth-runner stack lock force-release my-stack \
    --reason "Process terminated, lock stuck"

# Verify release
sloth-runner stack lock status my-stack

Snapshot Restore Failed

Problem: Restore operation fails

Solution:

# Check snapshot integrity
sloth-runner stack snapshot show my-stack v38

# Try dry-run first
sloth-runner stack snapshot restore my-stack v38 --dry-run

# Check disk space
df -h /etc/sloth-runner

# Try restore again with verbose output
sloth-runner stack snapshot restore my-stack v38 --verbose

Drift Auto-Fix Errors

Problem: Auto-fix fails to correct drift

Solution:

# Get detailed drift report
sloth-runner stack drift show my-stack

# Try dry-run to see fix plan
sloth-runner stack drift fix my-stack --dry-run

# Fix one resource at a time
sloth-runner stack drift fix my-stack \
    --resource web-server

# Manual intervention if needed
# (follow suggested fixes from drift report)

Circular Dependency

Problem: Circular dependency detected

Solution:

# Show dependency graph
sloth-runner stack deps graph my-stack

# Check for cycles
sloth-runner stack deps check my-stack

# Resolution options:
# 1. Remove unnecessary dependency
# 2. Introduce intermediary stack
# 3. Refactor to break cycle


Testing Status

Test Results

Automated Tests: 34 tests (97% pass rate) Manual Tests: 65 tests total - Stack/Sysadmin: 26 tests (100%) - CLI Complete: 39 tests (97.4%)

Overall: 98% success rate (97/99 tests passed)

Validated Features

โœ… Lock acquire/release cycle โœ… Lock persistence across restarts โœ… Snapshot creation and restoration โœ… Version management (37+ versions) โœ… Drift detection and fixing โœ… Dependency tracking โœ… Validation system โœ… Event system integration โœ… Database schema and migrations โœ… CLI commands and output


Migration Guide

From Terraform

Terraform users will find familiar concepts:

Terraform Sloth Runner Notes
terraform.tfstate Stack state in SQLite More structured
State locking (S3/DynamoDB) Built-in locking No external dependencies
terraform plan stack validate + drift detect Pre-flight checks
terraform apply Workflow execution with auto-lock Automatic safety
Workspace Stack Similar isolation concept
Backend SQLite database Simpler, local-first

From Pulumi

Pulumi users will appreciate:

Pulumi Sloth Runner Notes
State snapshots Stack snapshots Same concept
Stack outputs Stack outputs Compatible API
Pulumi.yaml Workflow definition DSL-based
Policy packs Validation system Pre-execution checks
Secrets Sensitive values Encrypted storage

Roadmap

Planned Features

  • Remote state backend (S3, GCS, Azure Blob)
  • State encryption at rest
  • Distributed locking (Redis/etcd)
  • Web UI for state visualization
  • Terraform import compatibility
  • GitOps integration (auto-sync from Git)
  • Advanced drift remediation
  • Multi-region replication

Support

Documentation: https://docs.sloth-runner.io GitHub Issues: https://github.com/chalkan3-sloth/sloth-runner/issues Source Code: cmd/sloth-runner/commands/stack/ Test Results: /tmp/SISTEMA_100_FUNCIONAL.md


Last Updated: 2025-10-10 Version: 1.0.0 Status: Production Ready