๐๏ธ Sloth Runner Architecture¶
Complete Technical Architecture Documentation
๐ Table of Contents¶
- Overview
- High-Level Architecture
- Core Components
- System Architecture Diagrams
- Component Details
- Data Flow
- Distributed Execution
- State Management
- Security Architecture
- Deployment Architectures
Overview¶
Sloth Runner is a distributed task automation and orchestration platform built in Go, featuring:
- Lua-based DSL for workflow definition
- Distributed agent architecture for multi-machine execution
- Pluggable module system for extensibility
- State management with distributed locking
- Event-driven hooks for lifecycle management
- Built-in scheduler for cron-like task execution
- Web UI for visualization and monitoring
Key Characteristics¶
- Language: Go (backend), Lua (DSL), TypeScript/React (Web UI)
- Architecture Style: Microservices, Master-Agent, Event-Driven
- Communication: gRPC (agents), HTTP (API), SSH (legacy)
- State Storage: SQLite (local), Bolt (embedded), optional PostgreSQL
- Configuration: YAML, TOML, Environment Variables
High-Level Architecture¶
graph TB
subgraph UserInterface["User Interface Layer"]
CLI[CLI Client]
WebUI[Web UI]
API[REST API]
end
subgraph ControlPlane["Control Plane - Master Node"]
Master[Master Server]
Registry[Agent Registry]
Scheduler[Task Scheduler]
StateDB[(State Database)]
StackDB[(Stack Database)]
end
subgraph ExecutionEngine["Execution Engine"]
Runner[Workflow Runner]
LuaVM[Lua VM]
Modules[Lua Modules]
Hooks[Hook System]
Executor[Task Executor]
end
subgraph DataPlane["Data Plane - Agents"]
Agent1[Agent Node 1]
Agent2[Agent Node 2]
AgentN[Agent Node N]
end
subgraph ExternalSystems["External Systems"]
Git[Git Repos]
Cloud[Cloud APIs]
SSH[SSH Targets]
K8s[Kubernetes]
end
CLI --> Master
WebUI --> API
API --> Master
Master --> Registry
Master --> Scheduler
Master <--> StateDB
Master <--> StackDB
Master --> Runner
Runner --> LuaVM
LuaVM --> Modules
Runner --> Hooks
Runner --> Executor
Master -.gRPC.-> Agent1
Master -.gRPC.-> Agent2
Master -.gRPC.-> AgentN
Modules --> Git
Modules --> Cloud
Modules --> SSH
Modules --> K8s
Agent1 --> Runner
Agent2 --> Runner
AgentN --> Runner Core Components¶
1. CLI (Command Line Interface)¶
Entry point for user interactions. Built using Cobra framework.
graph LR
CLI[sloth-runner CLI]
CLI --> Run[run]
CLI --> Agent[agent]
CLI --> Stack[stack]
CLI --> Workflow[workflow]
CLI --> Scheduler[scheduler]
CLI --> State[state]
CLI --> Secrets[secrets]
CLI --> Hook[hook]
CLI --> Events[events]
CLI --> DB[db]
CLI --> Sysadmin[sysadmin]
Agent --> AgentList[list]
Agent --> AgentStart[start]
Agent --> AgentInstall[install]
Agent --> AgentMetrics[metrics]
Stack --> StackList[list]
Stack --> StackShow[show]
Stack --> StackDelete[delete] Location: cmd/sloth-runner/main.go, cmd/sloth-runner/commands/
Key Commands: - run - Execute workflows - agent - Manage distributed agents - stack - Manage deployment stacks - scheduler - Schedule recurring tasks - state - Distributed state operations - workflow - Workflow management - sysadmin - System administration tools
2. Master Server¶
Central coordinator for distributed execution.
Responsibilities: - Agent registration and health monitoring - Task distribution and scheduling - State coordination - Metrics collection - Event aggregation
Location: cmd/sloth-runner/agent_registry.go
Components: - Agent Registry: Maintains active agent connections - Task Dispatcher: Distributes tasks to appropriate agents - Health Monitor: Tracks agent health and availability - Metrics Aggregator: Collects performance metrics
3. Workflow Runner¶
Executes workflow definitions with dependency resolution.
graph TD
WorkflowDef[Workflow Definition Lua File] --> Parser[DSL Parser]
Parser --> DAG[DAG Builder]
DAG --> Scheduler[Task Scheduler]
Scheduler --> Executor[Task Executor]
Executor --> Hooks[Pre/Post Hooks]
Executor --> StateCheck{Check Dependencies}
StateCheck -->|Ready| Execute[Execute Task]
StateCheck -->|Wait| Queue[Task Queue]
Execute --> Results[Collect Results]
Results --> Artifacts[Save Artifacts]
Results --> NextTasks[Trigger Next Tasks] Location: internal/runner/, internal/execution/
Key Features: - Dependency Resolution: Builds execution DAG from task dependencies - Parallel Execution: Runs independent tasks concurrently - Retry Logic: Configurable retry with exponential backoff - Timeout Management: Per-task and workflow-level timeouts - Artifact Management: File sharing between tasks
4. Lua VM Integration¶
Embeds Lua for DSL execution and module system.
graph LR
subgraph LuaVM["Lua VM"]
DSL[DSL Code] --> LuaState[Lua State]
LuaState --> BuiltinFuncs[Built-in Functions]
LuaState --> Modules[Lua Modules]
end
subgraph GoBridge["Go Bridge"]
GoAPI[Go API]
GoAPI --> LuaState
end
subgraph ModuleSystem["Module System"]
Modules --> Core[Core Modules]
Modules --> External[External Modules]
Core --> Facts[facts]
Core --> FileOps[file_ops]
Core --> Exec[exec]
Core --> Log[log]
Core --> State[state]
External --> Git[git]
External --> Docker[docker]
External --> K8s[kubernetes]
External --> Cloud[cloud providers]
end Location: internal/lua/, internal/luamodules/, internal/modules/
Capabilities: - DSL Parsing: Converts Lua code to workflow structures - Module Loading: Dynamic module registration - Go-Lua Bridge: Bidirectional function calls - Sandboxing: Restricted execution environment
5. Agent System¶
Distributed execution nodes for remote task execution.
sequenceDiagram
participant Master
participant Agent
participant TaskExecutor
participant Target
Agent->>Master: Register (gRPC)
Master->>Agent: Registration Confirmed
loop Heartbeat
Agent->>Master: Send Heartbeat
Master->>Agent: ACK
end
Master->>Agent: Delegate Task (gRPC)
Agent->>TaskExecutor: Execute Task
TaskExecutor->>Target: Perform Operations
Target-->>TaskExecutor: Results
TaskExecutor-->>Agent: Task Complete
Agent-->>Master: Task Results (gRPC)
Master->>Agent: Request Metrics
Agent-->>Master: Metrics Data Location: internal/agent/, cmd/sloth-runner/commands/agent/
Features: - Auto-Discovery: Agents register with master on startup - Health Monitoring: Continuous heartbeat mechanism - Task Delegation: Executes tasks on behalf of master - Resource Reporting: CPU, memory, disk usage - Update Mechanism: Self-update capability
6. State Management¶
Distributed state with locking for coordination.
Location: internal/state/, cmd/sloth-runner/commands/state/
Operations: - Get/Set: Key-value storage - Compare-and-Swap: Atomic updates - Locking: Distributed lock acquisition - TTL Support: Automatic expiration - Namespaces: Isolated state spaces
Storage Backends: - SQLite: Default embedded database - BoltDB: High-performance key-value store - PostgreSQL: Optional for high availability
7. Hook System¶
Event-driven lifecycle management.
graph LR
subgraph HookTypes["Hook Types"]
PreTask[pre_task]
PostTask[post_task]
OnSuccess[on_success]
OnFailure[on_failure]
OnTimeout[on_timeout]
WorkflowStart[workflow_start]
WorkflowComplete[workflow_complete]
end
subgraph HookExecution["Hook Execution"]
Dispatcher[Event Dispatcher]
Executor[Hook Executor]
end
PreTask --> Dispatcher
PostTask --> Dispatcher
OnSuccess --> Dispatcher
OnFailure --> Dispatcher
OnTimeout --> Dispatcher
WorkflowStart --> Dispatcher
WorkflowComplete --> Dispatcher
Dispatcher --> Executor
Executor --> Actions[Execute Actions] Location: internal/hooks/
Capabilities: - Lifecycle Hooks: Pre/post execution hooks - Conditional Execution: Run hooks based on conditions - Async Execution: Non-blocking hook execution - Error Handling: Graceful failure handling
8. Module System¶
Pluggable modules for extensibility.
Built-in Modules: - facts - System discovery - file_ops - File operations - exec - Command execution - git - Git operations - docker - Docker management - pkg - Package management - systemd - Service management - infra_test - Infrastructure testing - state - State operations - metrics - Metrics collection - log - Logging - net - HTTP/networking - ai - AI integration - gitops - GitOps workflows
Module API:
-- Module registration
local mymodule = {}
function mymodule.operation(args)
-- Go function called via bridge
return go_bridge.call("mymodule.operation", args)
end
return mymodule
System Architecture Diagrams¶
Deployment Architecture¶
graph TB
subgraph Workstation["User Workstation"]
DevCLI[Developer CLI]
end
subgraph MasterNode["Master Node - Primary"]
Master[Master Server :50053]
MasterDB[(State DB Stack DB)]
MasterUI[Web UI :8080]
end
subgraph AgentCluster["Agent Cluster"]
A1[Agent 1 build-01]
A2[Agent 2 build-02]
A3[Agent 3 deploy-01]
end
subgraph TargetInfra["Target Infrastructure"]
K8s[Kubernetes Cluster]
Servers[Virtual Machines]
Cloud[Cloud Resources]
end
DevCLI -->|gRPC/HTTP| Master
DevCLI -->|HTTP| MasterUI
Master <--> MasterDB
Master -.gRPC.-> A1
Master -.gRPC.-> A2
Master -.gRPC.-> A3
A1 --> K8s
A2 --> Servers
A3 --> Cloud Task Execution Flow¶
sequenceDiagram
participant User
participant CLI
participant Master
participant Runner
participant LuaVM
participant Agent
participant Target
User->>CLI: sloth-runner run workflow.sloth
CLI->>Master: Load & Parse Workflow
Master->>Runner: Initialize Workflow
Runner->>LuaVM: Execute DSL
LuaVM-->>Runner: Parsed Tasks & DAG
Runner->>Runner: Build Execution Plan
loop For Each Task
Runner->>Master: Check if Delegated
alt Local Execution
Runner->>LuaVM: Execute Task
LuaVM->>Target: Perform Operations
Target-->>LuaVM: Results
LuaVM-->>Runner: Task Complete
else Remote Execution
Master->>Agent: Delegate Task
Agent->>LuaVM: Execute Task
LuaVM->>Target: Perform Operations
Target-->>LuaVM: Results
LuaVM-->>Agent: Task Complete
Agent-->>Master: Task Results
Master-->>Runner: Results Received
end
Runner->>Runner: Update Task Status
Runner->>Runner: Trigger Dependent Tasks
end
Runner-->>CLI: Workflow Complete
CLI-->>User: Display Results State Management Architecture¶
graph TB
subgraph ApplicationLayer["Application Layer"]
App[Application Code]
end
subgraph StateAPI["State API"]
API[State API]
Lock[Lock Manager]
Cache[In-Memory Cache]
end
subgraph StorageLayer["Storage Layer"]
SQLite[(SQLite DB)]
Bolt[(BoltDB)]
end
subgraph DistributionLayer["Distribution Layer"]
Master[Master Node]
Agent1[Agent 1]
Agent2[Agent 2]
end
App --> API
API --> Lock
API --> Cache
Cache -.Sync.-> SQLite
Cache -.Sync.-> Bolt
Lock --> SQLite
Master <--> API
Agent1 <--> API
Agent2 <--> API Component Details¶
CLI Command Structure¶
sloth-runner
โโโ run Execute workflows
โโโ agent Manage agents
โ โโโ start Start agent daemon
โ โโโ list List registered agents
โ โโโ install Install agent on remote
โ โโโ update Update agent version
โ โโโ metrics View agent metrics
โ โโโ modules Check agent modules
โโโ workflow Workflow operations
โ โโโ list List workflows
โ โโโ show Show workflow details
โ โโโ validate Validate workflow syntax
โโโ stack Stack management
โ โโโ list List stacks
โ โโโ show Show stack details
โ โโโ delete Delete stack
โ โโโ export Export stack data
โโโ scheduler Task scheduling
โ โโโ add Add scheduled task
โ โโโ list List scheduled tasks
โ โโโ delete Remove scheduled task
โ โโโ run Execute scheduled tasks
โโโ state State operations
โ โโโ get Get state value
โ โโโ set Set state value
โ โโโ delete Delete state key
โ โโโ list List state keys
โ โโโ lock Acquire distributed lock
โโโ secrets Secrets management
โ โโโ set Store secret
โ โโโ get Retrieve secret
โ โโโ list List secrets
โ โโโ delete Delete secret
โโโ hook Hook management
โ โโโ list List registered hooks
โ โโโ add Add hook
โ โโโ delete Remove hook
โโโ events Event operations
โ โโโ list List events
โ โโโ clear Clear event log
โโโ sysadmin System administration
โ โโโ health Health checks
โ โโโ logs Log management
โ โโโ backup Backup operations
โ โโโ packages Package management
โ โโโ services Service management
โโโ master Master server operations
โ โโโ start Start master server
โโโ ui Web UI
โ โโโ start Start web interface
โโโ version Show version info
Internal Package Structure¶
internal/
โโโ agent/ Agent implementation
โ โโโ client.go Agent gRPC client
โ โโโ server.go Agent gRPC server
โ โโโ registry.go Agent registration
โ โโโ health.go Health monitoring
โโโ client/ Master client library
โโโ config/ Configuration management
โโโ core/ Core functionality
โ โโโ workflow.go Workflow structures
โ โโโ task.go Task structures
โ โโโ types.go Common types
โโโ execution/ Execution engine
โ โโโ executor.go Task executor
โ โโโ dag.go DAG builder
โ โโโ parallel.go Parallel execution
โโโ hooks/ Hook system
โ โโโ dispatcher.go Event dispatcher
โ โโโ executor.go Hook executor
โ โโโ types.go Hook types
โโโ lua/ Lua VM integration
โ โโโ state.go Lua state management
โ โโโ bridge.go Go-Lua bridge
โ โโโ sandbox.go Sandboxing
โโโ luamodules/ Lua module implementations
โ โโโ facts/ System facts module
โ โโโ fileops/ File operations module
โ โโโ exec/ Execution module
โ โโโ ... Other modules
โโโ masterdb/ Master database
โ โโโ agent_db.go Agent persistence
โ โโโ stack_db.go Stack persistence
โโโ metrics/ Metrics collection
โ โโโ collector.go Metrics collector
โ โโโ exporter.go Prometheus exporter
โโโ modules/ Module system
โ โโโ loader.go Module loader
โ โโโ registry.go Module registry
โโโ runner/ Workflow runner
โ โโโ runner.go Main runner
โ โโโ context.go Execution context
โ โโโ results.go Result collection
โโโ ssh/ SSH connectivity
โ โโโ client.go SSH client
โ โโโ tunnel.go SSH tunneling
โโโ state/ State management
โ โโโ state.go State operations
โ โโโ lock.go Distributed locking
โ โโโ storage.go Storage backends
โโโ taskrunner/ Task execution
โ โโโ task.go Task runner
โ โโโ parallel.go Parallel tasks
โโโ telemetry/ Telemetry system
โ โโโ metrics.go Metrics
โ โโโ tracing.go Distributed tracing
โโโ webui/ Web interface
โโโ server.go HTTP server
โโโ handlers/ HTTP handlers
Data Flow¶
Workflow Execution Data Flow¶
flowchart TD
Start[User: sloth-runner run] --> Load[Load Workflow File]
Load --> Parse[Parse Lua DSL]
Parse --> Validate[Validate Workflow]
Validate --> BuildDAG[Build Task DAG]
BuildDAG --> InitState[Initialize State]
InitState --> CheckTasks{More Tasks?}
CheckTasks -->|No| Complete[Workflow Complete]
CheckTasks -->|Yes| SelectTask[Select Ready Task]
SelectTask --> CheckDelegate{Delegated?}
CheckDelegate -->|Local| ExecLocal[Execute Locally]
CheckDelegate -->|Remote| FindAgent[Find Agent]
FindAgent --> DelegateTask[Delegate to Agent]
DelegateTask --> WaitResult[Wait for Result]
WaitResult --> CollectResult
ExecLocal --> PreHooks[Execute Pre-Hooks]
PreHooks --> RunCommand[Run Task Command]
RunCommand --> PostHooks[Execute Post-Hooks]
PostHooks --> CollectResult[Collect Results]
CollectResult --> SaveArtifacts[Save Artifacts]
SaveArtifacts --> UpdateState[Update State]
UpdateState --> TriggerNext[Trigger Dependent Tasks]
TriggerNext --> CheckTasks
Complete --> SaveStack[Save to Stack]
SaveStack --> ExportResults[Export Results]
ExportResults --> End[Return to User] Agent Communication Flow¶
sequenceDiagram
participant Agent
participant Master
participant Database
participant TaskQueue
Note over Agent,Master: Agent Registration
Agent->>Master: gRPC: RegisterAgent(info)
Master->>Database: Store Agent Info
Database-->>Master: Agent ID
Master-->>Agent: Registration Success
Note over Agent,Master: Heartbeat Loop
loop Every 30s
Agent->>Master: gRPC: Heartbeat(agent_id, metrics)
Master->>Database: Update Last Seen
Master-->>Agent: ACK + Config Updates
end
Note over Agent,Master: Task Delegation
Master->>TaskQueue: Enqueue Task
Master->>Master: Select Agent
Master->>Agent: gRPC: ExecuteTask(task_def)
Agent->>Agent: Execute Task
Agent->>Master: gRPC: TaskProgress(status)
Agent->>Master: gRPC: TaskComplete(result)
Master->>Database: Store Result
Note over Agent,Master: Metrics Collection
Master->>Agent: gRPC: GetMetrics()
Agent-->>Master: Metrics Data
Master->>Database: Store Metrics Distributed Execution¶
Agent Modes¶
- Standalone Agent
- Runs independently
- No master required
-
Local workflow execution
-
Managed Agent
- Registers with master
- Receives delegated tasks
-
Reports status and metrics
-
Hybrid Mode
- Can execute both local and delegated tasks
- Automatic failover
- Load balancing
Task Delegation Strategy¶
graph TD
Task[Task Definition] --> CheckDelegate{Has :delegate_to?}
CheckDelegate -->|No| LocalExec[Execute Locally]
CheckDelegate -->|Yes| CheckAgent{Agent Specified?}
CheckAgent -->|Specific Agent| FindSpecific[Find Agent by Name]
CheckAgent -->|Tag-based| FindByTags[Find Agents by Tags]
CheckAgent -->|Any| FindAvailable[Find Available Agent]
FindSpecific --> ValidateAgent{Agent Available?}
FindByTags --> SelectBest[Select Best Agent]
FindAvailable --> SelectBest
SelectBest --> ValidateAgent
ValidateAgent -->|Yes| SendTask[Delegate Task]
ValidateAgent -->|No| Fallback{Fallback to Local?}
Fallback -->|Yes| LocalExec
Fallback -->|No| Error[Task Failed]
SendTask --> Monitor[Monitor Execution]
Monitor --> Results[Collect Results]
LocalExec --> Results Load Balancing¶
Strategies: 1. Round Robin: Distribute tasks evenly 2. Least Loaded: Send to agent with lowest load 3. Tag-based: Route by agent capabilities 4. Geographic: Route by location 5. Custom: User-defined logic
State Management¶
State Storage Model¶
erDiagram
STATE {
string key PK
string namespace
bytes value
timestamp created_at
timestamp updated_at
timestamp expires_at
string owner
}
LOCK {
string lock_id PK
string resource
string holder
timestamp acquired_at
timestamp expires_at
}
WORKFLOW_STATE {
string workflow_id PK
string status
json task_states
json variables
timestamp started_at
timestamp completed_at
}
STATE ||--o{ LOCK : "protects"
WORKFLOW_STATE ||--o{ STATE : "uses" Lock Mechanism¶
sequenceDiagram
participant Task1
participant LockManager
participant Database
participant Task2
Task1->>LockManager: Acquire Lock("resource_x")
LockManager->>Database: Check Lock Status
Database-->>LockManager: Not Locked
LockManager->>Database: Create Lock Record
LockManager-->>Task1: Lock Acquired
Task2->>LockManager: Acquire Lock("resource_x")
LockManager->>Database: Check Lock Status
Database-->>LockManager: Locked by Task1
LockManager-->>Task2: Lock Denied
Task1->>Task1: Execute Critical Section
Task1->>LockManager: Release Lock("resource_x")
LockManager->>Database: Delete Lock Record
LockManager-->>Task1: Lock Released
Task2->>LockManager: Acquire Lock("resource_x")
LockManager->>Database: Check Lock Status
Database-->>LockManager: Not Locked
LockManager->>Database: Create Lock Record
LockManager-->>Task2: Lock Acquired Stack State Management System¶
Overview¶
The Stack State Management System is a Terraform/Pulumi-inspired subsystem that provides enterprise-grade state management for workflows. It enables version control, drift detection, dependency tracking, and distributed locking for deployment stacks.
Stack State Architecture¶
graph TB
subgraph ClientLayer["Client Layer"]
CLI[CLI Client]
API[REST API]
end
subgraph StackStateSystem["Stack State System"]
subgraph CoreServices["Core Services"]
LockSvc[Locking Service]
SnapshotSvc[Snapshot Service]
DriftSvc[Drift Detection]
end
subgraph AdvancedServices["Advanced Services"]
DepSvc[Dependency Tracker]
ValidSvc[Validation Service]
EventSvc[Event Processor]
end
end
subgraph Storage["Storage Layer"]
StackDB[(Stack Database SQLite)]
EventStore[(Event Store)]
end
CLI --> LockSvc
CLI --> SnapshotSvc
CLI --> DriftSvc
API --> LockSvc
LockSvc --> StackDB
SnapshotSvc --> StackDB
DriftSvc --> StackDB
DepSvc --> StackDB
ValidSvc --> StackDB
LockSvc --> EventSvc
SnapshotSvc --> EventSvc
DriftSvc --> EventSvc
EventSvc --> EventStore Key Components¶
| Component | Purpose | Features |
|---|---|---|
| Locking Service | Prevent concurrent executions | Metadata tracking, force release, timeout management |
| Snapshot Service | Version control and rollback | Auto-versioning (v1, v2...), point-in-time recovery |
| Drift Detection | State validation | Compare actual vs desired, auto-fix capability |
| Dependency Tracker | Manage stack relationships | Circular dependency detection, execution ordering |
| Validation Service | Pre-flight checks | Resource verification, configuration validation |
| Event Processor | Audit trail | 100 workers, 1000 event buffer |
Database Schema¶
erDiagram
STACKS ||--o{ STATE_LOCKS : has
STACKS ||--o{ STATE_VERSIONS : has
STACKS ||--o{ STATE_EVENTS : generates
STACKS ||--o{ RESOURCES : contains
RESOURCES }o--o{ RESOURCES : depends_on
STACKS {
int id PK
string name UK
string description
string status
string version
datetime created_at
datetime updated_at
datetime last_execution
int execution_count
}
STATE_LOCKS {
int stack_id FK
string locked_by
datetime locked_at
string operation
string reason
json metadata
}
STATE_VERSIONS {
int id PK
int stack_id FK
string version
string creator
string description
blob state_data
datetime created_at
}
STATE_EVENTS {
int id PK
int stack_id FK
string event_type
string severity
string message
string source
datetime created_at
}
RESOURCES {
int id PK
int stack_id FK
string name
string type
string state
json dependencies
} Core Features¶
1. State Locking¶
Prevents concurrent modifications to the same stack:
# Acquire lock for deployment
sloth-runner stack lock acquire production-stack \
--reason "Deploying v2.0.0" \
--locked-by "deploy-bot"
# Check lock status
sloth-runner stack lock status production-stack
# Release lock
sloth-runner stack lock release production-stack
Lock Lifecycle:
stateDiagram-v2
[*] --> Unlocked
Unlocked --> Acquiring: lock acquire
Acquiring --> Locked: Success
Acquiring --> Unlocked: Failure
Locked --> Releasing: lock release
Releasing --> Unlocked: Success
Locked --> ForceReleasing: force-release
ForceReleasing --> Unlocked: Success
Locked --> Locked: Status Check
Unlocked --> Unlocked: Status Check 2. Snapshots & Versioning¶
Point-in-time backups with automatic versioning:
# Create snapshot
sloth-runner stack snapshot create production-stack \
--description "Before v2.0 upgrade" \
--creator "admin"
# List versions
sloth-runner stack snapshot list production-stack
# Restore to previous version
sloth-runner stack snapshot restore production-stack v35
# Compare versions
sloth-runner stack snapshot compare production-stack v35 v38
Testing Results: 37+ versions successfully created and managed
3. Drift Detection¶
Identifies differences between desired and actual state:
# Detect drift
sloth-runner stack drift detect production-stack
# Show detailed report
sloth-runner stack drift show production-stack
# Auto-fix drift
sloth-runner stack drift fix production-stack --auto-approve
Drift Types: - Configuration drift (port changes, replica counts) - Resource drift (missing/extra resources) - State drift (service status) - Dependency drift (missing dependencies)
4. Dependency Management¶
Tracks and validates stack dependencies:
# Show dependencies
sloth-runner stack deps show backend-stack
# Generate dependency graph
sloth-runner stack deps graph backend-stack --output deps.png
# Check for circular dependencies
sloth-runner stack deps check backend-stack
# Determine execution order
sloth-runner stack deps order frontend backend database cache
Dependency Graph Example:
graph TB
subgraph InfraLayer["Infrastructure Layer"]
Network[network-stack]
Storage[storage-stack]
end
subgraph DataLayer["Data Layer"]
Database[database-stack]
Cache[cache-stack]
end
subgraph AppLayer["Application Layer"]
Backend[backend-stack]
Frontend[frontend-stack]
end
Network --> Database
Network --> Cache
Storage --> Database
Database --> Backend
Cache --> Backend
Backend --> Frontend 5. Validation System¶
Pre-flight checks before execution:
# Validate single stack
sloth-runner stack validate production-stack
# Validate all stacks
sloth-runner stack validate all
Validation Checklist: - โ Configuration syntax - โ Dependencies availability - โ Resource existence - โ Permissions - โ Lock availability - โ Disk space - โ Network connectivity
Event System Integration¶
Stack operations emit events for auditability:
Event Types: - stack.created, stack.updated, stack.destroyed - stack.execution.started, stack.execution.completed, stack.execution.failed - lock.acquired, lock.released, lock.force_released - snapshot.created, snapshot.restored, snapshot.deleted - drift.detected, drift.fixed
Event Processing: - 100 concurrent workers - 1000 event buffer capacity - Automatic hook execution - Complete persistence
Performance Metrics¶
| Operation | Duration | Notes |
|---|---|---|
| Workflow Execution | 71ms | 5 tasks, typical stack |
| Lock Acquire/Release | <50ms | Including persistence |
| Snapshot Creation | <100ms | Typical stack size |
| Drift Detection | 200-500ms | Depends on resource count |
| Validation | 100-300ms | Comprehensive checks |
Workflow Integration¶
Automatic state management in workflows:
workflow.define("production_deploy")
:description("Production deployment with state management")
:version("2.0.0")
:tasks({deploy})
:config({
timeout = "30m",
require_lock = true, -- Automatic locking
create_snapshot = true, -- Auto-snapshot before execution
validate_before = true, -- Pre-flight validation
detect_drift = true, -- Post-execution drift check
on_failure = "rollback" -- Auto-rollback on failure
})
Use Cases¶
- CI/CD Pipelines: Prevent conflicting deployments, automatic rollback
- Multi-Environment Management: Coordinate deployments across dev/staging/prod
- Infrastructure as Code: Terraform-like state management
- Team Collaboration: Lock coordination, audit trail
- Disaster Recovery: Point-in-time restoration
Storage¶
Database Location: /etc/sloth-runner/stacks.db
Features: - Auto-creation on first use - Foreign key enforcement - Optimized indexes - ACID compliance - Automatic backups
Tables: 5 core tables (stacks, state_locks, state_versions, state_events, resources)
Testing Status¶
Test Coverage: 98% success rate (97/99 tests passed) - โ Lock operations: 100% functional - โ Snapshot management: 37+ versions tested - โ Drift detection: Validated - โ Dependency tracking: Circular detection working - โ Validation system: All checks passing - โ Event system: Full integration confirmed
Comparison with Other Tools¶
| Feature | Sloth Runner | Terraform | Pulumi |
|---|---|---|---|
| State Backend | SQLite (local-first) | S3/Remote | Cloud service |
| Locking | Built-in | External (DynamoDB) | Service-based |
| Versioning | Automatic snapshots | Manual | Checkpoint |
| Drift Detection | Built-in | terraform plan | pulumi preview |
| Language | Lua DSL | HCL | Multi-language |
| Dependencies | SQLite only | Multiple backends | Cloud required |
Documentation¶
For complete Stack State Management documentation, see: - Stack State Management Guide
Security Architecture¶
Authentication & Authorization¶
graph TB
subgraph SecurityLayers["Security Layers"]
TLS[TLS/mTLS]
Auth[Authentication]
Authz[Authorization]
Audit[Audit Logging]
end
subgraph AuthMethods["Auth Methods"]
APIKey[API Keys]
JWT[JWT Tokens]
SSH[SSH Keys]
Cert[Client Certificates]
end
subgraph Authorization["Authorization"]
RBAC[Role-Based Access]
Policy[Policy Engine]
Secrets[Secrets Management]
end
TLS --> Auth
Auth --> Authz
Authz --> Audit
APIKey --> Auth
JWT --> Auth
SSH --> Auth
Cert --> Auth
RBAC --> Authz
Policy --> Authz
Secrets --> Authz Secrets Management¶
Features: - Encrypted storage - Per-environment secrets - Secret rotation - Audit trail - Integration with external vaults (HashiCorp Vault, AWS Secrets Manager)
Network Security¶
graph LR
subgraph External["External"]
User[User]
Agent[Remote Agent]
end
subgraph DMZ["DMZ"]
LB[Load Balancer]
Proxy[Reverse Proxy]
end
subgraph InternalNetwork["Internal Network"]
Master[Master Server]
DB[(Database)]
Agents[Internal Agents]
end
User -->|HTTPS/TLS| LB
Agent -->|gRPC/mTLS| LB
LB --> Proxy
Proxy --> Master
Master <--> DB
Master <-.gRPC.-> Agents Deployment Architectures¶
Single Node Deployment¶
graph TB
subgraph SingleServer["Single Server"]
CLI[CLI]
Master[Master]
Agent[Local Agent]
DB[(SQLite)]
UI[Web UI]
end
CLI --> Master
Master --> Agent
Master --> DB
UI --> Master Use Case: Development, small teams, single machine automation
Distributed Deployment¶
graph TB
subgraph ControlPlane["Control Plane"]
Master[Master Server]
MasterDB[(PostgreSQL)]
WebUI[Web UI]
end
subgraph BuildCluster["Build Cluster"]
B1[Build Agent 1]
B2[Build Agent 2]
B3[Build Agent 3]
end
subgraph DeployCluster["Deploy Cluster"]
D1[Deploy Agent 1]
D2[Deploy Agent 2]
end
subgraph TestCluster["Test Cluster"]
T1[Test Agent 1]
T2[Test Agent 2]
end
Master --> MasterDB
WebUI --> Master
Master -.-> B1
Master -.-> B2
Master -.-> B3
Master -.-> D1
Master -.-> D2
Master -.-> T1
Master -.-> T2 Use Case: CI/CD pipelines, enterprise deployments, multi-environment
High Availability Deployment¶
graph TB
subgraph LoadBalancer["Load Balancer"]
LB[HAProxy/Nginx]
end
subgraph MasterCluster["Master Cluster"]
M1[Master 1 Primary]
M2[Master 2 Standby]
M3[Master 3 Standby]
end
subgraph Database["Database"]
PGDB[(PostgreSQL Primary-Replica)]
end
subgraph AgentPool["Agent Pool"]
A1[Agent 1]
A2[Agent 2]
AN[Agent N]
end
LB --> M1
LB -.Failover.-> M2
LB -.Failover.-> M3
M1 --> PGDB
M2 --> PGDB
M3 --> PGDB
M1 -.-> A1
M1 -.-> A2
M1 -.-> AN Use Case: Mission-critical, 24/7 operations, large scale
Performance Characteristics¶
Scalability¶
| Component | Scalability | Limits |
|---|---|---|
| Master | Vertical | ~10,000 agents per master |
| Agents | Horizontal | Unlimited agents |
| Workflows | Horizontal | Thousands concurrent |
| Tasks per Workflow | Limited | ~1,000 tasks recommended |
| State Operations | High | Millions of operations/sec |
Throughput¶
- Task Execution: 100+ tasks/second (single agent)
- Agent Registration: 1,000+ agents/minute
- State Operations: 10,000+ ops/second
- Workflow Parsing: 50+ workflows/second
Resource Requirements¶
Master Node: - CPU: 2-4 cores minimum, 8+ recommended - Memory: 2GB minimum, 8GB recommended - Storage: 10GB minimum, 100GB+ for production - Network: 1Gbps
Agent Node: - CPU: 1-2 cores - Memory: 512MB minimum, 2GB recommended - Storage: 5GB minimum - Network: 100Mbps
Extension Points¶
Custom Modules¶
Create custom Lua modules:
-- custom_module.lua
local module = {}
function module.my_operation(args)
-- Your logic here
return {
success = true,
data = "result"
}
end
return module
Register in Go:
Custom Commands¶
Extend CLI with custom commands:
func NewCustomCommand() *cobra.Command {
cmd := &cobra.Command{
Use: "custom",
Short: "Custom command",
RunE: func(cmd *cobra.Command, args []string) error {
// Your logic
return nil
},
}
return cmd
}
Hooks¶
Implement custom hooks:
workflow.define("my_workflow")
:on_task_start(function(task_name)
log.info("Task starting: " .. task_name)
end)
:on_task_complete(function(task_name, success)
if not success then
-- Send alert
end
end)
Best Practices¶
Architecture Guidelines¶
- Separation of Concerns: Keep control plane separate from execution
- Stateless Agents: Agents should not store state locally
- Idempotency: Design tasks to be idempotent
- Error Handling: Always handle errors gracefully
- Monitoring: Implement comprehensive monitoring
- Security: Always use TLS for network communication
Performance Optimization¶
- Parallel Execution: Use
parallel()for independent tasks - Task Granularity: Balance task size (not too small, not too large)
- State Caching: Cache frequently accessed state
- Agent Pooling: Pre-provision agent pools
- Database Tuning: Optimize database settings for workload
High Availability¶
- Master Redundancy: Run multiple master nodes
- Database Replication: Use database replication
- Agent Health Checks: Monitor agent health continuously
- Graceful Degradation: Handle partial failures
- Backup Strategy: Regular backups of state and stack databases
Related Documentation¶
Language: English | Portuguรชs