๐ Telemetry & Observability¶
Overview¶
Sloth Runner provides comprehensive telemetry and observability features through native Prometheus integration and a rich terminal-based Grafana-style dashboard. Monitor your agent fleet, track task execution metrics, analyze performance, and gain deep insights into your infrastructure automation.
Enterprise-Grade Observability
Built-in Prometheus metrics server with auto-discovery, real-time dashboards, and zero-configuration setup.
Key Features¶
๐ฏ Prometheus Integration¶
- Native Metrics Exporter: Built-in HTTP server exposing Prometheus-compatible metrics
- Auto-Discovery: Metrics endpoint automatically configured on agent startup
- Standard Format: Compatible with Prometheus, Grafana, and all observability tools
- Zero Configuration: Telemetry enabled by default with sensible defaults
๐ Terminal Dashboard¶
- Rich Visualization: Beautiful terminal-based dashboard with tables, charts, and progress bars
- Real-time Updates: Watch mode with configurable refresh intervals
- Comprehensive Metrics: System resources, task performance, gRPC stats, and error tracking
- Color-Coded Insights: Visual indicators for performance and health status
๐ Metrics Categories¶
Task Metrics¶
- Total tasks executed (by status: success, failed, skipped)
- Currently running tasks
- Task duration histograms (P50, P99 latencies)
- Per-task and per-group performance tracking
System Metrics¶
- Agent uptime
- Memory allocation
- Goroutines count
- Agent version and build information
gRPC Metrics¶
- Request counts per method
- Request duration histograms
- Success/error rates
Error Tracking¶
- Error counts by type
- Failed task tracking
- System error monitoring
Quick Start¶
Enable Telemetry on Agent¶
Telemetry is enabled by default. Start your agent:
To explicitly configure telemetry:
# Enable telemetry with custom port
./sloth-runner agent start \
--name my-agent \
--master localhost:50053 \
--telemetry \
--metrics-port 9090
To disable telemetry:
Access Metrics¶
Get Prometheus Endpoint¶
Output:
โ
Metrics Endpoint:
URL: http://192.168.1.100:9090/metrics
๐ Usage:
# View metrics in browser:
open http://192.168.1.100:9090/metrics
# Fetch metrics via curl:
curl http://192.168.1.100:9090/metrics
# Configure Prometheus scraper:
- job_name: 'sloth-runner-agents'
static_configs:
- targets: ['192.168.1.100:9090']
View Snapshot¶
View Dashboard¶
Single View¶
Watch Mode (Auto-Refresh)¶
# Refresh every 5 seconds (default)
./sloth-runner agent metrics grafana my-agent --watch
# Custom refresh interval (2 seconds)
./sloth-runner agent metrics grafana my-agent --watch --interval 2
Architecture¶
graph LR
A[Sloth Runner Agent] --> B[Telemetry Server :9090]
B --> C[/metrics endpoint]
B --> D[/health endpoint]
B --> E[/info endpoint]
C --> F[Prometheus Scraper]
C --> G[CLI: agent metrics prom]
C --> H[CLI: agent metrics grafana]
F --> I[Prometheus Server]
I --> J[Grafana Dashboards]
style B fill:#4CAF50
style C fill:#2196F3
style H fill:#FF9800
Components¶
- Telemetry Server (
internal/telemetry/server.go
) - HTTP server running on configurable port (default: 9090)
- Serves Prometheus metrics in text format
-
Provides health check and service info endpoints
-
Metrics Collector (
internal/telemetry/metrics.go
) - Defines all Prometheus metrics (counters, gauges, histograms)
- Thread-safe global singleton
-
Automatic runtime metrics collection
-
Visualizer (
internal/telemetry/visualizer.go
) - Fetches and parses Prometheus metrics
- Rich terminal dashboard rendering
-
Historical trends support
-
CLI Commands
agent metrics prom
: Get endpoint URL or raw metricsagent metrics grafana
: Display rich dashboard
Use Cases¶
Development¶
Monitor your tasks during development:
# Terminal 1: Watch dashboard
./sloth-runner agent metrics grafana dev-agent --watch --interval 1
# Terminal 2: Execute tasks
./sloth-runner run -f deploy.sloth --values dev.yaml
Production Monitoring¶
Integrate with Prometheus and Grafana:
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'sloth-runner-production'
static_configs:
- targets:
- 'agent1:9090'
- 'agent2:9090'
- 'agent3:9090'
labels:
environment: production
Performance Analysis¶
Identify slow tasks and bottlenecks:
# View detailed performance metrics
./sloth-runner agent metrics grafana prod-agent
# Check P99 latencies in Task Performance section
# Tasks with ๐ด Slow indicator need optimization
Debugging¶
Track errors and failures:
# View error counts
./sloth-runner agent metrics grafana my-agent
# Check Errors section for error types
# Cross-reference with Task Metrics for failed tasks
Next Steps¶
- Prometheus Metrics Reference - Complete metrics documentation
- Grafana Dashboard Guide - Dashboard features and usage
- Deployment Guide - Production deployment and integration
Supported Platforms¶
- โ Linux (amd64, arm64)
- โ macOS (Intel, Apple Silicon)
- โ Windows (via WSL2)
- โ Containers (Docker, Incus/LXC)
- โ Kubernetes (via DaemonSet)
Performance Impact¶
Telemetry has minimal performance overhead:
- Memory: ~10-20MB additional
- CPU: <1% under normal load
- Network: Metrics served only on-demand (pull model)
- Storage: Metrics stored in-memory, no persistence
Security Considerations¶
Network Exposure
The metrics endpoint is exposed on all network interfaces by default. In production:
- Use firewall rules to restrict access
- Consider binding to localhost only and using reverse proxy
- Enable authentication via reverse proxy (Prometheus doesn't support auth natively)
Best Practice
Run agents in private networks and expose metrics only to monitoring infrastructure.
Troubleshooting¶
Telemetry Not Starting¶
Check agent logs for errors:
Verify port availability:
Try different port:
Cannot Access Metrics¶
Test from agent host:
Test from remote:
Check firewall:
# Allow port 9090
sudo ufw allow 9090/tcp
# Or use firewalld
sudo firewall-cmd --permanent --add-port=9090/tcp
sudo firewall-cmd --reload
Dashboard Shows No Data¶
Verify agent is running with telemetry:
Check metrics endpoint directly:
Ensure tasks have been executed (initial metrics are zero):
Further Reading¶
- Prometheus Documentation
- Grafana Documentation
- pterm Library (used for terminal visualization)