Fleet Manager
The Fleet Manager is a pull-based fleet monitoring system that runs on the N100 server (user-controlled hardware) via Cloudflare tunnel. It provides reliable sprite fleet oversight independent of sprites.dev availability.
Why Fleet Manager?
Section titled “Why Fleet Manager?”Sprites are beta infrastructure. We’ve encountered issues where:
- A sprite vanished without warning
- An agent became unresponsive and couldn’t be diagnosed remotely
- Using sprites to manage sprites creates a single point of failure
Solution: Run fleet oversight from N100 hardware we control. This provides:
| Benefit | Description |
|---|---|
| Independent monitoring | Doesn’t rely on sprites.dev being up |
| User-controlled hardware | We own the N100 |
| Reliable cron checks | Every 15 minutes, automatically |
| Fleet-wide operations | Execute commands across all sprites at once |
Naming Convention
Section titled “Naming Convention”Sprites use semantic naming: {team}-{role}-{id}
| Semantic Name | Legacy API Name | Team |
|---|---|---|
agents-review-claude-01 | hammer | agents |
agents-review-codex-01 | anvil | agents |
agents-synth-01 | forge | agents |
dev-workspace-01 | mallet | dev |
infra-bootstrap-01 | test-sprite | infra |
mobile-conductor-01 | sprite-mobile-conductor | mobile |
mobile-worker-01 | sprite-mobile-worker-1 | mobile |
mobile-worker-02 | sprite-mobile-worker-2 | mobile |
Commands
Section titled “Commands”Check Fleet Status
Section titled “Check Fleet Status”$ fleet statusFleet Status (8 sprites)==========================================================================================Sprite (API Name) Status Resp ms Claude------------------------------------------------------------------------------------------agents-review-claude-01 (hammer) warm yes 169 yesagents-review-codex-01 (anvil) warm yes 223 yesagents-synth-01 (forge) warm yes 150 yesdev-workspace-01 (mallet) warm yes 184 yesinfra-bootstrap-01 (test-sprite) warm yes 160 yesmobile-conductor-01 (sprite-mobile-conductor) warm yes 123 yesmobile-worker-01 (sprite-mobile-worker-1) warm yes 115 yesmobile-worker-02 (sprite-mobile-worker-2) warm yes 190 yes------------------------------------------------------------------------------------------Total: 8 | Healthy: 8 | Unhealthy: 0Filter by team:
$ fleet status --team agentsFleet Status (3 sprites (team: agents))...Columns explained:
- Sprite: Semantic name
- (API Name): Legacy name used by sprites.dev API
- Status: API-reported state (warm, cold, running, not_found, error)
- Resp: Whether a command executed successfully inside the sprite
- ms: Latency to the Sprites API
- Claude: Whether Claude CLI is available
Execute Fleet-Wide Commands
Section titled “Execute Fleet-Wide Commands”# Check hostnames$ fleet exec "hostname"Executing on 8 sprites: hostname...============================================================[hammer] OK hammer[anvil] OK anvil...Complete: 8/8 succeeded
# Check disk usage$ fleet exec "df -h /"
# Verify secrets loaded$ fleet exec "wc -l ~/.env.secrets"Update Packages Fleet-Wide
Section titled “Update Packages Fleet-Wide”# Update Infisical CLI on all sprites$ fleet update infisicalUpdating 'infisical' on all sprites...Complete: 8/8 succeeded
# Update Cloudflared$ fleet update cloudflaredHealth Check (Cron)
Section titled “Health Check (Cron)”# All healthy$ fleet healthOK: All 8 sprites healthy$ echo $?0
# Issues detected$ fleet healthUNHEALTHY sprites: sprite-mobile-conductor: api:not_found mallet: unresponsive$ echo $?1Generate Inventory Report
Section titled “Generate Inventory Report”$ fleet report{ "timestamp": "2026-02-02T06:20:36.098000+00:00", "total": 8, "healthy": 8, "unhealthy": 0, "sprites": [...], "errors": []}
# Save to file$ fleet report -o /var/log/fleet-report.jsonArchitecture
Section titled “Architecture”N100 Server (/opt/fleet-manager/) │ ├── fleet.py (Python CLI) │ ├── httpx.AsyncClient → Sprites API (async status) │ └── sprites-py → sprite.command() (parallel exec) │ ├── config.yaml (sprite list, timeouts, webhooks) │ └── Cron: */15 * * * * fleet health --quiet └── Logs to /var/log/fleet-healthcheck.logDesign principles:
- Pull-based: Query sprites on demand, don’t rely on push
- Parallel execution: asyncio + ThreadPoolExecutor
- Aggressive timeouts: 10s API, 15s probe, 60s commands
- Graceful degradation: Works without sprites-py (status-only mode)
Installation on N100
Section titled “Installation on N100”SSH to N100 and run the installer:
ssh n100cd /path/to/agent-harness/scripts/fleetsudo ./install.shThe installer:
- Creates
/opt/fleet-manager/with Python venv - Installs dependencies (httpx, pyyaml, sprites-py)
- Creates
/usr/local/bin/fleetwrapper - Prompts for
SPRITES_API_TOKEN - Sets up cron job (every 15 minutes)
Configuration
Section titled “Configuration”Edit /opt/fleet-manager/config.yaml:
# Sprites to managesprites: - hammer - anvil - forge - mallet - test-sprite - sprite-mobile-conductor - sprite-mobile-worker-1 - sprite-mobile-worker-2
# Timeouts in secondstimeouts: api_status: 10 health_probe: 15 command_exec: 60
# Optional Discord/Slack alertingalerts: webhook_url: ""Alerting
Section titled “Alerting”Configure webhook alerts for unhealthy sprites:
alerts: webhook_url: "https://discord.com/api/webhooks/..."When issues are detected, an alert is sent with details about which sprites are unhealthy.
Health check logs are stored at:
/var/log/fleet-healthcheck.logExample log entries:
[2026-02-02 06:15:00] OK - All sprites healthy[2026-02-02 06:30:00] ALERT - UNHEALTHY: mallet (unresponsive)[2026-02-02 06:45:00] OK - All sprites healthyIntegration with Sprite Inventory
Section titled “Integration with Sprite Inventory”Fleet Manager complements the push-based sprite inventory system:
| System | Direction | Purpose |
|---|---|---|
| Sprite Inventory | Push (sprite → repo) | Detailed per-sprite state, repo staleness |
| Fleet Manager | Pull (N100 → sprites) | Fleet-wide health, command execution |
Use both for comprehensive monitoring:
- Inventory for detailed diagnostics and historical tracking
- Fleet Manager for real-time health and fleet-wide operations
Troubleshooting
Section titled “Troubleshooting””sprites-py not available”
Section titled “”sprites-py not available””Fleet works in status-only mode without sprites-py. To enable exec:
source /opt/fleet-manager/venv/bin/activatepip install sprites-pySprite shows “unresponsive” but API says “warm”
Section titled “Sprite shows “unresponsive” but API says “warm””The sprite is running but commands aren’t executing. Debug:
# Try with longer timeoutfleet exec -t 120 "hostname"
# Connect directlysprite console -s <sprite-name>Cron not running
Section titled “Cron not running”# Check croncrontab -l
# Test manually/opt/fleet-manager/fleet-healthcheckSource: agent-harness/scripts/fleet/, agent-harness/docs/fleet-manager.md