Lesson 2: Infrastructure Planning

Lesson 2: Infrastructure Planning

Learning Objectives

  • Design a production-grade Hermes deployment architecture
  • Choose between VPS, Docker, Modal, and Daytona hosting options
  • Plan for high availability, disaster recovery, and scaling

Priya’s Infrastructure Decision

NovaCraft runs on AWS. Priya needs Hermes to:

  • Run 24/7 with minimal downtime
  • Handle requests from 50 team members across 3 time zones
  • Connect securely to internal APIs
  • Keep all data within the company’s AWS account

2.1 Deployment Options

Option Best For Cost Complexity Data Control
VPS (bare metal) Small teams (<20) $20-50/mo Low Full
Docker (recommended) Mid teams (20-200) $50-200/mo Medium Full
Modal (serverless) Burst workloads Pay-per-use Low Partial
Daytona (cloud dev) Dev/test $30-100/mo Low Full

2.2 Docker Deployment (Priya’s Choice)

Server Sizing

NovaCraft: 50 users, ~200 requests/day

Recommended:
  CPU:    4 vCPU
  RAM:    8 GB
  Disk:   50 GB SSD
  OS:     Ubuntu 24.04 LTS
  AWS:    t3.xlarge (~$120/month)

Docker Compose

# docker-compose.yml
version: "3.8"

services:
  hermes:
    image: nousresearch/hermes-agent:latest
    container_name: hermes-agent
    restart: unless-stopped
    ports:
      - "127.0.0.1:8080:8080"
    volumes:
      - ./config:/root/.hermes
      - ./data:/root/.hermes/data
    environment:
      - HERMES_API_KEY=${HERMES_API_KEY}
      - TZ=UTC
    healthcheck:
      test: ["CMD", "hermes", "doctor"]
      interval: 60s
      timeout: 10s
      retries: 3
    deploy:
      resources:
        limits:
          memory: 4G
          cpus: "2.0"

Initial Setup

# 1. Install Docker
curl -fsSL https://get.docker.com | bash

# 2. Create project directory
mkdir -p /opt/hermes/{config,data}
cd /opt/hermes

# 3. Create docker-compose.yml (as above)

# 4. Set API key
echo "HERMES_API_KEY=your-key-here" > .env

# 5. Start
docker compose up -d

# 6. Verify
docker compose logs -f hermes

2.3 Network Architecture

┌────────────────────────────────────────────┐
│                  AWS VPC                    │
│                                             │
│  ┌──────────┐     ┌──────────────────┐     │
│  │ ALB/Nginx│────→│  Hermes Agent     │     │
│  │ (HTTPS)  │     │  (Docker)         │     │
│  └──────────┘     └───────┬──────────┘     │
│                           │                 │
│  ┌─────────┐  ┌──────────▼──────────┐     │
│  │ Slack   │  │  Internal Services   │     │
│  │ Webhook │  │  Jira · GitHub ·     │     │
│  └─────────┘  │  Datadog · etc.      │     │
│               └─────────────────────┘     │
└────────────────────────────────────────────┘

Nginx Reverse Proxy

# /etc/nginx/sites-available/hermes
server {
    listen 443 ssl http2;
    server_name hermes.novacraft.internal;

    ssl_certificate     /etc/ssl/hermes.crt;
    ssl_certificate_key /etc/ssl/hermes.key;

    location / {
        proxy_pass http://127.0.0.1:8080;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

2.4 Configuration

SOUL.md for NovaCraft

# SOUL.md — NovaCraft AI Assistant

## Identity
You are NovaCraft's AI assistant, helping a 50-person B2B SaaS team.
We build project management tools for mid-size companies.

## Company Facts
- Founded: 2022
- Team: 50 people (SF, London, Bangalore)
- Stack: Python/FastAPI backend, React frontend, PostgreSQL, AWS
- Revenue: $5M ARR
- Key product: NovaCraft PM (project management SaaS)

## Communication Style
- Professional but friendly
- Use Slack-appropriate formatting (bullet points, code blocks)
- Keep responses concise for Slack (< 300 words unless asked for detail)
- When uncertain, say so—don't hallucinate internal data

LLM Provider Configuration

hermes config edit
llm:
  provider: openrouter
  model: anthropic/claude-sonnet
  fallback: nous/hermes-3-70b
  max_tokens: 4096
  temperature: 0.3     # Lower for enterprise (more deterministic)

2.5 High Availability

Health Monitoring

# Systemd watchdog
# /etc/systemd/system/hermes-watchdog.service
[Unit]
Description=Hermes Agent Watchdog
After=docker.service

[Service]
Type=oneshot
ExecStart=/opt/hermes/scripts/healthcheck.sh

[Timer]
OnCalendar=*:0/5
#!/bin/bash
# /opt/hermes/scripts/healthcheck.sh

if ! docker compose -f /opt/hermes/docker-compose.yml ps | grep -q "Up"; then
    echo "Hermes is down, restarting..."
    docker compose -f /opt/hermes/docker-compose.yml restart
    curl -X POST "$SLACK_WEBHOOK" \
      -d '{"text":"⚠️ Hermes Agent was down and has been restarted."}'
fi

Backup Strategy

# Daily backup of config and data
# /opt/hermes/scripts/backup.sh

BACKUP_DIR="/opt/hermes/backups/$(date +%Y%m%d)"
mkdir -p "$BACKUP_DIR"

# Backup config
cp -r /opt/hermes/config "$BACKUP_DIR/config"

# Backup data (memory, skills, etc.)
cp -r /opt/hermes/data "$BACKUP_DIR/data"

# Upload to S3
aws s3 sync "$BACKUP_DIR" "s3://novacraft-backups/hermes/$(date +%Y%m%d)/"

# Rotate: keep 30 days
find /opt/hermes/backups -maxdepth 1 -mtime +30 -exec rm -rf {} +

2.6 Disaster Recovery

Recovery Plan

Scenario RTO RPO Recovery Steps
Container crash 1 min 0 Auto-restart (Docker policy)
Server reboot 5 min 0 Auto-start (systemd)
Data corruption 30 min 24h Restore from S3 backup
Full server loss 2h 24h Provision new server + restore

Quick Restore

# On new server:
# 1. Install Docker
# 2. Pull latest backup
aws s3 sync "s3://novacraft-backups/hermes/latest/" /opt/hermes/

# 3. Start
cd /opt/hermes && docker compose up -d

# 4. Verify
docker compose logs -f

2.7 Hands-On Exercise

  1. Deploy Hermes on Docker:
mkdir -p /opt/hermes && cd /opt/hermes
# Create docker-compose.yml from section 2.2
docker compose up -d
  1. Configure health check: Add the watchdog script

  2. Set up daily backup: Create the backup cron job

crontab -e
# Add: 0 3 * * * /opt/hermes/scripts/backup.sh
  1. Write your SOUL.md: Customize with your company context

  2. Test failover: Stop the container and verify auto-restart

docker stop hermes-agent
# Wait 60 seconds, verify it auto-restarts
docker ps

Lesson Summary

Key Point Details
Deployment Docker Compose recommended for 20-200 users
Sizing 4 vCPU / 8 GB for 50 users
Network Nginx reverse proxy + HTTPS
HA Health checks + auto-restart + daily backups
DR RTO 2h, RPO 24h with S3 backups

Next Lesson: Multi-Platform Gateway—connecting Slack, Discord, Email, and more.