Skip to main content

AI Workloads

Orchestr8 provides enterprise-grade AI workload management on Kubernetes, integrating Llama-Stack with orchestr8's security, multi-tenancy, and operational excellence.

What orchestr8 Adds to AI Workloads

  • Enterprise Security: Zero-trust networking, pod security standards, secrets management
  • Multi-Tenancy: Namespace isolation with RBAC and resource quotas
  • GitOps Integration: AI workloads deployed via ArgoCD with full audit trails
  • GPU Orchestration: Intelligent GPU scheduling across multiple tenants
  • Compliance: Built-in SOC2, GDPR, and HIPAA controls for AI applications

Quick Start

1. Initialize Your First AI Workload

# Create a RAG application
o8 llama init my-rag-app --template rag --provider openai

# Create an agentic workflow
o8 llama init my-agent --template agent --provider anthropic

# Create a simple inference service
o8 llama init my-inference --template inference --provider groq

2. Configure Your Workload

Navigate to your workload directory and customize the configuration:

cd my-rag-app

The generated structure includes:

  • .o8/module.yaml - AI workload specification
  • base/ - Kubernetes manifests
  • overlays/ - Environment-specific configurations
  • tests/ - Automated testing setup

3. Deploy to Your Cluster

# Validate configuration
o8 llama validate

# Deploy to development environment
o8 llama deploy --environment dev

# Check deployment status
o8 llama status

AI Workload Templates

Orchestr8 provides pre-configured templates for common AI use cases:

🔍 RAG Applications

Retrieval-Augmented Generation applications with vector search capabilities.

Features:

  • Document ingestion and chunking
  • Vector embedding generation
  • Semantic search integration
  • Context-aware response generation

Components:

  • Vector database (ChromaDB, Qdrant, or PGVector)
  • Embedding service
  • Document processing pipeline
  • Query interface

🤖 Agentic Workflows

Multi-step reasoning applications with tool integration.

Features:

  • Multi-turn conversation handling
  • Tool calling and execution
  • Memory management
  • Safety guardrails

Components:

  • Agent orchestrator
  • Tool registry
  • Memory storage
  • Safety filters

Inference Services

High-performance model serving for real-time applications.

Features:

  • Model loading and caching
  • Auto-scaling based on demand
  • Load balancing
  • Performance monitoring

Components:

  • Model server
  • Load balancer
  • Monitoring stack
  • Caching layer

🛠️ Custom Workloads

Flexible template for building specialized AI applications.

Features:

  • Customizable AI pipeline
  • Multi-provider support
  • Resource optimization
  • Monitoring integration

AI Providers

Orchestr8 supports multiple AI providers through standardized configuration:

Cloud Providers

  • OpenAI: GPT models, embeddings, and fine-tuned models
  • Anthropic: Claude models with advanced reasoning
  • Groq: High-speed inference with specialized hardware
  • AWS Bedrock: Enterprise AI models with AWS integration

Local Deployment

  • Ollama: Local model serving with GPU acceleration
  • Hugging Face: Open-source models and transformers
  • Custom Models: Deploy your own fine-tuned models

Vector Databases

  • ChromaDB: Simple and efficient vector storage
  • Qdrant: High-performance vector search engine
  • PGVector: PostgreSQL extension for vector operations

Configuration

Provider Configuration

AI providers are configured through Kubernetes secrets managed by External Secrets Operator:

# Example: OpenAI provider configuration
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: llama-stack-api-keys
spec:
target:
name: llama-stack-api-keys
template:
data:
OPENAI_API_KEY: "{{ .openai_api_key }}"
data:
- secretKey: openai_api_key
remoteRef:
key: /orchestr8/llama-stack/api-keys
property: openai_api_key

Resource Management

AI workloads require careful resource planning:

# Example: GPU resource configuration
spec:
requirements:
compute:
gpu:
enabled: true
type: nvidia.com/gpu
count: 1
memory: 16Gi
cpu:
requests: 2000m
limits: 8000m
memory:
requests: 4Gi
limits: 16Gi

Storage Configuration

AI applications need optimized storage for models and data:

# Example: Storage configuration
spec:
requirements:
storage:
modelCache:
type: persistent
size: 100Gi
storageClass: fast-ssd
vectorStore:
type: persistent
size: 200Gi
storageClass: fast-ssd

Security

AI workloads in Orchestr8 follow enterprise security best practices:

Network Security

  • Default-deny network policies with explicit allow rules
  • Istio service mesh for mTLS communication
  • Pod Security Standards with restricted profiles

Data Protection

  • Encryption at rest for model weights and data
  • Encryption in transit for all communications
  • Secret management through External Secrets Operator

Compliance

  • GDPR compliance with data retention policies
  • SOC2 controls for access and audit logging
  • HIPAA support for healthcare applications

Monitoring

Orchestr8 provides comprehensive monitoring for AI workloads:

Metrics

  • Model Performance: Latency, throughput, accuracy
  • Resource Usage: GPU utilization, memory consumption
  • Cost Tracking: Provider API usage and costs

Dashboards

  • AI Workload Overview: High-level health and performance
  • Provider Metrics: API usage and response times
  • Resource Utilization: GPU and compute efficiency

Alerting

  • Performance Degradation: Response time increases
  • Resource Exhaustion: GPU or memory limits
  • Cost Overruns: Budget threshold violations

Best Practices

Development

  1. Start with templates - Use provided templates as starting points
  2. Validate early - Run o8 llama validate before deployment
  3. Test incrementally - Deploy to dev environment first
  4. Monitor resources - Watch GPU and memory usage

Production

  1. Resource planning - Size GPU nodes appropriately
  2. Provider redundancy - Configure multiple AI providers
  3. Scaling policies - Set up horizontal pod autoscaling
  4. Backup strategies - Backup model weights and vector data

Security

  1. Least privilege - Use minimal RBAC permissions
  2. Network isolation - Implement proper network policies
  3. Secret rotation - Regularly rotate API keys
  4. Audit logging - Enable comprehensive audit trails

Troubleshooting

Common Issues

GPU nodes not detected:

# Check GPU node labels
kubectl get nodes -l nvidia.com/gpu.present=true

# Verify GPU operator installation
kubectl get pods -n gpu-operator-resources

AI workload fails to start:

# Check pod events
kubectl describe pod -n llama-stack -l app.kubernetes.io/name=llama-stack

# View detailed logs
o8 llama logs --follow

Provider authentication errors:

# Verify secrets exist
kubectl get secrets -n llama-stack

# Check secret contents (base64 encoded)
kubectl get secret llama-stack-api-keys -o yaml

Getting Help

  • Check logs: Use o8 llama logs for real-time debugging
  • Validate configuration: Run o8 llama validate to check setup
  • Monitor status: Use o8 llama status for health overview
  • Review documentation: Check provider-specific guides

Next Steps

  1. Provider Configuration - Set up AI providers and secrets
  2. Resource Planning - Plan GPU and storage requirements
  3. Security Configuration - Harden AI workload security
  4. Monitoring Setup - Configure AI-specific monitoring