AI Workloads
Orchestr8 provides enterprise-grade AI workload management on Kubernetes, integrating Llama-Stack with orchestr8's security, multi-tenancy, and operational excellence.
What orchestr8 Adds to AI Workloads
- Enterprise Security: Zero-trust networking, pod security standards, secrets management
- Multi-Tenancy: Namespace isolation with RBAC and resource quotas
- GitOps Integration: AI workloads deployed via ArgoCD with full audit trails
- GPU Orchestration: Intelligent GPU scheduling across multiple tenants
- Compliance: Built-in SOC2, GDPR, and HIPAA controls for AI applications
Quick Start
1. Initialize Your First AI Workload
# Create a RAG application
o8 llama init my-rag-app --template rag --provider openai
# Create an agentic workflow
o8 llama init my-agent --template agent --provider anthropic
# Create a simple inference service
o8 llama init my-inference --template inference --provider groq
2. Configure Your Workload
Navigate to your workload directory and customize the configuration:
cd my-rag-app
The generated structure includes:
.o8/module.yaml- AI workload specificationbase/- Kubernetes manifestsoverlays/- Environment-specific configurationstests/- Automated testing setup
3. Deploy to Your Cluster
# Validate configuration
o8 llama validate
# Deploy to development environment
o8 llama deploy --environment dev
# Check deployment status
o8 llama status
AI Workload Templates
Orchestr8 provides pre-configured templates for common AI use cases:
🔍 RAG Applications
Retrieval-Augmented Generation applications with vector search capabilities.
Features:
- Document ingestion and chunking
- Vector embedding generation
- Semantic search integration
- Context-aware response generation
Components:
- Vector database (ChromaDB, Qdrant, or PGVector)
- Embedding service
- Document processing pipeline
- Query interface
🤖 Agentic Workflows
Multi-step reasoning applications with tool integration.
Features:
- Multi-turn conversation handling
- Tool calling and execution
- Memory management
- Safety guardrails
Components:
- Agent orchestrator
- Tool registry
- Memory storage
- Safety filters
⚡ Inference Services
High-performance model serving for real-time applications.
Features:
- Model loading and caching
- Auto-scaling based on demand
- Load balancing
- Performance monitoring
Components:
- Model server
- Load balancer
- Monitoring stack
- Caching layer
🛠️ Custom Workloads
Flexible template for building specialized AI applications.
Features:
- Customizable AI pipeline
- Multi-provider support
- Resource optimization
- Monitoring integration
AI Providers
Orchestr8 supports multiple AI providers through standardized configuration:
Cloud Providers
- OpenAI: GPT models, embeddings, and fine-tuned models
- Anthropic: Claude models with advanced reasoning
- Groq: High-speed inference with specialized hardware
- AWS Bedrock: Enterprise AI models with AWS integration
Local Deployment
- Ollama: Local model serving with GPU acceleration
- Hugging Face: Open-source models and transformers
- Custom Models: Deploy your own fine-tuned models
Vector Databases
- ChromaDB: Simple and efficient vector storage
- Qdrant: High-performance vector search engine
- PGVector: PostgreSQL extension for vector operations
Configuration
Provider Configuration
AI providers are configured through Kubernetes secrets managed by External Secrets Operator:
# Example: OpenAI provider configuration
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: llama-stack-api-keys
spec:
target:
name: llama-stack-api-keys
template:
data:
OPENAI_API_KEY: "{{ .openai_api_key }}"
data:
- secretKey: openai_api_key
remoteRef:
key: /orchestr8/llama-stack/api-keys
property: openai_api_key
Resource Management
AI workloads require careful resource planning:
# Example: GPU resource configuration
spec:
requirements:
compute:
gpu:
enabled: true
type: nvidia.com/gpu
count: 1
memory: 16Gi
cpu:
requests: 2000m
limits: 8000m
memory:
requests: 4Gi
limits: 16Gi
Storage Configuration
AI applications need optimized storage for models and data:
# Example: Storage configuration
spec:
requirements:
storage:
modelCache:
type: persistent
size: 100Gi
storageClass: fast-ssd
vectorStore:
type: persistent
size: 200Gi
storageClass: fast-ssd
Security
AI workloads in Orchestr8 follow enterprise security best practices:
Network Security
- Default-deny network policies with explicit allow rules
- Istio service mesh for mTLS communication
- Pod Security Standards with restricted profiles
Data Protection
- Encryption at rest for model weights and data
- Encryption in transit for all communications
- Secret management through External Secrets Operator
Compliance
- GDPR compliance with data retention policies
- SOC2 controls for access and audit logging
- HIPAA support for healthcare applications
Monitoring
Orchestr8 provides comprehensive monitoring for AI workloads:
Metrics
- Model Performance: Latency, throughput, accuracy
- Resource Usage: GPU utilization, memory consumption
- Cost Tracking: Provider API usage and costs
Dashboards
- AI Workload Overview: High-level health and performance
- Provider Metrics: API usage and response times
- Resource Utilization: GPU and compute efficiency
Alerting
- Performance Degradation: Response time increases
- Resource Exhaustion: GPU or memory limits
- Cost Overruns: Budget threshold violations
Best Practices
Development
- Start with templates - Use provided templates as starting points
- Validate early - Run
o8 llama validatebefore deployment - Test incrementally - Deploy to dev environment first
- Monitor resources - Watch GPU and memory usage
Production
- Resource planning - Size GPU nodes appropriately
- Provider redundancy - Configure multiple AI providers
- Scaling policies - Set up horizontal pod autoscaling
- Backup strategies - Backup model weights and vector data
Security
- Least privilege - Use minimal RBAC permissions
- Network isolation - Implement proper network policies
- Secret rotation - Regularly rotate API keys
- Audit logging - Enable comprehensive audit trails
Troubleshooting
Common Issues
GPU nodes not detected:
# Check GPU node labels
kubectl get nodes -l nvidia.com/gpu.present=true
# Verify GPU operator installation
kubectl get pods -n gpu-operator-resources
AI workload fails to start:
# Check pod events
kubectl describe pod -n llama-stack -l app.kubernetes.io/name=llama-stack
# View detailed logs
o8 llama logs --follow
Provider authentication errors:
# Verify secrets exist
kubectl get secrets -n llama-stack
# Check secret contents (base64 encoded)
kubectl get secret llama-stack-api-keys -o yaml
Getting Help
- Check logs: Use
o8 llama logsfor real-time debugging - Validate configuration: Run
o8 llama validateto check setup - Monitor status: Use
o8 llama statusfor health overview - Review documentation: Check provider-specific guides
Next Steps
- Provider Configuration - Set up AI providers and secrets
- Resource Planning - Plan GPU and storage requirements
- Security Configuration - Harden AI workload security
- Monitoring Setup - Configure AI-specific monitoring