NebusAI Engineering Team Handbook
Quick Summary (for RAG)
Internal handbook for NebusAI engineering team covering: codebase architecture (Rust, Go, Python, Flutter), development workflow (GitHub flow, PR reviews), code standards by language, testing requirements (unit, integration, E2E), deployment pipeline (staging, canary, production), on-call rotation, incident response procedures, AI agent development guidelines, and onboarding checklist. For internal NebusAI engineering use only.
Table of Contents
- Engineering Overview
- Codebase Architecture
- Development Workflow
- Code Standards
- Testing Requirements
- Deployment Pipeline
- AI Agent Development
- On-Call & Incidents
- New Engineer Onboarding
- Security Best Practices
- Architecture Decision Records (ADRs)
Engineering Overview
Team Structure
| Team | Focus | Tech Stack |
|---|---|---|
| Platform | Core services, gating, tenancy | Rust |
| Commerce | POS, orders, payments, inventory | Rust |
| Gateway | API gateway, orchestration | Go |
| AI/ML | Agents, predictions, analytics | Python |
| Frontend | All Flutter shells | Flutter/Dart |
| Edge | Cloudflare workers, OlympusEdge | TypeScript, Rust |
| Infrastructure | GCP, CI/CD, monitoring | Terraform, Pulumi |
Key Contacts
| Role | Slack Handle | Contact Method |
|---|---|---|
| VP Engineering | @vp-eng | Check #eng-general channel topic |
| Platform Lead | @platform-lead | DM or #eng-platform |
| AI Lead | @ai-lead | DM or #eng-ai |
| Frontend Lead | @frontend-lead | DM or #eng-frontend |
See the NebusAI Org Chart for current team members.
Communication Channels
| Channel | Purpose |
|---|---|
| #eng-general | General engineering |
| #eng-platform | Platform team |
| #eng-frontend | Flutter team |
| #eng-ai | AI/ML team |
| #eng-oncall | On-call coordination |
| #deployments | Deploy notifications |
| #incidents | Active incidents |
Codebase Architecture
Repository Structure
olympus-cloud-gcp/
├── backend/
│ ├── rust/
│ │ ├── auth/ # Authentication service
│ │ ├── platform/ # Gating, tenancy, RBAC
│ │ └── commerce/ # Orders, payments, inventory
│ ├── go/
│ │ └── gateway/ # API gateway, GraphQL
│ └── python/
│ ├── ai/ # AI agents, LangGraph
│ └── analytics/ # Predictions, ML models
├── frontend/
│ └── flutter/
│ ├── packages/ # Shared packages
│ └── shells/ # 12 deployment shells
├── workers/ # Cloudflare Workers
├── edge/ # OlympusEdge code
├── infrastructure/ # Terraform, Pulumi
├── docs/ # Documentation
└── scripts/ # Build & deployment scripts
Service Architecture
┌─────────────────────────────────────────────────────────────┐
│ Edge (Cloudflare) │
│ Workers AI │ Vectorize │ AI Gateway │ Workers │ R2 │ D1 │
└─────────────────────────┬───────────────────────────────────┘
│
┌─────────────────────────▼───────────────────────────────────┐
│ Go Gateway (Cloud Run) │
│ GraphQL │ REST │ WebSocket │ Auth Middleware │
└─────────────────────────┬───────────────────────────────────┘
│
┌───────────────┴───────────────┐
│ │
┌─────────▼─────────┐ ┌─────────▼─────────┐
│ Rust Services │ │ Python Services │
│ (Cloud Run) │ │ (Cloud Run) │
├───────────────────┤ ├───────────────────┤
│ • Auth (8001) │ │ • Analytics (8004)│
│ • Platform (8002) │◄────────►│ • ML (8005) │
│ • Commerce (8003) │ Events │ • AI Predictions │
│ • Creator (8004) │ └─────────┬─────────┘
│ • CMS (8005) │ │
│ • Chat (8007) │ │
│ • Alerting (8080) │ │
└─────────┬─────────┘ │
│ │
└────────────┬─────────────────┘
│
┌──────────────────────▼──────────────────────────────────────┐
│ Data Layer │
│ Cloud Spanner │ ClickHouse │ Redis │ Pub/Sub │ GCS │
└─────────────────────────────────────────────────────────────┘
Key Documents
- Architecture:
docs/architecture/ARCHITECTURE.md - Master Plan:
docs/plans/master-plan.md - Production Standards:
docs/PRODUCTION-READY-STANDARDS.md
Development Workflow
GitHub Flow
-
Create branch from
developgit checkout develop
git pull origin develop
git checkout -b feat/epic-XXX-description -
Make changes with atomic commits
git commit -m "feat(component): description [#XXX]" -
Push and create PR
git push origin feat/epic-XXX-description -
Get review (2 approvals required)
-
Merge via squash merge
Branch Naming
| Type | Pattern | Example |
|---|---|---|
| Feature | feat/epic-XXX-description | feat/epic-944-ai-router |
| Bug Fix | fix/issue-XXX-description | fix/issue-123-login-error |
| Hotfix | hotfix/issue-XXX-description | hotfix/issue-456-payment-crash |
| Docs | docs/description | docs/update-api-reference |
Commit Messages
Follow Conventional Commits:
<type>(<scope>): <description> [#issue]
[optional body]
[optional footer]
Types:
feat: New featurefix: Bug fixdocs: Documentationstyle: Formattingrefactor: Code refactoringtest: Adding testschore: Maintenance
Examples:
feat(gating): add canary deployment support [#944]
fix(commerce): resolve payment timeout issue [#123]
docs(api): update order endpoint documentation
Pull Request Requirements
PR Template:
## Description
Brief description of changes
## Related Issue
Closes #XXX
## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation
## Testing
- [ ] Unit tests added/updated
- [ ] Integration tests pass
- [ ] E2E tests pass (if applicable)
## Checklist
- [ ] Code follows style guidelines
- [ ] Self-review completed
- [ ] Documentation updated
- [ ] No new warnings
Review Requirements:
- 2 approvals required
- All CI checks pass
- No unresolved comments
- Linked to GitHub issue
Code Standards
Rust Standards
Formatting:
cargo fmt
cargo clippy
Error Handling:
// Use Result with custom error types
pub type Result<T> = std::result::Result<T, ServiceError>;
// Use ? operator for propagation
pub async fn get_tenant(&self, id: &str) -> Result<Tenant> {
let tenant = self.repo.find_by_id(id).await?;
Ok(tenant)
}
Testing:
#[cfg(test)]
mod tests {
use super::*;
#[tokio::test]
async fn test_get_tenant_success() {
// Arrange
let repo = MockTenantRepo::new();
let service = TenantService::new(repo);
// Act
let result = service.get_tenant("tenant-123").await;
// Assert
assert!(result.is_ok());
}
}
Go Standards
Formatting:
go fmt ./...
golangci-lint run
Error Handling:
// Always handle errors explicitly
result, err := doSomething()
if err != nil {
return fmt.Errorf("failed to do something: %w", err)
}
Testing:
func TestGetTenant(t *testing.T) {
// Arrange
repo := NewMockRepo()
service := NewTenantService(repo)
// Act
tenant, err := service.GetTenant("tenant-123")
// Assert
assert.NoError(t, err)
assert.Equal(t, "tenant-123", tenant.ID)
}
Python Standards
Formatting:
black .
ruff check .
mypy .
Type Hints:
from typing import Optional, List
from pydantic import BaseModel
class TenantService:
async def get_tenant(self, tenant_id: str) -> Optional[Tenant]:
"""Get tenant by ID."""
return await self.repo.find_by_id(tenant_id)
Testing:
import pytest
from unittest.mock import AsyncMock
@pytest.mark.asyncio
async def test_get_tenant_success():
# Arrange
repo = AsyncMock()
repo.find_by_id.return_value = Tenant(id="tenant-123")
service = TenantService(repo)
# Act
result = await service.get_tenant("tenant-123")
# Assert
assert result.id == "tenant-123"
Flutter/Dart Standards
Formatting:
dart format .
dart analyze
State Management (Riverpod):
import 'package:riverpod_annotation/riverpod_annotation.dart';
part 'tenant_provider.g.dart';
class TenantNotifier extends _$TenantNotifier {
FutureOr<Tenant?> build(String tenantId) async {
return await ref.read(tenantRepositoryProvider).findById(tenantId);
}
Future<void> refresh() async {
state = const AsyncLoading();
state = await AsyncValue.guard(
() => ref.read(tenantRepositoryProvider).findById(state.value!.id),
);
}
}
Testing Requirements
Coverage Requirements
| Type | Rust | Go | Python | Flutter |
|---|---|---|---|---|
| Unit Tests | 80% | 80% | 80% | 70% |
| Integration | Required | Required | Required | - |
| E2E | Critical paths | Critical paths | AI flows | Shell flows |
Unit Tests
- Test individual functions/methods
- Mock dependencies
- Run on every commit
- Required for PR approval
Integration Tests
- Test service interactions
- Use test containers
- Run on PR merge
- Cover API contracts
E2E Tests
- Test full user flows
- Run in staging
- Required for release
- Cover critical paths:
- Authentication
- Order creation
- Payment processing
- AI agent interactions
Running Tests
# Rust
cargo test
cargo test --test integration
# Go
go test ./...
go test -tags=integration ./...
# Python
pytest
pytest -m integration
# Flutter
flutter test
flutter test integration_test/
Deployment Pipeline
Environments
| Environment | Purpose | Deploy Trigger |
|---|---|---|
dev | Development | PR branch push |
staging | Pre-production | Merge to develop |
production | Live | Merge to main |
CI/CD Pipeline
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Lint │───▶│ Test │───▶│ Build │───▶│ Deploy │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
▼ ▼ ▼
┌───────┐ ┌─────────┐ ┌──────┐
│ Dev │ │ Staging │ │ Prod │
└───────┘ └─────────┘ └──────┘
Deployment Commands
# Deploy to staging
./scripts/deploy.sh staging
# Deploy to production (requires approval)
./scripts/deploy.sh production
# Rollback
./scripts/rollback.sh production v2.4.0
Canary Deployments
For production releases:
- Start canary (1% traffic)
- Monitor metrics (30 minutes)
- Expand (5%, 25%, 50%)
- Full rollout (100%)
Automatic rollback if:
- Error rate > 1%
- P99 latency > 500ms
- Success rate < 99%
Release Process
- Create release branch from
develop - Version bump in version files
- QA sign-off in staging
- Create PR to
main - Release approval (2 approvals)
- Merge and deploy
- Tag release (
v2.4.1)
AI Agent Development
Agent Architecture
┌─────────────────────────────────────────────────────────────┐
│ LangGraph Orchestrator │
├─────────────────────────────────────────────────────────────┤
│ intent_router → planner → approval_checker → executor │
└─────────────────────────────────────────────────────────────┘
Creating a New Agent
- Define agent in
backend/python/app/agents/ - Create graph nodes:
- Intent router
- Planner
- Tool executor
- Response generator
- Add HITL checkpoints for sensitive actions
- Register in agent registry
- Configure model tiers
Agent Configuration
# agents/inventory_agent.py
from langgraph import StateGraph
class InventoryAgent:
def __init__(self):
self.graph = StateGraph()
self.allowed_tiers = ["T2", "T3", "T4"]
self.hitl_actions = ["order_inventory", "adjust_par_levels"]
RAG Integration
from clients.vectorize_client import VectorizeClient
async def query_knowledge_base(query: str, index: str = "docs-rag"):
client = VectorizeClient()
results = await client.query(
query=query,
index=index,
top_k=5,
min_score=0.7
)
return results
AI Testing
@pytest.mark.asyncio
async def test_inventory_agent_low_stock():
agent = InventoryAgent()
# Simulate low stock scenario
state = {"inventory": {"ground_beef": {"current": 10, "par": 50}}}
result = await agent.process(state, "Check inventory levels")
assert result.suggested_action == "order_ground_beef"
assert result.requires_hitl == True
On-Call & Incidents
On-Call Rotation
- Weekly rotation (Monday 9AM to Monday 9AM)
- Primary + Secondary on-call
- Escalation path: Primary → Secondary → Team Lead → VP Eng
On-Call Responsibilities
- Respond to alerts within 15 minutes
- Triage and assess impact
- Mitigate or escalate
- Document in incident channel
- Handoff to next engineer
Incident Severity
| Severity | Impact | Response | Examples |
|---|---|---|---|
| P1 | Business down | Immediate | Full outage |
| P2 | Major feature broken | 1 hour | Payments failing |
| P3 | Feature degraded | 4 hours | Slow reports |
| P4 | Minor issue | 24 hours | UI glitch |
Incident Response
- Acknowledge alert in PagerDuty
- Join #incidents channel
- Assess scope and impact
- Communicate status
- Mitigate - fix or rollback
- Resolve - confirm fix
- Post-mortem - document learnings
Runbooks
Key runbooks in Cockpit:
high-error-rate: Debug elevated errorslatency-spike: Investigate slow responsesdatabase-issues: Spanner troubleshootingai-service-down: AI agent recovery
New Engineer Onboarding
Week 1: Setup & Learning
Day 1-2: Environment Setup
- Get laptop and accounts
- Clone repository
- Configure GCP credentials for dev environment
- Verify connectivity to
dev.api.olympuscloud.ai - Review architecture docs
Day 3-5: Codebase Familiarity
- Read
ARCHITECTURE.md - Read
master-plan.md - Explore service code
- Review recent PRs
- Complete first "good first issue"
Week 2: Deeper Dive
- Pair with team member
- Attend team standup
- Fix a real bug
- Write tests for existing code
- Review a PR
Week 3-4: Independent Work
- Own a small feature
- Present at team meeting
- Join on-call shadow rotation
- Complete security training
- Read incident post-mortems
Key Resources
| Resource | Location |
|---|---|
| Architecture | docs/architecture/ARCHITECTURE.md |
| API Reference | docs/api/ |
| Runbooks | Cockpit > Runbooks |
| Slack | #eng-general |
| Wiki | Notion/Confluence |
Buddy System
Every new engineer gets a buddy:
- Same team, 6+ months tenure
- Daily sync for first 2 weeks
- Weekly sync for month 2
- Available for questions anytime
Security Best Practices
Secure Coding Guidelines
Input Validation:
- Validate ALL user input at service boundaries
- Use parameterized queries (no string concatenation)
- Sanitize output to prevent XSS
- Validate file uploads (type, size, content)
Authentication & Authorization:
- Use JWT tokens with short expiration (15 min access, 7 day refresh)
- Implement RBAC checks at API layer AND service layer
- Never expose internal IDs - use UUIDs
- Log all auth failures
Data Protection:
- Encrypt sensitive data at rest (AES-256)
- Use TLS 1.3 for all connections
- Never log PII or credentials
- Implement data retention policies
Secret Management
| Type | Storage | Rotation |
|---|---|---|
| API Keys | Secret Manager | 90 days |
| Database Creds | Secret Manager | 30 days |
| JWT Signing Keys | Secret Manager | 180 days |
| Service Accounts | IAM | 365 days |
Rules:
- NEVER commit secrets to git
- Use environment variables or Secret Manager
- Rotate compromised secrets immediately
- Audit secret access quarterly
Vulnerability Handling
Discovery:
- Triaged within 24 hours
- Severity assigned (Critical/High/Medium/Low)
- Owner assigned
- Fix timeline established
SLA by Severity:
| Severity | Fix Timeline | Examples |
|---|---|---|
| Critical | 24 hours | Auth bypass, RCE |
| High | 7 days | SQL injection, XSS |
| Medium | 30 days | Info disclosure |
| Low | 90 days | Best practice issues |
Security Checklist
Before merging any PR:
- No hardcoded secrets
- Input validation implemented
- Auth checks in place
- Sensitive data not logged
- Dependencies scanned for CVEs
- SQL uses parameterized queries
Architecture Decision Records (ADRs)
What is an ADR?
Architecture Decision Records document significant technical decisions with context, rationale, and consequences.
When to Write an ADR
Write an ADR when:
- Choosing between technologies (e.g., Redis vs Memcached)
- Defining API contracts
- Establishing patterns or conventions
- Making trade-offs with long-term impact
- Deprecating existing approaches
ADR Template
# ADR-XXX: [Decision Title]
## Status
[Proposed | Accepted | Deprecated | Superseded by ADR-YYY]
## Context
[What is the issue? Why does it need a decision?]
## Decision
[What is the decision? Be specific.]
## Rationale
[Why this decision over alternatives?]
## Alternatives Considered
1. **Alternative A**: [Description] - Rejected because [reason]
2. **Alternative B**: [Description] - Rejected because [reason]
## Consequences
### Positive
- [Benefit 1]
- [Benefit 2]
### Negative
- [Trade-off 1]
- [Trade-off 2]
## References
- [Related docs, issues, discussions]
ADR Location
All ADRs stored in docs/architecture/adr/:
001-stateful-shell-route-architecture.md002-cloud-spanner-vs-postgres.md003-rust-for-core-services.md004-acp-ai-router-architecture.md
ADR Process
- Draft: Create PR with new ADR
- Review: Team discusses in PR comments
- Approve: 2+ senior engineers approve
- Merge: ADR becomes official record
- Supersede: Create new ADR referencing old one
Quick Reference
Common Commands
# Local development
make dev # Start all services
make test # Run tests
make lint # Run linters
make build # Build services
# Git
git checkout -b feat/epic-XXX-description
git commit -m "feat(scope): description [#XXX]"
git push origin feat/epic-XXX-description
# Deployment
./scripts/deploy.sh staging
./scripts/deploy.sh production
./scripts/rollback.sh production v2.4.0
Important URLs
| Service | URL |
|---|---|
| GitHub | github.com/OlympusCloud/olympus-cloud-gcp |
| CI/CD | GitHub Actions |
| Cockpit | cockpit.olympuscloud.ai |
| Logs | logs.olympuscloud.ai |
| Metrics | metrics.olympuscloud.ai |
Emergency Contacts
- On-Call Engineer: Via PagerDuty (auto-escalation enabled)
- Security Team: security@nebusai.com or #security-urgent in Slack
- Engineering Leadership: Page via PagerDuty "Engineering Leadership" escalation
- Executive Team: Contact via #exec-escalation (for P0/SEV1 only)
INTERNAL - NebusAI Engineering Team Only