Skip to main content

NebusAI Engineering Team Handbook

Quick Summary (for RAG)

Internal handbook for NebusAI engineering team covering: codebase architecture (Rust, Go, Python, Flutter), development workflow (GitHub flow, PR reviews), code standards by language, testing requirements (unit, integration, E2E), deployment pipeline (staging, canary, production), on-call rotation, incident response procedures, AI agent development guidelines, and onboarding checklist. For internal NebusAI engineering use only.


Table of Contents

  1. Engineering Overview
  2. Codebase Architecture
  3. Development Workflow
  4. Code Standards
  5. Testing Requirements
  6. Deployment Pipeline
  7. AI Agent Development
  8. On-Call & Incidents
  9. New Engineer Onboarding
  10. Security Best Practices
  11. Architecture Decision Records (ADRs)

Engineering Overview

Team Structure

TeamFocusTech Stack
PlatformCore services, gating, tenancyRust
CommercePOS, orders, payments, inventoryRust
GatewayAPI gateway, orchestrationGo
AI/MLAgents, predictions, analyticsPython
FrontendAll Flutter shellsFlutter/Dart
EdgeCloudflare workers, OlympusEdgeTypeScript, Rust
InfrastructureGCP, CI/CD, monitoringTerraform, Pulumi

Key Contacts

RoleSlack HandleContact Method
VP Engineering@vp-engCheck #eng-general channel topic
Platform Lead@platform-leadDM or #eng-platform
AI Lead@ai-leadDM or #eng-ai
Frontend Lead@frontend-leadDM or #eng-frontend
note

See the NebusAI Org Chart for current team members.

Communication Channels

ChannelPurpose
#eng-generalGeneral engineering
#eng-platformPlatform team
#eng-frontendFlutter team
#eng-aiAI/ML team
#eng-oncallOn-call coordination
#deploymentsDeploy notifications
#incidentsActive incidents

Codebase Architecture

Repository Structure

olympus-cloud-gcp/
├── backend/
│ ├── rust/
│ │ ├── auth/ # Authentication service
│ │ ├── platform/ # Gating, tenancy, RBAC
│ │ └── commerce/ # Orders, payments, inventory
│ ├── go/
│ │ └── gateway/ # API gateway, GraphQL
│ └── python/
│ ├── ai/ # AI agents, LangGraph
│ └── analytics/ # Predictions, ML models
├── frontend/
│ └── flutter/
│ ├── packages/ # Shared packages
│ └── shells/ # 12 deployment shells
├── workers/ # Cloudflare Workers
├── edge/ # OlympusEdge code
├── infrastructure/ # Terraform, Pulumi
├── docs/ # Documentation
└── scripts/ # Build & deployment scripts

Service Architecture

┌─────────────────────────────────────────────────────────────┐
│ Edge (Cloudflare) │
│ Workers AI │ Vectorize │ AI Gateway │ Workers │ R2 │ D1 │
└─────────────────────────┬───────────────────────────────────┘

┌─────────────────────────▼───────────────────────────────────┐
│ Go Gateway (Cloud Run) │
│ GraphQL │ REST │ WebSocket │ Auth Middleware │
└─────────────────────────┬───────────────────────────────────┘

┌───────────────┴───────────────┐
│ │
┌─────────▼─────────┐ ┌─────────▼─────────┐
│ Rust Services │ │ Python Services │
│ (Cloud Run) │ │ (Cloud Run) │
├───────────────────┤ ├───────────────────┤
│ • Auth (8001) │ │ • Analytics (8004)│
│ • Platform (8002) │◄────────►│ • ML (8005) │
│ • Commerce (8003) │ Events │ • AI Predictions │
│ • Creator (8004) │ └─────────┬─────────┘
│ • CMS (8005) │ │
│ • Chat (8007) │ │
│ • Alerting (8080) │ │
└─────────┬─────────┘ │
│ │
└────────────┬─────────────────┘

┌──────────────────────▼──────────────────────────────────────┐
│ Data Layer │
│ Cloud Spanner │ ClickHouse │ Redis │ Pub/Sub │ GCS │
└─────────────────────────────────────────────────────────────┘

Key Documents

  • Architecture: docs/architecture/ARCHITECTURE.md
  • Master Plan: docs/plans/master-plan.md
  • Production Standards: docs/PRODUCTION-READY-STANDARDS.md

Development Workflow

GitHub Flow

  1. Create branch from develop

    git checkout develop
    git pull origin develop
    git checkout -b feat/epic-XXX-description
  2. Make changes with atomic commits

    git commit -m "feat(component): description [#XXX]"
  3. Push and create PR

    git push origin feat/epic-XXX-description
  4. Get review (2 approvals required)

  5. Merge via squash merge

Branch Naming

TypePatternExample
Featurefeat/epic-XXX-descriptionfeat/epic-944-ai-router
Bug Fixfix/issue-XXX-descriptionfix/issue-123-login-error
Hotfixhotfix/issue-XXX-descriptionhotfix/issue-456-payment-crash
Docsdocs/descriptiondocs/update-api-reference

Commit Messages

Follow Conventional Commits:

<type>(<scope>): <description> [#issue]

[optional body]

[optional footer]

Types:

  • feat: New feature
  • fix: Bug fix
  • docs: Documentation
  • style: Formatting
  • refactor: Code refactoring
  • test: Adding tests
  • chore: Maintenance

Examples:

feat(gating): add canary deployment support [#944]
fix(commerce): resolve payment timeout issue [#123]
docs(api): update order endpoint documentation

Pull Request Requirements

PR Template:

## Description
Brief description of changes

## Related Issue
Closes #XXX

## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation

## Testing
- [ ] Unit tests added/updated
- [ ] Integration tests pass
- [ ] E2E tests pass (if applicable)

## Checklist
- [ ] Code follows style guidelines
- [ ] Self-review completed
- [ ] Documentation updated
- [ ] No new warnings

Review Requirements:

  • 2 approvals required
  • All CI checks pass
  • No unresolved comments
  • Linked to GitHub issue

Code Standards

Rust Standards

Formatting:

cargo fmt
cargo clippy

Error Handling:

// Use Result with custom error types
pub type Result<T> = std::result::Result<T, ServiceError>;

// Use ? operator for propagation
pub async fn get_tenant(&self, id: &str) -> Result<Tenant> {
let tenant = self.repo.find_by_id(id).await?;
Ok(tenant)
}

Testing:

#[cfg(test)]
mod tests {
use super::*;

#[tokio::test]
async fn test_get_tenant_success() {
// Arrange
let repo = MockTenantRepo::new();
let service = TenantService::new(repo);

// Act
let result = service.get_tenant("tenant-123").await;

// Assert
assert!(result.is_ok());
}
}

Go Standards

Formatting:

go fmt ./...
golangci-lint run

Error Handling:

// Always handle errors explicitly
result, err := doSomething()
if err != nil {
return fmt.Errorf("failed to do something: %w", err)
}

Testing:

func TestGetTenant(t *testing.T) {
// Arrange
repo := NewMockRepo()
service := NewTenantService(repo)

// Act
tenant, err := service.GetTenant("tenant-123")

// Assert
assert.NoError(t, err)
assert.Equal(t, "tenant-123", tenant.ID)
}

Python Standards

Formatting:

black .
ruff check .
mypy .

Type Hints:

from typing import Optional, List
from pydantic import BaseModel

class TenantService:
async def get_tenant(self, tenant_id: str) -> Optional[Tenant]:
"""Get tenant by ID."""
return await self.repo.find_by_id(tenant_id)

Testing:

import pytest
from unittest.mock import AsyncMock

@pytest.mark.asyncio
async def test_get_tenant_success():
# Arrange
repo = AsyncMock()
repo.find_by_id.return_value = Tenant(id="tenant-123")
service = TenantService(repo)

# Act
result = await service.get_tenant("tenant-123")

# Assert
assert result.id == "tenant-123"

Flutter/Dart Standards

Formatting:

dart format .
dart analyze

State Management (Riverpod):

import 'package:riverpod_annotation/riverpod_annotation.dart';

part 'tenant_provider.g.dart';


class TenantNotifier extends _$TenantNotifier {

FutureOr<Tenant?> build(String tenantId) async {
return await ref.read(tenantRepositoryProvider).findById(tenantId);
}

Future<void> refresh() async {
state = const AsyncLoading();
state = await AsyncValue.guard(
() => ref.read(tenantRepositoryProvider).findById(state.value!.id),
);
}
}

Testing Requirements

Coverage Requirements

TypeRustGoPythonFlutter
Unit Tests80%80%80%70%
IntegrationRequiredRequiredRequired-
E2ECritical pathsCritical pathsAI flowsShell flows

Unit Tests

  • Test individual functions/methods
  • Mock dependencies
  • Run on every commit
  • Required for PR approval

Integration Tests

  • Test service interactions
  • Use test containers
  • Run on PR merge
  • Cover API contracts

E2E Tests

  • Test full user flows
  • Run in staging
  • Required for release
  • Cover critical paths:
    • Authentication
    • Order creation
    • Payment processing
    • AI agent interactions

Running Tests

# Rust
cargo test
cargo test --test integration

# Go
go test ./...
go test -tags=integration ./...

# Python
pytest
pytest -m integration

# Flutter
flutter test
flutter test integration_test/

Deployment Pipeline

Environments

EnvironmentPurposeDeploy Trigger
devDevelopmentPR branch push
stagingPre-productionMerge to develop
productionLiveMerge to main

CI/CD Pipeline

┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐
│ Lint │───▶│ Test │───▶│ Build │───▶│ Deploy │
└─────────┘ └─────────┘ └─────────┘ └─────────┘

┌───────────────────┼───────────────────┐
│ │ │
▼ ▼ ▼
┌───────┐ ┌─────────┐ ┌──────┐
│ Dev │ │ Staging │ │ Prod │
└───────┘ └─────────┘ └──────┘

Deployment Commands

# Deploy to staging
./scripts/deploy.sh staging

# Deploy to production (requires approval)
./scripts/deploy.sh production

# Rollback
./scripts/rollback.sh production v2.4.0

Canary Deployments

For production releases:

  1. Start canary (1% traffic)
  2. Monitor metrics (30 minutes)
  3. Expand (5%, 25%, 50%)
  4. Full rollout (100%)

Automatic rollback if:

  • Error rate > 1%
  • P99 latency > 500ms
  • Success rate < 99%

Release Process

  1. Create release branch from develop
  2. Version bump in version files
  3. QA sign-off in staging
  4. Create PR to main
  5. Release approval (2 approvals)
  6. Merge and deploy
  7. Tag release (v2.4.1)

AI Agent Development

Agent Architecture

┌─────────────────────────────────────────────────────────────┐
│ LangGraph Orchestrator │
├─────────────────────────────────────────────────────────────┤
│ intent_router → planner → approval_checker → executor │
└─────────────────────────────────────────────────────────────┘

Creating a New Agent

  1. Define agent in backend/python/app/agents/
  2. Create graph nodes:
    • Intent router
    • Planner
    • Tool executor
    • Response generator
  3. Add HITL checkpoints for sensitive actions
  4. Register in agent registry
  5. Configure model tiers

Agent Configuration

# agents/inventory_agent.py
from langgraph import StateGraph

class InventoryAgent:
def __init__(self):
self.graph = StateGraph()
self.allowed_tiers = ["T2", "T3", "T4"]
self.hitl_actions = ["order_inventory", "adjust_par_levels"]

RAG Integration

from clients.vectorize_client import VectorizeClient

async def query_knowledge_base(query: str, index: str = "docs-rag"):
client = VectorizeClient()
results = await client.query(
query=query,
index=index,
top_k=5,
min_score=0.7
)
return results

AI Testing

@pytest.mark.asyncio
async def test_inventory_agent_low_stock():
agent = InventoryAgent()

# Simulate low stock scenario
state = {"inventory": {"ground_beef": {"current": 10, "par": 50}}}

result = await agent.process(state, "Check inventory levels")

assert result.suggested_action == "order_ground_beef"
assert result.requires_hitl == True

On-Call & Incidents

On-Call Rotation

  • Weekly rotation (Monday 9AM to Monday 9AM)
  • Primary + Secondary on-call
  • Escalation path: Primary → Secondary → Team Lead → VP Eng

On-Call Responsibilities

  1. Respond to alerts within 15 minutes
  2. Triage and assess impact
  3. Mitigate or escalate
  4. Document in incident channel
  5. Handoff to next engineer

Incident Severity

SeverityImpactResponseExamples
P1Business downImmediateFull outage
P2Major feature broken1 hourPayments failing
P3Feature degraded4 hoursSlow reports
P4Minor issue24 hoursUI glitch

Incident Response

  1. Acknowledge alert in PagerDuty
  2. Join #incidents channel
  3. Assess scope and impact
  4. Communicate status
  5. Mitigate - fix or rollback
  6. Resolve - confirm fix
  7. Post-mortem - document learnings

Runbooks

Key runbooks in Cockpit:

  • high-error-rate: Debug elevated errors
  • latency-spike: Investigate slow responses
  • database-issues: Spanner troubleshooting
  • ai-service-down: AI agent recovery

New Engineer Onboarding

Week 1: Setup & Learning

Day 1-2: Environment Setup

  • Get laptop and accounts
  • Clone repository
  • Configure GCP credentials for dev environment
  • Verify connectivity to dev.api.olympuscloud.ai
  • Review architecture docs

Day 3-5: Codebase Familiarity

  • Read ARCHITECTURE.md
  • Read master-plan.md
  • Explore service code
  • Review recent PRs
  • Complete first "good first issue"

Week 2: Deeper Dive

  • Pair with team member
  • Attend team standup
  • Fix a real bug
  • Write tests for existing code
  • Review a PR

Week 3-4: Independent Work

  • Own a small feature
  • Present at team meeting
  • Join on-call shadow rotation
  • Complete security training
  • Read incident post-mortems

Key Resources

ResourceLocation
Architecturedocs/architecture/ARCHITECTURE.md
API Referencedocs/api/
RunbooksCockpit > Runbooks
Slack#eng-general
WikiNotion/Confluence

Buddy System

Every new engineer gets a buddy:

  • Same team, 6+ months tenure
  • Daily sync for first 2 weeks
  • Weekly sync for month 2
  • Available for questions anytime

Security Best Practices

Secure Coding Guidelines

Input Validation:

  • Validate ALL user input at service boundaries
  • Use parameterized queries (no string concatenation)
  • Sanitize output to prevent XSS
  • Validate file uploads (type, size, content)

Authentication & Authorization:

  • Use JWT tokens with short expiration (15 min access, 7 day refresh)
  • Implement RBAC checks at API layer AND service layer
  • Never expose internal IDs - use UUIDs
  • Log all auth failures

Data Protection:

  • Encrypt sensitive data at rest (AES-256)
  • Use TLS 1.3 for all connections
  • Never log PII or credentials
  • Implement data retention policies

Secret Management

TypeStorageRotation
API KeysSecret Manager90 days
Database CredsSecret Manager30 days
JWT Signing KeysSecret Manager180 days
Service AccountsIAM365 days

Rules:

  • NEVER commit secrets to git
  • Use environment variables or Secret Manager
  • Rotate compromised secrets immediately
  • Audit secret access quarterly

Vulnerability Handling

Discovery:

  1. Triaged within 24 hours
  2. Severity assigned (Critical/High/Medium/Low)
  3. Owner assigned
  4. Fix timeline established

SLA by Severity:

SeverityFix TimelineExamples
Critical24 hoursAuth bypass, RCE
High7 daysSQL injection, XSS
Medium30 daysInfo disclosure
Low90 daysBest practice issues

Security Checklist

Before merging any PR:

  • No hardcoded secrets
  • Input validation implemented
  • Auth checks in place
  • Sensitive data not logged
  • Dependencies scanned for CVEs
  • SQL uses parameterized queries

Architecture Decision Records (ADRs)

What is an ADR?

Architecture Decision Records document significant technical decisions with context, rationale, and consequences.

When to Write an ADR

Write an ADR when:

  • Choosing between technologies (e.g., Redis vs Memcached)
  • Defining API contracts
  • Establishing patterns or conventions
  • Making trade-offs with long-term impact
  • Deprecating existing approaches

ADR Template

# ADR-XXX: [Decision Title]

## Status
[Proposed | Accepted | Deprecated | Superseded by ADR-YYY]

## Context
[What is the issue? Why does it need a decision?]

## Decision
[What is the decision? Be specific.]

## Rationale
[Why this decision over alternatives?]

## Alternatives Considered
1. **Alternative A**: [Description] - Rejected because [reason]
2. **Alternative B**: [Description] - Rejected because [reason]

## Consequences

### Positive
- [Benefit 1]
- [Benefit 2]

### Negative
- [Trade-off 1]
- [Trade-off 2]

## References
- [Related docs, issues, discussions]

ADR Location

All ADRs stored in docs/architecture/adr/:

  • 001-stateful-shell-route-architecture.md
  • 002-cloud-spanner-vs-postgres.md
  • 003-rust-for-core-services.md
  • 004-acp-ai-router-architecture.md

ADR Process

  1. Draft: Create PR with new ADR
  2. Review: Team discusses in PR comments
  3. Approve: 2+ senior engineers approve
  4. Merge: ADR becomes official record
  5. Supersede: Create new ADR referencing old one

Quick Reference

Common Commands

# Local development
make dev # Start all services
make test # Run tests
make lint # Run linters
make build # Build services

# Git
git checkout -b feat/epic-XXX-description
git commit -m "feat(scope): description [#XXX]"
git push origin feat/epic-XXX-description

# Deployment
./scripts/deploy.sh staging
./scripts/deploy.sh production
./scripts/rollback.sh production v2.4.0

Important URLs

ServiceURL
GitHubgithub.com/OlympusCloud/olympus-cloud-gcp
CI/CDGitHub Actions
Cockpitcockpit.olympuscloud.ai
Logslogs.olympuscloud.ai
Metricsmetrics.olympuscloud.ai

Emergency Contacts

  • On-Call Engineer: Via PagerDuty (auto-escalation enabled)
  • Security Team: security@nebusai.com or #security-urgent in Slack
  • Engineering Leadership: Page via PagerDuty "Engineering Leadership" escalation
  • Executive Team: Contact via #exec-escalation (for P0/SEV1 only)

INTERNAL - NebusAI Engineering Team Only