NebusAI Engineering Team Handbook

Quick Summary (for RAG)

Internal handbook for NebusAI engineering team covering: codebase architecture (Rust, Go, Python, Flutter), development workflow (GitHub flow, PR reviews), code standards by language, testing requirements (unit, integration, E2E), deployment pipeline (staging, canary, production), on-call rotation, incident response procedures, AI agent development guidelines, and onboarding checklist. For internal NebusAI engineering use only.

Engineering Overview
Codebase Architecture
Development Workflow
Code Standards
Testing Requirements
Deployment Pipeline
AI Agent Development
On-Call & Incidents
New Engineer Onboarding
Security Best Practices
Architecture Decision Records (ADRs)

Engineering Overview

Team Structure

Team	Focus	Tech Stack
Platform	Core services, gating, tenancy	Rust
Commerce	POS, orders, payments, inventory	Rust
Gateway	API gateway, orchestration	Go
AI/ML	Agents, predictions, analytics	Python
Frontend	All Flutter shells	Flutter/Dart
Edge	Cloudflare workers, OlympusEdge	TypeScript, Rust
Infrastructure	GCP, CI/CD, monitoring	Terraform, Pulumi

Key Contacts

Role	Slack Handle	Contact Method
VP Engineering	@vp-eng	Check #eng-general channel topic
Platform Lead	@platform-lead	DM or #eng-platform
AI Lead	@ai-lead	DM or #eng-ai
Frontend Lead	@frontend-lead	DM or #eng-frontend

note

See the NebusAI Org Chart for current team members.

Communication Channels

Channel	Purpose
#eng-general	General engineering
#eng-platform	Platform team
#eng-frontend	Flutter team
#eng-ai	AI/ML team
#eng-oncall	On-call coordination
#deployments	Deploy notifications
#incidents	Active incidents

Codebase Architecture

Repository Structure

olympus-cloud-gcp/
├── backend/
│   ├── rust/
│   │   ├── auth/          # Authentication service
│   │   ├── platform/      # Gating, tenancy, RBAC
│   │   └── commerce/      # Orders, payments, inventory
│   ├── go/
│   │   └── gateway/       # API gateway, GraphQL
│   └── python/
│       ├── ai/            # AI agents, LangGraph
│       └── analytics/     # Predictions, ML models
├── frontend/
│   └── flutter/
│       ├── packages/      # Shared packages
│       └── shells/        # 12 deployment shells
├── workers/               # Cloudflare Workers
├── edge/                  # OlympusEdge code
├── infrastructure/        # Terraform, Pulumi
├── docs/                  # Documentation
└── scripts/               # Build & deployment scripts

Service Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Edge (Cloudflare)                        │
│  Workers AI │ Vectorize │ AI Gateway │ Workers │ R2 │ D1    │
└─────────────────────────┬───────────────────────────────────┘
                          │
┌─────────────────────────▼───────────────────────────────────┐
│                   Go Gateway (Cloud Run)                     │
│  GraphQL │ REST │ WebSocket │ Auth Middleware               │
└─────────────────────────┬───────────────────────────────────┘
                          │
          ┌───────────────┴───────────────┐
          │                               │
┌─────────▼─────────┐          ┌─────────▼─────────┐
│   Rust Services   │          │  Python Services  │
│   (Cloud Run)     │          │   (Cloud Run)     │
├───────────────────┤          ├───────────────────┤
│ • Auth (8001)     │          │ • Analytics (8004)│
│ • Platform (8002) │◄────────►│ • ML (8005)       │
│ • Commerce (8003) │  Events  │ • AI Predictions  │
│ • Creator (8004)  │          └─────────┬─────────┘
│ • CMS (8005)      │                    │
│ • Chat (8007)     │                    │
│ • Alerting (8080) │                    │
└─────────┬─────────┘                    │
          │                              │
          └────────────┬─────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────────┐
│                    Data Layer                                │
│  Cloud Spanner │ ClickHouse │ Redis │ Pub/Sub │ GCS         │
└─────────────────────────────────────────────────────────────┘

Key Documents

Architecture: docs/architecture/ARCHITECTURE.md
Master Plan: docs/plans/master-plan.md
Production Standards: docs/PRODUCTION-READY-STANDARDS.md

Development Workflow

GitHub Flow

Create branch from develop

git checkout develop
git pull origin develop
git checkout -b feat/epic-XXX-description

Make changes with atomic commits

git commit -m "feat(component): description [#XXX]"

Push and create PR

git push origin feat/epic-XXX-description

Get review (2 approvals required)
Merge via squash merge

Branch Naming

Type	Pattern	Example
Feature	`feat/epic-XXX-description`	`feat/epic-944-ai-router`
Bug Fix	`fix/issue-XXX-description`	`fix/issue-123-login-error`
Hotfix	`hotfix/issue-XXX-description`	`hotfix/issue-456-payment-crash`
Docs	`docs/description`	`docs/update-api-reference`

Commit Messages

Follow Conventional Commits:

<type>(<scope>): <description> [#issue]

[optional body]

[optional footer]

Types:

feat: New feature
fix: Bug fix
docs: Documentation
style: Formatting
refactor: Code refactoring
test: Adding tests
chore: Maintenance

Examples:

feat(gating): add canary deployment support [#944]
fix(commerce): resolve payment timeout issue [#123]
docs(api): update order endpoint documentation

Pull Request Requirements

PR Template:

## Description
Brief description of changes

## Related Issue
Closes #XXX

## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation

## Testing
- [ ] Unit tests added/updated
- [ ] Integration tests pass
- [ ] E2E tests pass (if applicable)

## Checklist
- [ ] Code follows style guidelines
- [ ] Self-review completed
- [ ] Documentation updated
- [ ] No new warnings

Review Requirements:

2 approvals required
All CI checks pass
No unresolved comments
Linked to GitHub issue

Code Standards

Rust Standards

Formatting:

cargo fmt
cargo clippy

Error Handling:

// Use Result with custom error types
pub type Result<T> = std::result::Result<T, ServiceError>;

// Use ? operator for propagation
pub async fn get_tenant(&self, id: &str) -> Result<Tenant> {
    let tenant = self.repo.find_by_id(id).await?;
    Ok(tenant)
}

Testing:

#[cfg(test)]
mod tests {
    use super::*;

    #[tokio::test]
    async fn test_get_tenant_success() {
        // Arrange
        let repo = MockTenantRepo::new();
        let service = TenantService::new(repo);

        // Act
        let result = service.get_tenant("tenant-123").await;

        // Assert
        assert!(result.is_ok());
    }
}

Go Standards

Formatting:

go fmt ./...
golangci-lint run

Error Handling:

// Always handle errors explicitly
result, err := doSomething()
if err != nil {
    return fmt.Errorf("failed to do something: %w", err)
}

Testing:

func TestGetTenant(t *testing.T) {
    // Arrange
    repo := NewMockRepo()
    service := NewTenantService(repo)

    // Act
    tenant, err := service.GetTenant("tenant-123")

    // Assert
    assert.NoError(t, err)
    assert.Equal(t, "tenant-123", tenant.ID)
}

Python Standards

Formatting:

black .
ruff check .
mypy .

Type Hints:

from typing import Optional, List
from pydantic import BaseModel

class TenantService:
    async def get_tenant(self, tenant_id: str) -> Optional[Tenant]:
        """Get tenant by ID."""
        return await self.repo.find_by_id(tenant_id)

Testing:

import pytest
from unittest.mock import AsyncMock

@pytest.mark.asyncio
async def test_get_tenant_success():
    # Arrange
    repo = AsyncMock()
    repo.find_by_id.return_value = Tenant(id="tenant-123")
    service = TenantService(repo)

    # Act
    result = await service.get_tenant("tenant-123")

    # Assert
    assert result.id == "tenant-123"

Flutter/Dart Standards

Formatting:

dart format .
dart analyze

State Management (Riverpod):

import 'package:riverpod_annotation/riverpod_annotation.dart';

part 'tenant_provider.g.dart';

@riverpod
class TenantNotifier extends _$TenantNotifier {
  @override
  FutureOr<Tenant?> build(String tenantId) async {
    return await ref.read(tenantRepositoryProvider).findById(tenantId);
  }

  Future<void> refresh() async {
    state = const AsyncLoading();
    state = await AsyncValue.guard(
      () => ref.read(tenantRepositoryProvider).findById(state.value!.id),
    );
  }
}

Testing Requirements

Coverage Requirements

Type	Rust	Go	Python	Flutter
Unit Tests	80%	80%	80%	70%
Integration	Required	Required	Required	-
E2E	Critical paths	Critical paths	AI flows	Shell flows

Unit Tests

Test individual functions/methods
Mock dependencies
Run on every commit
Required for PR approval

Integration Tests

Test service interactions
Use test containers
Run on PR merge
Cover API contracts

E2E Tests

Test full user flows
Run in staging
Required for release
Cover critical paths:
- Authentication
- Order creation
- Payment processing
- AI agent interactions

Running Tests

# Rust
cargo test
cargo test --test integration

# Go
go test ./...
go test -tags=integration ./...

# Python
pytest
pytest -m integration

# Flutter
flutter test
flutter test integration_test/

Deployment Pipeline

Environments

Environment	Purpose	Deploy Trigger
`dev`	Development	PR branch push
`staging`	Pre-production	Merge to `develop`
`production`	Live	Merge to `main`

CI/CD Pipeline

┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐
│  Lint   │───▶│  Test   │───▶│  Build  │───▶│ Deploy  │
└─────────┘    └─────────┘    └─────────┘    └─────────┘
                                                  │
                              ┌───────────────────┼───────────────────┐
                              │                   │                   │
                              ▼                   ▼                   ▼
                          ┌───────┐          ┌─────────┐         ┌──────┐
                          │  Dev  │          │ Staging │         │ Prod │
                          └───────┘          └─────────┘         └──────┘

Deployment Commands

# Deploy to staging
./scripts/deploy.sh staging

# Deploy to production (requires approval)
./scripts/deploy.sh production

# Rollback
./scripts/rollback.sh production v2.4.0

Canary Deployments

For production releases:

Start canary (1% traffic)
Monitor metrics (30 minutes)
Expand (5%, 25%, 50%)
Full rollout (100%)

Automatic rollback if:

Error rate > 1%
P99 latency > 500ms
Success rate < 99%

Release Process

Create release branch from develop
Version bump in version files
QA sign-off in staging
Create PR to main
Release approval (2 approvals)
Merge and deploy
Tag release (v2.4.1)

AI Agent Development

Agent Architecture

┌─────────────────────────────────────────────────────────────┐
│                    LangGraph Orchestrator                    │
├─────────────────────────────────────────────────────────────┤
│  intent_router → planner → approval_checker → executor      │
└─────────────────────────────────────────────────────────────┘

Creating a New Agent

Define agent in backend/python/app/agents/
Create graph nodes:
- Intent router
- Planner
- Tool executor
- Response generator
Add HITL checkpoints for sensitive actions
Register in agent registry
Configure model tiers

Agent Configuration

# agents/inventory_agent.py
from langgraph import StateGraph

class InventoryAgent:
    def __init__(self):
        self.graph = StateGraph()
        self.allowed_tiers = ["T2", "T3", "T4"]
        self.hitl_actions = ["order_inventory", "adjust_par_levels"]

RAG Integration

from clients.vectorize_client import VectorizeClient

async def query_knowledge_base(query: str, index: str = "docs-rag"):
    client = VectorizeClient()
    results = await client.query(
        query=query,
        index=index,
        top_k=5,
        min_score=0.7
    )
    return results

AI Testing

@pytest.mark.asyncio
async def test_inventory_agent_low_stock():
    agent = InventoryAgent()

    # Simulate low stock scenario
    state = {"inventory": {"ground_beef": {"current": 10, "par": 50}}}

    result = await agent.process(state, "Check inventory levels")

    assert result.suggested_action == "order_ground_beef"
    assert result.requires_hitl == True

On-Call & Incidents

On-Call Rotation

Weekly rotation (Monday 9AM to Monday 9AM)
Primary + Secondary on-call
Escalation path: Primary → Secondary → Team Lead → VP Eng

On-Call Responsibilities

Respond to alerts within 15 minutes
Triage and assess impact
Mitigate or escalate
Document in incident channel
Handoff to next engineer

Incident Severity

Severity	Impact	Response	Examples
P1	Business down	Immediate	Full outage
P2	Major feature broken	1 hour	Payments failing
P3	Feature degraded	4 hours	Slow reports
P4	Minor issue	24 hours	UI glitch

Incident Response

Acknowledge alert in PagerDuty
Join #incidents channel
Assess scope and impact
Communicate status
Mitigate - fix or rollback
Resolve - confirm fix
Post-mortem - document learnings

Runbooks

Key runbooks in Cockpit:

high-error-rate: Debug elevated errors
latency-spike: Investigate slow responses
database-issues: Spanner troubleshooting
ai-service-down: AI agent recovery

New Engineer Onboarding

Week 1: Setup & Learning

Day 1-2: Environment Setup

Get laptop and accounts
Clone repository
Configure GCP credentials for dev environment
Verify connectivity to dev.api.olympuscloud.ai
Review architecture docs

Day 3-5: Codebase Familiarity

Week 2: Deeper Dive

Week 3-4: Independent Work

Key Resources

Resource	Location
Architecture	`docs/architecture/ARCHITECTURE.md`
API Reference	`docs/api/`
Runbooks	Cockpit > Runbooks
Slack	#eng-general
Wiki	Notion/Confluence

Buddy System

Every new engineer gets a buddy:

Same team, 6+ months tenure
Daily sync for first 2 weeks
Weekly sync for month 2
Available for questions anytime

Security Best Practices

Secure Coding Guidelines

Input Validation:

Validate ALL user input at service boundaries
Use parameterized queries (no string concatenation)
Sanitize output to prevent XSS
Validate file uploads (type, size, content)

Authentication & Authorization:

Use JWT tokens with short expiration (15 min access, 7 day refresh)
Implement RBAC checks at API layer AND service layer
Never expose internal IDs - use UUIDs
Log all auth failures

Data Protection:

Encrypt sensitive data at rest (AES-256)
Use TLS 1.3 for all connections
Never log PII or credentials
Implement data retention policies

Secret Management

Type	Storage	Rotation
API Keys	Secret Manager	90 days
Database Creds	Secret Manager	30 days
JWT Signing Keys	Secret Manager	180 days
Service Accounts	IAM	365 days

Rules:

NEVER commit secrets to git
Use environment variables or Secret Manager
Rotate compromised secrets immediately
Audit secret access quarterly

Vulnerability Handling

Discovery:

Triaged within 24 hours
Severity assigned (Critical/High/Medium/Low)
Owner assigned
Fix timeline established

SLA by Severity:

Severity	Fix Timeline	Examples
Critical	24 hours	Auth bypass, RCE
High	7 days	SQL injection, XSS
Medium	30 days	Info disclosure
Low	90 days	Best practice issues

Security Checklist

Before merging any PR:

Architecture Decision Records (ADRs)

What is an ADR?

Architecture Decision Records document significant technical decisions with context, rationale, and consequences.

When to Write an ADR

Write an ADR when:

Choosing between technologies (e.g., Redis vs Memcached)
Defining API contracts
Establishing patterns or conventions
Making trade-offs with long-term impact
Deprecating existing approaches

ADR Template

# ADR-XXX: [Decision Title]

## Status
[Proposed | Accepted | Deprecated | Superseded by ADR-YYY]

## Context
[What is the issue? Why does it need a decision?]

## Decision
[What is the decision? Be specific.]

## Rationale
[Why this decision over alternatives?]

## Alternatives Considered
1. **Alternative A**: [Description] - Rejected because [reason]
2. **Alternative B**: [Description] - Rejected because [reason]

## Consequences

### Positive
- [Benefit 1]
- [Benefit 2]

### Negative
- [Trade-off 1]
- [Trade-off 2]

## References
- [Related docs, issues, discussions]

ADR Location

All ADRs stored in docs/architecture/adr/:

001-stateful-shell-route-architecture.md
002-cloud-spanner-vs-postgres.md
003-rust-for-core-services.md
004-acp-ai-router-architecture.md

ADR Process

Draft: Create PR with new ADR
Review: Team discusses in PR comments
Approve: 2+ senior engineers approve
Merge: ADR becomes official record
Supersede: Create new ADR referencing old one

Quick Reference

Common Commands

# Local development
make dev           # Start all services
make test          # Run tests
make lint          # Run linters
make build         # Build services

# Git
git checkout -b feat/epic-XXX-description
git commit -m "feat(scope): description [#XXX]"
git push origin feat/epic-XXX-description

# Deployment
./scripts/deploy.sh staging
./scripts/deploy.sh production
./scripts/rollback.sh production v2.4.0

Important URLs

Service	URL
GitHub	github.com/OlympusCloud/olympus-cloud-gcp
CI/CD	GitHub Actions
Cockpit	cockpit.olympuscloud.ai
Logs	logs.olympuscloud.ai
Metrics	metrics.olympuscloud.ai

Emergency Contacts

On-Call Engineer: Via PagerDuty (auto-escalation enabled)
Security Team: security@nebusai.com or #security-urgent in Slack
Engineering Leadership: Page via PagerDuty "Engineering Leadership" escalation
Executive Team: Contact via #exec-escalation (for P0/SEV1 only)

INTERNAL - NebusAI Engineering Team Only

Quick Summary (for RAG)​

Table of Contents​

Engineering Overview​

Team Structure​

Key Contacts​

Communication Channels​

Codebase Architecture​

Repository Structure​

Service Architecture​

Key Documents​

Development Workflow​

GitHub Flow​

Branch Naming​

Commit Messages​

Pull Request Requirements​

Code Standards​

Rust Standards​

Go Standards​

Python Standards​

Flutter/Dart Standards​

Testing Requirements​

Coverage Requirements​

Unit Tests​

Integration Tests​

E2E Tests​

Running Tests​

Deployment Pipeline​

Environments​

CI/CD Pipeline​

Deployment Commands​

Canary Deployments​

Release Process​

AI Agent Development​

Agent Architecture​

Creating a New Agent​

Agent Configuration​

RAG Integration​

AI Testing​

On-Call & Incidents​

On-Call Rotation​

On-Call Responsibilities​

Incident Severity​

Incident Response​

Runbooks​

New Engineer Onboarding​

Week 1: Setup & Learning​

Week 2: Deeper Dive​

Week 3-4: Independent Work​

Key Resources​

Buddy System​

Security Best Practices​

Secure Coding Guidelines​

Secret Management​

Vulnerability Handling​

Security Checklist​

Architecture Decision Records (ADRs)​

What is an ADR?​

When to Write an ADR​

ADR Template​

ADR Location​

ADR Process​

Quick Reference​

Common Commands​

Important URLs​

Emergency Contacts​

Quick Summary (for RAG)

Table of Contents

Engineering Overview

Team Structure

Key Contacts

Communication Channels

Codebase Architecture

Repository Structure

Service Architecture

Key Documents

Development Workflow

GitHub Flow

Branch Naming

Commit Messages

Pull Request Requirements

Code Standards

Rust Standards

Go Standards

Python Standards

Flutter/Dart Standards

Testing Requirements

Coverage Requirements

Unit Tests

Integration Tests

E2E Tests

Running Tests

Deployment Pipeline

Environments

CI/CD Pipeline

Deployment Commands

Canary Deployments

Release Process

AI Agent Development

Agent Architecture

Creating a New Agent

Agent Configuration

RAG Integration

AI Testing

On-Call & Incidents

On-Call Rotation

On-Call Responsibilities

Incident Severity

Incident Response

Runbooks

New Engineer Onboarding

Week 1: Setup & Learning

Week 2: Deeper Dive

Week 3-4: Independent Work

Key Resources

Buddy System

Security Best Practices

Secure Coding Guidelines

Secret Management

Vulnerability Handling

Security Checklist

Architecture Decision Records (ADRs)

What is an ADR?

When to Write an ADR

ADR Template

ADR Location

ADR Process

Quick Reference

Common Commands

Important URLs

Emergency Contacts