Skip to main content

RAG Indexing

How documents are processed, chunked, embedded, and stored in Vectorize indexes for RAG retrieval.

RAG Architecture

The Olympus Cloud RAG system uses a hybrid architecture:

┌─────────────────────────────────────────────────────────────────────────┐
│ RAG Architecture │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Document │ │ Chunking │ │ Embedding │ │
│ │ Ingestion │───▶│ Pipeline │───▶│ Service │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Cloudflare Vectorize │ │
│ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ │
│ │ │menu-rag│ │support │ │sales-kb│ │ops-kb │ │ │
│ │ └────────┘ └────────┘ └────────┘ └────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Query Router │ │
│ │ score_threshold → top_k → re-ranking → response │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘

Vectorize vs Vertex AI

FeatureCloudflare VectorizeGCP Vertex AI
LatencyUnder 20ms (edge)50-100ms
CostIncluded with WorkersPer query
ScalingAutomaticManual configuration
Max Dimensions15363072
Max Vectors5M per indexUnlimited
Best ForReal-time queriesLarge-scale analytics

Recommendation: Use Vectorize for real-time agent queries, Vertex AI for batch processing and training.


Vectorize Index Management

Available Indexes

Index NameContentAgentsUpdate Frequency
menu-ragMenu items, ingredients, allergensMenu Assistant, Voice AIReal-time
support-ragFAQs, troubleshooting, docsSupport AgentDaily
sales-ragPricing, ROI, competitorsMinervaWeekly
ops-ragRunbooks, monitoring, alertsMaximusOn change
training-ragInternal docs, proceduresAll internal agentsWeekly
policy-ragBusiness rules, complianceScheduling, AnalyticsOn change

Index Configuration

// Create new index
const index = await vectorize.createIndex({
name: 'support-rag',
dimensions: 768, // BGE-base dimensions
metric: 'cosine',
metadata_fields: {
doc_type: 'String',
category: 'String',
tenant_id: 'String',
updated_at: 'Number',
}
});

Metadata Schema

FieldTypePurpose
doc_typeStringContent classification (faq, guide, runbook)
categoryStringContent category (orders, payments, scheduling)
tenant_idStringTenant isolation for multi-tenant queries
updated_atNumberTimestamp for freshness filtering
languageStringContent language (en, es, fr)
audienceStringTarget audience (staff, manager, customer)

Document Ingestion Pipeline

Supported Formats

FormatProcessingChunking Strategy
MarkdownNativeHeader-based semantic
PDFPyMuPDF extractionPage + paragraph
HTMLBeautifulSoupSection-based
VideoWhisper transcriptionTime-segment
FAQ JSONDirect importQ&A pairs

Ingestion Workflow

# GitHub Actions workflow for doc ingestion
name: RAG Document Ingestion
on:
push:
paths:
- 'documentation/**/*.md'
- 'docs/**/*.md'

jobs:
ingest:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Process documents
run: |
python scripts/rag/process_docs.py \
--source documentation/ \
--index support-rag \
--chunk-strategy semantic

- name: Upload to Vectorize
run: |
python scripts/rag/upload_vectors.py \
--index support-rag \
--vectors output/vectors.json

Chunking Strategies

Strategy Comparison

StrategyBest ForChunk SizeOverlap
SemanticDocumentationVariable50 tokens
FixedCode, logs500 tokens100 tokens
ParagraphArticlesVariable1 sentence
Q&AFAQsQuestion + AnswerNone
def semantic_chunk(markdown_content: str) -> list[Chunk]:
"""Split by headers, respecting document structure."""
chunks = []
current_chunk = []
current_headers = []

for line in markdown_content.split('\n'):
if line.startswith('#'):
# Save previous chunk
if current_chunk:
chunks.append(Chunk(
content='\n'.join(current_chunk),
headers=current_headers.copy(),
metadata={'type': 'section'}
))
# Start new chunk with header context
level = len(line.split()[0])
current_headers = current_headers[:level-1] + [line]
current_chunk = [line]
else:
current_chunk.append(line)

return chunks

Content-Type Specific Strategies

Content TypeStrategyRationale
API docsEndpoint-basedOne chunk per endpoint
RunbooksStep-basedOne chunk per procedure
FAQsQ&A pairsQuestion + answer together
TutorialsSection-basedLogical learning units
ReferenceTerm-basedDefinition + examples

Embedding Models

Model Comparison

ModelProviderDimensionsSpeedQualityCost
BGE-base-en-v1.5Workers AI768FastGoodFREE
BGE-small-en-v1.5Workers AI384FastestAcceptableFREE
text-embedding-004Vertex AI768MediumExcellent$0.025/1K
text-embedding-3-largeOpenAI3072MediumExcellent$0.13/1K

Recommendation by Use Case

Use CaseRecommended ModelRationale
Real-time chatBGE-smallLowest latency
Support queriesBGE-baseBalance of speed/quality
Sales/complextext-embedding-004Highest accuracy
Batch indexingtext-embedding-004Quality over speed

Embedding Code Example

// Workers AI embedding
const embeddings = await ai.run('@cf/baai/bge-base-en-v1.5', {
text: [chunk.content]
});

// Insert into Vectorize
await index.upsert([{
id: chunk.id,
values: embeddings.data[0],
metadata: {
doc_type: chunk.type,
category: chunk.category,
tenant_id: tenantId,
updated_at: Date.now()
}
}]);

  • Querying - Query patterns and retrieval strategies
  • Maintenance - Index maintenance and monitoring
  • Overview - RAG Knowledge Base overview