My $5/month RAG System Just Got Eyes: Adding Multimodal Search Without Breaking the Bank

**MyrinNew** · 01-25-2026, 02:40 PM

Last month, I showed you how my V2 beat $200 enterprise RAG systems with hybrid search and reranking. The response was incredible but one comment stuck with me:

"This is great for text, but what about images? My team has thousands of screenshots and diagrams we can't search."

So I rebuilt it again. This time, I gave my RAG system vision.

Who Should Read This?

Freelancers/agencies managing client screenshots, bug reports, design files
Teams drowning in Slack screenshots that are unsearchable
Anyone tired of "what was that dashboard screenshot called again?"
Developers wanting multimodal RAG without $100/month bills

Why This Matters: The Cost Reality

OpenAI Vision API	$100/month	Vision only (no search, no storage)
Google Vertex AI	$15/month	Vision only (no embeddings)
AWS Rekognition	$12/month	Labels only (no semantic search)
Pinecone + OpenAI	$120/month	Vision + Search (separate services)
This Project	$5/month	Vision + OCR + Embeddings + Hybrid Search + Storage

The hidden cost: Most solutions charge separately for vision, embeddings, and search. This project includes everything on Cloudflare's edge.

The Problem With V2

V2 could find anything in text documents. But when clients uploaded:

📸 Dashboard screenshots
📊 Technical diagrams
📋 Scanned documents
🐛 Error message screenshots

The system was blind. It could only search by filename (screenshot-2026-01-15.png - useless) or manually added descriptions (which nobody bothers writing).

In 2026, text-only search feels like using a flip phone.

The Upgrade: Making RAG "See"

I added Llama 4 Scout (Meta's 17B multimodal model) to the stack. Now when you upload an image:

Llama 4 Scout analyzes the pixels - generates a detailed description
OCR extracts visible text - captures button labels, error messages, code
BGE creates embeddings - makes it all searchable
Stores in the same index - no separate image database needed

How It Works:

┌───────────────────────────────────────────────── ────────┐
│ Image Upload (40KB) │
└────────────────────┬──────────────────────────── ────────┘
│
▼
┌───────────────────────┐
│ Llama 4 Scout │
│ (Multimodal Model) │
└───────┬───────────────┘
│
┌────────┴────────┐
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Semantic │ │ OCR Text │
│ Description │ │ Extraction │
│ (1,865 chars)│ │ (1,043 chars)│
└──────┬───────┘ └──────┬───────┘
│ │
└────────┬─────────┘
▼
┌───────────────┐
│ BGE Embedding │
│ (384 dims) │
└───────┬───────┘
▼
┌──────────────────────┐
│ Vectorize + D1 │
│ (Single Index) │
└──────────┬───────────┘
▼
┌─────────────────────┐
│ Searchable! │
│ • By meaning │
│ • By text │
│ • By similarity │
└─────────────────────┘

Processing time: 7.9 seconds

Search time: 900ms (first) → 0ms (cached)

The Code:

// The magic: Images become searchable text
const visionResponse = await env.AI.run('@cf/meta/llama-4-scout-17b-16e-instruct', {
messages: [{
role: 'user',
content: [
{ type: 'image_url', image_url: { url: `data:image/png;base64,${base64Image}` }},
{ type: 'text', text: 'Describe this image in detail for search indexing.' }
]
}]
});

// Combine semantic description + extracted text
const searchableContent = `${description}\n\nVisible Text: ${ocrText}`;

// Store in the same 384-dim index as regular documents
await env.VECTORIZE.upsert([{
id: imageId,
values: embedding,
metadata: { content: searchableContent, isImage: true }
}]);

Why this works:

✅ Single unified index (images + text coexist)

✅ Hybrid search still applies (Vector + BM25)

✅ OCR makes screenshots searchable by visible text

✅ Same $5/month cost

Why Not Use Separate OCR?

Short answer: Llama 4 Scout does OCR + semantic understanding in one call.

Long answer:

Tesseract can't run on Workers - needs native binaries, breaks serverless
Context matters - Llama 4 understands table structures, headers, layouts (Tesseract just dumps text linearly)
Efficiency - One API call vs. two (vision + OCR separately)
Fallback resilience - If OCR fails, semantic description still makes it searchable

Trade-off: Dedicated OCR might be 2-3% more accurate on printed text, but Llama 4's multimodal understanding gives better search results.

Performance: Before vs After

Search Types	Text only	Text + Images + Scanned Docs
Image Ingestion	❌	✅ 7.9s (Llama 4 + OCR)
OCR Extraction	❌	✅ 1,000+ chars (receipts, forms, diagrams)
Reverse Image Search	❌	✅ 8s
Latency (text search)	~900ms	~900ms (unchanged!)
Latency (cached)	N/A	0ms (new cache)
Cost	$5/month	$5/month
Document Types	Text, Code, Markdown	+ Screenshots, Receipts, Forms, Diagrams

Yes, the cost didn't change. Cloudflare's edge deployment means you're not paying for idle GPU time.

Real-World Test: Visual Bug Reports

I tested with actual use cases from my consulting work:

Test 1: Screenshot Search

Uploaded: Dashboard screenshot with metrics cards

Search query: "Find dashboards with performance metrics"

Result: ✅ Found 3 similar screenshots in 1.1s

What it matched:

Description: "dashboard interface with metrics cards"
OCR text: "Response Time: 847ms", "Throughput: 2.4K/s"

Test 2: Error Message Recognition

Uploaded: Screenshot of React error in browser console

Search query: "TypeError undefined property"

Result: ✅ Matched via OCR text extraction

OCR captured:

TypeError: Cannot read property 'map' of undefined
at ProductList.jsx:42

Test 3: Diagram Discovery

Uploaded: Architecture diagram with boxes and arrows

Search query: "microservices architecture"

Result: ✅ Matched via semantic description

Llama 4 described it as:

"Architecture diagram showing microservices pattern with API Gateway, Service Discovery, and multiple backend services connected via message queue"

Real-World Test: Financial Document Search

To prove this isn't just for tech screenshots, I threw it a real challenge: a Nigerian bank receipt with mixed English/abbreviations, account numbers, and structured financial data (40KB JPEG).

What Llama 4 Scout Extracted:

OCR (1,043 characters):

Transaction Amount: N30,000
Transaction Type: INTER-BANK
Sender: CHUKWUDI NWANERI
Beneficiary: BOBMANUEL CECILIA OGECHI
Account: 3113880181
Bank: First Bank of Nigeria
Reference: NXG000014260102194419228984379203

Semantic Description (1,865 characters):

"Transaction receipt from Access Bank, detailing a successful inter-bank transfer. The receipt is structured into sections with header, transaction details, sender/beneficiary information..."

Processing Time: 7.9 seconds (vision + OCR + embedding)

Search Results - 5 Different Queries:

I tested every searchable element. Every query found the receipt as the #1 result:

"N30000 transfer"	✅ #1 match	1.2s
"BOBMANUEL CECILIA"	✅ #1 match	609ms
"Access Bank transaction"	✅ #1 match	527ms
"NXG000014260102194419228984379203"	✅ #1 match	601ms
"Access Bank transaction" (repeat)	✅ #1 match	0ms (cached!)

This proves:

✅ Semantic search - "N30000 transfer" matched without exact text

✅ Name extraction - Found partial name "BOBMANUEL CECILIA"

✅ Exact text matching - 30-character transaction reference found instantly

✅ Cache working - Repeat query eliminated all latency

Use Cases This Enables:

📄 Receipt Management

Upload scanned receipts, invoices, bills. Search by:

Amount: "show me transfers over N25000"
Vendor: "find all Access Bank transactions"
Date: "transactions from January 2026"

💼 Financial Audit Trails

Search transactions by reference number
Find transfers by recipient name
Track spending patterns across documents

🏦 Compliance & Bookkeeping

Searchable transaction history without manual data entry
Automated document categorization by bank/type
Audit-ready record keeping with instant retrieval

🔒 Privacy-First

Your financial documents never leave Cloudflare's network. No OpenAI API calls, no Google Cloud uploads - just edge processing.

All for $5/month.

The "I Tried CLIP and It Failed" Story

Initially, I wanted to use CLIP (OpenAI's vision-language model) for "true" visual embeddings. The plan was beautiful:

Image → CLIP → Visual embedding (512 dims) → Separate index

Problem: Cloudflare Workers AI doesn't support CLIP.

Error code 5018: "This account is not allowed to access this model."

After wasting a weekend on this, I realized something: For RAG use cases, descriptions work better than visual embeddings anyway.

Why?

Descriptions are searchable by meaning ("red button") and text ("Submit")
Visual embeddings only match pixel similarity (good for "find similar images", bad for "find the login screen")
Single index is simpler than dual-index systems

Lesson learned: Sometimes the "clever" solution is worse than the simple one.

How This Compares to Multimodal Alternatives

Base cost	$0.01/image	$0.0015/image	Included in $5/month
OCR	Not included	Separate API ($1.50/1K pages)	Built-in
Hybrid search	No	No	✅ Vector + BM25
Reranking	No	No	✅ Cross-encoder
Edge latency	200-500ms	300-600ms	~900ms (first), 0ms (cached)
Data leaves network	✅ Yes	✅ Yes	❌ No (Cloudflare only)
Setup complexity	API integration	Complex SDK	wrangler deploy
Storage included	No (S3 separate)	No (GCS separate)	✅ D1 + Vectorize

At scale (10K images/month):

OpenAI Vision: $100/month (just for vision, excluding embeddings & storage)
Google Vertex AI: $15/month (vision only) + $10/month (embeddings) + storage
AWS Rekognition: $12/month (labels only) + separate search solution
This stack: $5/month (everything included)

At scale (100K images/month):

OpenAI: $1,000/month
Google: $250/month
This stack: ~$50/month (still 20x cheaper)

New Features in V3

1. Image Ingestion Endpoint

curl -X POST https://your-worker.dev/ingest-image \
-F "id=dashboard-001" \
-F "image=@screenshot.png" \
-F "category=ui-screenshots"

Response:

{
"success": true,
"documentId": "dashboard-001",
"description": "Dashboard interface with...",
"extractedText": "API Key\nEnter your API key\nTest...",
"performance": {
"multimodalProcessing": "4852ms",
"totalTime": "7737ms"
}
}

2. Reverse Image Search

Upload an image, find visually similar ones:

curl -X POST https://your-worker.dev/find-similar-images \
-F "image=@query.png" \
-F "topK=5"

Use cases:

"Find screenshots that look like this"
"Match product photos"
"Locate similar diagrams"

3. 60-Second Cache (New!)

After rebuilding, I added caching. Same query within 60s? 0ms response.

First search: 929ms
Cached search: 0ms ✨

Real log output:

POST /search - Ok @ 3:33:00 PM
POST /search - Ok @ 3:33:02 PM
(log) Cache hit!

How it works:

In-memory cache (not Workers KV - that adds 50-100ms latency)
Caches final search results (~5KB per query)
60-second TTL (queries expire after 1 minute - balances freshness vs performance)
Uses <1MB of Worker's 128MB RAM

4. Batch Embeddings (Optimization)

V2 generated embeddings sequentially (slow). V3 uses Promise.all():

Before: 3 chunks → 3 seconds (sequential)

After: 3 chunks → 1.2 seconds (parallel)

// V2: Sequential (slow)
for (const chunk of chunks) {
const emb = await env.AI.run('@cf/baai/bge-small-en-v1.5', {text: chunk});
vectors.push(emb);
}

// V3: Parallel (fast)
const embeddings = await Promise.all(
chunks.map(chunk => env.AI.run('@cf/baai/bge-small-en-v1.5', {text: chunk}))
);

The Tech Stack (Updated)

All still on Cloudflare's edge:

Workers - Runtime (serverless, globally distributed)
Vectorize - Vector database (384 dims, single unified index)
D1 - SQL database for BM25 keywords
Workers AI:
- @cf/meta/llama-4-scout-17b-16e-instruct (vision + OCR)
- @cf/baai/bge-small-en-v1.5 (embeddings - 384 dims)
- @cf/baai/bge-reranker-base (cross-encoder reranking)

Why 384 dimensions?

Tested: 384 dims achieves 66.43% MRR@5 vs 56.72% for semantic-only
Upgrading to 768 dims only improves to ~68% (2% gain)
But doubles cost and latency
Better to use reranker (adds 9.3 percentage points for minimal cost)

No external APIs. No data leaving Cloudflare's network.

Deployment (Still 10 Minutes)

git clone https://github.com/dannwaneri/vectorize-mcp-worker.git
cd vectorize-mcp-worker
npm install

# Create resources
wrangler vectorize create mcp-knowledge-base --dimensions=384
wrangler d1 create mcp-knowledge-db

# Update wrangler.toml with database_id, then:
wrangler d1 execute mcp-knowledge-db --remote --file=./schema.sql
wrangler deploy

Test it:

# Upload an image
curl -X POST https://your-worker.dev/ingest-image \
-F "id=test-001" \
-F "image=@screenshot.png"

# Search by text
curl -X POST https://your-worker.dev/search \
-H "Content-Type: application/json" \
-d '{"query": "dashboard metrics", "topK": 5}'

What This Can't Do (Yet)

Handwriting recognition - Llama 4 Scout struggles with cursive
Complex math equations - LaTeX rendering in images isn't perfect
Video analysis - Only processes static images (frame extraction coming in V4?)
Non-Latin scripts - Haven't tested Arabic/Chinese/Cyrillic thoroughly

If your use case needs these, let me know in the comments - might prioritize them.

What's Next?

Considering:

✅ Multimodal search (Done!)
✅ 60-second cache (Done!)
✅ Batch embeddings (Done!)
Face/object detection (if there's demand)
Video frame analysis
PDF image extraction

But honestly? This covers 95% of multimodal RAG use cases.

Try It Live

Dashboard: https://vectorize-mcp-worker.fpl-tes....dev/dashboard
GitHub: github.com/dannwaneri/vectorize-mcp-worker

Upload a screenshot and search for it. You'll see why 2026 is the year RAG gets eyes.

Need help deploying this for your team? Hire me on Upwork

⭐ Star the repo if this helps your project!

Questions? Drop them in the comments.

More...