My $5/month RAG System Just Got Eyes: Adding Multimodal Search Without Breaking the Bank

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • MyrinNew
    Senior Member
    • Feb 2024
    • 5175

    #1

    My $5/month RAG System Just Got Eyes: Adding Multimodal Search Without Breaking the Bank

    Last month, I showed you how my V2 beat $200 enterprise RAG systems with hybrid search and reranking. The response was incredible but one comment stuck with me:


    "This is great for text, but what about images? My team has thousands of screenshots and diagrams we can't search."


    So I rebuilt it again. This time, I gave my RAG system vision.


    Who Should Read This?

    • Freelancers/agencies managing client screenshots, bug reports, design files
    • Teams drowning in Slack screenshots that are unsearchable
    • Anyone tired of "what was that dashboard screenshot called again?"
    • Developers wanting multimodal RAG without $100/month bills


    Why This Matters: The Cost Reality

    OpenAI Vision API $100/month Vision only (no search, no storage)
    Google Vertex AI $15/month Vision only (no embeddings)
    AWS Rekognition $12/month Labels only (no semantic search)
    Pinecone + OpenAI $120/month Vision + Search (separate services)
    This Project $5/month Vision + OCR + Embeddings + Hybrid Search + Storage


    The hidden cost: Most solutions charge separately for vision, embeddings, and search. This project includes everything on Cloudflare's edge.


    The Problem With V2

    V2 could find anything in text documents. But when clients uploaded:
    • 📸 Dashboard screenshots
    • 📊 Technical diagrams
    • 📋 Scanned documents
    • 🐛 Error message screenshots


    The system was blind. It could only search by filename (screenshot-2026-01-15.png - useless) or manually added descriptions (which nobody bothers writing).


    In 2026, text-only search feels like using a flip phone.


    The Upgrade: Making RAG "See"

    I added Llama 4 Scout (Meta's 17B multimodal model) to the stack. Now when you upload an image:

    1. Llama 4 Scout analyzes the pixels - generates a detailed description
    2. OCR extracts visible text - captures button labels, error messages, code
    3. BGE creates embeddings - makes it all searchable
    4. Stores in the same index - no separate image database needed


    How It Works:





    ┌───────────────────────────────────────────────── ────────┐
    │ Image Upload (40KB) │
    └────────────────────┬──────────────────────────── ────────┘


    ┌───────────────────────┐
    │ Llama 4 Scout │
    │ (Multimodal Model) │
    └───────┬───────────────┘

    ┌────────┴────────┐
    ▼ ▼
    ┌──────────────┐ ┌──────────────┐
    │ Semantic │ │ OCR Text │
    │ Description │ │ Extraction │
    │ (1,865 chars)│ │ (1,043 chars)│
    └──────┬───────┘ └──────┬───────┘
    │ │
    └────────┬─────────┘

    ┌───────────────┐
    │ BGE Embedding │
    │ (384 dims) │
    └───────┬───────┘

    ┌──────────────────────┐
    │ Vectorize + D1 │
    │ (Single Index) │
    └──────────┬───────────┘

    ┌─────────────────────┐
    │ Searchable! │
    │ • By meaning │
    │ • By text │
    │ • By similarity │
    └─────────────────────┘







    Processing time: 7.9 seconds


    Search time: 900ms (first) → 0ms (cached)

    The Code:




    // The magic: Images become searchable text
    const visionResponse = await env.AI.run('@cf/meta/llama-4-scout-17b-16e-instruct', {
    messages: [{
    role: 'user',
    content: [
    { type: 'image_url', image_url: { url: `data:image/png;base64,${base64Image}` }},
    { type: 'text', text: 'Describe this image in detail for search indexing.' }
    ]
    }]
    });

    // Combine semantic description + extracted text
    const searchableContent = `${description}\n\nVisible Text: ${ocrText}`;

    // Store in the same 384-dim index as regular documents
    await env.VECTORIZE.upsert([{
    id: imageId,
    values: embedding,
    metadata: { content: searchableContent, isImage: true }
    }]);






    Why this works:


    ✅ Single unified index (images + text coexist)


    ✅ Hybrid search still applies (Vector + BM25)


    ✅ OCR makes screenshots searchable by visible text


    ✅ Same $5/month cost

    Why Not Use Separate OCR?

    Short answer: Llama 4 Scout does OCR + semantic understanding in one call.


    Long answer:
    • Tesseract can't run on Workers - needs native binaries, breaks serverless
    • Context matters - Llama 4 understands table structures, headers, layouts (Tesseract just dumps text linearly)
    • Efficiency - One API call vs. two (vision + OCR separately)
    • Fallback resilience - If OCR fails, semantic description still makes it searchable


    Trade-off: Dedicated OCR might be 2-3% more accurate on printed text, but Llama 4's multimodal understanding gives better search results.

    Performance: Before vs After

    Search Types Text only Text + Images + Scanned Docs
    Image Ingestion ✅ 7.9s (Llama 4 + OCR)
    OCR Extraction ✅ 1,000+ chars (receipts, forms, diagrams)
    Reverse Image Search ✅ 8s
    Latency (text search) ~900ms ~900ms (unchanged!)
    Latency (cached) N/A 0ms (new cache)
    Cost $5/month $5/month
    Document Types Text, Code, Markdown + Screenshots, Receipts, Forms, Diagrams


    Yes, the cost didn't change. Cloudflare's edge deployment means you're not paying for idle GPU time.

    Real-World Test: Visual Bug Reports

    I tested with actual use cases from my consulting work:

    Test 1: Screenshot Search

    Uploaded: Dashboard screenshot with metrics cards


    Search query: "Find dashboards with performance metrics"


    Result: ✅ Found 3 similar screenshots in 1.1s


    What it matched:
    • Description: "dashboard interface with metrics cards"
    • OCR text: "Response Time: 847ms", "Throughput: 2.4K/s"

    Test 2: Error Message Recognition

    Uploaded: Screenshot of React error in browser console


    Search query: "TypeError undefined property"


    Result: ✅ Matched via OCR text extraction


    OCR captured:






    TypeError: Cannot read property 'map' of undefined
    at ProductList.jsx:42







    Test 3: Diagram Discovery

    Uploaded: Architecture diagram with boxes and arrows


    Search query: "microservices architecture"


    Result: ✅ Matched via semantic description


    Llama 4 described it as:


    "Architecture diagram showing microservices pattern with API Gateway, Service Discovery, and multiple backend services connected via message queue"

    Real-World Test: Financial Document Search

    To prove this isn't just for tech screenshots, I threw it a real challenge: a Nigerian bank receipt with mixed English/abbreviations, account numbers, and structured financial data (40KB JPEG).

    What Llama 4 Scout Extracted:

    OCR (1,043 characters):






    Transaction Amount: N30,000
    Transaction Type: INTER-BANK
    Sender: CHUKWUDI NWANERI
    Beneficiary: BOBMANUEL CECILIA OGECHI
    Account: 3113880181
    Bank: First Bank of Nigeria
    Reference: NXG000014260102194419228984379203







    Semantic Description (1,865 characters):


    "Transaction receipt from Access Bank, detailing a successful inter-bank transfer. The receipt is structured into sections with header, transaction details, sender/beneficiary information..."


    Processing Time: 7.9 seconds (vision + OCR + embedding)


    Search Results - 5 Different Queries:

    I tested every searchable element. Every query found the receipt as the #1 result:


    "N30000 transfer" ✅ #1 match 1.2s
    "BOBMANUEL CECILIA" ✅ #1 match 609ms
    "Access Bank transaction" ✅ #1 match 527ms
    "NXG000014260102194419228984379203" ✅ #1 match 601ms
    "Access Bank transaction" (repeat) ✅ #1 match 0ms (cached!)


    This proves:


    Semantic search - "N30000 transfer" matched without exact text


    Name extraction - Found partial name "BOBMANUEL CECILIA"


    Exact text matching - 30-character transaction reference found instantly


    Cache working - Repeat query eliminated all latency

    Use Cases This Enables:

    📄 Receipt Management

    Upload scanned receipts, invoices, bills. Search by:
    • Amount: "show me transfers over N25000"
    • Vendor: "find all Access Bank transactions"
    • Date: "transactions from January 2026"

    💼 Financial Audit Trails

    • Search transactions by reference number
    • Find transfers by recipient name
    • Track spending patterns across documents

    🏦 Compliance & Bookkeeping

    • Searchable transaction history without manual data entry
    • Automated document categorization by bank/type
    • Audit-ready record keeping with instant retrieval

    🔒 Privacy-First

    Your financial documents never leave Cloudflare's network. No OpenAI API calls, no Google Cloud uploads - just edge processing.


    All for $5/month.

    The "I Tried CLIP and It Failed" Story

    Initially, I wanted to use CLIP (OpenAI's vision-language model) for "true" visual embeddings. The plan was beautiful:






    Image → CLIP → Visual embedding (512 dims) → Separate index







    Problem: Cloudflare Workers AI doesn't support CLIP.


    Error code 5018: "This account is not allowed to access this model."


    After wasting a weekend on this, I realized something: For RAG use cases, descriptions work better than visual embeddings anyway.


    Why?

    1. Descriptions are searchable by meaning ("red button") and text ("Submit")
    2. Visual embeddings only match pixel similarity (good for "find similar images", bad for "find the login screen")
    3. Single index is simpler than dual-index systems


    Lesson learned: Sometimes the "clever" solution is worse than the simple one.


    How This Compares to Multimodal Alternatives

    Base cost $0.01/image $0.0015/image Included in $5/month
    OCR Not included Separate API ($1.50/1K pages) Built-in
    Hybrid search No No ✅ Vector + BM25
    Reranking No No ✅ Cross-encoder
    Edge latency 200-500ms 300-600ms ~900ms (first), 0ms (cached)
    Data leaves network ✅ Yes ✅ Yes ❌ No (Cloudflare only)
    Setup complexity API integration Complex SDK wrangler deploy
    Storage included No (S3 separate) No (GCS separate) ✅ D1 + Vectorize


    At scale (10K images/month):
    • OpenAI Vision: $100/month (just for vision, excluding embeddings & storage)
    • Google Vertex AI: $15/month (vision only) + $10/month (embeddings) + storage
    • AWS Rekognition: $12/month (labels only) + separate search solution
    • This stack: $5/month (everything included)


    At scale (100K images/month):
    • OpenAI: $1,000/month
    • Google: $250/month
    • This stack: ~$50/month (still 20x cheaper)


    New Features in V3

    1. Image Ingestion Endpoint





    curl -X POST https://your-worker.dev/ingest-image \
    -F "id=dashboard-001" \
    -F "image=@screenshot.png" \
    -F "category=ui-screenshots"







    Response:






    {
    "success": true,
    "documentId": "dashboard-001",
    "description": "Dashboard interface with...",
    "extractedText": "API Key\nEnter your API key\nTest...",
    "performance": {
    "multimodalProcessing": "4852ms",
    "totalTime": "7737ms"
    }
    }







    2. Reverse Image Search

    Upload an image, find visually similar ones:






    curl -X POST https://your-worker.dev/find-similar-images \
    -F "image=@query.png" \
    -F "topK=5"







    Use cases:
    • "Find screenshots that look like this"
    • "Match product photos"
    • "Locate similar diagrams"


    3. 60-Second Cache (New!)

    After rebuilding, I added caching. Same query within 60s? 0ms response.






    First search: 929ms
    Cached search: 0ms ✨







    Real log output:






    POST /search - Ok @ 3:33:00 PM
    POST /search - Ok @ 3:33:02 PM
    (log) Cache hit!







    How it works:
    • In-memory cache (not Workers KV - that adds 50-100ms latency)
    • Caches final search results (~5KB per query)
    • 60-second TTL (queries expire after 1 minute - balances freshness vs performance)
    • Uses <1MB of Worker's 128MB RAM


    4. Batch Embeddings (Optimization)

    V2 generated embeddings sequentially (slow). V3 uses Promise.all():


    Before: 3 chunks → 3 seconds (sequential)


    After: 3 chunks → 1.2 seconds (parallel)






    // V2: Sequential (slow)
    for (const chunk of chunks) {
    const emb = await env.AI.run('@cf/baai/bge-small-en-v1.5', {text: chunk});
    vectors.push(emb);
    }

    // V3: Parallel (fast)
    const embeddings = await Promise.all(
    chunks.map(chunk => env.AI.run('@cf/baai/bge-small-en-v1.5', {text: chunk}))
    );







    The Tech Stack (Updated)

    All still on Cloudflare's edge:
    • Workers - Runtime (serverless, globally distributed)
    • Vectorize - Vector database (384 dims, single unified index)
    • D1 - SQL database for BM25 keywords
    • Workers AI:
      • @cf/meta/llama-4-scout-17b-16e-instruct (vision + OCR)
      • @cf/baai/bge-small-en-v1.5 (embeddings - 384 dims)
      • @cf/baai/bge-reranker-base (cross-encoder reranking)


    Why 384 dimensions?
    • Tested: 384 dims achieves 66.43% MRR@5 vs 56.72% for semantic-only
    • Upgrading to 768 dims only improves to ~68% (2% gain)
    • But doubles cost and latency
    • Better to use reranker (adds 9.3 percentage points for minimal cost)


    No external APIs. No data leaving Cloudflare's network.


    Deployment (Still 10 Minutes)





    git clone https://github.com/dannwaneri/vectorize-mcp-worker.git
    cd vectorize-mcp-worker
    npm install

    # Create resources
    wrangler vectorize create mcp-knowledge-base --dimensions=384
    wrangler d1 create mcp-knowledge-db

    # Update wrangler.toml with database_id, then:
    wrangler d1 execute mcp-knowledge-db --remote --file=./schema.sql
    wrangler deploy







    Test it:






    # Upload an image
    curl -X POST https://your-worker.dev/ingest-image \
    -F "id=test-001" \
    -F "image=@screenshot.png"

    # Search by text
    curl -X POST https://your-worker.dev/search \
    -H "Content-Type: application/json" \
    -d '{"query": "dashboard metrics", "topK": 5}'







    What This Can't Do (Yet)

    • Handwriting recognition - Llama 4 Scout struggles with cursive
    • Complex math equations - LaTeX rendering in images isn't perfect
    • Video analysis - Only processes static images (frame extraction coming in V4?)
    • Non-Latin scripts - Haven't tested Arabic/Chinese/Cyrillic thoroughly


    If your use case needs these, let me know in the comments - might prioritize them.


    What's Next?

    Considering:
    • ✅ Multimodal search (Done!)
    • ✅ 60-second cache (Done!)
    • ✅ Batch embeddings (Done!)
    • Face/object detection (if there's demand)
    • Video frame analysis
    • PDF image extraction


    But honestly? This covers 95% of multimodal RAG use cases.


    Try It Live



    Upload a screenshot and search for it. You'll see why 2026 is the year RAG gets eyes.


    Need help deploying this for your team? Hire me on Upwork


    Star the repo if this helps your project!


    Questions? Drop them in the comments.




    More...
Working...