CASE STUDY · /rag-chatbot

Multi-Tenant RAG Platform

A retrieval-augmented chat backend — each tenant indexes their own knowledge base, end users get grounded answers via an embeddable widget.

ROLE

Solo Lead Engineer · Pipeline + Inference

TIMELINE

2025

TEAM

1 (solo)

STATUS

Production

Overview

A production RAG (Retrieval-Augmented Generation) service that lets any tenant point at their own documentation, knowledge base, or website — and surfaces an embeddable chat widget that answers user questions grounded in that content.

Two cooperating pipelines: the indexing pipeline (offline, scheduled) crawls → chunks → embeds → upserts into MongoDB Atlas. The query pipeline (per message, <2s) embeds the question → vector-searches → assembles context → streams a GPT-4 answer back through the widget.

The problem

Off-the-shelf chatbots (Intercom AI, generic GPT bubbles) are ungrounded in private data and hallucinate confidently. Clients needed a chatbot that strictly answers from their own docs, with citations, and respects per-tenant isolation.

Architecture

RAG chatbot widget — flow wireframes

How a crawled corpus becomes vectors in MongoDB Atlas, and how the widget question becomes a grounded OpenAI answer. Three layouts to compare.

ARCH · lo-fi

Lane A · Indexing / Crawlrun: offline / scheduled

1 Seed URLs

abc.com sitemap + manual list of help / pricing / docs pages.

sitemap.xmlconfig.json

2 Crawler

Fetch pages, follow internal links, dedupe by URL hash.

PuppeteerCheerionode-fetch

3 Clean & chunk

Strip nav/footer, extract main text, split into ~500 token chunks w/ 50 overlap.

readabilitytiktoken

4 Embed chunks

Generate a 1536-dim vector per chunk; batch & retry on 429.

OpenAI text-embedding-3-small

5 Upsert to vector DB

{ text, embedding, url, title, crawledAt } → collection w/ vector index.

MongoDB Atlas$vectorSearch index

Lane B · Query / Chatrun: per-message, <2s

1 Widget

User types a question in the floating iframe widget on abc.com.

iframepostMessagesessionStorage

2 POST /api/ask

Express endpoint; rate-limit, sanitize, attach userId & session.

Expresshelmetexpress-rate-limit

3 Embed question

Same model as ingest. 1536-dim query vector.

OpenAI text-embedding-3-small

4 Vector search

$vectorSearch → top-K (k=5) chunks by cosine similarity, score filter >0.75.

Atlas $vectorSearchk=5

5 LLM answer

Build prompt = system rules + retrieved chunks + user question. Stream to widget.

OpenAI gpt-4o-miniSSE

process node OpenAI call Atlas vector op›› arrows = data hand-off

How to read it: Two timelines that only meet at the Atlas vector index. Top lane is offline indexing; bottom lane is per-message retrieval. The slow, batchy work is done before any user is waiting.

What I built

Crawler: Puppeteer + Cheerio, follows internal links, dedupes by URL hash, respects robots.txt
Chunker: ~500-token semantic chunks with 50-token overlap, preserves heading context
Embeddings: OpenAI text-embedding-3-small (1536-dim), batched with 429-retry
Vector store: MongoDB Atlas $vectorSearch index, cosine similarity, per-tenant namespace
Query path: embed question → top-K=5 retrieval with score filter >0.75 → context assembly with citations
Inference: GPT-4o-mini, temperature 0.3, streamed via SSE to the embeddable widget
Tenant isolation: vector collections scoped per tenant, API key auth, usage tracking

Tech stack

Node.jsTypeScriptExpressOpenAI GPT-4o-miniOpenAI text-embedding-3MongoDB Atlas Vector SearchPuppeteerCheerioSSEWeb Components

Outcome

Sub-2-second grounded answers across tested corpora. Zero hallucination on questions where the answer exists in the source data; clean refusal pattern when it doesn't. Indexing pipeline scales to tens of thousands of pages per tenant.

Want this in your stack? Let's talk.

Email Athar Back to all projects