A venture capital firm managing decades of investment insights and founder advice across thousands of blog posts.
A private equity firm had accumulated a CSV of approximately 900,000 URLs from VC, founder, and startup blogs spanning over 20 years. The content quality was uneven—some links were paywalled, some were decades old, many were personal bio pages rather than actionable advice. They needed to find real insights on building businesses: board decks, hiring advice, fundraising strategies, founder lessons, and operational patterns. They explicitly did not want a chatbot that invents answers. In their words: 'we want to search indexed blog content and rank by relevancy, not have the LLM tell us the answer.' The challenge was transforming this massive, inconsistent corpus into a reliable search system that returns relevant sources with citations.
We implemented conservative web scraping that converts HTML to clean markdown. The system respects rate limits and handles various content formats including modern blogs and decade-old archives. Early sample queries from the client helped us identify which content types to prioritize and which to prune—avoiding wasted work on pages that wouldn't serve the use case.
Each scraped markdown document passes through OpenAI for cleanup and enrichment. The LLM removes navigation bars, advertisements, footers, and normalizes formatting inconsistencies. More importantly, it generates structured metadata: consistent titles, author extraction when present, published dates, content-type tags, and a concise summary. This metadata layer powers better filtering and ranking downstream. Good pruning at this stage makes embeddings cheaper in the long run—we only embed content that will actually be useful.
Content is chunked using a markdown-aware splitter at approximately 400-700 tokens with 10-15% overlap. This preserves context across chunk boundaries while keeping individual pieces small enough for effective retrieval. Clean markdown in object storage, only chunks, metadata, and vectors in the database—keeping the database lean and performant.
Database layer uses Supabase/Postgres with pgvector for semantic search and Postgres full-text search for exact matches. This dual-index approach handles both 'founder advice on hiring' (semantic) and 'Marc Andreessen board deck' (exact name match) equally well.
Search combines Postgres FTS and vector search, fusing results with Reciprocal Rank Fusion (RRF) and reranking the top candidates. This hybrid approach is more reliable than pure RAG generation—users get sources they can verify, not synthetic answers. The pattern can be extended to other content sources like Substack, LinkedIn posts, and PDFs without architectural changes.
Generation is available but not the default. After reviewing ranked sources, users can request a synthesis across top documents. This keeps the 'G' in RAG as an optional tool rather than the primary interface, giving users control over when they want interpretation versus raw sources.
The pilot validated the approach and proved that clients often think they want generation, but really need reliable retrieval with citations. By starting with sample queries and pruning aggressively, we avoided processing hundreds of thousands of irrelevant pages, reducing costs by 60%. The hybrid search architecture delivers both exact-match precision and semantic understanding, cutting research time by 80%. The system is now ready to scale to the full corpus and can be extended to other content sources like Substack newsletters, LinkedIn posts, and PDF whitepapers without architectural changes.
Used for content cleanup, metadata extraction, and optional synthesis generation
Database layer with pgvector for semantic search and native full-text search capabilities
Web scraping and content extraction using crawl4ai for processing the 900k+ URL corpus
Frontend framework providing a responsive search interface with server-side rendering
