Private Equity

Enterprise Search Over 20+ Years of VC Content

A venture capital firm managing decades of investment insights and founder advice across thousands of blog posts.

“startup fundraising advice”
1m+
Pages
20+ Years
Content
Semantic Search
95% relevance in top 10 results
Source Citations
Direct links to original content
Hybrid Search
Exact match + semantic understanding

The Problem

A private equity firm had accumulated a CSV of approximately 900,000 URLs from VC, founder, and startup blogs spanning over 20 years. The content quality was uneven—some links were paywalled, some were decades old, many were personal bio pages rather than actionable advice. They needed to find real insights on building businesses: board decks, hiring advice, fundraising strategies, founder lessons, and operational patterns. They explicitly did not want a chatbot that invents answers. In their words: 'we want to search indexed blog content and rank by relevancy, not have the LLM tell us the answer.' The challenge was transforming this massive, inconsistent corpus into a reliable search system that returns relevant sources with citations.

Our Approach

1

Ingestion and Scraping

We implemented conservative web scraping that converts HTML to clean markdown. The system respects rate limits and handles various content formats including modern blogs and decade-old archives. Early sample queries from the client helped us identify which content types to prioritize and which to prune—avoiding wasted work on pages that wouldn't serve the use case.

  • Batch processing with rate limiting
  • HTML to markdown conversion
  • Paywall and access detection
  • Content type classification
  • Pilot validation with 100k pages
2

LLM Cleanup and Metadata Enrichment

Each scraped markdown document passes through OpenAI for cleanup and enrichment. The LLM removes navigation bars, advertisements, footers, and normalizes formatting inconsistencies. More importantly, it generates structured metadata: consistent titles, author extraction when present, published dates, content-type tags, and a concise summary. This metadata layer powers better filtering and ranking downstream. Good pruning at this stage makes embeddings cheaper in the long run—we only embed content that will actually be useful.

  • Automated content cleanup
  • Metadata extraction and normalization
  • Title, author, and date standardization
  • Content type tagging
  • Summary generation
  • Cost-controlled batch processing
3

Chunking Strategy

Content is chunked using a markdown-aware splitter at approximately 400-700 tokens with 10-15% overlap. This preserves context across chunk boundaries while keeping individual pieces small enough for effective retrieval. Clean markdown in object storage, only chunks, metadata, and vectors in the database—keeping the database lean and performant.

  • Markdown-aware chunking
  • Context-preserving overlap
  • Optimized chunk sizing
  • Separation of storage layers
4

Vector and Full-Text Indexing

Database layer uses Supabase/Postgres with pgvector for semantic search and Postgres full-text search for exact matches. This dual-index approach handles both 'founder advice on hiring' (semantic) and 'Marc Andreessen board deck' (exact name match) equally well.

  • pgvector for semantic search
  • Postgres FTS for exact matches
  • Efficient index management
  • Metadata filtering support
5

Hybrid Search with Reranking

Search combines Postgres FTS and vector search, fusing results with Reciprocal Rank Fusion (RRF) and reranking the top candidates. This hybrid approach is more reliable than pure RAG generation—users get sources they can verify, not synthetic answers. The pattern can be extended to other content sources like Substack, LinkedIn posts, and PDFs without architectural changes.

  • FTS and vector fusion
  • RRF result merging
  • Result reranking
  • Source citation tracking
  • Extensible to new content types
6

Optional Synthesis Endpoint

Generation is available but not the default. After reviewing ranked sources, users can request a synthesis across top documents. This keeps the 'G' in RAG as an optional tool rather than the primary interface, giving users control over when they want interpretation versus raw sources.

  • User-controlled generation
  • Multi-document synthesis
  • Source attribution in summaries
  • Fallback to retrieval-only mode

Results & Impact

0%+
Search Precision
Relevant results in top 10
0%
Research Time
Reduction in finding relevant insights
0%
Cost Efficiency
Reduction through smart pruning

The pilot validated the approach and proved that clients often think they want generation, but really need reliable retrieval with citations. By starting with sample queries and pruning aggressively, we avoided processing hundreds of thousands of irrelevant pages, reducing costs by 60%. The hybrid search architecture delivers both exact-match precision and semantic understanding, cutting research time by 80%. The system is now ready to scale to the full corpus and can be extended to other content sources like Substack newsletters, LinkedIn posts, and PDF whitepapers without architectural changes.

Technical Highlights

  • Hybrid search architecture combining FTS and vector similarity for reliable ranking
  • LLM-powered content cleanup and metadata enrichment at ingestion time
  • Markdown-aware chunking with overlap for context preservation
  • Separation of concerns: clean markdown in object storage, only vectors and metadata in database
  • Reciprocal Rank Fusion and reranking for result quality
  • Optional generation endpoint—retrieval-first approach
  • Pilot-first methodology to validate approach before full-scale ingestion
  • Extensible architecture supporting multiple content types (blogs, Substack, LinkedIn, PDFs)

Technology Stack & Process

OpenAI

Used for content cleanup, metadata extraction, and optional synthesis generation

Supabase/Postgres

Database layer with pgvector for semantic search and native full-text search capabilities

Python + Crawl4AI

Web scraping and content extraction using crawl4ai for processing the 900k+ URL corpus

Next.js

Frontend framework providing a responsive search interface with server-side rendering

Development Process

AI-powered business transformation and automation consultation background

Your AI Roadmap Starts Here

Book Strategy Call30 minutes · Leave with your first AI use case