T he easy part of enterprise RAG is the chat interface. The hard part, still, in 2026, is everything upstream. Which documents to index, how to chunk, which embedding model and re-ranker to trust, and how to know when the system is wrong. Since Jina AI's September 2024 late chunking paper [1] and Voyage 4's Mixture-of-Experts release in January 2026 [2], every part of this pipeline has moved. This pillar walks the whole thing and skips what every tutorial already covers.
Two editorial positions up front. Chunk strategy matters more than model choice on most corpora. And evaluation, specifically a named, versioned, nightly-run RAGAS-style eval set [3], is what separates RAG teams that ship from RAG teams that argue about which embedding model is best for six months. Everything below assumes the eval discipline is in place.
Chunk strategy: fixed, semantic, and late
In the public RAG post-mortems we have read (engineering blogs from LinkedIn, Databricks, Pinecone, LlamaIndex, and dozens of smaller teams) chunking is by a wide margin the most-cited root cause of poor retrieval. Splits that broke semantic units, splits that were too large and diluted relevance scores, splits that orphaned a reference from its antecedent. The fix is almost never a different embedding model. The fix is a better splitter.
Three families of chunking dominate in 2026. Fixed-size recursive-character is the default in most frameworks and works fine for pure prose. Semantic chunking (cluster adjacent sentences by cosine distance and cut at local maxima) improves retrieval on technical documents and is the upgrade most teams take first. Late chunking, published by Jina AI in September 2024 as arXiv 2409.04701 [1], is the interesting new primitive: embed the entire long document first with a long-context model, then split the token sequence into chunks and mean-pool each slice. Because every chunk embedding saw the full surrounding context, long-distance dependencies survive.
Format-aware splitting is still table stakes. PDFs with tables get table-aware chunking; code gets AST-aware chunking; markdown gets section-aware chunking. Generic recursive-character splitters work for prose and only prose. The practical recipe: semantic chunking for prose, late chunking when retrieval quality is the bottleneck and you can afford a long-context embedding model, format-specialised splitters everywhere else.
"Late chunking is the first genuinely new idea in RAG retrieval in two years. Everything else is tuning."
The embedding model landscape in April 2026
The embedding field consolidated around three vendors plus a strong open-source tail. Voyage AI shipped Voyage 4 on January 15, 2026. It is a Mixture-of-Experts architecture with a shared embedding space across 4-large, 4, and 4-lite, which means you can index documents with the large model and query with the lite model without re-embedding [2]. Voyage 3-large already topped Cohere v4 and OpenAI 3-large on a number of MTEB tasks in January 2025 [8].
Cohere Embed v4 is multimodal, embedding text, images, and mixed-modality content into the same vector space, with a 128,000-token context window, which is the feature that makes it competitive for enterprise document RAG on long PDFs and scanned tables [4]. OpenAI text-embedding-3-large ($0.13 per million tokens) remains a common default, and the Matryoshka variable-dimension support introduced in 2024 is now standard in the category [5].
Open-source defaults for air-gapped deployments: BGE-M3 (multilingual, hybrid sparse+dense) and the Jina v3 family (long-context, late-chunking-ready). MTEB scores as of Q1 2026 cluster Voyage 4 > Cohere v4 > OpenAI 3-large > BGE-M3; enterprise workloads vary, but the ordering is consistent enough to use as a starting default.
| Model | Vendor / release | Context | Key differentiator |
|---|---|---|---|
| Voyage 4-large | Voyage AI, Jan 15 2026 | 32K | MoE architecture; shared embedding space with lite variants |
| Cohere Embed v4 | Cohere, 2025 | 128K | Multimodal text+image in one vector space |
| OpenAI text-embedding-3-large | OpenAI, 2024 | 8K | Matryoshka variable dimensions; $0.13/M tokens |
| Jina Embeddings v3 | Jina AI, 2024 | 8K | Late-chunking-ready; open source weights |
| BGE-M3 | BAAI, 2024 | 8K | Open source; hybrid sparse + dense; multilingual |
Re-rankers are the cheapest big improvement
The re-ranker is a second pass over retrieved candidates: take the top 50 from your vector search, re-score them with a cross-encoder, and keep the top 5. It is not glamorous. It is, empirically, the single highest-ROI component most teams skip.
Cohere Rerank 3.5 is the default procurement path as of April 2026, available in Amazon Bedrock through the Rerank API and in Pinecone and Elasticsearch integrations [6]. Cohere's own benchmarks show the largest improvements over Rerank v2 on constrained queries and semi-structured JSON; on generic prose benchmarks the lift is more modest. Rerank 3 Nimble is 3-5x faster than Rerank 3 with comparable accuracy on BEIR, worth evaluating when latency budget is tight.
Alternatives: Jina reranker-v2 (open weights, competitive on short queries), mixedbread-ai/mxbai-rerank-large-v1 (open source, strong on multilingual), and custom fine-tuned cross-encoders where the corpus is weird enough to justify the training investment. Latency cost for all commercial re-rankers is measured in a few hundred milliseconds per query, invisible to the end user on a conversational interface.
"Model choice is the conversation everyone wants to have. Chunking and re-ranking are the changes that actually move the benchmark."
Evaluation with RAGAS is the whole game
Teams ship or stall on evals. Teams with a named eval set (a specific corpus of queries with expected answers, run nightly) can answer is this better? with a number. Teams without one argue and wait for a product manager to decide.
The open-source framework most teams standardise on is RAGAS (arXiv 2309.15217). It offers reference-free metrics including Faithfulness, Context Precision, Context Recall, Context Entities Recall, Response Relevancy, Answer Accuracy, Factual Correctness, and Semantic Similarity, plus agentic metrics like Tool Call F1 and Agent Goal Accuracy [3]. Faithfulness and Context Precision are the two that disagree most often in practice. A response can be fully grounded in retrieved context that was itself irrelevant. Track both or fool yourself.
Scale the eval set with the program. Minimum viable: 100 queries with graded answers. Mature: 2,000 queries covering every failure mode observed in production. Update weekly, run nightly, post the delta to a Slack channel. Every model change, prompt change, and chunker change ships with an eval-delta comment or it does not ship. LLM-as-judge saves annotation cost but drifts; calibrate quarterly against human graders.
- 01
Start with failures
Seed the eval set with queries users complained about, not queries that worked. Bias towards the corpus edges.
- 02
Use four tiers, not binary
Binary correct/incorrect loses signal. Four tiers (incorrect / partial / correct-with-caveat / correct) calibrates in two hours of annotator training.
- 03
Track Faithfulness and Context Precision separately
A high-Faithfulness / low-Precision pair flags groundedness against the wrong evidence. Common early-stage failure.
- 04
LLM-as-judge is fine, once calibrated
Calibrate quarterly against human graders. Model drift between GPT-4-class versions is real and can flip your sign.
- 05
Publish the delta
Every change posts an eval-delta to the channel. No change ships silently; that rule is what separates real programs from theatre.
Vector databases in production, April 2026
The managed shortlist settled on three. Pinecone is the managed-simplicity option (SOC 2 Type II, ISO 27001, GDPR-aligned, with an external HIPAA attestation), and in BYOC mode clusters run inside the customer's own AWS, Azure, or GCP account for hard isolation [7]. Weaviate is the flexible hybrid-search powerhouse; Weaviate Enterprise Cloud gained HIPAA compliance on AWS in 2025 and ships tenant-aware classes, lifecycle endpoints, and ACLs for multitenant deployments.
Qdrant is the performance-first option: Rust-native, SOC 2 Type II, with a markedly advanced filtering engine that lets complex metadata queries execute before the vector search. At 10M+ vectors with concurrent filtered queries, third-party benchmarks put Qdrant at 2-5x higher QPS than Weaviate on equivalent hardware at the same recall target. Open-source alternatives worth evaluating: Milvus (CNCF-graduated), pgvector for teams already on Postgres, and Chroma for prototyping.
Hybrid search, which combines vector similarity, keyword search, and metadata filters in one query, is the feature to verify in procurement. Weaviate and Qdrant include it by default; Pinecone added sparse-dense hybrid support through native integrations. If you need HIPAA, the shortlist narrows quickly; if you need on-prem isolation, Qdrant and Milvus are the common answers.
| Database | Model | Compliance | Strength |
|---|---|---|---|
| Pinecone | Fully managed; BYOC | SOC 2 II, ISO 27001, HIPAA attested, GDPR | Zero-ops managed; fast time-to-value |
| Weaviate | Managed + self-host | SOC 2 II, HIPAA on AWS (2025) | Hybrid search native; strong multitenancy |
| Qdrant | Managed + self-host | SOC 2 II; HIPAA-ready | Rust performance; best filtered-query QPS |
| Milvus | Self-host (CNCF) | Customer-configurable | Open source; GPU-accelerated at scale |
| pgvector | Postgres extension | Inherited from Postgres | Stay on existing database; simple ops |
When agentic RAG earns its seat
Agentic RAG is retrieval placed inside a planning loop. The agent decides what to retrieve, judges whether the retrieval was good enough, and retrieves again (or chooses a different index) if not. The cost is latency and token spend; the benefit is accuracy on multi-hop questions and ambiguous intent.
Add the agentic layer when the workload asks for it. Four triggers. Users ask follow-ups that depend on the previous retrieval. The correct answer requires joining two corpora. Intent is ambiguous and you would rather the system clarify than guess. Retrieval must decide among multiple typed sources (structured DB, unstructured document store, API). For simple lookups, agentic RAG is a tax: three extra seconds and a bigger token bill for no accuracy gain. The trigger list is short, which is the part most steering committees forget to check before shipping.
The governance implication: agentic RAG exposes new failure modes (retrieval loops, prompt injection via retrieved content) that OWASP LLM06 Excessive Agency and LLM08 Vector and Embedding Weaknesses specifically call out. Wire agentic RAG through the MCP gateway (see our MCP and AI Governance pillars) so tool calls are auditable and policy-governed.
Frequently asked
-
What is late chunking in RAG?
Late chunking flips the order of two steps. Embed the whole long document first with a long-context model, then split the token sequence into chunks and mean-pool each slice. Jina AI published the technique on September 7, 2024 as arXiv 2409.04701 [1]. Because every chunk embedding saw the full surrounding context, long-distance dependencies survive, the exact failure mode that sinks classic chunk-then-embed pipelines. The first genuinely new idea in retrieval in two years, which is why it was the one people argued about at conferences. -
Which embedding model should I use for enterprise RAG in 2026?
Three commercial defaults, plus open source for the air-gapped case. Voyage 4-large (January 15, 2026; MoE with a shared embedding space across the family) [2] is the current MTEB leader. Cohere Embed v4 wins if you need multimodal (text plus image) at a 128K context window [4]. OpenAI text-embedding-3-large remains the safe default at $0.13 per million tokens [5]. For air-gapped: BGE-M3 or Jina v3. The ordering is consistent enough across enterprise corpora to use as a starting point, the part teams skip when they argue about model choice for six months. -
Do I need a re-ranker?
Almost certainly yes. It is the highest-ROI upgrade most RAG teams skip. Take the top 50 from your vector search, re-score with a cross-encoder, keep the top 5. Cohere Rerank 3.5 (via Amazon Bedrock's Rerank API) is the default procurement path; the lift is largest on constrained and semi-structured queries [6]. Latency cost is a few hundred milliseconds per query, invisible on a conversational interface. The part RAG teams admit only privately: they spent a quarter arguing about embedding models before trying a re-ranker. -
How should I evaluate a production RAG system?
Use RAGAS as the baseline framework (arXiv 2309.15217) [3]. Track four metrics at minimum. Faithfulness. Context Precision. Context Recall. Response Relevancy. Minimum-viable eval set is 100 graded queries. Mature looks more like 2,000, covering every observed failure mode. Run nightly. Post deltas to a public channel. No change ships without an eval-delta, the one rule that separates real programs from theatre. -
Pinecone, Weaviate, or Qdrant?
Three answers, one question each. Pinecone for zero-ops managed with SOC 2 Type II, ISO 27001, HIPAA-attested. Weaviate for hybrid-search native with strong multitenancy and HIPAA-on-AWS added in 2025. Qdrant for raw performance: Rust-native, 2 to 5x higher QPS than Weaviate at 10M+ vectors on filtered queries per third-party 2025 benchmarks [7]. Milvus and pgvector are the open-source names worth a short-list. If you need HIPAA the list narrows fast. If you need on-prem isolation, the two open-source names are usually where teams land. -
When does agentic RAG earn its seat?
Four situations. Users ask follow-ups that depend on previous retrievals. The correct answer requires joining two corpora. Intent is ambiguous and you would rather the system clarify than guess. Retrieval must choose among multiple typed sources (structured DB, unstructured store, API). For simple lookups, agentic RAG is a tax: three extra seconds and a bigger token bill for no accuracy gain. The trap teams walk into: wiring agentic retrieval on top of a single store that never needed it. See the agentic RAG glossary entry.