What is late chunking in RAG?

Late chunking flips the order of two steps. Embed the whole long document first with a long-context model, then split the token sequence into chunks and mean-pool each slice. Jina AI published the technique on September 7, 2024 as arXiv 2409.04701 [1] . Because every chunk embedding saw the full surrounding context, long-distance dependencies survive, the exact failure mode that sinks classic chunk-then-embed pipelines. The first genuinely new idea in retrieval in two years, which is why it was the one people argued about at conferences.

Which embedding model should I use for enterprise RAG in 2026?

Three commercial defaults, plus open source for the air-gapped case. Voyage 4-large (January 15, 2026; MoE with a shared embedding space across the family) [2] is the current MTEB leader. Cohere Embed v4 wins if you need multimodal (text plus image) at a 128K context window [4] . OpenAI text-embedding-3-large remains the safe default at $0.13 per million tokens [5] . For air-gapped: BGE-M3 or Jina v3. The ordering is consistent enough across enterprise corpora to use as a starting point, the part teams skip when they argue about model choice for six months.

Do I need a re-ranker?

Almost certainly yes. It is the highest-ROI upgrade most RAG teams skip. Take the top 50 from your vector search, re-score with a cross-encoder, keep the top 5. Cohere Rerank 3.5 (via Amazon Bedrock's Rerank API) is the default procurement path; the lift is largest on constrained and semi-structured queries [6] . Latency cost is a few hundred milliseconds per query, invisible on a conversational interface. The part RAG teams admit only privately: they spent a quarter arguing about embedding models before trying a re-ranker.

How should I evaluate a production RAG system?

Use RAGAS as the baseline framework (arXiv 2309.15217) [3] . Track four metrics at minimum. Faithfulness. Context Precision. Context Recall. Response Relevancy. Minimum-viable eval set is 100 graded queries. Mature looks more like 2,000, covering every observed failure mode. Run nightly. Post deltas to a public channel. No change ships without an eval-delta, the one rule that separates real programs from theatre.

Pinecone, Weaviate, or Qdrant?

Three answers, one question each. Pinecone for zero-ops managed with SOC 2 Type II, ISO 27001, HIPAA-attested. Weaviate for hybrid-search native with strong multitenancy and HIPAA-on-AWS added in 2025. Qdrant for raw performance: Rust-native, 2 to 5x higher QPS than Weaviate at 10M+ vectors on filtered queries per third-party 2025 benchmarks [7] . Milvus and pgvector are the open-source names worth a short-list. If you need HIPAA the list narrows fast. If you need on-prem isolation, the two open-source names are usually where teams land.

When does agentic RAG earn its seat?

Four situations. Users ask follow-ups that depend on previous retrievals. The correct answer requires joining two corpora. Intent is ambiguous and you would rather the system clarify than guess. Retrieval must choose among multiple typed sources (structured DB, unstructured store, API). For simple lookups, agentic RAG is a tax: three extra seconds and a bigger token bill for no accuracy gain. The trap teams walk into: wiring agentic retrieval on top of a single store that never needed it. See the agentic RAG glossary entry .

Disclosure: Jarvis AI is a product of ASCENDING Inc., which publishes Explore Agentic. We flag every page that discusses Jarvis and mark comparison tables that include it. Our editorial policy is on the About page.

Pillar · Enterprise RAG

Retrieval is still the hardest part of the stack

Production RAG in 2026: chunk strategies including late chunking, the embedding model landscape (OpenAI text-embedding-3, Cohere Embed v4, Voyage 3/4), re-rankers, the RAGAS eval framework, vector database selection, and when agentic RAG earns its seat.

Wenjia (Soraya) Zheng

Contributing Writer · Retrieval & Evaluation

Reviewed by Ziyi Tao

17 minutes · Updated April 17, 2026

Editorial illustration: documents, not flowcharts. Commissioned, April 2026.

Thirty-second read

Chunk strategy matters more than model choice on most corpora. Late chunking (Jina AI, arXiv 2409.04701, Sep 7 2024) preserves context by embedding the full document first and chunking the token sequence afterwards [1].
Embedding leaders as of April 2026: Voyage 4 (MoE, shared-space, Jan 2026) [2], Cohere Embed v4 (multimodal, 128K context) [4], OpenAI text-embedding-3-large with Matryoshka dimensions [5].
Re-rankers remain the single highest-ROI component most teams skip. Cohere Rerank 3.5 (on Bedrock via Rerank API) is the default procurement path; improvements are largest on constrained and semi-structured queries [6].
RAGAS is the open-source framework most teams standardise on. Metrics include Faithfulness, Context Precision, Context Recall, Response Relevancy, and an agentic suite (Tool Call F1, Agent Goal Accuracy) [3].
Vector-database shortlist in April 2026: Pinecone (managed, SOC 2 Type II, ISO 27001, HIPAA-attested), Weaviate (hybrid-search native, SOC 2 Type II + HIPAA on AWS), Qdrant (Rust, fastest at 10M+ vectors under concurrency) [7].
Agentic RAG earns its seat on multi-hop and ambiguous queries; it is a tax on simple lookups. See the agentic RAG glossary entry.

By the numbers

Sep 7, 2024

Jina late chunking paper published

arXiv 2409.04701. Context preserved by embedding whole document before splitting. Source [1].

Jan 2026

Voyage 4 MoE release

Shared embedding space across 4-large/4-lite; allows mixed-accuracy indexing without re-embedding. Source [2].

128K tokens

Cohere Embed v4 context window

Multimodal (text, image, mixed); shared embedding space. Source [4].

2-5x QPS

Qdrant vs Weaviate at 10M+ vectors

Measured on filtered queries at equivalent recall; 2025 third-party benchmarks. Source [7].

T he easy part of enterprise RAG is the chat interface. The hard part, still, in 2026, is everything upstream. Which documents to index, how to chunk, which embedding model and re-ranker to trust, and how to know when the system is wrong. Since Jina AI's September 2024 late chunking paper [1] and Voyage 4's Mixture-of-Experts release in January 2026 [2], every part of this pipeline has moved. This pillar walks the whole thing and skips what every tutorial already covers.

Two editorial positions up front. Chunk strategy matters more than model choice on most corpora. And evaluation, specifically a named, versioned, nightly-run RAGAS-style eval set [3], is what separates RAG teams that ship from RAG teams that argue about which embedding model is best for six months. Everything below assumes the eval discipline is in place.

Chunk strategy: fixed, semantic, and late

In the public RAG post-mortems we have read (engineering blogs from LinkedIn, Databricks, Pinecone, LlamaIndex, and dozens of smaller teams) chunking is by a wide margin the most-cited root cause of poor retrieval. Splits that broke semantic units, splits that were too large and diluted relevance scores, splits that orphaned a reference from its antecedent. The fix is almost never a different embedding model. The fix is a better splitter.

Three families of chunking dominate in 2026. Fixed-size recursive-character is the default in most frameworks and works fine for pure prose. Semantic chunking (cluster adjacent sentences by cosine distance and cut at local maxima) improves retrieval on technical documents and is the upgrade most teams take first. Late chunking, published by Jina AI in September 2024 as arXiv 2409.04701 [1], is the interesting new primitive: embed the entire long document first with a long-context model, then split the token sequence into chunks and mean-pool each slice. Because every chunk embedding saw the full surrounding context, long-distance dependencies survive.

Format-aware splitting is still table stakes. PDFs with tables get table-aware chunking; code gets AST-aware chunking; markdown gets section-aware chunking. Generic recursive-character splitters work for prose and only prose. The practical recipe: semantic chunking for prose, late chunking when retrieval quality is the bottleneck and you can afford a long-context embedding model, format-specialised splitters everywhere else.

"Late chunking is the first genuinely new idea in RAG retrieval in two years. Everything else is tuning."

Source · Editorial judgment, drawn from running late chunking in production since late 2024

The embedding model landscape in April 2026

The embedding field consolidated around three vendors plus a strong open-source tail. Voyage AI shipped Voyage 4 on January 15, 2026. It is a Mixture-of-Experts architecture with a shared embedding space across 4-large, 4, and 4-lite, which means you can index documents with the large model and query with the lite model without re-embedding [2]. Voyage 3-large already topped Cohere v4 and OpenAI 3-large on a number of MTEB tasks in January 2025 [8].

Cohere Embed v4 is multimodal, embedding text, images, and mixed-modality content into the same vector space, with a 128,000-token context window, which is the feature that makes it competitive for enterprise document RAG on long PDFs and scanned tables [4]. OpenAI text-embedding-3-large ($0.13 per million tokens) remains a common default, and the Matryoshka variable-dimension support introduced in 2024 is now standard in the category [5].

Open-source defaults for air-gapped deployments: BGE-M3 (multilingual, hybrid sparse+dense) and the Jina v3 family (long-context, late-chunking-ready). MTEB scores as of Q1 2026 cluster Voyage 4 > Cohere v4 > OpenAI 3-large > BGE-M3; enterprise workloads vary, but the ordering is consistent enough to use as a starting default.

Embedding model leaders, April 2026. Prices are published list; context window is in tokens.

Model	Vendor / release	Context	Key differentiator
Voyage 4-large	Voyage AI, Jan 15 2026	32K	MoE architecture; shared embedding space with lite variants
Cohere Embed v4	Cohere, 2025	128K	Multimodal text+image in one vector space
OpenAI text-embedding-3-large	OpenAI, 2024	8K	Matryoshka variable dimensions; $0.13/M tokens
Jina Embeddings v3	Jina AI, 2024	8K	Late-chunking-ready; open source weights
BGE-M3	BAAI, 2024	8K	Open source; hybrid sparse + dense; multilingual

Re-rankers are the cheapest big improvement

The re-ranker is a second pass over retrieved candidates: take the top 50 from your vector search, re-score them with a cross-encoder, and keep the top 5. It is not glamorous. It is, empirically, the single highest-ROI component most teams skip.

Cohere Rerank 3.5 is the default procurement path as of April 2026, available in Amazon Bedrock through the Rerank API and in Pinecone and Elasticsearch integrations [6]. Cohere's own benchmarks show the largest improvements over Rerank v2 on constrained queries and semi-structured JSON; on generic prose benchmarks the lift is more modest. Rerank 3 Nimble is 3-5x faster than Rerank 3 with comparable accuracy on BEIR, worth evaluating when latency budget is tight.

Alternatives: Jina reranker-v2 (open weights, competitive on short queries), mixedbread-ai/mxbai-rerank-large-v1 (open source, strong on multilingual), and custom fine-tuned cross-encoders where the corpus is weird enough to justify the training investment. Latency cost for all commercial re-rankers is measured in a few hundred milliseconds per query, invisible to the end user on a conversational interface.

"Model choice is the conversation everyone wants to have. Chunking and re-ranking are the changes that actually move the benchmark."

Source · Editorial judgment, drawn from six enterprise RAG deployments reviewed in Q1 2026

Evaluation with RAGAS is the whole game

Teams ship or stall on evals. Teams with a named eval set (a specific corpus of queries with expected answers, run nightly) can answer is this better? with a number. Teams without one argue and wait for a product manager to decide.

The open-source framework most teams standardise on is RAGAS (arXiv 2309.15217). It offers reference-free metrics including Faithfulness, Context Precision, Context Recall, Context Entities Recall, Response Relevancy, Answer Accuracy, Factual Correctness, and Semantic Similarity, plus agentic metrics like Tool Call F1 and Agent Goal Accuracy [3]. Faithfulness and Context Precision are the two that disagree most often in practice. A response can be fully grounded in retrieved context that was itself irrelevant. Track both or fool yourself.

Scale the eval set with the program. Minimum viable: 100 queries with graded answers. Mature: 2,000 queries covering every failure mode observed in production. Update weekly, run nightly, post the delta to a Slack channel. Every model change, prompt change, and chunker change ships with an eval-delta comment or it does not ship. LLM-as-judge saves annotation cost but drifts; calibrate quarterly against human graders.

01

Start with failures

Seed the eval set with queries users complained about, not queries that worked. Bias towards the corpus edges.
02

Use four tiers, not binary

Binary correct/incorrect loses signal. Four tiers (incorrect / partial / correct-with-caveat / correct) calibrates in two hours of annotator training.
03

Track Faithfulness and Context Precision separately

A high-Faithfulness / low-Precision pair flags groundedness against the wrong evidence. Common early-stage failure.
04

LLM-as-judge is fine, once calibrated

Calibrate quarterly against human graders. Model drift between GPT-4-class versions is real and can flip your sign.
05

Publish the delta

Every change posts an eval-delta to the channel. No change ships silently; that rule is what separates real programs from theatre.

Vector databases in production, April 2026

The managed shortlist settled on three. Pinecone is the managed-simplicity option (SOC 2 Type II, ISO 27001, GDPR-aligned, with an external HIPAA attestation), and in BYOC mode clusters run inside the customer's own AWS, Azure, or GCP account for hard isolation [7]. Weaviate is the flexible hybrid-search powerhouse; Weaviate Enterprise Cloud gained HIPAA compliance on AWS in 2025 and ships tenant-aware classes, lifecycle endpoints, and ACLs for multitenant deployments.

Qdrant is the performance-first option: Rust-native, SOC 2 Type II, with a markedly advanced filtering engine that lets complex metadata queries execute before the vector search. At 10M+ vectors with concurrent filtered queries, third-party benchmarks put Qdrant at 2-5x higher QPS than Weaviate on equivalent hardware at the same recall target. Open-source alternatives worth evaluating: Milvus (CNCF-graduated), pgvector for teams already on Postgres, and Chroma for prototyping.

Hybrid search, which combines vector similarity, keyword search, and metadata filters in one query, is the feature to verify in procurement. Weaviate and Qdrant include it by default; Pinecone added sparse-dense hybrid support through native integrations. If you need HIPAA, the shortlist narrows quickly; if you need on-prem isolation, Qdrant and Milvus are the common answers.

Vector database enterprise posture, April 2026. Compliance claims verified against vendor trust pages.

Database	Model	Compliance	Strength
Pinecone	Fully managed; BYOC	SOC 2 II, ISO 27001, HIPAA attested, GDPR	Zero-ops managed; fast time-to-value
Weaviate	Managed + self-host	SOC 2 II, HIPAA on AWS (2025)	Hybrid search native; strong multitenancy
Qdrant	Managed + self-host	SOC 2 II; HIPAA-ready	Rust performance; best filtered-query QPS
Milvus	Self-host (CNCF)	Customer-configurable	Open source; GPU-accelerated at scale
pgvector	Postgres extension	Inherited from Postgres	Stay on existing database; simple ops

When agentic RAG earns its seat

Agentic RAG is retrieval placed inside a planning loop. The agent decides what to retrieve, judges whether the retrieval was good enough, and retrieves again (or chooses a different index) if not. The cost is latency and token spend; the benefit is accuracy on multi-hop questions and ambiguous intent.

Add the agentic layer when the workload asks for it. Four triggers. Users ask follow-ups that depend on the previous retrieval. The correct answer requires joining two corpora. Intent is ambiguous and you would rather the system clarify than guess. Retrieval must decide among multiple typed sources (structured DB, unstructured document store, API). For simple lookups, agentic RAG is a tax: three extra seconds and a bigger token bill for no accuracy gain. The trigger list is short, which is the part most steering committees forget to check before shipping.

The governance implication: agentic RAG exposes new failure modes (retrieval loops, prompt injection via retrieved content) that OWASP LLM06 Excessive Agency and LLM08 Vector and Embedding Weaknesses specifically call out. Wire agentic RAG through the MCP gateway (see our MCP and AI Governance pillars) so tool calls are auditable and policy-governed.

FAQ

Frequently asked

What is late chunking in RAG?

Late chunking flips the order of two steps. Embed the whole long document first with a long-context model, then split the token sequence into chunks and mean-pool each slice. Jina AI published the technique on September 7, 2024 as arXiv 2409.04701 [1]. Because every chunk embedding saw the full surrounding context, long-distance dependencies survive, the exact failure mode that sinks classic chunk-then-embed pipelines. The first genuinely new idea in retrieval in two years, which is why it was the one people argued about at conferences.
Which embedding model should I use for enterprise RAG in 2026?

Three commercial defaults, plus open source for the air-gapped case. Voyage 4-large (January 15, 2026; MoE with a shared embedding space across the family) [2] is the current MTEB leader. Cohere Embed v4 wins if you need multimodal (text plus image) at a 128K context window [4]. OpenAI text-embedding-3-large remains the safe default at $0.13 per million tokens [5]. For air-gapped: BGE-M3 or Jina v3. The ordering is consistent enough across enterprise corpora to use as a starting point, the part teams skip when they argue about model choice for six months.
Do I need a re-ranker?

Almost certainly yes. It is the highest-ROI upgrade most RAG teams skip. Take the top 50 from your vector search, re-score with a cross-encoder, keep the top 5. Cohere Rerank 3.5 (via Amazon Bedrock's Rerank API) is the default procurement path; the lift is largest on constrained and semi-structured queries [6]. Latency cost is a few hundred milliseconds per query, invisible on a conversational interface. The part RAG teams admit only privately: they spent a quarter arguing about embedding models before trying a re-ranker.
How should I evaluate a production RAG system?

Use RAGAS as the baseline framework (arXiv 2309.15217) [3]. Track four metrics at minimum. Faithfulness. Context Precision. Context Recall. Response Relevancy. Minimum-viable eval set is 100 graded queries. Mature looks more like 2,000, covering every observed failure mode. Run nightly. Post deltas to a public channel. No change ships without an eval-delta, the one rule that separates real programs from theatre.
Pinecone, Weaviate, or Qdrant?

Three answers, one question each. Pinecone for zero-ops managed with SOC 2 Type II, ISO 27001, HIPAA-attested. Weaviate for hybrid-search native with strong multitenancy and HIPAA-on-AWS added in 2025. Qdrant for raw performance: Rust-native, 2 to 5x higher QPS than Weaviate at 10M+ vectors on filtered queries per third-party 2025 benchmarks [7]. Milvus and pgvector are the open-source names worth a short-list. If you need HIPAA the list narrows fast. If you need on-prem isolation, the two open-source names are usually where teams land.
When does agentic RAG earn its seat?

Four situations. Users ask follow-ups that depend on previous retrievals. The correct answer requires joining two corpora. Intent is ambiguous and you would rather the system clarify than guess. Retrieval must choose among multiple typed sources (structured DB, unstructured store, API). For simple lookups, agentic RAG is a tax: three extra seconds and a bigger token bill for no accuracy gain. The trap teams walk into: wiring agentic retrieval on top of a single store that never needed it. See the agentic RAG glossary entry.

See it in action

From the Jarvis lab

All videos →

Video preview — Solving Data Challenges: Enhancing Your Business with RAG AI — Watch · 3:20 · ASCENDING

Solving Data Challenges: Enhancing Your Business with RAG AI

An RAG primer for business readers: builds a vector store, indexes company documents, and queries real financials (Wells Fargo, Goldman Sachs) — showing how retrieval grounds LLM answers in your own data.
Open on YouTube ↗

Video preview — Jarvis Knowledge base overview — Watch · 33 · ASCENDING

Jarvis Knowledge base overview

A 33-second overview of Jarvis Knowledge Base, showing how governed AI plugs directly into your enterprise document store.
Open on YouTube ↗

Video preview — Jarvis Knowledge Base | Semantic Search Across Your Documents — Watch · 1:54 · ASCENDING

Jarvis Knowledge Base | Semantic Search Across Your Documents

Semantic search demo — Jarvis goes beyond keywords to deliver contextual answers across scattered files, folders, and formats that traditional search can't handle.
Open on YouTube ↗

Video preview — Jarvis AI + S3 Integration | Fast, Smart, Secure Knowledge Retrieval — Watch · 4:19 · ASCENDING

Jarvis AI + S3 Integration | Fast, Smart, Secure Knowledge Retrieval

Shows how Jarvis connects directly to Amazon S3 to turn contracts, proposals, HR forms, and SOPs into a secure, searchable knowledge base — fast, smart retrieval without data leaving AWS.
Open on YouTube ↗

References

Sources & citations

Each [n] above points here. URLs go to the publisher's canonical page. The access date is the day we last opened the link and confirmed the cited claim was still on the page. If a source has rotted, file a correction at /about#corrections.

[1]
arXiv / Jina AI . Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models
https://arxiv.org/abs/2409.04701 · accessed 2026-04-17
Published September 7, 2024. Authors Günther, Mohr, Wang, Xiao (Jina AI). Code: https://github.com/jina-ai/late-chunking.
[2]
Voyage AI . The Voyage 4 model family: shared embedding space with MoE architecture
https://blog.voyageai.com/2026/01/15/voyage-4/ · accessed 2026-04-17
January 15, 2026 release. First production embedding model with MoE; shared embedding space across 4-large/4/4-lite.
[3]
RAGAS . Ragas: Automated Evaluation of Retrieval Augmented Generation
https://docs.ragas.io/en/stable/ · accessed 2026-04-17
Framework documentation. Core paper arXiv 2309.15217. Metrics: Faithfulness, Context Precision/Recall, Response Relevancy, Tool Call F1, Agent Goal Accuracy.
[4]
Cohere . Cohere Embed v4 (multimodal, 128K-token context)
https://docs.cohere.com/docs/cohere-embed · accessed 2026-04-17
Cohere Embed v4 documentation. Text + image in shared embedding space; 128K token context window.
[5]
OpenAI . New embedding models (text-embedding-3-large / 3-small)
https://openai.com/index/new-embedding-models-and-api-updates/ · accessed 2026-04-17
OpenAI's text-embedding-3 family launch. Matryoshka variable dimensions; $0.13/M tokens for 3-large, $0.02/M for 3-small.
[6]
AWS (Cohere) . Cohere Rerank 3.5 is now available in Amazon Bedrock through Rerank API
https://aws.amazon.com/blogs/machine-learning/cohere-rerank-3-5-is-now-available-in-amazon-bedrock-through-rerank-api/ · accessed 2026-04-17
Cohere Rerank 3.5 availability via Bedrock Rerank API. Notable lift on constrained queries and semi-structured JSON.
[7]
Xenoss . Pinecone vs Qdrant vs Weaviate: enterprise vector database comparison
https://xenoss.io/blog/vector-database-comparison-pinecone-qdrant-weaviate · accessed 2026-04-17
Third-party 2025 comparison. SOC 2/ISO/HIPAA claims, BYOC, multitenancy, hybrid search, 2-5x Qdrant QPS on filtered queries.
[8]
Voyage AI . voyage-3-large: the new state-of-the-art general-purpose embedding model
https://blog.voyageai.com/2025/01/07/voyage-3-large/ · accessed 2026-04-17
January 7, 2025 release. Voyage 3-large topped Cohere v4 and OpenAI 3-large on a number of MTEB tasks.
[9]
Jina AI . Late Chunking in Long-Context Embedding Models (product notes)
https://jina.ai/news/late-chunking-in-long-context-embedding-models/ · accessed 2026-04-17
Jina AI late-chunking product notes; implementation available in jina-embeddings-v3 API with up to 8,192 tokens.

Related in this edition

Comparison

MCP vs RAG

Two protocols for two different problems. When to use which.

Read

Glossary

Agentic RAG

The plain-language explainer, with a diagram and three citations.

Read

Pillar

Agentic AI

The pillar. Retrieval is often inside an agent.

Read

Retrieval is still the hardest part of the stack

Chunk strategy: fixed, semantic, and late

The embedding model landscape in April 2026

Re-rankers are the cheapest big improvement

Evaluation with RAGAS is the whole game

Start with failures

Use four tiers, not binary

Track Faithfulness and Context Precision separately

LLM-as-judge is fine, once calibrated

Publish the delta

Vector databases in production, April 2026

When agentic RAG earns its seat

Frequently asked

From the Jarvis lab

Solving Data Challenges: Enhancing Your Business with RAG AI

Jarvis Knowledge base overview

Jarvis Knowledge Base | Semantic Search Across Your Documents

Jarvis AI + S3 Integration | Fast, Smart, Secure Knowledge Retrieval

Enterprise AI Adoption Without the Complexity: A Practical Guide for Leaders Ready to Move Fast

Your Documents Are Sitting on a Gold Mine — Here's How AI Unlocks It

From Inbox to Case Record in Minutes: How an Insurance Defense Firm Eliminated Manual Intake with AI

When Your Data Knows the Answer but Your Team Can't Ask the Question

Sources & citations

Related in this edition

MCP vs RAG

Agentic RAG

Agentic AI