RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation

Abstract

Retrieval-Augmented Generation (RAG) has shown significant improvements invarious natural language processing tasks by integrating the strengths of largelanguage models (LLMs) and external knowledge databases. However, RAGintroduces long sequence generation and leads to high computation and memorycosts. We propose RAGCache, a novel multilevel dynamic caching system tailoredfor RAG. Our analysis benchmarks current RAG systems, pinpointing theperformance bottleneck (i.e., long sequence due to knowledge injection) andoptimization opportunities (i.e., caching knowledge's intermediate states).Based on these insights, we design RAGCache, which organizes the intermediatestates of retrieved knowledge in a knowledge tree and caches them in the GPUand host memory hierarchy. RAGCache proposes a replacement policy that is awareof LLM inference characteristics and RAG retrieval patterns. It alsodynamically overlaps the retrieval and inference steps to minimize theend-to-end latency. We implement RAGCache and evaluate it on vLLM, astate-of-the-art LLM inference system and Faiss, a state-of-the-art vectordatabase. The experimental results show that RAGCache reduces the time to firsttoken (TTFT) by up to 4x and improves the throughput by up to 2.1x compared tovLLM integrated with Faiss.

Quick Read (beta)

loading the full paper ...