[Paper Reading] ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition

Paper Source: ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition (arxiv.org)(ACL’2024)

Background

自注意力开销: 自注意力是LLM的核心组件，但在处理长输入序列时，它的资源消耗非常大。这导致在推理过程中，存储每个token的KV Cache时，会产生显著的内存和计算成本。
共享系统提示词: 在多用户系统中，多个基于LLM的应用程序共享系统提示词（即模型给出的通用指令或示例），这会导致在多个请求中存储相同提示的KV Cache时出现冗余。

通过将KV tensor 组织成更小的块(Chunk)，并使用前缀树结构，系统能够在运行时检测并消除冗余的KV tensor。

KV Cache: 通常，KV Cache以密集tensor的形式存储，大小为b×h×n×d。当多个序列共享相同的前缀标记时，它们的KV tensor是相同的，因此可以在内存中共享。

举个🌰：

序列 $S_i = [t_1, \dots,t_{n_s},t_{n_s +1},\dots,t_{n_p}]$

序列 $S_j = [t_1, \dots,t_{n_s},t_{n_s +1}^`,\dots,t_{n_p}^`]$

$t_1, \dots,t_{n_s}$ 的KV Cache只需在显存中保留一份即可

b: batch size

h: number of heads

n: sequence length

d: head dimension size
前缀树结构: 将所有序列的KV Cache组织成一个前缀树。KV tensor沿着序列长度维度被切片，每个节点定义了一个块(Chunk)，这些块存储了token、Key以及对应的Value。

KV Cache in prefix tree
推理中构造前缀树的三种情形：
1. When a new sequence joins: the prefix tree is searched and updated to insert a new path.
2. When a completed sequence leaves: the prefix tree is updated to delete its path.
3. At each decoding iteration: we append new tokens into leaf chunks or grow a new chunk when the leaf chunk is full.

自注意力计算采用了两阶段分区策略来优化计算。在第一阶段(Chunk-first)，具有共享前缀的块一起处理，以改善数据局部性。在第二阶段(Sequence-first)，序列的非共享部分单独处理。

During prefilling: 该阶段生成前缀树，进行前缀搜索以避免KV的重复计算或对已匹配的prompt重复position embedding；未匹配的后缀token则正常计算并在chunk化后插入到前缀树中
During iterative decoding：
- Chunk-first phase: 在这一阶段，主要处理由多个序列共享的块。
  
  对 prefix tree 中的共享块进行遍历计算局部attention后保存到显存中
  
  Algorithm 1: Chunk-first
- Sequence-first phase: 这一阶段则处理特定序列的块。
  
  Algorithm 2:Sequence-first