GuMorming

[Paper Reading] Model Tells You What to Discard: Adaptive KV Cache Compression For LLMs

发表于2024-08-23|Paper Reading|LLM•ICLR'2024

Paper Souce: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs (arxiv.org)(ICLR’2024) Observations 对于一个大模型，同一层中的不同head，不同层的不同head之间，它们学到的注意力机制是不一样的。有的heads在计算注意力时，更加关注special tokens(<s>)；有的更加关注标点（punctuations），有的关注局部性（locality，recent tokens）；有的关注高频 tokens；有的全局都关注（all tokens）。所以，本文提出针对这5类heads实行不同的 KV Cache 压缩策略那么在生成过程中，各个head的注意力关注点是否会发生转变呢？上图表明，用户提供的prompt在生成过程中各个head的注意力结构具有一致性（consistent），基于此，假定已确定的注意力模式在未来的生成步骤中将保持不变，因此在只需要根据prompt进行一次模型剖析（model profiling， ...

[Paper Reading] ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition

发表于2024-08-23|Paper Reading|LLM•ACL'2024

Paper Source: ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition (arxiv.org)(ACL’2024) Background 自注意力开销: 自注意力是LLM的核心组件，但在处理长输入序列时，它的资源消耗非常大。这导致在推理过程中，存储每个token的KV Cache时，会产生显著的内存和计算成本。共享系统提示词: 在多用户系统中，多个基于LLM的应用程序共享系统提示词（即模型给出的通用指令或示例），这会导致在多个请求中存储相同提示的KV Cache时出现冗余。 Approach Prefix Aware KV Cache(PAKV) 通过将KV tensor 组织成更小的块(Chunk)，并使用前缀树结构，系统能够在运行时检测并消除冗余的KV tensor。 KV Cache: 通常，KV Cache以密集tensor的形式存储，大小为b×h×n×d。当多个序列共享相同的前缀标记时，它们的KV tensor是相同的，因此 ...

[Paper Reading] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

发表于2024-08-16|Paper Reading|LLM•Quantization

Paper Source: KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization (arxiv.org) Background 在LLMs中，随着上下文长度的增加，KV Cache 激活所需的内存成为主要瓶颈，特别是在处理长序列时。这在推理过程中尤为棘手，因为需要高效地存储和处理这些激活。量化是一种常见的技术，用于通过用更少的位数表示数据来减少内存使用。然而，现有的量化方法在准确表示 KV Cache 的小于4位精度时导致模型性能的显著下降。 KVQuant Approach 研究表明，Key 在某些通道（维度）上容易出现高幅度的异常值，而 Value 在通道和 token 上都有此现象（但不如 Key 表现明显）。 Per-Channel Key Quantization 核心思想是对每个通道（即每个维度）单独进行量化，而不是对整个 Key 矩阵进行统一的量化。这是因为不同通道中的数据分布可能会有显著差异，尤其是某些通道中的数据可能包含了更重要的信息或更大的数 ...

[Paper Reading] FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

发表于2024-08-15|Paper Reading|LLM•ICML‘2023

Paper Source: FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU (arxiv.org) Background LLM推理的相关研究一般从系统和算法层面同时进行考虑，其中稀疏和量化属于算法层面的研究。 token生成过程可分为2个阶段：prefill phase, decoding phase。 prefill phase，主要是input sequence(即prompt)进入模型后的相关计算，包括KV cache的计算以及attention相关的计算。 decode phase，主要是为了输出generated token而在模型里做的相关计算，包括KV cache的更新以及attention相关的计算。本文目标在于设计出一个在单一消费级GPU上的高吞吐 offloading 策略。 Offloading Strategy Problem Formulation three-level memory hierarchy: G ...

[Paper Reading] DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

发表于2024-07-26|Paper Reading|LLM•OSDI'2024

Paper Source: DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving(OSDI’24) Terms SLO Attainment: the proportion of requests that meet the SLOs(Service-Level Object) Prefill 阶段关注 TTFT; Decode 阶段关注 TPOT TTFT: the time to first token, which is the duration of the prefill phase TPOT: the time per ouput token, which represents the average time taken to generate a token for each request (except for the first token) Overall request latency = TTFT + T ...