[Paper Reading] DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

Paper Source: DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving(OSDI’24)

Terms

SLO Attainment: the proportion of requests that meet the SLOs(Service-Level Object)
Prefill 阶段关注 TTFT; Decode 阶段关注 TPOT
TTFT: the time to first token, which is the duration of the prefill phase

TPOT: the time per ouput token, which represents the average time taken to generate a token for each request (except for the first token)

Overall request latency = TTFT + TPOT * number of generated tokens
Intra-op:
在单个操作内部的并行。也就是说，对于某一个特定的操作，将其拆分成多个子任务，这些子任务可以在多个计算单元上并行执行，从而加速整个操作的完成。举个🌰，在矩阵乘法操作中，可以将矩阵分块，每一块在不同的处理器上同时进行计算，这样就实现了操作内部的并行化

inter-op: 在不同操作之间的并行。也就是说，将整个计算任务分解成多个操作，这些操作可以在多个计算单元上并行执行，从而加速整个任务的完成。举个🌰，将模型按层划分给不同
GPU 从而形成流水线。

粒度不同：Intra-op 关注的是单个操作内部的并行化，而 Inter-op 关注的是不同操作之间的并行化。

适用场景：Intra-op 通常用于需要大量计算资源的单个操作，如矩阵乘法、卷积操作等。Inter-op
通常用于具有多个独立操作的计算任务，如计算图中的操作调度。

实现方式：Intra-op 通过将操作拆分为多个子任务并在多个计算单元上并行执行。Inter-op
通过分析操作之间的依赖关系，将独立的操作分配到不同的计算单元上并行执行。
Poisson arrival process: 假设请求的到达时间是随机的，但平均到达率是恒定的。即请求的到达时间间隔符合指数分布，整体到达过程符合泊松分布。

Problems

现有的系统将 prefill 和 decode 两阶段在同一位置（GPUs）上分批进行，以最大化系统吞吐。然而，由于下面的问题，在满足SLOs的情况下，目前的方法很难保持较高的服务质量和较低的成本。

Prefill-Decoding Interference

Prefill-Decoding Interference

向一批 decoding requests 添加 prefill 任务后，二者的进程都会被拖慢，导致 TTFT 和 TPOT 都会显著增加，在 prefill
任务更长时，这一效应会加剧，如（b）所示。这是因为这两个阶段（prefill 和 decode)在同一位置进行，会彼竞争资源。

Ineffective scheduling
Prioritizing tasks in either phase adversely affects the latency of the other, rendering priority scheduling ineffective.
Resource and parallelism coupling
将 prefill 和 decode 阶段分配在相同 GPUs 不可避免要共享资源以及并行策略配置，然而，各阶段都有其计算特性和时延要求。这需要更异构的资源配置来解决。

For example, the prefill phase tends to be compute-bound and benefits from more intra-op parallelism to reduce execution time to meet the tight SLO on TTFT.

By contrast, the optimal parallelism configuration of the decoding phase depends on the running batch size.

在现有系统中，由于耦合的原因，资源分配和并行策略都是为满足 TTFT 和 TPOT 中要求更高的一个而量身定制的，对另一个来说可能并不理想。这往往会导致为满足两个 SLO 而过度分配资源。

Main Idea

DistServe Architecture

将 prefill 和 decoding 阶段分解到不同GPU上，从而可以独立进行 batch 和 parallelism 策略的选择，目标是追求 maximum per-GPU
goodput.

Prefill instance: upon receiving a request, performs only the prefill computation for this request to generate the first output token
Prefill instance sends the KV caches to a decoding instance
Decoding instance is responsible for subsequent decoding steps.
(Because decoding computation often has low GPU utilization, we may allocate multiple prefill instances per decoding instance.)

Tradeoff Analysis

分解使得两阶段解耦，并能分别分析每个阶段的特点。下面分析确定两阶段的 batching 及 parallelism 的 guideline。

Analysis for Prefill Instance

Assuming a given arrival rate, we aim to fulfill the service’s latency requirement on TTFT using the least resources.

Batching Strategy

The prefill step is typically compute-intensive. Once the GPU becomes compute-bound, adding more requests to the batch no longer improves GPU efficiency. Instead, it proportionally extends the total processing time for the batch, inadvertently delaying all included requests.

Hence, for prefill instances, it is necessary to identify a critical input length threshold, denoted as $L_m$ , beyond which the prefill phase becomes compute-bound.

Parallelism Plan

Assuming uniform requests input lengths of 512 tokens and a Poisson arrival process.

Suppose the Poisson arrival rate is R and the utilization condition of RD < 1, the average TTFT (Avg_TTFT ) can be modeled by the M/D/1 queue [47] in close form:

$\begin{align} Avg\_TTFT = D + \frac{RD^2}{2(1-RD)} \end{align}$

With 2-way inter-op parallelism, we assume the request-level latency becomes $D_s$ , and the slowest stage takes $D_m$ to finish. We have $D ≈ D_s ≈ 2 × D_m$ , due to negligible inter-layer activation communication [33, 59]. The average TTFT with 2-way inter-op parallelism is derived as:

$\begin{align} Avg\_TTFT_{inter} = D_s + \frac{RD_m^2}{2(1-RD_m)} = D + \frac{RD^2}{4(2-RD)} \end{align}$

For intra-op parallelism, we introduce a speedup coefficient $K$ , where $1 < K < 2$ , reflecting the imperfect speedup caused by high communication overheads of intra-op parallelism. With the execution time $D_s = \frac{D}{K}$ , the average TTFT for 2degree intra-op parallelism is:

$\begin{align} Avg\_TTFT_{intra} = \frac{D}{K} + \frac{RD^2}{2K(K-RD)} \end{align}$

在 req/s 较低时，执行时间（第一项）占据主导地位，此时 intra-op 更高效；当 req/s 较高时，排队时延（第二项）更显著，此时 inter-op 更高效。

Analysis for Decoding Instance

For decoding instances, our optimization goal is to satisfy the application’s TPOT requirement using minimal computing resources.

Batching strategy

如 Fig.3(b)所示，Batching 是避免 low GPU utilization（hence high per-gpu goodput）的关键。

在现有系统中，prefill 和 decoding 阶段位于同一位置，增加 decoding batch size 非常困难，尤其是在请求率较高的情况下。这是因为共享 GPU 会造成prefill 和 decoding 工作之间的竞争，从而导致 TTFT 和 TPOT 之间的权衡。

在 prefill 和 decoding 解耦后，可以将多个 prefill instance 分配给单个 decoding instance，从而为 decoding instance 提供更大的 batch size。

Parallelism plan

因为需要维护所有 requests 的 KV Cache， decoding阶段的 batch size 受到 GPU 显存的限制。

Insight: Scaling the decoding instance with model parallelism or leveraging advanced memory management techniques for LLM KV caches, such as Paged-Attention [32] and GQA [10], enable further scaling of the decoding batch size to nearly compute-bound.

基于insight，研究了大批量条件下不同并行度的吞吐和时延变化。

如Fig.5所示，intra-op 减少了时延，而收益递减，这是由通信开销增加引起的。 inter-op 几乎可以线性增加吞吐量。

Method

Given the model, workload characteristic, latency requirements, and SLO attainment target, DistServe will determine:

prefill instance 和 decoding instance 的并行策略
每种 instance 要部署的数量
如何部署到物理集群上

Our goal is to find a placement that maximizes the per-gpu goodput.

For clusters with high-speed cross-node networks

For environments lacking above infrastructure

A straightforward solution is to always colocate prefill and decoding instances on the same node, utilizing the NVLINK, which is commonly available inside a GPU node

5edd2af17a8f2138706a1b90a7e5534c