Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting

Abstract

Speculative decoding has demonstrated its effectiveness in accelerating theinference of large language models while maintaining a consistent samplingdistribution. However, the conventional approach of training a separate draftmodel to achieve a satisfactory token acceptance rate can be costly. Drawinginspiration from early exiting, we propose a novel self-speculative decodingframework \emph{Kangaroo}, which uses a fixed shallow sub-network as aself-draft model, with the remaining layers serving as the larger target model.We train a lightweight and efficient adapter module on top of the sub-networkto bridge the gap between the sub-network and the full model's representationability. It is noteworthy that the inference latency of the self-draft modelmay no longer be negligible compared to the large model, necessitatingstrategies to increase the token acceptance rate while minimizing the draftingsteps of the small model. To address this challenge, we introduce an additionalearly exiting mechanism for generating draft tokens. Specifically, we halt thesmall model's subsequent prediction during the drafting phase once theconfidence level for the current token falls below a certain threshold.Extensive experiments on the Spec-Bench demonstrate the effectiveness ofKangaroo. Under single-sequence verification, Kangaroo achieves speedups up to$1.68\times$ on Spec-Bench, outperforming Medusa-1 with 88.7\% fewer additionalparameters (67M compared to 591M). The code for Kangaroo is available athttps://github.com/Equationliu/Kangaroo.

Quick Read (beta)

loading the full paper ...