DFlash: Block Diffusion for Flash Speculative Decoding

文章目录

Block-wise Diffusion Drafting: Instead of generating one token at a time, DFlash produces a block of candidate tokens through a small number of diffusion steps, which the target model verifies in parallel — dramatically reducing the number of sequential forward passes required. High Acceptance Rate on Coding/Math: Benchmarks show particularly strong performance on code generation and mathematical reasoning tasks, where the block-level approach aligns well with the token distributions typical of structured output. Production-Ready Integration: DFlash is designed to plug into existing vLLM and SGLang deployments with minimal configuration changes, making it accessible to practitioners already running LLM inference infrastructure.
💬 Issue #1 — Dflash training code (27 comments) @jianc99 (maintainer) — "We plan to open-source our full training recipe—including the training code and training data—together with the complete paper. The release is expected by the end of this month." @Arcmoon-Hu — "Good job! And have you any plan deploy the flash model on vllm or sglang?" 💬 Translation: 作者承诺将于本月底开源完整的训练代码和训练数据，并预告了 vLLM/SGLang 部署计划，引发社区积极响应。 💬 Issue #52 — Does this support using quantized models like AWQ for the main model? (13 comments) @leonardcser — "Same question here with Qwen/Qwen3.5-27B-FP8. I don't get much of a speed up and the acceptance rate seems suspiciously low." @jianc99 — "Could you run our benchmark to check whether the acceptance lengths are aligned? If they are, then the lower acceptance length you're seeing is likely due to out-of-distribution data, since our training data mainly focuses on coding and math. If they're not aligned, there might be some issues with the inference backend." 💬 Translation: 社区报告 FP8 量化模型接受率偏低的问题，作者指出这可能与 OOD（分布外）数据有关，建议用户先跑官方 benchmark 排查。 💬 Issue #50 — Llama cpp support (10 comments) @DavidDohmen — "Not only related to windows/linux. Generally llama.cpp is one of the most powerful inference frameworks used dominantly in the local inference sector. People like me, who are compute hungry and would greatly benefit from these amazing speedups. Great work team!" 💬 Translation: 社区强烈呼吁支持 llama.cpp，这对本地推理用户意义重大，作者正在评估该需求。
DFlash represents a promising direction in LLM inference optimization by replacing the traditional autoregressive draft model with a diffusion-based block generator. With strong engagement from the open-source community, active development, and upcoming features like llama.cpp support and more model integrations (Gemma 4, MoE, Dense variants), it's a project worth watching — especially for anyone running LLM inference at scale or optimizing for throughput on coding-heavy workloads. 📎 @z-lab — github.com/z-lab/dflash

DFlash (Block Diffusion for Flash Speculative Decoding) is an open-source project from z-lab that brings diffusion-based speculative decoding to large language models, achieving significant speedups during inference by using a small draft model trained with diffusion techniques to propose token blocks that the target model can verify in parallel.

Unlike traditional speculative decoding which relies on autoregressive draft models, DFlash leverages a novel block-wise diffusion approach that generates multiple candidate tokens simultaneously. This fundamentally changes the acceptance/rejection dynamics and can achieve higher acceptance rates on certain task distributions, particularly in coding and mathematical reasoning scenarios.

The project has gained significant traction with 3,500+ stars and 671 stars in the last day alone, reflecting the AI community's strong interest in practical LLM inference optimizations. The codebase integrates with popular backends like vLLM and SGLang.