文章目录

ds4 (DeepSeek 4 Flash) is a high-performance local inference engine developed by legendary programmer Salvatore "antirez" Sanfilippo — the creator of Redis. Written in C with carefully optimized Metal (Apple Silicon) and CUDA (NVIDIA) kernels, ds4 lets you run the DeepSeek V4 Flash model directly on your own hardware with remarkable speed and minimal memory footprint. The project is only 10 days old but has already accumulated nearly 10,000 stars, reflecting the developer community's excitement about finally having a truly fast, local option for running cutting-edge open-weight LLMs.

  • Metal 4 Optimized: Purpose-built GPU kernels for Apple Silicon (M1-M5), leveraging Metal 4's advanced capabilities for both prefill and generation phases — achieving dramatic speedups on the latest hardware with dedicated branch maintenance.
  • CUDA & ROCm Support: Native NVIDIA CUDA kernels with fp16/q8 quantization support, plus an emerging AMD ROCm/HIP backend (zero-copy mode requires significantly less RAM than full-copy, making 62GB VRAM GPUs viable for the full ~81GB model).
  • Minimal Footprint GGUF Loader: Supports stock-recipe (Q8_0/F32) and abliterated GGUFs end-to-end, enabling consumer-grade GPU setups to run the full DeepSeek V4 Flash model that previously required prohibitively expensive infrastructure.

Issue #15 — Metal 4 M5 Prefill Optimizations (21 comments, open)
"Very significant speedup. I'm in the middle of a large refactoring so can't merge right now and there is the problem that I miss the hardware to later maintain it, but I'll see what I can do :)" — @antirez
"100% with you, we need to sort this out and let you get an M5 😎" — @ivanfioravanti
The community is rallying to help the maintainer get Apple M5 hardware access so he can properly benchmark and maintain the Metal 4 branch.

Issue #16 — AMD GPU ROCm/HIP Backend Support (16 comments, open)
"The current DeepSeek V4 Flash model (~81GB compressed) doesn't fit in consumer AMD GPUs with limited VRAM. Hardware tested: AMD Ryzen AI Max+ 395, GPU: Radeon 8060S (62GB VRAM), System RAM: 128GB." — @darchidev
"I've started work on the HIP kernels for DS4. Currently it works and loads the full model, or you can use the zero-copy that is in Metal as well. Full-copy requires 93GB ram, zero-copy-fast way less as it loads when needed." — @ejpir
The ROCm port is actively developed, with the zero-copy approach making AMD consumer GPUs a realistic target.

Issue #74 — FP4 Inference for 4-bit Quantization (8 comments, closed)
Community members discussed ways to make ds4 accessible on more modest GPU hardware through aggressive quantization, with the team exploring FP4 and other compression strategies.

ds4 represents a rare convergence of legendary open-source craftsmanship and cutting-edge AI infrastructure needs. Watching @antirez — who built Redis into one of the most beloved databases in history — bring the same obsessively optimized C coding style to local LLM inference is genuinely compelling. The active discussions around Metal 4, AMD ROCm, and FP4 quantization show a healthy, growing community. For developers who want privacy, speed, and full control over their AI workloads without cloud dependency, ds4 is quickly becoming the go-to engine. Worth watching closely as it matures.

@antirezgithub.com/antirez/ds4