DwarfStar 4: Running DeepSeek V4 Flash on Your Mac with 1M Token Context
文章目录
- DeepSeek V4 Flash-Specific Optimization: ds4 implements Metal and CUDA graph executors tailored to the exact tensor layout, quantization mix, and metadata of the official DeepSeek V4 Flash GGUF files. Every kernel is shaped for this model, enabling prefill optimizations like the recently merged Metal 4 M5 acceleration PR. Disk-Backed KV Cache: The inference engine exploits the model's highly compressed KV state combined with fast SSD read/write speeds on modern Macs, making long-context inference (up to 1M tokens) viable on machines with 96-128GB RAM. Agent-Ready Server API: Ships as both a standalone CLI binary (ds4) and an HTTP server (ds4-server) with a clean REST API - ready to be integrated with coding agents and tooling out of the box.
- Issue #15 - "Add Metal 4 M5 prefill optimizations" (15 comments) A contributor proposed significant speedups via Apple Silicon M4/M5 Metal optimizations. The project author (antirez, of Redis fame) responded: "Very significant speedup. I'm in the middle of a large refactoring so can't merge right now and there is the problem that I miss the hardware to later maintain it, but I'll see what I can do." Another community member replied: "100% with you, we need to sort this out and let you get an M5" - highlighting both the excitement around the performance gains and the challenge of maintaining Apple-specific GPU optimizations without dedicated hardware. Issue #60 - "feat(loader): support stock-recipe (Q8_0/F32) abliterated GGUFs end-to-end on Metal" (9 comments) A user tested a PR loading an abliterated DeepSeek V4 Flash GGUF and initially hit a Metal graph error: "Metal graph indexer q projection expects F16 weights." The author fixed it within hours in commit c2144e50, and the user confirmed success running the model with a 100k token context window. This thread showcases the rapid iteration cycle and the importance of abliterated GGUF variants for local inference. Issue #25 - "Generating GGUF from official DeepSeek releases" (5 comments) Community members discussed the challenges of producing GGUF files that match official DeepSeek V4 Flash logits. The author explained: "I'm using a mix of the llama things and my private code that is not ready to be released." The conversation also surfaced that DeepSeek v4 Flash produces dramatically shorter "thinking" sections than comparable models - roughly 1/5 the length for equivalent complexity - making it practical for real-time inference even on constrained hardware.
- DwarfStar 4 represents a compelling new direction in local AI inference: instead of building a general-purpose Swiss Army knife, it bets everything on one model done right. By targeting DeepSeek V4 Flash specifically - with its 1M context window, compact KV cache, and strong multilingual output - ds4 delivers a genuinely usable frontier-model experience on hardware that fits in a home office. The active community engagement, rapid bug fixes, and planned Metal/CUDA optimizations make this a project worth watching closely. @antirez - https://github.com/antirez/ds4
DwarfStar 4 (ds4) is a lightweight, model-specific native inference engine built for DeepSeek V4 Flash — the powerful 284B mixture-of-experts model. Unlike general-purpose GGUF runners, ds4 is intentionally narrow: it targets only DeepSeek V4 Flash, optimized specifically for Metal (macOS) and CUDA (Linux), with deep integration into the model's KV cache architecture, prompt rendering, and server API. The project emerged in early May 2026 and has already garnered over 7,000 stars, driven by the growing need to run frontier-class models on personal hardware — including MacBooks with 128GB RAM and Mac Studios.
The core innovation behind ds4 is treating the KV cache as a first-class disk citizen. DeepSeek V4 Flash features an exceptionally compressed KV cache, and ds4 capitalizes on this by supporting on-disk KV cache persistence alongside in-RAM inference. This enables the model to maintain a massive 1M token context window on consumer-grade hardware that would otherwise be impossible.
- DeepSeek V4 Flash-Specific Optimization: ds4 implements Metal and CUDA graph executors tailored to the exact tensor layout, quantization mix, and metadata of the official DeepSeek V4 Flash GGUF files. Every kernel is shaped for this model, enabling prefill optimizations like the recently merged Metal 4 M5 acceleration PR.
- Disk-Backed KV Cache: The inference engine exploits the model's highly compressed KV state combined with fast SSD read/write speeds on modern Macs, making long-context inference (up to 1M tokens) viable on machines with 96-128GB RAM.
- Agent-Ready Server API: Ships as both a standalone CLI binary (
ds4) and an HTTP server (ds4-server) with a clean REST API - ready to be integrated with coding agents and tooling out of the box.
ds4) and an HTTP server (ds4-server) with a clean REST API - ready to be integrated with coding agents and tooling out of the box.Issue #15 - "Add Metal 4 M5 prefill optimizations" (15 comments)
A contributor proposed significant speedups via Apple Silicon M4/M5 Metal optimizations. The project author (antirez, of Redis fame) responded: "Very significant speedup. I'm in the middle of a large refactoring so can't merge right now and there is the problem that I miss the hardware to later maintain it, but I'll see what I can do." Another community member replied: "100% with you, we need to sort this out and let you get an M5" - highlighting both the excitement around the performance gains and the challenge of maintaining Apple-specific GPU optimizations without dedicated hardware.
Issue #60 - "feat(loader): support stock-recipe (Q8_0/F32) abliterated GGUFs end-to-end on Metal" (9 comments)
A user tested a PR loading an abliterated DeepSeek V4 Flash GGUF and initially hit a Metal graph error: "Metal graph indexer q projection expects F16 weights." The author fixed it within hours in commit c2144e50, and the user confirmed success running the model with a 100k token context window. This thread showcases the rapid iteration cycle and the importance of abliterated GGUF variants for local inference.
Issue #25 - "Generating GGUF from official DeepSeek releases" (5 comments)
Community members discussed the challenges of producing GGUF files that match official DeepSeek V4 Flash logits. The author explained: "I'm using a mix of the llama things and my private code that is not ready to be released." The conversation also surfaced that DeepSeek v4 Flash produces dramatically shorter "thinking" sections than comparable models - roughly 1/5 the length for equivalent complexity - making it practical for real-time inference even on constrained hardware.
DwarfStar 4 represents a compelling new direction in local AI inference: instead of building a general-purpose Swiss Army knife, it bets everything on one model done right. By targeting DeepSeek V4 Flash specifically - with its 1M context window, compact KV cache, and strong multilingual output - ds4 delivers a genuinely usable frontier-model experience on hardware that fits in a home office. The active community engagement, rapid bug fixes, and planned Metal/CUDA optimizations make this a project worth watching closely.
@antirez - https://github.com/antirez/ds4