DS4: DeepSeek 4 Flash — Run 671B Parameters Locally on Your Mac
文章目录
- Native Metal & CUDA Backends — DS4 is written in C with first-class support for Apple Silicon's Metal API and NVIDIA CUDA. It leverages advanced memory mapping techniques to handle models that don't fit in VRAM, supporting configurations like DeepSeek V4 Flash at full 671B parameter scale on 128GB machines. Quantization Support — Built-in support for multiple quantization formats (IQ2, IQ3, Q4) with automatic weight loading. The engine handles compressed model formats natively, significantly reducing memory footprint without sacrificing too much accuracy. Tool-Calling & Agentic Workflows — Beyond pure inference, DS4 includes a tool-safe directional steering policy (Issue #148) that enables reliable agentic behavior — critical for developers building AI-powered applications that need deterministic tool execution. AMD ROCm/HIP Backend In Progress — The community is actively working on AMD GPU support (Issue #16), with initial HIP kernels already functioning. This will expand hardware support beyond Apple and NVIDIA.
- DS4 has an unusually active GitHub Issues section for such a young project. Here are the most illuminating threads from the community:
- Developer ivanfioravanti contributed a massive 100%+ speedup to Metal prefill on M5 hardware:
"Very significant speedup. I'm in the middle of a large refactoring so can't merge right now and there is the problem that I miss the hardware to later maintain it, but I'll see what I can do
Thanks." — antirez
"100% with you, we need to sort this out and let you get an M5 😎" — ivanfioravanti
This PR demonstrates the kind of community-driven optimization that makes DS4 special — contributors with real hardware are actively pushing performance boundaries. - User darchidev documented their journey getting DeepSeek V4 Flash running on an AMD Ryzen AI Max+ 395 (62GB VRAM): "The current DeepSeek V4 Flash model (~81GB compressed) doesn't fit in consumer AMD GPUs with limited VRAM... I've started implementing AMD HIP/ROCm kernels." — darchidev "I've started work on the HIP kernels for DS4. Currently it works and loads the full model, or you can use the zero-copy mode like Metal. Full-copy requires 93GB RAM, zero-copy-fast way less as it loads when needed." — ejpir User puppinoo asked about legacy support, demonstrating the project's appeal to developers with diverse hardware setups. This thread is a great example of real-world debugging and hardware optimization happening in the open.
- Contributor audreyt proposed an advanced policy to prevent tool-calling hallucinations, which antirez called "brilliant" and merged immediately: "That's brilliant! Thank you so much. Going to merge ASAP." — antirez Discussion evolved around whether the new policy should become the default, indicating serious thinking about safety-critical defaults for production use.
- Developer ivanfioravanti contributed a massive 100%+ speedup to Metal prefill on M5 hardware:
"Very significant speedup. I'm in the middle of a large refactoring so can't merge right now and there is the problem that I miss the hardware to later maintain it, but I'll see what I can do
- DS4 fills a genuine gap in the local AI landscape: a fast, portable, open-source inference engine for DeepSeek models that doesn't require cloud infrastructure. With 83 open issues, active community contributions, and a founder with a track record of building world-class open-source software, DS4 is well-positioned to become the standard way developers run DeepSeek models locally. Whether you're building AI agents, prototyping applications, or running offline workloads, DS4 is worth exploring. ⭐ 10,644 Stars | Language: C | License: MIT 👉 Explore DS4 on GitHub
DS4 is a high-performance local inference engine purpose-built for running DeepSeek V4 Flash models on Apple Silicon (Metal) and NVIDIA GPUs (CUDA). Created by Salvatore Sanfilippo (aka antirez, the legendary creator of Redis), this project brings serious AI inference capabilities to consumer hardware — no cloud required, no API keys needed.
Since launching on May 6, 2026, DS4 has already garnered over 10,600 stars, making it one of the fastest-growing open-source AI projects this month. The engine focuses on aggressive optimization: Metal prefetch pipelines, memory mapping, and quantization support are all deeply tuned for inference workloads that real developers actually care about.
What makes DS4 stand out is its focus on local-first AI — giving developers the ability to run state-of-the-art models on their own machines with full privacy and zero latency costs. Whether you're building on a MacBook Pro with an M-series chip or a workstation with NVIDIA RTX GPUs, DS4 aims to make it Just Work.
- Native Metal & CUDA Backends — DS4 is written in C with first-class support for Apple Silicon's Metal API and NVIDIA CUDA. It leverages advanced memory mapping techniques to handle models that don't fit in VRAM, supporting configurations like DeepSeek V4 Flash at full 671B parameter scale on 128GB machines.
- Quantization Support — Built-in support for multiple quantization formats (IQ2, IQ3, Q4) with automatic weight loading. The engine handles compressed model formats natively, significantly reducing memory footprint without sacrificing too much accuracy.
- Tool-Calling & Agentic Workflows — Beyond pure inference, DS4 includes a tool-safe directional steering policy (Issue #148) that enables reliable agentic behavior — critical for developers building AI-powered applications that need deterministic tool execution.
- AMD ROCm/HIP Backend In Progress — The community is actively working on AMD GPU support (Issue #16), with initial HIP kernels already functioning. This will expand hardware support beyond Apple and NVIDIA.
DS4 has an unusually active GitHub Issues section for such a young project. Here are the most illuminating threads from the community:
Developer ivanfioravanti contributed a massive 100%+ speedup to Metal prefill on M5 hardware:
"Very significant speedup. I'm in the middle of a large refactoring so can't merge right now and there is the problem that I miss the hardware to later maintain it, but I'll see what I can do
Thanks." — antirez
"100% with you, we need to sort this out and let you get an M5 😎" — ivanfioravanti
This PR demonstrates the kind of community-driven optimization that makes DS4 special — contributors with real hardware are actively pushing performance boundaries.
User darchidev documented their journey getting DeepSeek V4 Flash running on an AMD Ryzen AI Max+ 395 (62GB VRAM):
"The current DeepSeek V4 Flash model (~81GB compressed) doesn't fit in consumer AMD GPUs with limited VRAM... I've started implementing AMD HIP/ROCm kernels." — darchidev
"I've started work on the HIP kernels for DS4. Currently it works and loads the full model, or you can use the zero-copy mode like Metal. Full-copy requires 93GB RAM, zero-copy-fast way less as it loads when needed." — ejpir
User puppinoo asked about legacy support, demonstrating the project's appeal to developers with diverse hardware setups. This thread is a great example of real-world debugging and hardware optimization happening in the open.
Contributor audreyt proposed an advanced policy to prevent tool-calling hallucinations, which antirez called "brilliant" and merged immediately:
"That's brilliant! Thank you so much. Going to merge ASAP." — antirez
Discussion evolved around whether the new policy should become the default, indicating serious thinking about safety-critical defaults for production use.
DS4 fills a genuine gap in the local AI landscape: a fast, portable, open-source inference engine for DeepSeek models that doesn't require cloud infrastructure. With 83 open issues, active community contributions, and a founder with a track record of building world-class open-source software, DS4 is well-positioned to become the standard way developers run DeepSeek models locally. Whether you're building AI agents, prototyping applications, or running offline workloads, DS4 is worth exploring.
⭐ 10,644 Stars | Language: C | License: MIT