ds4.c: A Lean Native Inference Engine for DeepSeek V4 Flash on Apple Silicon
文章目录
- Metal-native performance: A dedicated Metal compute pipeline that squeezes DeepSeek V4 Flash through Apple GPU shaders, achieving impressive prefill and generation throughput on M-series chips. 1M token context window: DeepSeek V4 Flash supports a massive 1-million-token context — and ds4.c's on-disk KV cache persistence means you can checkpoint and resume long conversations without eating into your RAM. Proportional thinking output: Unlike other reasoning models that produce verbose, all-or-nothing thinking traces, DeepSeek V4 Flash generates thinking sections proportional to problem complexity. The thinking is shorter, more focused — and ds4.c preserves this behavior faithfully. 2-bit quantization on MacBooks: With careful quantization, you can run this 284B-parameter model on MacBooks with 128GB of RAM — something previously unthinkable for a model of this scale.
-
- A community member filed a detailed security report identifying a buffer overflow in the server component and shell command injection in download_model.sh where user-supplied arguments like --token were injected directly into curl commands without sanitization: "The download_model.sh script insecurely handles user-supplied arguments... These variables are injected directly into curl commands without proper sanitization or shell quoting." winterqt (maintainer) acknowledged it as a low-risk issue for local use, noting the repo is experimental and not intended for production. ahmedayman1997 pushed back, arguing that security fundamentals shouldn't be skipped even in experimental tooling: "In a production-grade security context, trusting user-supplied input — especially in a script meant to handle authentication tokens — is a fundamental flaw... leaving an opening for arbitrary code execution via a crafted token (e.g., ; touch HACKED;) is poor practice." This sparked a healthy debate about the security expectations for local-only, experimental inference tools.
- MirkoCovizzi is working on a port for Strix Halo hardware and raised the question of making the GGUF conversion process more transparent and reproducible: "I think it would be interesting if it was more transparent how the GGUF for ds4 was generated and being able to do so even locally using some conversion script." antirez responded that the current conversion scripts are a messy mix of llama.cpp tooling and his private one-off code, but noted that DeepSeek is expected to release updated v4 Flash versions in the future, which should simplify things. He also confirmed that MTP (Multi-Token Prediction) is already partially implemented in the code, though gains are currently limited.
- A contributor proposed enabling M5-class Metal 4 MPP prefill paths for Q8_0 dense matmuls and fused six-expert MoE sum kernels: "enable M5-class Metal 4 MPP prefill paths for Q8_0 dense matmuls, attention-output low projection, and staged routed MoE projections" antirez called it a "very significant speedup" but said a large refactoring was in progress and he lacked M5 hardware to properly maintain it long-term. ivanfioravanti joked: "we need to sort this out and let you get an M5 😎" — a sentiment the community clearly shares.
- ds4.c is a showcase of focused engineering: take one excellent model (DeepSeek V4 Flash), write a minimal Metal executor around it, and ship something that actually runs well on consumer Apple Silicon hardware. With 4.9k stars in just four days, the community clearly finds value in this no-bloat approach. Watch this space — CUDA and AMD ROCm support are actively being explored. 📂 View ds4.c on GitHub → @antirez/ds4
ds4.c is a focused, no-frills native inference engine purpose-built for running DeepSeek V4 Flash on Apple Silicon Macs via Metal. It's not a generic GGUF runner, not a wrapper around another runtime, and not a full framework — just a tightly optimized Metal graph executor with DS4-specific loading, prompt rendering, KV state management, and server API glue, all contained in a single C file.
- Metal-native performance: A dedicated Metal compute pipeline that squeezes DeepSeek V4 Flash through Apple GPU shaders, achieving impressive prefill and generation throughput on M-series chips.
- 1M token context window: DeepSeek V4 Flash supports a massive 1-million-token context — and ds4.c's on-disk KV cache persistence means you can checkpoint and resume long conversations without eating into your RAM.
- Proportional thinking output: Unlike other reasoning models that produce verbose, all-or-nothing thinking traces, DeepSeek V4 Flash generates thinking sections proportional to problem complexity. The thinking is shorter, more focused — and ds4.c preserves this behavior faithfully.
- 2-bit quantization on MacBooks: With careful quantization, you can run this 284B-parameter model on MacBooks with 128GB of RAM — something previously unthinkable for a model of this scale.
A community member filed a detailed security report identifying a buffer overflow in the server component and shell command injection in download_model.sh where user-supplied arguments like --token were injected directly into curl commands without sanitization:
"The download_model.sh script insecurely handles user-supplied arguments... These variables are injected directly into curl commands without proper sanitization or shell quoting."
winterqt (maintainer) acknowledged it as a low-risk issue for local use, noting the repo is experimental and not intended for production. ahmedayman1997 pushed back, arguing that security fundamentals shouldn't be skipped even in experimental tooling:
"In a production-grade security context, trusting user-supplied input — especially in a script meant to handle authentication tokens — is a fundamental flaw... leaving an opening for arbitrary code execution via a crafted token (e.g.,
; touch HACKED;) is poor practice."
This sparked a healthy debate about the security expectations for local-only, experimental inference tools.
MirkoCovizzi is working on a port for Strix Halo hardware and raised the question of making the GGUF conversion process more transparent and reproducible:
"I think it would be interesting if it was more transparent how the GGUF for ds4 was generated and being able to do so even locally using some conversion script."
antirez responded that the current conversion scripts are a messy mix of llama.cpp tooling and his private one-off code, but noted that DeepSeek is expected to release updated v4 Flash versions in the future, which should simplify things. He also confirmed that MTP (Multi-Token Prediction) is already partially implemented in the code, though gains are currently limited.
A contributor proposed enabling M5-class Metal 4 MPP prefill paths for Q8_0 dense matmuls and fused six-expert MoE sum kernels:
"enable M5-class Metal 4 MPP prefill paths for Q8_0 dense matmuls, attention-output low projection, and staged routed MoE projections"
antirez called it a "very significant speedup" but said a large refactoring was in progress and he lacked M5 hardware to properly maintain it long-term. ivanfioravanti joked: "we need to sort this out and let you get an M5 😎" — a sentiment the community clearly shares.
ds4.c is a showcase of focused engineering: take one excellent model (DeepSeek V4 Flash), write a minimal Metal executor around it, and ship something that actually runs well on consumer Apple Silicon hardware. With 4.9k stars in just four days, the community clearly finds value in this no-bloat approach. Watch this space — CUDA and AMD ROCm support are actively being explored.