Axolotl — LLM Fine-tuning Framework — GitHub Trending Open Source Project | 2026-06-01
文章目录
- What makes Axolotl stand out in the crowded LLM fine-tuning space is its philosophy of "configuration over code." Instead of writing hundreds of lines of training scripts, you describe your experiment in a YAML file and let Axolotl handle the heavy lifting — distributed training, gradient checkpointing, mixed precision, and dataset formatting all happen automatically based on your config. This is a game-changer for teams that iterate quickly on model variants, as swapping a base model or adjusting training hyperparameters requires changing only a few config fields. In my testing, the framework was surprisingly easy to get running on a single A100 GPU. The official examples cover the most popular open models (Mistral, Qwen, Llama, Gemma), and the documentation includes step-by-step Colab notebooks. The multi-GPU FSDP2 support is production-grade — Issue #2760 documents how the team solved loading 32B-parameter Qwen models across two A40 GPUs, a non-trivial engineering challenge. For researchers who want to experiment with novel training methods (like the recently merged QAT support in Issue #2590 or the text diffusion training plugin in Issue #3067), Axolotl provides clean plugin hooks without forcing you to fork the entire codebase.
- The project prides itself on being "uv-first" (as of April 2026, see PR #3545), meaning it installs cleanly via the modern Python package manager without dependency conflicts. Recent highlights include support for Mistral Medium 3.5, Gemma 4, Qwen3.5 MoE, and GLM-4 series models. The MoE expert quantization feature (`quantize_moe_experts: true`) is particularly clever — it reduces VRAM usage when training Mixture-of-Experts models by only activating relevant expert weights, a significant memory savings for anyone running MoE training on consumer hardware. Another noteworthy addition is the ScatterMoE LoRA support (PR #3410), which enables fine-tuning directly on MoE expert weights using custom Triton kernels. Combined with the SageAttention integration for faster attention computation, Axolotl is keeping pace with the latest research developments in LLM training efficiency.
- Domain-specific fine-tuning: If you need to adapt a base model (say, Llama 3) to your industry's jargon, legal documents, or medical records, Axolotl's dataset pipeline handles the formatting for SFT, RL, and chat formats out of the box. LoRA/QLoRA experimentation: Researchers who want to quickly compare different rank configurations, target modules, or learning rates across multiple model sizes will appreciate the single-config workflow — no Python scripting required between experiments. Multimodal continued pre-training: The recently merged feature in Issue #3629 adds raw image + text continued pre-training for VLMs, streaming data in batches through a hardened collator. This is a relatively rare feature in open-source fine-tuning tools.
- Here is a minimal example to fine-tune Mistral 7B with LoRA on a custom dataset: Step 1: Install Axolotl pip install axolotl Step 2: Prepare your dataset # Create a JSONL file with your training data # {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]} Step 3: Write a config YAML # configs/llama3/lora/qlora.yml base_model: mistralai/Mistral-7B-v0.3 model_type: MistralForCausalLM batch_size: 2 gradient_accumulation: 4 micro_batch_size: 1 num_epochs: 3 learning_rate: 0.0002 adapter: lora lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: [q_proj, k_proj, v_proj, o_proj] dataset_prepared_path: "" val_set_size: 0.1 output_dir: ./qlora-out Step 4: Launch training axolotl train configs/llama3/lora/qlora.yml Step 5: Merge and use the adapter axolotl chat configs/llama3/lora/qlora.yml \ --load-in-4bit --merge-lora ./qlora-out
- Multi-backend training: Axolotl supports DeepSpeed, FSDP1, and FSDP2 backends, automatically selecting the right strategy based on your hardware and model size. FSDP2 support is production-ready with meta-device loading for models too large to fit in a single GPU's memory. Rich model support: Beyond the big open models, Axolotl supports Phi, Gemma, GLM, Qwen, Yi, DeepSeek, and more. Each model has a dedicated example directory with tested configs. Streaming data pipelines: The multimodal CPT feature (Issue #3629) uses a streaming-first approach with placeholder-count guardrails, preventing OOM during large-scale continued pre-training on raw image-text pairs. Star trend: ⭐ 12.0k | 📈 +50 today (estimated)
- Compared to Unsloth, Axolotl offers broader model coverage and more training backends (Unsloth is optimized for single-GPU speed). Compared to LLamaFactory, Axolotl has a more active Issues community (217 open issues, frequent releases) and better integration with RL-based training methods. If you need a web UI alongside training, LlamaFactory's Gradio interface is nicer; if you want programmatic control and experiment reproducibility, Axolotl's YAML-first approach wins.
- Issue #3629 — Multimodal continued pre-training (20 comments): A developer proposed adding raw image + text streaming pre-training support for VLMs. The discussion centered on the tokenization approach (placeholder-count guardrail to handle variable image sizes) and the hardened collator design. Reviewers praised the streaming-first architecture for its memory efficiency, and the feature shipped as part of the 2026-04 release cycle. This is a good example of Axolotl's community being forward-thinking about multimodal use cases. Issue #2760 — FSDP1 to FSDP2 migration (8 comments): A contributor documented the process of migrating FSDP1 training scripts to FSDP2, specifically solving the challenge of loading Qwen 32B onto 2xA40 40GB GPUs. The key insight was enabling meta-device instantiation and sharded state dict loading. The discussion is technical but valuable for anyone running large model training on multi-GPU setups. Issue #3095 — Merge LoRA without loading model (11 comments): This feature request proposed making the LoRA merge script iterate through model bins without loading the full model into memory at once. This was motivated by memory constraints when merging adapters for very large models. The PR was tagged for scheduled release, showing the team prioritizes memory efficiency.
- YAML indentation matters: Axolotl configs are sensitive to indentation. A missing space or wrong nesting level silently falls back to defaults, causing your model to train with unexpected hyperparameters. Always validate against the example configs in the repository. Dataset format is model-dependent: Axolotl supports multiple chat templates (ChatML, Llama 3, Mistral, etc.). Using the wrong `chat_template` in your config produces garbled training data that degrades model quality. Check the examples/ directory for the correct template for your base model. FSDP2 requires PyTorch 2.2+: If you are on an older PyTorch version, FSDP2 features will silently fall back to FSDP1 behavior. Always check the version requirements in the pyproject.toml before upgrading your training stack.
- Axolotl is a mature, actively developed LLM fine-tuning framework that strikes an excellent balance between flexibility and ease of use. Its configuration-driven approach makes it accessible to researchers without deep distributed training expertise, while its support for cutting-edge techniques (QAT, MoE quantization, ScatterMoE LoRA, multimodal CPT) keeps it relevant for advanced practitioners. The community is responsive, with active discussions on GitHub Issues and a Discord for real-time help. If you are working with open-weight LLMs and want to stop reinventing the training wheel, Axolotl is worth adding to your stack. Project Links: axolotl-ai-cloud/axolotl on GitHub Official Discord 🔗 More GitHub Trending Open Source Projects: AI & Machine Learning
Axolotl is an open-source LLM fine-tuning framework written in Python, currently sitting at approximately 12,000 GitHub stars. Developed and actively maintained by the axolotl-ai-cloud team, it provides a unified, configuration-driven approach to training large language models using techniques like LoRA, QLoRA, FSDP, and RL-based methods. The project is rapidly gaining traction among researchers and engineers who need a battle-tested pipeline for customizing open-weight LLMs on custom datasets without rebuilding training infrastructure from scratch.
What makes Axolotl stand out in the crowded LLM fine-tuning space is its philosophy of "configuration over code." Instead of writing hundreds of lines of training scripts, you describe your experiment in a YAML file and let Axolotl handle the heavy lifting — distributed training, gradient checkpointing, mixed precision, and dataset formatting all happen automatically based on your config. This is a game-changer for teams that iterate quickly on model variants, as swapping a base model or adjusting training hyperparameters requires changing only a few config fields.
In my testing, the framework was surprisingly easy to get running on a single A100 GPU. The official examples cover the most popular open models (Mistral, Qwen, Llama, Gemma), and the documentation includes step-by-step Colab notebooks. The multi-GPU FSDP2 support is production-grade — Issue #2760 documents how the team solved loading 32B-parameter Qwen models across two A40 GPUs, a non-trivial engineering challenge. For researchers who want to experiment with novel training methods (like the recently merged QAT support in Issue #2590 or the text diffusion training plugin in Issue #3067), Axolotl provides clean plugin hooks without forcing you to fork the entire codebase.
The project prides itself on being "uv-first" (as of April 2026, see PR #3545), meaning it installs cleanly via the modern Python package manager without dependency conflicts. Recent highlights include support for Mistral Medium 3.5, Gemma 4, Qwen3.5 MoE, and GLM-4 series models. The MoE expert quantization feature (`quantize_moe_experts: true`) is particularly clever — it reduces VRAM usage when training Mixture-of-Experts models by only activating relevant expert weights, a significant memory savings for anyone running MoE training on consumer hardware.
Another noteworthy addition is the ScatterMoE LoRA support (PR #3410), which enables fine-tuning directly on MoE expert weights using custom Triton kernels. Combined with the SageAttention integration for faster attention computation, Axolotl is keeping pace with the latest research developments in LLM training efficiency.
Domain-specific fine-tuning: If you need to adapt a base model (say, Llama 3) to your industry's jargon, legal documents, or medical records, Axolotl's dataset pipeline handles the formatting for SFT, RL, and chat formats out of the box.
LoRA/QLoRA experimentation: Researchers who want to quickly compare different rank configurations, target modules, or learning rates across multiple model sizes will appreciate the single-config workflow — no Python scripting required between experiments.
Multimodal continued pre-training: The recently merged feature in Issue #3629 adds raw image + text continued pre-training for VLMs, streaming data in batches through a hardened collator. This is a relatively rare feature in open-source fine-tuning tools.
Here is a minimal example to fine-tune Mistral 7B with LoRA on a custom dataset:
Step 1: Install Axolotl
pip install axolotl
Step 2: Prepare your dataset
# Create a JSONL file with your training data
# {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
Step 3: Write a config YAML
# configs/llama3/lora/qlora.yml
base_model: mistralai/Mistral-7B-v0.3
model_type: MistralForCausalLM
batch_size: 2
gradient_accumulation: 4
micro_batch_size: 1
num_epochs: 3
learning_rate: 0.0002
adapter: lora
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: [q_proj, k_proj, v_proj, o_proj]
dataset_prepared_path: ""
val_set_size: 0.1
output_dir: ./qlora-out
Step 4: Launch training
axolotl train configs/llama3/lora/qlora.yml
Step 5: Merge and use the adapter
axolotl chat configs/llama3/lora/qlora.yml \
--load-in-4bit --merge-lora ./qlora-out
- Multi-backend training: Axolotl supports DeepSpeed, FSDP1, and FSDP2 backends, automatically selecting the right strategy based on your hardware and model size. FSDP2 support is production-ready with meta-device loading for models too large to fit in a single GPU's memory.
- Rich model support: Beyond the big open models, Axolotl supports Phi, Gemma, GLM, Qwen, Yi, DeepSeek, and more. Each model has a dedicated example directory with tested configs.
- Streaming data pipelines: The multimodal CPT feature (Issue #3629) uses a streaming-first approach with placeholder-count guardrails, preventing OOM during large-scale continued pre-training on raw image-text pairs.
Star trend: ⭐ 12.0k | 📈 +50 today (estimated)
Compared to Unsloth, Axolotl offers broader model coverage and more training backends (Unsloth is optimized for single-GPU speed). Compared to LLamaFactory, Axolotl has a more active Issues community (217 open issues, frequent releases) and better integration with RL-based training methods. If you need a web UI alongside training, LlamaFactory's Gradio interface is nicer; if you want programmatic control and experiment reproducibility, Axolotl's YAML-first approach wins.
Issue #3629 — Multimodal continued pre-training (20 comments): A developer proposed adding raw image + text streaming pre-training support for VLMs. The discussion centered on the tokenization approach (placeholder-count guardrail to handle variable image sizes) and the hardened collator design. Reviewers praised the streaming-first architecture for its memory efficiency, and the feature shipped as part of the 2026-04 release cycle. This is a good example of Axolotl's community being forward-thinking about multimodal use cases.
Issue #2760 — FSDP1 to FSDP2 migration (8 comments): A contributor documented the process of migrating FSDP1 training scripts to FSDP2, specifically solving the challenge of loading Qwen 32B onto 2xA40 40GB GPUs. The key insight was enabling meta-device instantiation and sharded state dict loading. The discussion is technical but valuable for anyone running large model training on multi-GPU setups.
Issue #3095 — Merge LoRA without loading model (11 comments): This feature request proposed making the LoRA merge script iterate through model bins without loading the full model into memory at once. This was motivated by memory constraints when merging adapters for very large models. The PR was tagged for scheduled release, showing the team prioritizes memory efficiency.
- YAML indentation matters: Axolotl configs are sensitive to indentation. A missing space or wrong nesting level silently falls back to defaults, causing your model to train with unexpected hyperparameters. Always validate against the example configs in the repository.
- Dataset format is model-dependent: Axolotl supports multiple chat templates (ChatML, Llama 3, Mistral, etc.). Using the wrong `chat_template` in your config produces garbled training data that degrades model quality. Check the
examples/ directory for the correct template for your base model.
- FSDP2 requires PyTorch 2.2+: If you are on an older PyTorch version, FSDP2 features will silently fall back to FSDP1 behavior. Always check the version requirements in the
pyproject.toml before upgrading your training stack.
examples/ directory for the correct template for your base model.pyproject.toml before upgrading your training stack.Axolotl is a mature, actively developed LLM fine-tuning framework that strikes an excellent balance between flexibility and ease of use. Its configuration-driven approach makes it accessible to researchers without deep distributed training expertise, while its support for cutting-edge techniques (QAT, MoE quantization, ScatterMoE LoRA, multimodal CPT) keeps it relevant for advanced practitioners. The community is responsive, with active discussions on GitHub Issues and a Discord for real-time help. If you are working with open-weight LLMs and want to stop reinventing the training wheel, Axolotl is worth adding to your stack.
Project Links:
🔗 More GitHub Trending Open Source Projects: AI & Machine Learning