For experiments and research on Applied AI.
Housing a variety of Triton and CUDA kernels for training and inference.
Inference kernels = no backward pass support.
1 - Triton - MoE (Mixtral) GEMM for accelerating inference. Uses col major access pattern to increase locality.
data:image/s3,"s3://crabby-images/5748d/5748d3278ba782ddffa27f96b05f04b318a826e8" alt="moe_gemm_a100"
data:image/s3,"s3://crabby-images/ed100/ed100617678fc398b6975206a1a051d6547b1687" alt="softmax_fused"
- CUDA Mode - Reading group for learning CUDA programming - (Discord, Lecture Materials, Lecture recordings)
- llama-recipes - Recipes for fine-tuning and inference for Llama model series
- NeurIPS'23 LLM Efficiency Challenge - 1LLM + 1GPU + 1Day competition - (website, code, NeurIPS Workshop recordings)
- PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation paper
- Accelerating a Triton Fused Kernel for W4A16 Quantized Inference with SplitK Work Decomposition paper
- PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel paper
- Sustainable AI: Environmental Implications, Challenges and Opportunities paper
The applied-ai repo is released under the BSD 3 license.