Small Chips, Big Models: The Art of Squeezing LLMs onto Budget Hardware

By Taufiq

Elevator Pitch

LLMs traditionally require GPUs for fast inference. However, GPUs aren’t available on most edge devices, smartphones, or CPU-only machines. This talk explains how proper techniques can enable the deployment of these models on low resource hardware.

Description

Large language models traditionally require GPUs for fast, low-latency inference. However, GPUs aren’t available on most edge devices, smartphones or CPU-only computers. This talk explains how proper techniques, such as quantization, GGUF formatting, and efficient inference engines, can enable the deployment of these models on everyday hardware. This talk is for ML beginners, NLP/LLM enthusiasts, or anyone curious about on-device, CPU, and mobile deployment.

Expect to learn:
- The challenges in deploying language models on resource-constrained devices
- Optimization solutions: quantization, model compression, and efficient formats
- Practical implementation using llama.cpp and Python bindings

Outlines:
1. Introduction (5 mins): Why on-device deployment?
2. Challenges (5 mins): What are the challenges in On-device deployment?
3. Optimization (10 mins): Quantization, GGUF, model compression for faster inference.
4. Demo (5mins): Running on different devices - Snapdragon 680 (4GB) Snapdragon 685 (8GB), AMD Ryzen 5 (16GB).
5. Q&A (5 mins): Audience discussion and open-source resources.

Resources:
- Github: https://github.com/taufiq-ai/slm-python-buddy
- Blog Article: https://taufiq.hashnode.dev/pybuddy-local-ai-coding-assistant