What you need before starting
This guide was tested on a machine with 8GB of RAM and 512MB of dedicated GPU memory. If your machine meets or exceeds those specs, you can follow along exactly. If you have no GPU at all, you can still run a model — just use the CPU build of llama.cpp instead of the Vulkan build. If you have a dedicated Nvidia GPU (RTX series), the CUDA build will give you the best performance.
| Your Setup | Build to Use |
|---|---|
| No GPU / CPU only | CPU build |
| Any GPU (AMD, Intel, basic Nvidia) | Vulkan build |
| Nvidia RTX (CUDA-capable) | CUDA 12 or CUDA 13 build |
Step 1: Download llama.cpp
Download the Vulkan prebuilt binary here. This is a prebuilt version, meaning you do not need to install C++, CMake, or any build tools. Download the ZIP, extract it to a folder of your choice, and you are ready for the next step.
llama.cpp is an open-source engine written in C++ that runs AI language models locally on your hardware. It was originally built for CPU-only inference but now supports GPU acceleration via Vulkan, CUDA, and Metal (Mac). It handles all the low-level computation — you just point it at a model file and it runs.
Step 2: Start the server
Open the folder where you extracted llama.cpp. Click on the address bar at the top of the File Explorer window, type cmd, and press Enter. This opens a command prompt already pointed at that folder.
In the command prompt, type the following and press Enter:
llama-server.exe
You will see llama.cpp start up and host a local web UI — typically at http://127.0.0.1:8080/. Open that in your browser. The interface is there, but it will not respond to messages yet because no model is loaded. That is next.
Step 3: Download a model in GGUF format
llama.cpp requires models in GGUF format. This is a packaged, ready-to-run model file — similar to how a prebuilt binary saves you from compiling code yourself. The alternative format (SafeTensors) requires additional conversion steps, so stick with GGUF.
For a basic machine, small models are the right choice. Qwen3 0.6B is a solid starting point — capable enough to be useful, small enough to run comfortably on 8GB of RAM. Download the Qwen3 0.6B GGUF model from Hugging Face here. Look for the Q8 file for best quality, or Q4_M for a smaller, faster option.
The Q number refers to quantization — a compression technique that reduces model file size and memory usage at a small cost to quality. Q8 is high quality and larger. Q4 is more compressed and faster. For a 0.6B model, Q8 is fine on most basic machines.
Step 4: Load the model and run
Stop the server if it is still running (Ctrl+C in the command prompt). Now start it again with the model path:
llama-server.exe -m C:\path\to\your-model.gguf
Replace the path with wherever you saved the downloaded GGUF file. The server will load the model — this takes a few seconds — and then the web UI at http://localhost:8080 will be fully functional. You can now chat with the model directly in your browser, completely offline.
On the test machine (8GB RAM, basic GPU), Qwen3 0.6B runs at roughly 14–15 tokens per second. That is fast enough for comfortable back-and-forth conversation and light coding assistance.
What to try next
Once the model is running, the setup is yours to extend. You can swap in larger models as you get comfortable — Qwen3 1.7B is a meaningful step up if your machine handles it. You can also connect llama.cpp to other tools via its OpenAI-compatible API endpoint, which means any application that supports OpenAI's API can talk to your local model instead.
Local models are also the foundation for more advanced workflows. To understand how AI models can be extended to use tools and take real-world actions — which is where local inference becomes genuinely powerful — see our article on From LLM to AI Agent: The Magic of Tool Calling.