What you need before starting

This guide was tested on a machine with 8GB of RAM and 512MB of dedicated GPU memory. If your machine meets or exceeds those specs, you can follow along exactly. If you have no GPU at all, you can still run a model — just use the CPU build of llama.cpp instead of the Vulkan build. If you have a dedicated Nvidia GPU (RTX series), the CUDA build will give you the best performance.

Your SetupBuild to Use
No GPU / CPU onlyCPU build
Any GPU (AMD, Intel, basic Nvidia)Vulkan build
Nvidia RTX (CUDA-capable)CUDA 12 or CUDA 13 build

Step 1: Download llama.cpp

Download the Vulkan prebuilt binary here. This is a prebuilt version, meaning you do not need to install C++, CMake, or any build tools. Download the ZIP, extract it to a folder of your choice, and you are ready for the next step.

llama.cpp is an open-source engine written in C++ that runs AI language models locally on your hardware. It was originally built for CPU-only inference but now supports GPU acceleration via Vulkan, CUDA, and Metal (Mac). It handles all the low-level computation — you just point it at a model file and it runs.

Step 2: Start the server

Open the folder where you extracted llama.cpp. Click on the address bar at the top of the File Explorer window, type cmd, and press Enter. This opens a command prompt already pointed at that folder.

In the command prompt, type the following and press Enter:

  • llama-server.exe

You will see llama.cpp start up and host a local web UI — typically at http://127.0.0.1:8080/. Open that in your browser. The interface is there, but it will not respond to messages yet because no model is loaded. That is next.

Step 3: Download a model in GGUF format

llama.cpp requires models in GGUF format. This is a packaged, ready-to-run model file — similar to how a prebuilt binary saves you from compiling code yourself. The alternative format (SafeTensors) requires additional conversion steps, so stick with GGUF.

For a basic machine, small models are the right choice. Qwen3 0.6B is a solid starting point — capable enough to be useful, small enough to run comfortably on 8GB of RAM. Download the Qwen3 0.6B GGUF model from Hugging Face here. Look for the Q8 file for best quality, or Q4_M for a smaller, faster option.

The Q number refers to quantization — a compression technique that reduces model file size and memory usage at a small cost to quality. Q8 is high quality and larger. Q4 is more compressed and faster. For a 0.6B model, Q8 is fine on most basic machines.

Horizontal comparison diagram in 16:9 format showing three model size options for a basic PC. Left box labeled Qwen3 0.6B — Best for 8GB RAM, recommended for beginners, Q8 format. Center box labeled Qwen3 1.7B — Needs 8GB+ RAM, better reasoning. Right box labeled Qwen3 4B — Needs 16GB RAM, not recommended for basic machines. Each box has a RAM indicator bar below it, filled proportionally. Style: Clean, minimalist educational flowchart, flat vector design, 16:9 widescreen format, clear text labels, natural warm colors. STRICTLY NO glowing brains, NO neon blue sci-fi nodes, NO abstract robotic clichés.

Step 4: Load the model and run

Stop the server if it is still running (Ctrl+C in the command prompt). Now start it again with the model path:

  • llama-server.exe -m C:\path\to\your-model.gguf

Replace the path with wherever you saved the downloaded GGUF file. The server will load the model — this takes a few seconds — and then the web UI at http://localhost:8080 will be fully functional. You can now chat with the model directly in your browser, completely offline.

On the test machine (8GB RAM, basic GPU), Qwen3 0.6B runs at roughly 14–15 tokens per second. That is fast enough for comfortable back-and-forth conversation and light coding assistance.

Step-by-step pipeline diagram in 16:9 format showing the full local AI setup flow. Five boxes connected left to right by arrows: 1. Download llama.cpp Vulkan ZIP, 2. Extract and open CMD in folder, 3. Download GGUF model from Hugging Face, 4. Run llama-server.exe with model path, 5. Chat via browser at localhost:8080. Each box has a short one-line description below it. Style: Clean, minimalist educational flowchart, flat vector design, 16:9 widescreen format, clear text labels, natural warm colors. STRICTLY NO glowing brains, NO neon blue sci-fi nodes, NO abstract robotic clichés.

What to try next

Once the model is running, the setup is yours to extend. You can swap in larger models as you get comfortable — Qwen3 1.7B is a meaningful step up if your machine handles it. You can also connect llama.cpp to other tools via its OpenAI-compatible API endpoint, which means any application that supports OpenAI's API can talk to your local model instead.

Local models are also the foundation for more advanced workflows. To understand how AI models can be extended to use tools and take real-world actions — which is where local inference becomes genuinely powerful — see our article on From LLM to AI Agent: The Magic of Tool Calling.