Skip to main content

Running Local AI on a GMKtec K12 Mini PC

3 min 577 words

I wanted a dedicated box for running local language models — something that could handle 7B–13B parameter models without hogging resources on my main server. The GMKtec K12 caught my eye because of the AMD Ryzen 7 7840HS with its integrated Radeon 780M GPU. Here’s how it went.

Hardware Specs

ComponentDetail
DeviceGMKtec K12 Mini PC
CPUAMD Ryzen 7 7840HS (8C/16T)
iGPUAMD Radeon 780M (12 CUs, RDNA 3)
RAM32 GB DDR5-5600
Storage1 TB NVMe SSD
OSUbuntu 24.04 LTS
Cost~$450

The 780M’s ROCm support was the selling point. Not every integrated GPU plays nice with AI inference frameworks, but AMD has been steadily improving their Linux compute stack.

The Setup

Base System

Fresh Ubuntu 24.04 install, then the basics:

sudo apt update && sudo apt upgrade -y
sudo apt install -y curl git htop

Ollama

Ollama makes local model serving dead simple. One-line install:

curl -fsSL https://ollama.com/install.sh | sh

Pull a model to test with:

ollama pull llama3.1:8b
ollama run llama3.1:8b "Summarize the PMBOK in three sentences."

First response came back in about 8 seconds. Not blazing, but workable for a $450 box.

Open WebUI

For a ChatGPT-like interface on the local network, Open WebUI runs as a Docker container:

sudo apt install -y docker.io docker-compose
sudo usermod -aG docker $USER
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Navigate to http://<k12-ip>:3000, create an admin account, and you’re in. It auto-detects Ollama running on the host.

ROCm for GPU Acceleration

This is where it got interesting. The 780M supports ROCm, but you need to set it up explicitly:

sudo apt install -y rocm-libs rocm-dev
ollama serve  # restart to pick up GPU

Verify GPU detection:

ollama ps
# Should show the model loaded with GPU layers

What Worked

  • Ollama + Open WebUI is a remarkably smooth stack. Install to working chat interface in under 20 minutes.
  • Llama 3.1 8B runs comfortably with ~6 GB RAM allocated to the model. Leaves plenty of headroom for the OS and other services.
  • The 780M iGPU does accelerate inference noticeably — about 40% faster token generation compared to CPU-only on this chip.
  • Power draw stays under 35W at idle and peaks around 65W under full inference load. Quiet too — the fan barely spins up for small models.

What Didn’t

  • 13B models are tight. They load, but generation speed drops significantly. The 32 GB RAM ceiling means you’re swapping if you try anything larger.
  • ROCm setup required a few tries. The Ubuntu 24.04 packages weren’t perfectly aligned at launch — I had to pin a specific ROCm version (6.0.2) to get stable inference.
  • Mixtral 8x7B won’t fit. The MoE architecture needs more memory than this box can offer. Stick to dense models at 8B or below for a usable experience.

Performance Notes

ModelTokens/sec (CPU)Tokens/sec (GPU)RAM Used
Llama 3.1 8B~12 t/s~18 t/s5.8 GB
Mistral 7B~14 t/s~20 t/s5.2 GB
Phi-3 Mini~22 t/s~30 t/s3.1 GB
Llama 3.1 13B~5 t/s~8 t/s9.4 GB

Final Thoughts

For under $500, the GMKtec K12 is a solid entry point for local AI inference. It won’t replace a dedicated GPU server, but for private conversations with an LLM, local RAG experiments, or just keeping your data off third-party APIs — it does the job. The form factor is small enough to tuck behind a monitor, and the power draw is low enough to leave running 24/7.

Next step: setting up a reverse proxy with Caddy so I can access Open WebUI from anywhere on my Tailscale network.