Picking up from Part 0, I’ll setup my Desktop to be able to host an inference engine and an LLM model. I’ll perform a sample query to the LLM after to verify.
I already have an existing OS on my Desktop so I’ll be dual-booting a fresh Debian (13.5 to be exact) to keep things separate. As for the inference engine, I’ll be using vLLM alongside my RTX 3070Ti GPU (not much VRAM I know, but we’ll make it work).
So, first things first, let’s install the dependencies.
Docker
I’ll be using Docker to run vLLM to simplify dependency management. As such, I’ve installed Docker from the apt repository as documented here. I’ve also gone through some of the post-install steps, mainly to manage Docker as a non-root user and configuring Docker to start on boot.
NVIDIA Container Toolkit Setup
Toolkit Installation
Next up is installing the NVIDIA Container Toolkit to support GPU access to the containerized vLLM. I followed the installation guide under the apt repository:
- Pre-requisites
sudo apt-get update && sudo apt-get install -y --no-install-recommends \
ca-certificates \
curl \
gnupg2
- Configure repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
- Update package list
sudo apt-get update
- Install the toolkit
export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.19.1-1
sudo apt-get install -y \
nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}
After which I added the NVIDIA container runtime in Docker by running:
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
NVIDIA Driver Installation
I also needed to install the NVIDIA Drivers for Debian. Otherwise, if I run the container toolkit verification immediately, I get the following error:
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to create the automatic CDI modifier: failed to generate CDI spec for mode "auto": failed to construct device spec generators: failed to initialize NVML: ERROR_LIBRARY_NOT_FOUND
To install the NVIDIA Drivers, I just followed the link above and ran the following:
- Pre-requisites
sudo apt install linux-headers-$(uname -r)
- Install cuda-keyring package and update package list
wget https://developer.download.nvidia.com/compute/cuda/repos/debian13/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
- Install Drivers (Compute-only / Headless)
sudo apt -V install nvidia-driver-cuda nvidia-kernel-dkms
- Reboot
sudo reboot
Once the system is back up, I verified the driver installation by running:
nvidia-smi
Which outputs:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 610.43.02 KMD Version: 610.43.02 CUDA UMD Version: 13.3 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3070 Ti On | 00000000:01:00.0 On | N/A |
| 0% 33C P8 12W / 290W | 17MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Running the container toolkit verification:
docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Gives me the same output as nvidia-smi.
Running vLLM
Let’s now put everything together by running vLLM via Docker:
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:v0.24.0 \
--model Qwen/Qwen3-0.6B \
--max-model-len 4096 \
--gpu-memory-utilization 0.90 \
--enforce-eager \
--default-chat-template-kwargs '{"enable_thinking": false}'
The combination of --max-model-len, --gpu-memory-utilization and --enforce-eager are required to be able to fit the model on the VRAM of my GPU. The --default-chat-template-kwargs just forces Qwen to disable “thinking” which is enabled by default.
Opening another shell, I can then verify vLLM by running:
curl http://localhost:8000/v1/models
which gives me:
{
"object": "list",
"data": [
{
"id": "Qwen/Qwen3-0.6B",
"object": "model",
"owned_by": "vllm",
"root": "Qwen/Qwen3-0.6B",
"parent": null,
"max_model_len": 4096,
...snip...
and what we came for:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What sound does a fox make?"}
]
}' | jq
which returns
{
"id": "chatcmpl-8b5da16aaf0ed845",
"object": "chat.completion",
"model": "Qwen/Qwen3-0.6B",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "A fox makes a sound like a \"mew\" when it's looking. It's a common and distinctive sound associated with foxes in many regions.",
"refusal": null,
"annotations": null,
"audio": null,
"function_call": null,
"reasoning": null
},
...snip...
I’m calling that a success. We’ll tackle the Shutdown-On-Idle and Wake-On-Lan on the next one. See you then!