Skip to content

๐Ÿš€ gputool User Guide

gputool is a lightweight, non-sudo utility CLI designed to manage services and networking in restricted or locked-down environments (such as university lab machines or shared server nodes) where you do not have root (sudo) privileges.

Unlike the main sjsujetsontool, which installs system-level packages and systemd services on NVIDIA Jetson dev kits, gputool runs completely inside the user space, saving all of its state, logs, and sockets locally under $HOME/.gputool.


๐Ÿ†š sjsujetsontool vs gputool

Feature sjsujetsontool gputool
Privileges Required sudo (Root) No Sudo (User space only)
Primary Platforms NVIDIA Jetson Generic Linux (Jetson, x86 Servers, etc.)
Tailscale Mode System kernel interface (tailscale0 tun card) User-space networking (Proxy mode)
Installation System-wide binaries & systemd services Static binaries in user home directory
Download Dependencies Requires system curl / wget Falls back to Python 3 if curl/wget are missing

๐Ÿ“ฅ Installation

Choose the command that fits your system's network tools:

Option A: Standard (curl / wget installed)

curl -fsSL https://raw.githubusercontent.com/lkk688/edgeAI/main/jetson/install_gputool.sh | bash

or

wget https://raw.githubusercontent.com/lkk688/edgeAI/main/jetson/install_gputool.sh
chmod +x install_gputool.sh
./install_gputool.sh
source ~/.bashrc
gputool version

Option B: Locked-down Lab Machines (No curl or wget)

If the computer has neither curl nor wget (or they are blocked), use Python 3 to stream the installer:

python3 -c "import urllib.request; print(urllib.request.urlopen('https://raw.githubusercontent.com/lkk688/edgeAI/main/jetson/install_gputool.sh').read().decode())" | bash


๐Ÿ“‹ Command Reference

Core Commands

  • Show Help Menu:
    gputool help
    
  • Print Version:
    gputool version
    
  • Install/Re-install local script:
    gputool install
    
  • Update script from GitHub:
    gputool update-script
    
  • Install Miniconda:
    gputool install-conda [install_path]
    
  • Run complete system check:
    gputool check [env_name]
    
  • Create customized Conda env (PyTorch + HF):
    gputool setup-env [env_name] [python_ver]
    
  • Create LeRobot / PyTorch / HF Conda env:
    gputool setup-lerobot [env_name]
    

๐Ÿ AI & Machine Learning Setup

gputool provides helper utilities to set up localized machine learning virtual environments directly inside your user directory using Conda.

๐Ÿ“ฆ Install Miniconda

On a brand new/empty machine (where Conda is not present), you can download and install Miniconda to user space:

gputool install-conda [install_path]
(If install_path is omitted, it defaults to $HOME/miniconda3).

What it does: 1. Downloads Installer: Downloads the latest Miniconda installer from Anaconda's repositories. 2. Silent Install: Performs a silent batch installation in user space (no root required). 3. ToS Auto-Acceptance: Automatically accepts the Anaconda Terms of Service to prevent CondaToSNonInteractiveError during package setups. 4. Shell configuration: Runs conda init to automatically configure your .bashrc and/or .zshrc.

[!TIP] If you run gputool setup-lerobot directly on a system where Conda is missing, gputool will automatically detect this and trigger the Miniconda auto-installation fallback for you!

๐Ÿค– Create Conda Env & Install LeRobot / PyTorch / HF

Creates a new Conda environment with Python 3.10 and automatically configures it with the Blackwell-compatible CUDA 12.8 PyTorch build, LeRobot (with extra simulator dependencies), and Hugging Face packages:

gputool setup-lerobot [env_name]
(If env_name is omitted, it defaults to lerobot).

What it does: 1. Locates Conda: Dynamically scans for and sources the active Conda initialization profile (e.g. miniconda3, anaconda3, or system paths). 2. Creates Environment: Initializes the target Conda virtual environment using Python 3.10. 3. Installs CMake < 4: Automatically configures the environment with cmake<4 from conda-forge. This is a critical fallback that prevents build isolation compile failures (Compatibility with CMake < 3.5 has been removed from CMake) when compiling older robotics simulation packages like egl_probe/hf-egl-probe under newer system environments. 4. Installs PyTorch (Auto-Detected CUDA Build): Probes the host with nvidia-smi and nvcc, then automatically selects the matching PyTorch wheel index: * Modern GPUs (compute capability โ‰ฅ 7.0, including Blackwell sm_120 RTX 5080) โ†’ CUDA 12.8 (cu128). This is required on Blackwell, as older wheels raise "no kernel image available" exceptions. * Older GPUs (compute capability < 7.0, e.g. Pascal/Maxwell) โ†’ CUDA 11.8 (cu118) for compatibility. * No GPU / no CUDA toolkit โ†’ defaults to CUDA 12.8 (cu128). * If the chosen wheel index fails, it falls back to the default PyPI torch build. The detected GPU name, compute capability, nvcc version, and selected build are printed before installation. 5. Installs LeRobot & HF: Performs a Pip installation of huggingface_hub and the complete LeRobot suite with simulation extras (lerobot[all]). 6. Runs Verification: Automatically verifies GPU detection, PyTorch CUDA initialization, and imports.

To activate the environment after setup:

conda activate <env_name>

Example Verification Output

Upon successful installation, gputool outputs the active versions and hardware state:

==================================================
๐Ÿงฌ PyTorch Version    : 2.10.0+cu128
๐ŸŸข CUDA Available      : True
๐Ÿ–ฅ๏ธ  GPU Device Name    : NVIDIA GeForce RTX 5080
โš™๏ธ  CUDA Device Arch   : ['sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120']
๐Ÿค— HF Hub Version     : 0.35.3
๐Ÿค– LeRobot Version    : 0.4.4
==================================================

๐Ÿ Create Customized Conda Env & Install PyTorch / HF

Creates a new Conda environment with a custom Python version (defaulting to name py312 and Python 3.12) and automatically configures it with the Blackwell-compatible CUDA 12.8 PyTorch build and Hugging Face:

gputool setup-env [env_name] [python_ver]
(If parameters are omitted, it defaults to environment name py312 and Python version 3.12).

What it does: 1. Locates Conda: Dynamically scans for and sources the active Conda initialization profile (e.g., miniconda3, anaconda3, or system paths). 2. Creates Environment: Initializes the target Conda virtual environment using the specified Python version (e.g., 3.12). 3. Installs PyTorch (Auto-Detected CUDA Build): Probes the host GPU and CUDA toolkit and selects the matching PyTorch wheel automatically โ€” CUDA 12.8 (cu128) for modern GPUs like the Blackwell sm_120 RTX 5080, CUDA 11.8 (cu118) for older GPUs (compute capability < 7.0), and cu128 as the default when no GPU/CUDA is found. Falls back to the default PyPI torch if the selected index fails. 4. Installs Hugging Face: Performs a Pip installation of huggingface_hub. 5. Runs Verification: Automatically verifies GPU detection, PyTorch CUDA initialization, and imports.

To activate the environment after setup:

conda activate <env_name>

Example Verification Output

Upon successful installation, gputool outputs the active versions and hardware state:

==================================================
๐Ÿงฌ PyTorch Version    : 2.10.0+cu128
๐ŸŸข CUDA Available      : True
๐Ÿ–ฅ๏ธ  GPU Device Name    : NVIDIA GeForce RTX 5080
โš™๏ธ  CUDA Device Arch   : ['sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120']
๐Ÿค— HF Hub Version     : 0.35.3
==================================================

๐Ÿ” Run Complete System Diagnostic Check

To verify that the system hardware, Conda environment, PyTorch with CUDA, Hugging Face Hub, LeRobot installation, and Userspace Tailscale VPN/Proxy are all correctly configured, run:

gputool check [env_name]
(If env_name is omitted, it defaults to lerobot).

What it verifies: 1. GPU & Driver: Checks if nvidia-smi is available and reports the active GPU name, NVIDIA driver version, and host system CUDA version. 2. Conda environment: Verifies if Conda is installed and detects if the target environment (e.g. lerobot) exists. 3. PyTorch & CUDA: Performs in-environment checks to verify PyTorch is installed, CUDA is active, and prints the compute capability of the GPU (e.g. Blackwell RTX 5080 capability details). 4. Hugging Face Hub: Checks Hugging Face Hub library setup, authentication status, and tests network connectivity to huggingface.co. 5. LeRobot: Confirms lerobot package version and checks key simulation backends (gymnasium, mujoco, h5py). 6. Tailscale & Proxy: Queries tailscaled daemon status, checks whether proxy port 1055 is active and listening, and displays assigned Tailscale IPs.

Example Output:

โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
๐Ÿ–ฅ๏ธ  System Hardware Check
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
[โœ…] NVIDIA Driver found via nvidia-smi.
   โ€ข GPU Name       : NVIDIA GeForce RTX 5080
   โ€ข Driver Version : 550.54.14
   โ€ข CUDA Version   : 12.8

โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
๐Ÿ Conda Environment Check
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
[โœ…] Conda is installed.
   โ€ข Conda Path    : /home/student/miniconda3/bin/conda
   โ€ข Conda Version : 24.1.2
[โœ…] Conda environment 'lerobot' exists.
[โš™๏ธ] Running Python diagnostic checks in Conda env 'lerobot'...

โ•โ•โ•โ• PyTorch & CUDA Diagnostic โ•โ•โ•โ•
   โ€ข PyTorch Installed      : 2.10.0+cu128                   โœ…
   โ€ข CUDA Available         : True                           โœ…
   โ€ข CUDA Backend Ver       : 12.8                           โœ…
   โ€ข GPU Device Name        : NVIDIA GeForce RTX 5080        โœ…
   โ€ข Compute Capability     : ('12', '0')                    โœ…

โ•โ•โ•โ• Hugging Face Hub Diagnostic โ•โ•โ•โ•
   โ€ข HF Hub Installed       : 0.35.3                         โœ…
   โ€ข HF Auth Status         : Logged In                      โœ…
   โ€ข HF Hub Connectivity    : Connected                      โœ…

โ•โ•โ•โ• LeRobot Diagnostic โ•โ•โ•โ•
   โ€ข LeRobot Installed      : 0.4.4                          โœ…
   โ€ข Simulation Packages    : gymnasium(OK), mujoco(OK)...   โœ…

โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
๐ŸŒ Userspace Tailscale VPN & Proxy Check
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
[โœ…] tailscaled background daemon is running.
   โ€ข Daemon PID      : 29841
[โœ…] Proxy port 1055 is listening.
   โ€ข Tailscale IPs   : 100.64.0.15
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

๐Ÿฆ™ Llama.cpp & Local LLM Serving

gputool provides fully integrated commands to compile llama.cpp with NVIDIA CUDA support, download optimized GGUF model quantizations from Hugging Face, and manage background llama-server instances for local API serving with maximum GPU offloading.

1. Compile llama.cpp with CUDA Support

This command clones llama.cpp from source, locates the active CUDA compiler (nvcc), and builds the binaries (llama-cli and llama-server) with GPU acceleration enabled:

gputool setup-llamacpp [env_name]
(If env_name is omitted, it defaults to lerobot).

For example:

(py312) 010796032@coe-cmpe-288-05:~$ gputool setup-llamacpp py312

What it does (robust, self-healing build): 1. Locates nvcc: Searches PATH, then /usr/local/cuda/bin, then any /usr/local/cuda-*/bin, and reports the detected CUDA toolkit version, GPU name, and compute capability. 2. Auto-installs build tools: Detects whether cmake and ninja are present inside the conda env and installs any that are missing from conda-forge (with a pip fallback for cmake). This means envs created by gputool setup-env (which ship PyTorch + HF only) no longer fail with cmake: command not found. 3. Detects GPU architecture: Auto-configures nvcc for your GPU (e.g. Blackwell RTX 5080 โ†’ sm_120a). Uses the Ninja generator when available for a faster build, and automatically wipes a stale build directory if the generator changed. 4. Installs binaries + libraries: Copies llama-cli, llama-server, and all shared libraries (libllama, libggml-cuda, โ€ฆ) into ~/.gputool/bin/ so the binaries run standalone. 5. Verifies CUDA: Runs llama-cli --list-devices and confirms the GPU is visible to llama.cpp before finishing.

[!NOTE] On fully locked-down machines with no network during the build, llama.cpp may print a warning that it could not download the embedded web UI assets โ€” this is harmless and the server still builds and serves the API normally.

2. Download a GGUF Model

Use Python's native huggingface_hub downloader to fetch a GGUF model directly to ~/.gputool/models/. This avoids symlink compilation issues and handles large file streaming smoothly.

gputool download-model [repo_id] [filename] [env_name]
* Default Model: If arguments are omitted, it defaults to the unsloth Q6_K_XL quantization of Qwen3.5 9B: * Repo ID: unsloth/Qwen3.5-9B-GGUF * Filename: Qwen3.5-9B-UD-Q6_K_XL.gguf

3. Serve the LLM via Llama Server

Start, stop, or check status of the local OpenAPI-compatible background serving daemon:

  • Start the server: Offloads all model layers (-ngl 99) to the GPU.
    gputool serve-llamacpp start [model_filename_or_path] [port] [--background|--foreground]
    
    (Defaults: model = Qwen3.5-9B-UD-Q6_K_XL.gguf, port = 8080, run mode = background)

Performance defaults (tuned for RTX 5080 / 16 GB + Qwen3.5-9B): the server launches with -ngl 99 --ctx-size 32768 --batch-size 4096 --ubatch-size 2048 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0. Flash-Attention plus a q8_0 KV cache roughly halves cache memory, so a 32K context fits comfortably (~10 GB VRAM used, leaving headroom). The 32768 context matches Unsloth's recommended output length (the model supports up to 262K). Tune the context for other GPUs:

gputool serve-llamacpp start Qwen3.5-9B-UD-Q6_K_XL.gguf 8080 --ctx-size 8192   # smaller GPUs

If you hit instability with the quantized cache, Unsloth suggests --cache-type-k bf16 --cache-type-v bf16 (uses more VRAM).

By default the server binds to 0.0.0.0, so it is reachable from other machines on the same network (e.g. lab peers) at http://<server-ip>:<port>. To restrict it to the local machine only, pass --host 127.0.0.1.

gputool serve-llamacpp start                       # LAN-accessible on 0.0.0.0:8080 (default)
gputool serve-llamacpp start Qwen3.5-9B-UD-Q6_K_XL.gguf 8080 --host 127.0.0.1   # local-only

[!WARNING] Binding to 0.0.0.0 exposes the model API to every host that can route to this machine. On shared or untrusted networks, either protect it with --api-key (below), use --host 127.0.0.1, or reach it over an SSH tunnel / the Tailscale VPN.

Protect the API with a token (--api-key): Require every request to carry an Authorization: Bearer <token> header. Requests without the correct token are rejected with HTTP 401.

gputool serve-llamacpp start Qwen3.5-9B-UD-Q6_K_XL.gguf 8080 --api-key sjsugputool
You can also set the token via the GPUTOOL_LLAMA_API_KEY environment variable instead of passing it on the command line (which keeps it out of your shell history):
export GPUTOOL_LLAMA_API_KEY=sjsugputool
gputool serve-llamacpp start
Clients then include the bearer token. With curl:
curl http://10.31.96.155:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sjsugputool" \
  -d '{"messages": [{"role": "user", "content": "Hello!"}]}'
Or, on machines without curl, with Python:
import json, urllib.request
req = urllib.request.Request(
    "http://<server-ip>:8080/v1/chat/completions",
    data=json.dumps({"messages": [{"role": "user", "content": "Hello!"}]}).encode(),
    headers={"Content-Type": "application/json", "Authorization": "Bearer sjsugputool"})
print(json.load(urllib.request.urlopen(req))["choices"][0]["message"]["content"])

Enable vision / image input (--mmproj): Qwen3.5 is multimodal. To accept images, llama-server needs the model's multimodal projector (mmproj) file in addition to the main GGUF. Download it once, and serve-llamacpp auto-detects any mmproj*.gguf sitting next to the model:

# one-time: fetch the projector into ~/.gputool/models/
gputool download-model unsloth/Qwen3.5-9B-GGUF mmproj-F16.gguf py312

# start โ€” vision is enabled automatically when an mmproj is present
gputool serve-llamacpp start Qwen3.5-9B-UD-Q6_K_XL.gguf 8080 --api-key sjsugputool
#  [โš™๏ธ] Vision (mmproj): enabled โ€” mmproj-F16.gguf
Use --mmproj <file> to point at a specific projector, or --no-mmproj to force text-only. Then send an image using the OpenAI vision format (image_url with a base64 data URI โ€” the build has LLAMA_CURL=OFF, so remote image URLs are not fetched):
python3 jetson/jetson-llm/vision_test.py \
  --url http://localhost:8080/v1 --api-key sjsugputool --image photo.jpg \
  -p "What is in this image?"
# โ†’ "A yellow circle on a blue square background."  (for the built-in test image)
See jetson/jetson-llm/vision_test.py for the raw request shape.

The run mode controls whether the server detaches from your terminal: * --background / -d (default): Runs as a detached nohup daemon, writes its PID to ~/.gputool/llama-server.pid, and logs to ~/.gputool/llama-server.log. The command returns immediately so you can keep using the shell (or close the SSH session) while the server keeps running.

gputool serve-llamacpp start Qwen3.5-9B-UD-Q6_K_XL.gguf 8080            # background (default)
gputool serve-llamacpp start Qwen3.5-9B-UD-Q6_K_XL.gguf 8080 -d         # explicit background
* --foreground / -f: Runs attached to your terminal and streams server logs live. Useful for debugging. Press Ctrl+C to stop it.
gputool serve-llamacpp start Qwen3.5-9B-UD-Q6_K_XL.gguf 8080 --foreground

  • Check server status:

    gputool serve-llamacpp status
    
    (Prints process PID details and runs a connection health check to /health. If no tracked PID file exists, it still detects stray/foreground llama-server instances via pgrep.)

  • Stop the server:

    gputool serve-llamacpp stop
    
    This stops the tracked daemon and any stray/orphaned llama-server processes โ€” it collects PIDs from the PID file plus a pgrep match on the installed binary, terminates each gracefully (SIGTERM, then SIGKILL after a 10s grace period), and clears the PID file. Running it when nothing is active simply reports that no servers were found.

Example Usage & Inference Query:

# 1. Start the server (offloads all layers to the GPU with -ngl 99)
gputool serve-llamacpp start Qwen3.5-9B-UD-Q6_K_XL.gguf 8080

# 2. Query completions via the OpenAI-compatible endpoint
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What GPU architecture is NVIDIA Blackwell?"}
    ]
  }'

[!TIP] No curl on locked-down machines? Many lab nodes ship without curl/wget. Query the server with Python's built-in urllib instead:

import json, urllib.request
req = urllib.request.Request(
    "http://localhost:8080/v1/chat/completions",
    data=json.dumps({"messages": [{"role": "user", "content": "What GPU architecture is NVIDIA Blackwell?"}]}).encode(),
    headers={"Content-Type": "application/json"})
print(json.load(urllib.request.urlopen(req))["choices"][0]["message"]["content"])

[!NOTE] Reasoning models (Qwen3.5): Qwen3.5 is a "thinking" model. By default it spends tokens on internal reasoning (returned in the reasoning_content field) before producing the final content, so a small max_tokens may return empty content with finish_reason: "length". Either raise max_tokens, or disable thinking for a direct answer by adding "chat_template_kwargs": {"enable_thinking": false} to the request body.

4. Chat from the Terminal (gputool chat)

gputool ships a built-in terminal chat client that talks to any OpenAI-compatible endpoint (your local llama-server by default) and streams the response token-by-token with per-turn token/throughput stats.

  • Rich UI (Claude-Code style): if the rich library is installed, tokens stream smoothly into a panel (refreshes are throttled and long replies stay scrollable โ€” no flicker/freeze at high token rates), and the completed reply is rendered as Markdown โ€” headings, bold, lists, and syntax-highlighted code blocks.
  • Stdlib fallback: if rich is not available, it automatically falls back to a pure-stdlib ANSI renderer, so it still works on locked-down machines without extra pip packages or curl.

To get the Rich UI, install it once into your environment:

pip install rich      # optional โ€” enables Markdown/syntax-highlighted rendering

gputool chat [message] [--host <ip>] [--port <port>] [--api-key <token>] [--system <text>] [--think] [--no-stream] [--no-color] [--plain]

Just type gputool chat

With no arguments, the client walks you through a short setup and then drops you into an interactive session. Your answers (server and API key) are saved to ~/.gputool/chat_config.json (with 0600 permissions), so on the next run you can just press Enter to reuse them โ€” you only enter the key once.

$ gputool chat
Server IP or URL [127.0.0.1:8080]: 10.31.96.155
API key [blank if none]: ********
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ๐Ÿฆ™ gputool chat โ€” local LLM โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚                                                                โ”‚
โ”‚  Endpoint  http://10.31.96.155:8080/v1/chat/completions        โ”‚
โ”‚  Model     Qwen3.5-9B-UD-Q6_K_XL.gguf                          โ”‚
โ”‚  Auth on   Streaming on   Thinking off                         โ”‚
โ”‚                                                                โ”‚
โ”‚  Type a message and press Enter.  /help for commands ยท /exit โ€ฆ โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
/help ยท /exit  You โ–ธ What GPU architecture is NVIDIA Blackwell?
โ•ญโ”€ Assistant โ–ธ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ NVIDIA Blackwell is the company's latest GPU architecture,      โ”‚
โ”‚ succeeding Hopper โ€” it powers the RTX 50-series and GB200.      โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
(prefill 24 tok @ 765 tok/s ยท gen 38 tok @ 99.2 tok/s ยท 0.5s)
/help ยท /exit  You โ–ธ /save notes.md
๐Ÿ’พ Saved conversation to /home/you/notes.md
/help ยท /exit  You โ–ธ /exit
Bye!

The stats line after each reply reports prefill (prompt-processing) and generation speeds separately, e.g. prefill 24 tok @ 765 tok/s ยท gen 38 tok @ 99.2 tok/s, taken from llama.cpp's timings (it falls back to a combined tok/s if the server doesn't report them). A dim /help ยท /exit hint sits next to the You โ–ธ input bar.

Other ways to launch

  • One-shot question (prints the streamed answer and exits โ€” no prompts):
    gputool chat "What GPU architecture is NVIDIA Blackwell?"
    
  • Skip the prompts with flags (also saved for next time):
    gputool chat --host 10.31.96.155 --port 8080 --api-key sjsugputool
    # or set it once via env: export GPUTOOL_LLAMA_API_KEY=sjsugputool
    

Defaults & options: * The saved config provides defaults; first-time defaults are 127.0.0.1:8080. Use --url http://10.31.96.155:8080/v1 for a full base URL, or --reset-config to ignore/overwrite saved values. * --api-key falls back to the GPUTOOL_LLAMA_API_KEY environment variable, then the saved config. * --model is auto-detected from the server's /v1/models endpoint if omitted. * Thinking/reasoning output is off by default for snappy answers; enable it with --think (or /think on during a session). * --plain forces the stdlib renderer even when rich is installed; --no-color disables colors.

Interactive slash commands:

Command Action
/exit, /quit, /q Leave the chat
/server Connect to a different server IP / API key (re-runs setup and saves it)
/save [file] Save the conversation to a file โ€” Markdown by default, or JSON if the name ends in .json (defaults to gputool_chat_<timestamp>.md in the current directory)
/reset, /clear Clear the conversation history (keeps the system prompt)
/system <text> Set (or clear, if empty) the system prompt
/think on\|off Toggle the model's reasoning output
/temp <v> Set sampling temperature (e.g. /temp 0.7)
/set <k> <v> Set top_p / top_k / min_p / presence / max_tokens
/preset <name> Apply Qwen3.5 sampling presets: thinking ยท coding ยท instruct
/config Show the current sampling settings
/help, /? Show the command help

Tuning generation (Unsloth's recommended Qwen3.5 settings): the /preset command applies these in one step, or set them individually with /temp and /set:

Preset temperature top_p top_k min_p presence thinking
thinking (general) 1.0 0.95 20 0.0 1.5 on
coding (precise) 0.6 0.95 20 0.0 0.0 on
instruct (non-thinking) 0.7 0.8 20 0.0 1.5 off

You can also start the client with these flags: gputool chat --temperature 0.6 --top-p 0.95 --top-k 20 --think.

5. Share the server by name (Headscale + llm.forgengi.org)

Once a node serves a model and joins the Headscale network (gputool tailscale up), you can make it reachable from any other node by a friendly HTTPS name instead of an IP. The Headscale host (headscale.forgengi.org, an edge gateway) runs nginx and reverse-proxies each LLM node under a path:

https://llm.forgengi.org/<name>/v1   ->   http://<node-tailscale-ip>:8080/v1

For example, the RTX 5080 board coe-cmpe-288-05 is published as node05:

curl https://llm.forgengi.org/node05/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sjsugputool" \
  -d '{"messages":[{"role":"user","content":"Hello!"}]}'

How the gateway is built (on the Headscale host): 1. DNS: a single record llm.forgengi.org โ†’ the gateway's IP (Cloudflare DNS-only; no per-node records needed). 2. TLS: one Let's Encrypt cert (certbot --nginx -d llm.forgengi.org). 3. Routing: a generator script, generate_llm_nginx.py, scans Headscale nodes, auto-detects which ones serve an LLM on :8080, maps them to friendly names (coe-cmpe-288-05 โ†’ node05, sjsujetson-01 โ†’ jetson01), and writes the nginx location blocks. Adding a node is just:

sudo python3 generate_llm_nginx.py   # on the Headscale host; reloads nginx

This is the endpoint that sjsujetsontool chat โ†’ "Our LLM server" connects to, so every Jetson can share one big GPU model over TLS without juggling IPs. Transport is encrypted end-to-end (clientโ†’gateway over HTTPS, gatewayโ†’node over the Headscale WireGuard tunnel), and the node's --api-key still gates completions.


๐ŸŒ Userspace Tailscale VPN

Since gputool runs without root privileges, it cannot create a virtual network interface card (tailscale0). Instead, it uses user-space networking (SOCKS5/HTTP Proxy) to routing network traffic through the Headscale VPN.

1. Download Static Binaries

Downloads the pre-compiled, standalone Tailscale package for your CPU architecture (AMD64, ARM64, or ARM) and configures the binaries in ~/.gputool/tailscale:

gputool tailscale setup

2. Connect to the VPN

Starts the background tailscaled daemon in userspace proxy mode and registers this device on the SJSU Headscale network:

gputool tailscale up
Note: If another device is already registered under the same hostname, use gputool tailscale up --force to override.

3. Check Connection & Proxy Details

Displays the current connection state, your assigned Tailscale IP address, and proxy variables:

gputool tailscale status
Example Output:
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
๐ŸŒ Userspace Tailscale VPN Status
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
๐Ÿ“ฆ Tailscale Version : 1.68.1
๐Ÿ”ง Daemon PID        : 12845
๐Ÿ”Œ Socket Path       : /home/student/.gputool/tailscaled.sock

๐ŸŸข Connection State  : Running
   Device Hostname   : lab-machine-01
   Tailscale IPs     : 100.64.0.15
   Connected Peers   : 8

๐Ÿ›ก๏ธ  User-Space Proxy Configuration:
   โ€ข SOCKS5 Proxy    : localhost:1055
   โ€ข HTTP Proxy      : localhost:1055
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

4. Disconnect and Stop Daemon

Logs out of the Headscale server and stops the background tailscaled process:

gputool tailscale down


๐Ÿ”Œ How to Access Peers (Proxy Usage)

Because the userspace client does not have a virtual TUN adapter, outgoing traffic is routed through local proxies running on port 1055.

Web Requests & APIs (curl, python, wget)

To access web interfaces or APIs on other Headscale nodes, prefix or export the HTTP/HTTPS proxy:

  • Single command:
    curl -x http://localhost:1055 http://<peer-tailscale-ip>:<port>
    
  • Export for session:
    export http_proxy=http://localhost:1055
    export https_proxy=http://localhost:1055
    # Now commands run transparently through the proxy
    curl http://<peer-tailscale-ip>:<port>
    

Secure Shell (SSH)

To SSH into another machine on the Headscale network, route the connection via the SOCKS5 proxy command:

ssh -o ProxyCommand="nc -X 5 -x localhost:1055 %h %p" username@<peer-tailscale-ip>


โš ๏ธ Key Limitations of Non-Sudo Mode

  • Inbound Access is Transparent: Other machines on the Headscale network can ping or access ports on this computer directly via its Tailscale IP without any proxy setup.
  • Outbound Access Requires Proxies: This machine cannot ping or access other nodes directly unless configured to use the proxy on localhost:1055 (as shown in the examples above).