Run Your Own LLM on a Laptop: The Complete Guide

Run the model on your hardware, keep your data on your disk, and control every parameter of inference.

May 31, 2026

Your data leaves your machine every time you call a cloud LLM API. Every prompt, every document, every private thought flows through someone else’s infrastructure. You pay for the privilege with money and privacy. Bastion believes you should own the entire stack. Run the model on your hardware, keep your data on your disk, and control every parameter of inference. This guide shows you exactly how to do it, from scanning your laptop’s capabilities to serving a local model at production quality.

The Idea (60 Seconds)

Local LLM inference runs large language models on your own hardware instead of renting compute from OpenAI, Anthropic, or Google. Three engines dominate the landscape:

llama.cpp: The workhorse. Runs on CPU, GPU, or both. Optimized for constrained hardware. Use this on laptops with limited VRAM.
vLLM: The throughput king. Built for NVIDIA GPUs with at least 16GB VRAM. Serves multiple concurrent requests with paged attention.
Ollama: The simplicity layer. Wraps llama.cpp with a friendly CLI. Best for developers who want a working model in under five minutes.

The core decision matrix maps your VRAM to the right model and quantization:

VRAM Model Size Quantization Example 6 GB 7B params Q4_K_M Llama-3.1-8B Q4_K_M 12 GB 13B params Q5_K_M Mistral-Nemo Q5_K_M 24 GB 34B params Q8_0 Command-R Q8_0 24 GB 70B params Q4_K_M Llama-3.1-70B Q4_K_M

CPU-only users can still run 7B models at Q4_K_M with 16GB system RAM, expect 3 to 8 tokens per second depending on your processor. The experience remains usable for interactive chat and code assistance.

Why This Matters

Privacy is the obvious reason, and it matters more than most people admit. Every prompt you send to a cloud API is logged, stored, and potentially used for training. Your proprietary code, your client data, your personal reflections all become someone else’s data point the moment they leave your network.

Cost compounds over time. A heavy API user spending $50 monthly on tokens saves the full amount after the first month of local inference. The hardware you already own pays for itself.

Latency disappears. A local model on your laptop responds instantly. Zero network round-trips, zero queuing behind other users, zero rate limits. Your iteration loop tightens from seconds to milliseconds.

Control matters for serious practitioners. You choose the model, the quantization, the context window, the sampling parameters. You decide when to upgrade, which version to pin, and how to batch requests. Cloud APIs make all those decisions for you.

Resilience rounds out the argument. Local inference works during internet outages, in air-gapped environments, and in regions with unreliable connectivity. Your capability stays online even when the cloud goes dark.

Walkthrough

Step 1: Scan Your Hardware

Before downloading any model, understand what your machine can handle. The localmllm scan command detects your available RAM, GPU, VRAM, and CPU cores automatically:

python localmllm.py scan

Output looks like this:

Hardware Scan Results
  System RAM : 32 GB
  GPU        : NVIDIA RTX 3080 Laptop
  VRAM       : 8 GB
  CPU Cores  : 8
  CPU Freq   : 3.2 GHz

Write down the VRAM number. It determines everything that follows.

Step 2: Choose Your Engine

Pick your inference engine based on the scan results:

8 GB VRAM or less: llama.cpp with GPU offloading. Offload as many layers as VRAM allows, run the rest on CPU.
16 GB VRAM or more: vLLM for maximum throughput, or Ollama for simplicity.
CPU only: llama.cpp with all layers on CPU. Acceptable for 7B models at Q4_K_M.

Ollama is the fastest path to a working model. Install it, pull a model, and start chatting:

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b
ollama run llama3.1:8b

For more control, build llama.cpp from source:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j$(nproc)
./llama-server -m models/llama-3.1-8b-q4_k_m.gguf -ngl 99 --port 8080

The -ngl 99 flag offloads all layers to GPU. Reduce this number if you run out of VRAM.

Step 3: Select Your Model and Quantization

Apply the decision matrix from the Idea section. Higher quantization means better quality and larger files. Lower quantization means smaller files and faster inference. Q4_K_M hits the sweet spot for most use cases: minimal quality loss, significant size reduction.

Download GGUF files from Hugging Face. TheBloke andbartowski namespaces host well-organized quantized models. Verify the file size matches your available disk space before downloading.

Step 4: Benchmark Before Committing

Always benchmark before you commit disk space and trust a model for real work. The localmllm benchmark command runs a standardized inference test:

python localmllm.py benchmark --model models/llama-3.1-8b-q4_k_m.gguf

You get three key metrics:

Tokens per second: Sustained generation speed. Above 10 t/s feels interactive. Below 5 t/s feels sluggish.
Time to first token: Prefill latency. Under 2 seconds keeps conversations feeling responsive.
Peak memory usage: Confirms the model fits your hardware with room for the operating system.

If the benchmark reveals memory pressure, drop to a lower quantization or reduce the context window size. Both adjustments free RAM and VRAM.

Step 5: Serve Your Model

Launch a local inference server and point your applications at it. The localmllm serve command generates the optimal configuration and starts the server:

python localmllm.py serve --model models/llama-3.1-8b-q4_k_m.gguf

Your model now exposes an OpenAI-compatible API at http://localhost:8080/v1. Any tool that supports the OpenAI API format connects directly. Point your code editor, your chat client, or your automation scripts at localhost and keep every request on your machine.

The Prompt Toolkit

1. Local LLM Architect Prompt

Copy this XML prompt into your favorite LLM (even a cloud one, since this is a one-time planning call) and get a tailored local inference plan:

<task>
You are a local LLM infrastructure architect. Given the user's hardware specifications and use case, produce a complete deployment plan.
</task>

<input_format>
<hardware>
  <ram>Amount in GB</ram>
  <gpu>Name of GPU or "none"</gpu>
  <vram>Amount in GB or "0"</vram>
  <cpu_cores>Number of cores</cpu_cores>
</hardware>
<use_case>Describe what you will use the model for: coding, writing, chat, research, analysis</use_case>
</input_format>

<output_format>
<deployment_plan>
  <recommended_model>Name and size</recommended_model>
  <quantization>Specific quant level (e.g. Q4_K_M)</quantization>
  <inference_engine>llama.cpp | vLLM | Ollama</inference_engine>
  <engine_config>
    Key configuration parameters: GPU layers, context window, thread count, batch size
  </engine_config>
  <context_window>Recommended size in tokens</context_window>
  <performance_estimates>
    <tokens_per_second>Estimated range</tokens_per_second>
    <time_to_first_token>Estimated range</time_to_first_token>
    <memory_footprint>Expected RAM + VRAM usage</memory_footprint>
  </performance_estimates>
  <disk_space>Model file size estimate</disk_space>
</deployment_plan>
</output_format>

<reasoning_rules>
1. Always fit the model within available VRAM plus RAM with 4 GB headroom for the OS.
2. Prefer GPU offloading over CPU-only inference whenever VRAM is available.
3. Q4_K_M is the default quantization. Increase only when VRAM allows.
4. Context window size affects memory linearly. Default to 4096 unless the use case demands more.
5. For coding tasks, prefer Code Llama or DeepSeek Coder. For general chat, prefer Llama 3.1 or Mistral.
6. When VRAM is 0, recommend CPU-only llama.cpp with thread count equal to physical cores minus 1.
</reasoning_rules>

Use it once, get your plan, then run everything locally forever.

2. localmllm CLI

The localmllm.py CLI automates the entire workflow from hardware scanning and model recommendation through benchmarking and server launch. Download it from the Bastion CLI downloads section. Five commands cover the full lifecycle:

# Detect your hardware capabilities
python localmllm.py scan

# Get a model recommendation for your use case
python localmllm.py recommend --use-case coding

# Generate inference engine configuration
python localmllm.py config --engine llama.cpp

# Benchmark a model on your hardware
python localmllm.py benchmark --model ./models/llama-3.1-8b-q4_k_m.gguf

# Launch a local inference server
python localmllm.py serve --model ./models/llama-3.1-8b-q4_k_m.gguf

Full source code is available in the Bastion CLI downloads. Every command runs offline with zero cloud dependencies: your hardware, your models, your rules.

Caveats

Local inference requires honest expectations about what your hardware can deliver. A laptop with 8GB VRAM runs a 7B model beautifully but chokes on a 70B model regardless of quantization tricks. Respect the memory limits.

Model quality matters: a Q4 quantized 7B model produces good output for coding assistance and casual chat, but noticeably weaker output than GPT-4 or Claude for complex reasoning, long-form writing, and nuanced analysis, so match the model to the task.

Disk space adds up fast. A single 70B Q4_K_M file consumes 40GB. Keep models you actively use and delete the rest. The localmllm benchmark command tells you whether a model earns its disk space before you commit to keeping it.

Updates require manual effort. Cloud APIs improve automatically. Local models stay frozen until you download a newer version. Subscribe to model release feeds or check Hugging Face weekly for updates.

Philosophy

Bastion stands on one principle: own your compute the way you own your tools. A carpenter keeps their saws in their own shop. A writer keeps their manuscripts in their own desk. A practitioner of AI should keep their models on their own hardware.

Every cloud dependency is a lease where leases expire, terms change, prices increase, and services shut down. Local inference is ownership: you paid for the hardware, downloaded the model, and run the server, controlling access while keeping prompts private at a fixed cost.

The open source ecosystem made this possible. Llama.cpp, vLLM, Ollama, and the thousands of model contributors built the infrastructure of independence. Bastion curates and distills their work into actionable guides. Use them. Build on them. Own your stack.

CTA

Ready to cut the cord on cloud LLMs? Download the localmllm.py CLI from the Bastion CLI downloads section, run scan to profile your hardware, and have a local model serving requests in under thirty minutes. The complete source code lives in the same directory, ready for customization.

Bastion Series articles give you the full blueprint: prompts, tools, and philosophy. Subscribe to unlock every article and every CLI tool in the archive. Your data stays on your machine. Your capability stays under your control.

ArchonHQ

Discussion about this post

Ready for more?