ArchonHQ

ArchonHQ

Running AI Agents on a Laptop GPU - My 6GB VRAM Setup That Actually Works

How I run a personal AI crew on an everyday RTX 3060. Zero enterprise budget. Zero PhD. Zero excuses.

Michal Szalinski's avatar
Michal Szalinski
May 26, 2026
∙ Paid

You’ve seen the posts. “I’m running a 70B parameter model on my home server with 48GB VRAM.” Cool story. Most of us are staring at a laptop with 6GB of VRAM and 32GB of system RAM, wondering if personal AI agents are beyond our reach.

They’re within reach. I’m writing this on my everyday laptop in Melbourne, and my AI crew is running in the background right now. Water reminders, posture nudges, health research, meal planning, coding help, all happening locally, privately, and fast enough to keep pace.

Here’s the setup, the models, the use cases, and the honest performance numbers.

The Idea (60 Seconds)

You can run useful AI agents on a 6GB VRAM laptop today. The trick is picking the right models for the right tasks, and using a hybrid local/cloud approach that keeps costs under $5/month. Local inference handles 80% of daily agent work. Cloud fills the gap for complex reasoning. The framework connecting everything matters more than the hardware.

The Minimum Viable Setup

  • RTX 3060 (6GB VRAM) or equivalent

  • 16GB+ system RAM (32GB preferred)

  • Ollama or llama.cpp (free)

  • A model at Q4_K_M quantization

  • An agent framework that supports local + cloud switching

That’s it. Zero custom builds. Zero ML expertise. Zero $3,000 GPU.

The whole system costs me roughly $5/month in cloud API calls on top of the free local inference. Most days the cloud stays untouched.

ArchonHQ is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Why This Setup, Not the Others

Cloud-only is expensive at scale. Every API call costs money. When your agent runs 50+ tool calls per day, the bills add up. Local inference for routine tasks, cloud for complex ones, keeps costs negligible.

Big VRAM builds are overkill for daily agents. Most agent tasks involve structured data extraction, simple reasoning, and tool orchestration. A 4B model handles these at 25-40 tokens/second. The extra VRAM goes unused 80% of the time.

Hermes Agent is built for this. Most AI frameworks assume you’re running cloud-only or have a server rack. Hermes is designed to work with whatever you’ve got: local models via Ollama, cloud models via OpenRouter, and seamless switching between them with /model. The framework adds 200MB of overhead. The bottleneck lives elsewhere.

Walkthrough: My Model Lineup (And Why Each One Earns Its VRAM)

User's avatar

Continue reading this post for free, courtesy of Michal Szalinski.

Or purchase a paid subscription.
© 2026 Michal Szalinski · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture