ArchonHQ

ArchonHQ

Clone Needle: Build a 26M Parameter Tool-Calling Model

Distill Gemini's function-calling into a tiny model that runs locally — replacing expensive cloud APIs with free inference

Michal Szalinski's avatar
Michal Szalinski
May 15, 2026
∙ Paid

Your production app calls OpenAI’s API 847 times per day. Each function call costs $0.002. The monthly bill hits $380. Your CFO asks pointed questions about “AI infrastructure costs” in the quarterly review.

Meanwhile, your tool-calling needs are embarrassingly simple. Parse JSON. Validate schemas. Route function calls. Extract parameters. A 175B parameter model feels like hiring a PhD to sort your mail.

What if you could distill those capabilities into a 26M parameter model that runs locally, costs zero per inference, and handles 90% of your tool-calling workload?

The Idea (60 Seconds)

You’ll build a lightweight tool-calling model by distilling Gemini’s function-calling behavior into a compact transformer. The 26M parameter model runs locally, processes tool calls in 50-100ms, and handles structured JSON output with schema validation. Training takes 4 hours on a single GPU. The result replaces expensive API calls for routine function routing and parameter extraction.

ArchonHQ is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Why Distillation, Beyond Fine-tuning

Fine-tuning starts with random weights. You’re teaching a model to speak tool-calling from scratch. Distillation starts with a teacher model that already excels at function calls. You’re copying expertise, instead of building it.

Data efficiency matters more than parameter count. Fine-tuning needs 50K+ examples to learn tool-calling patterns. Distillation works with 5K teacher-student pairs because the student learns from the teacher’s internal representations, beyond just input-output mappings.

Gemini’s tool-calling is already production-tested. Google spent millions optimizing function call accuracy. Distillation captures that optimization in a model you own completely.

The math is simple: 5K distillation examples vs 50K fine-tuning examples. 4 hours vs 40 hours. $20 in compute vs $200.

User's avatar

Continue reading this post for free, courtesy of Michal Szalinski.

Or purchase a paid subscription.
© 2026 Michal Szalinski · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture