← Back to insights
AI Can't Reliably Audit Compiled Code. The Numbers Prove It.
22 February 2026·2 min read

AI Can't Reliably Audit Compiled Code. The Numbers Prove It.

A new benchmark shows Claude Opus 4.6 detects only 49% of backdoors in compiled binaries with a 22% false positive rate. AI binary auditing is promising but not production-ready.

View original source →

A new open benchmark found that the best AI model available, Claude Opus 4.6, detected only 49% of hidden backdoors in compiled binaries while flagging clean code as malicious 22% of the time.

Researchers from Quesma partnered with a Dragon Sector reverse engineering expert to build BinaryAudit. The test suite used real network daemons: web servers, DNS servers, SSH servers, proxies. All had artificially embedded backdoors. No source code provided. Even armed with Ghidra, today's top models struggle because binary analysis strips away everything LLMs are trained on. There are no function names, no comments, no clear file structure. Just raw assembly.

Supply chain attacks make this matter. The XZ Utils backdoor in 2024 came within hours of compromising SSH servers across millions of Linux systems, embedded in a compression library by a patient, social-engineered contributor, caught only because one engineer noticed slightly slower logins. That kind of attack hides in compiled code, not in a README.

A 49% detection rate with a 22% false positive rate means missing half the real threats while burning analyst time on ghosts. AI-assisted binary auditing is a genuinely exciting frontier, but it is not production-ready.

What this means for your organisation

Don't rely on AI alone to audit firmware, container images, or third-party packages in compiled form. A 49% detection rate sounds like a useful starting point until you consider the 22% false positive rate alongside it. That combination produces alert fatigue that makes the real hits easier to miss, not harder.

The practical posture: use AI for initial triage and surface-level anomaly flagging, then route anything flagged to a specialist. That specialist doesn't have to be in-house. Dragon Sector, Quarkslab, and similar firms exist precisely for this. The cost of one targeted binary audit on a critical dependency is orders of magnitude cheaper than the breach it prevents. Any team treating AI as a complete substitute for that expertise is taking on risk they probably haven't quantified.


Source: BinaryAudit — We hid backdoors in ~40MB binaries and asked AI + Ghidra to find them