{"id":22066,"date":"2026-04-30T16:23:19","date_gmt":"2026-04-30T16:23:19","guid":{"rendered":"https:\/\/ideainthebox.com\/index.php\/2026\/04\/30\/this-startups-new-mechanistic-interpretability-tool-lets-you-debug-llms\/"},"modified":"2026-04-30T16:23:19","modified_gmt":"2026-04-30T16:23:19","slug":"this-startups-new-mechanistic-interpretability-tool-lets-you-debug-llms","status":"publish","type":"post","link":"https:\/\/ideainthebox.com\/index.php\/2026\/04\/30\/this-startups-new-mechanistic-interpretability-tool-lets-you-debug-llms\/","title":{"rendered":"This startup\u2019s new mechanistic interpretability tool lets you debug LLMs"},"content":{"rendered":"<div>\n<p>The San Francisco\u2013based startup Goodfire just released a new tool, called Silico, that lets researchers and engineers peer inside an AI model and adjust its parameters\u2014the <a href=\"https:\/\/www.technologyreview.com\/2026\/01\/07\/1130795\/what-even-is-a-parameter\/\">settings that determine a model\u2019s behavior<\/a>\u2014during training. This could give model makers more fine-grained control over how this technology is built than was once thought possible.<\/p>\n<p>Goodfire claims Silico is the first off-the-shelf tool of its kind that can help developers debug all stages of the development process, from building a data set to training a model.<\/p>\n<p>The company says its mission is to make building AI models less like alchemy and more like a science. Sure, LLMs like ChatGPT and Gemini can do amazing things. But nobody knows exactly how or why they work, and that can make it hard to fix their flaws or block unwanted behaviors.\u00a0<\/p>\n<p>\u201cWe saw this widening gap between how well models were understood and just how widely they were being deployed,\u201d Goodfire\u2019s CEO, Eric Ho, tells <em>MIT Technology Review<\/em> in an exclusive chat ahead of Silico\u2019s release. \u201cI think the dominant feeling in every single major frontier lab today is that you just need more scale, more compute, more data, and then you get AGI [artificial general intelligence] and nothing else matters. And we\u2019re saying no, there\u2019s a better way.\u201d<\/p>\n<p>Goodfire is one of a small handful of companies, including industry leaders Anthropic, OpenAI, and Google DeepMind, pioneering a technique known as mechanistic interpretability, which aims to <a href=\"https:\/\/www.technologyreview.com\/2026\/01\/12\/1129782\/ai-large-language-models-biology-alien-autopsy\/\">understand what goes on inside an AI model<\/a> when it carries out a task by mapping its neurons and the pathways between them. (<em>MIT Technology Review<\/em> picked <a href=\"https:\/\/www.technologyreview.com\/2026\/01\/12\/1130003\/mechanistic-interpretability-ai-research-models-2026-breakthrough-technologies\/\">mechanistic interpretability<\/a> as one of its 10 Breakthrough Technologies of 2026.)\u00a0\u00a0<\/p>\n<p>Goodfire wants to use this approach not only to audit models\u2014that is, studying those that have already been trained\u2014but to help design them in the first place.\u00a0\u00a0<\/p>\n<p>\u201cWe want to remove the trial and error and turn training models into precision engineering,\u201d says Ho. \u201cAnd that means exposing the knobs and dials so that you can actually use them during the training process.\u201d<\/p>\n<p>Goodfire has already used its techniques and tools to tweak the behaviors of LLMs\u2014for example, <a href=\"https:\/\/www.goodfire.ai\/research\/rlfr#\">reducing the number of hallucinations they produce<\/a>. With Silico, the company is now packaging up many of those in-house techniques and shipping them as a product.<\/p>\n<p>The tool uses agents to automate much of the complex work. \u201cAgents are now strong enough to do a lot of the interpretability work that we were doing using humans,\u201d says Ho. \u201cThat was kind of the gap that needed to be bridged before this was actually a viable platform that customers could use themselves.\u201d<\/p>\n<p>Leonard Bereska, a researcher at the University of Amsterdam who has worked on mechanistic interpretability, thinks Silico looks like a useful tool. But he pushes back on Goodfire\u2019s loftier aspirations. \u201cIn reality, they are adding precision to the alchemy,\u201d he says. \u201cCalling it engineering makes it sound more principled than it is.\u201d<\/p>\n<h3 class=\"wp-block-heading\">Mapping models<\/h3>\n<p>Silico lets you zoom in on specific parts of a trained model, such as individual neurons or groups of neurons, and run experiments to see what those neurons do. (Assuming you have access to the model\u2019s inner workings. Most people won\u2019t be able to use Silico to poke around inside ChatGPT or Gemini, but you can use it to look at the parameters inside many open-source models.) You can then check what inputs make different neurons fire, and trace pathways upstream and downstream of a neuron to see how other neurons affect it and how it affects other neurons in turn.<\/p>\n<p>For example, Goodfire found one neuron inside the open-source model Qwen 3 that was associated with the so-called trolley problem. Activating this neuron changed the model\u2019s responses, making it frame its outputs as explicit moral dilemmas. \u201cWhen this neuron\u2019s active, all sorts of weird things happen,\u201d says Ho.<\/p>\n<p>Pinpointing the source of odd behavior like this is now pretty standard practice. But Goodfire wants to make it easier to adjust that behavior. Using Silico, developers can now adjust the parameters connected to individual neurons to boost or suppress certain behaviors.<\/p>\n<p>In another example, Goodfire researchers asked a model whether a company should disclose that its AI behaves deceptively in 0.3% of cases, affecting 200 million users. The model said no, citing the negative business impact of such a disclosure.<\/p>\n<p>By looking inside the model, the researchers found that boosting neurons that were found to be associated with transparency and disclosure flipped the answer from no to yes nine out of 10 times. \u201cThe model already had the ethical reasoning circuitry, but it was being outweighed by the commercial risk assessment,\u201d says Ho.<\/p>\n<p>Tweaking the values of a model in this way is just one approach. Silico can also help steer the training process by filtering out certain training data to avoid setting unwanted values for certain parameters in the first place.\u00a0\u00a0\u00a0<\/p>\n<p>For example, many models will tell you that <a href=\"https:\/\/community.openai.com\/t\/why-9-11-is-larger-than-9-9-incredible\/869824\">9.11 is greater than 9.9<\/a>. Looking inside a model to see what\u2019s going on might reveal that it is being influenced by neurons associated with the Bible, in which verse 9.9 comes before 9.11, or by code repositories where consecutive updates are numbered 9.9, 9.10, 9.11 and so on. Using this information, the model can be retrained to make it avoid its \u201cBible\u201d neurons when doing math.<\/p>\n<p>By releasing Silico, Goodfire wants to put techniques previously available to a few top labs into the hands of smaller firms and research teams that want to build their own model or adapt an open-source one. The tool will be available for a fee determined on a case-by-case basis according to customers\u2019 requirements (Goodfire declined to give specific pricing details).<\/p>\n<p>\u201cIf we can make training models a lot more like building software, there\u2019s no reason why there can\u2019t be many more companies designing models that fit their needs,\u201d says Ho.<\/p>\n<p>Bereska agrees that tools like Silico could help firms build more trustworthy models. These techniques could be essential for safety-critical applications in health care and finance, he says.<\/p>\n<p>\u201cFrontier labs already have internal interpretability teams,\u201d he adds. \u201cSilico arms the next tier of companies, where the value is not having to hire interpretability researchers.\u201d<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>The San Francisco\u2013based startup Goodfire just released a new tool,  [&#8230;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[226],"tags":[],"class_list":["post-22066","post","type-post","status-publish","format-standard","hentry","category-technology"],"acf":[],"_links":{"self":[{"href":"https:\/\/ideainthebox.com\/index.php\/wp-json\/wp\/v2\/posts\/22066","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ideainthebox.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ideainthebox.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ideainthebox.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ideainthebox.com\/index.php\/wp-json\/wp\/v2\/comments?post=22066"}],"version-history":[{"count":0,"href":"https:\/\/ideainthebox.com\/index.php\/wp-json\/wp\/v2\/posts\/22066\/revisions"}],"wp:attachment":[{"href":"https:\/\/ideainthebox.com\/index.php\/wp-json\/wp\/v2\/media?parent=22066"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ideainthebox.com\/index.php\/wp-json\/wp\/v2\/categories?post=22066"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ideainthebox.com\/index.php\/wp-json\/wp\/v2\/tags?post=22066"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}