{"id":16011,"date":"2026-01-07T11:37:16","date_gmt":"2026-01-07T11:37:16","guid":{"rendered":"https:\/\/ideainthebox.com\/index.php\/2026\/01\/07\/what-even-is-a-parameter\/"},"modified":"2026-01-07T11:37:16","modified_gmt":"2026-01-07T11:37:16","slug":"what-even-is-a-parameter","status":"publish","type":"post","link":"https:\/\/ideainthebox.com\/index.php\/2026\/01\/07\/what-even-is-a-parameter\/","title":{"rendered":"LLMs contain a LOT of parameters. But what\u2019s a parameter?"},"content":{"rendered":"<div>\n<p>MIT Technology Review<em> Explains: Let our writers untangle the complex, messy world of technology to help you understand what\u2019s coming next. <\/em><a href=\"https:\/\/www.technologyreview.com\/tag\/tech-review-explains\"><em>You can read more from the series here<\/em><\/a><em>.<\/em><\/p>\n<p>I am writing this because one of my editors woke up in the middle of the night and scribbled on a bedside notepad: \u201cWhat is a parameter?\u201d Unlike a lot of thoughts that hit at 4 a.m., it\u2019s a really good question\u2014one that goes right to the heart of how large language models work. And I\u2019m not just saying that because he\u2019s my boss. (Hi, Boss!)<\/p>\n<p>A large language model\u2019s parameters are often said to be the dials and levers that control how it behaves. Think of a planet-size pinball machine that sends its balls pinging from one end to the other via billions of paddles and bumpers set just so. Tweak those settings and the balls will behave in a different way.\u00a0\u00a0<\/p>\n<p>OpenAI\u2019s GPT-3, released in 2020, had 175 billion parameters. Google DeepMind\u2019s latest LLM, Gemini 3, may have at least a trillion\u2014some think it\u2019s probably more like 7 trillion\u2014but the company isn\u2019t saying. (With competition now fierce, AI firms no longer share information about how their models are built.)<\/p>\n<p>But the basics of what parameters are and how they make LLMs do the remarkable things that they do are the same across different models. Ever wondered what makes an LLM really tick\u2014what\u2019s behind the colorful pinball-machine metaphors? Let\u2019s dive in.\u00a0\u00a0<\/p>\n<h3 class=\"wp-block-heading\"><strong>What is a parameter?<\/strong><\/h3>\n<p>Think back to middle school algebra, like 2<em>a<\/em> + <em>b<\/em>. Those letters are parameters: Assign them values and you get a result. In math or coding, parameters are used to set limits or determine output. The parameters inside LLMs work in a similar way, just on a mind-boggling scale.\u00a0<\/p>\n<h3 class=\"wp-block-heading\"><strong>How are they assigned their values?<\/strong><\/h3>\n<p>Short answer: an algorithm. When a model is trained, each parameter is set to a random value. The training process then involves an iterative series of calculations (known as training steps) that update those values. In the early stages of training, a model will make errors. The training algorithm looks at each error and goes back through the model, tweaking the value of each of the model\u2019s many parameters so that next time that error is smaller. This happens over and over again until the model behaves in the way its makers want it to. At that point, training stops and the values of the model\u2019s parameters are fixed.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Sounds straightforward \u2026<\/strong><\/h3>\n<p>In theory! In practice, because LLMs are trained on so much data and contain so many parameters, training them requires a huge number of steps and an eye-watering amount of computation. During training, the 175 billion parameters inside a medium-size LLM like GPT-3 will each get updated tens of thousands of times. In total, that adds up to quadrillions (a number with 15 zeros) of individual calculations. That\u2019s why training an LLM takes so much energy. We\u2019re talking about thousands of specialized high-speed computers running nonstop for months.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Oof. What are all these parameters for, exactly?<\/strong><\/h3>\n<p>There are three different types of parameters inside an LLM that get their values assigned through training: embeddings, weights, and biases. Let\u2019s take each of those in turn.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Okay! So, what are embeddings?<\/strong><\/h3>\n<p>An embedding is the mathematical representation of a word (or part of a word, known as a token) in an LLM\u2019s vocabulary. An LLM\u2019s vocabulary, which might contain up to a few hundred thousand unique tokens, is set by its designers before training starts. But there\u2019s no meaning attached to those words. That comes during training.\u00a0\u00a0<\/p>\n<p>When a model is trained, each word in its vocabulary is assigned a numerical value that captures the meaning of that word in relation to all the other words, based on how the word appears in countless examples across the model\u2019s training data.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Each word gets replaced by a kind of code?<\/strong><\/h3>\n<p>Yeah. But there\u2019s a bit more to it. The numerical value\u2014the embedding\u2014that represents each word is in fact a <em>list<\/em> of numbers, with each number in the list representing a different facet of meaning that the model has extracted from its training data. The length of this list of numbers is another thing that LLM designers can specify before an LLM is trained. A common size is 4,096.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Every word inside an LLM is represented by a list of 4,096 numbers?\u00a0\u00a0<\/strong><\/h3>\n<p>Yup, that\u2019s an embedding. And each of those numbers is tweaked during training. An LLM with embeddings that are 4,096 numbers long is said to have 4,096 dimensions.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Why 4,096?<\/strong><\/h3>\n<p>It might look like a strange number. But LLMs (like anything that runs on a computer chip) work best with powers of two\u20142, 4, 8, 16, 32, 64, and so on. LLM engineers have found that 4,096 is a power of two that hits a sweet spot between capability and efficiency. Models with fewer dimensions are less capable; models with more dimensions are too expensive or slow to train and run.\u00a0<\/p>\n<p>Using more numbers allows the LLM to capture very fine-grained information about how a word is used in many different contexts, what subtle connotations it might have, how it relates to other words, and so on.<\/p>\n<p>Back in February, <a href=\"https:\/\/www.technologyreview.com\/2025\/02\/27\/1112619\/openai-just-released-gpt-4-5-and-says-it-is-its-biggest-and-best-chat-model-yet\/\">OpenAI released GPT-4.5<\/a>, the firm\u2019s largest LLM yet (some estimates have put its parameter count at more than 10 trillion). Nick Ryder, a research scientist at OpenAI who worked on the model, told me at the time that bigger models can work with extra information, like emotional cues, such as when a speaker\u2019s words signal hostility: \u201cAll of these subtle patterns that come through a human conversation\u2014those are the bits that these larger and larger models will pick up on.\u201d<\/p>\n<p>The upshot is that all the words inside an LLM get encoded into a high-dimensional space. Picture thousands of words floating in the air around you. Words that are closer together have similar meanings. For example, \u201ctable\u201d and \u201cchair\u201d will be closer to each other than they are to \u201castronaut,\u201d which is close to \u201cmoon\u201d and \u201cMusk.\u201d Way off in the distance you can see \u201cprestidigitation.\u201d It\u2019s a little like that, but instead of being related to each other across three dimensions, the words inside an LLM are related across 4,096 dimensions.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Yikes.<\/strong><\/h3>\n<p>It\u2019s dizzying stuff. In effect, an LLM compresses the entire internet into a single monumental mathematical structure that encodes an unfathomable amount of interconnected information. It\u2019s both why LLMs can do astonishing things and why they\u2019re impossible to fully understand.\u00a0\u00a0\u00a0\u00a0<\/p>\n<h3 class=\"wp-block-heading\"><strong>Okay. So that\u2019s embeddings. What about weights?<\/strong><\/h3>\n<p>A weight is a parameter that represents the strength of a connection between different parts of a model\u2014and one of the most common types of dial for tuning a model\u2019s behavior. Weights are used when an LLM processes text.<\/p>\n<p>When an LLM reads a sentence (or a book chapter), it first looks up the embeddings for all the words and then passes those embeddings through a series of neural networks, known as transformers, that are designed to process sequences of data (like text) all at once. Every word in the sentence gets processed in relation to every other word.<\/p>\n<p>This is where weights come in. An embedding represents the meaning of a word without context. When a word appears in a specific sentence, transformers use weights to process the meaning of that word in that new context. (In practice, this involves multiplying each embedding by the weights for all other words.)<\/p>\n<h3 class=\"wp-block-heading\"><strong>And biases?<\/strong><\/h3>\n<p>Biases are another type of dial that complement the effects of the weights. Weights set the thresholds at which different parts of a model fire (and thus pass data on to the next part). Biases are used to adjust those thresholds so that an embedding can trigger activity even when its value is low. (Biases are values that are added to an embedding rather than multiplied with it.)\u00a0<\/p>\n<p>By shifting the thresholds at which parts of a model fire, biases allow the model to pick up information that might otherwise be missed. Imagine you\u2019re trying to hear what somebody is saying in a noisy room. Weights would amplify the loudest voices the most; biases are like a knob on a listening device that pushes quieter voices up in the mix.\u00a0<\/p>\n<p>Here\u2019s the TL;DR: Weights and biases are two different ways that an LLM extracts as much information as it can out of the text it is given. And both types of parameters are adjusted over and over again during training to make sure they do this.\u00a0<\/p>\n<h3 class=\"wp-block-heading\"><strong>Okay. What about neurons? Are they a type of parameter too?\u00a0<\/strong><\/h3>\n<p>No, neurons are more a way to organize all this math\u2014containers for the weights and biases, strung together by a web of pathways between them. It\u2019s all very loosely inspired by biological neurons inside animal brains, with signals from one neuron triggering new signals from the next and so on.\u00a0<\/p>\n<p>Each neuron in a model holds a single bias and weights for every one of the model\u2019s dimensions. In other words, if a model has 4,096 dimensions\u2014and therefore its embeddings are lists of 4,096 numbers\u2014then each of the neurons in that model will hold one bias and 4,096 weights.\u00a0<\/p>\n<p>Neurons are arranged in layers. In most LLMs, each neuron in one layer is connected to every neuron in the layer above. A 175-billion-parameter model like GPT-3 might have around 100 layers with a few tens of thousands of neurons in each layer. And each neuron is running tens of thousands of computations at a time.\u00a0<\/p>\n<h3 class=\"wp-block-heading\"><strong>Dizzy again. That\u2019s a lot of math.<\/strong><\/h3>\n<p>That\u2019s a lot of math.<\/p>\n<h3 class=\"wp-block-heading\"><strong>And how does all of that fit together? How does an LLM take a bunch of words and decide what words to give back?<\/strong><\/h3>\n<p>When an LLM processes a piece of text, the numerical representation of that text\u2014the embedding\u2014gets passed through multiple layers of the model. In each layer, the value of the embedding (that list of 4,096 numbers) gets updated many times by a series of computations involving the model\u2019s weights and biases (attached to the neurons) until it gets to the final layer.<\/p>\n<p>The idea is that all the meaning and nuance and context of that input text is captured by the final value of the embedding after it has gone through a mind-boggling series of computations. That value is then used to calculate the next word that the LLM should spit out.\u00a0<\/p>\n<p>It won\u2019t be a surprise that this is more complicated than it sounds: The model in fact calculates, for every word in its vocabulary, how likely that word is to come next and ranks the results. It then picks the top word. (Kind of. See below \u2026)\u00a0<\/p>\n<p>That word is appended to the previous block of text, and the whole process repeats until the LLM calculates that the most likely next word to spit out is one that signals the end of its output.\u00a0<\/p>\n<h3 class=\"wp-block-heading\"><strong>That\u2019s it?\u00a0\u00a0<\/strong><\/h3>\n<p>Sure. Well \u2026<\/p>\n<h3 class=\"wp-block-heading\"><strong>Go on.<\/strong><\/h3>\n<p>LLM designers can also specify a handful of other parameters, known as hyperparameters. The main ones are called temperature, top-p, and top-k.<\/p>\n<h3 class=\"wp-block-heading\"><strong>You\u2019re making this up.<\/strong><\/h3>\n<p>Temperature is a parameter that acts as a kind of creativity dial. It influences the model\u2019s choice of what word comes next. I just said that the model ranks the words in its vocabulary and picks the top one. But the temperature parameter can be used to push the model to choose the most probable next word, making its output more factual and relevant, or a less probable word, making the output more surprising and less robotic.\u00a0<\/p>\n<p>Top-p and top-k are two more dials that control the model\u2019s choice of next words. They are settings that force the model to pick a word at random from a pool of most probable words instead of the top word. These parameters affect how the model comes across\u2014quirky and creative versus trustworthy and dull.\u00a0\u00a0\u00a0<\/p>\n<h3 class=\"wp-block-heading\"><strong>One last question! There has been a lot of buzz about small models that can outperform big models. How does a small model do more with fewer parameters?<\/strong><\/h3>\n<p>That\u2019s <a href=\"https:\/\/www.technologyreview.com\/2025\/01\/03\/1108800\/small-language-models-ai-breakthrough-technologies-2025\/\">one of the hottest questions<\/a> in AI right now. There are a lot of different ways it can happen. Researchers have found that the amount of training data makes a huge difference. First you need to make sure the model sees enough data: An LLM trained on too little text won\u2019t make the most of all its parameters, and a smaller model trained on the same amount of data could outperform it.\u00a0<\/p>\n<p>Another trick researchers have hit on is overtraining. Showing models far more data than previously thought necessary seems to make them perform better. The result is that a small model trained on a lot of data can outperform a larger model trained on less data. Take Meta\u2019s Llama LLMs. The 70-billion-parameter Llama 2 was trained on around 2 trillion words of text; the 8-billion-parameter Llama 3 was trained on around 15 trillion words of text. The far smaller Llama 3 is the better model.\u00a0<\/p>\n<p>A third technique, known as distillation, uses a larger model to train a smaller one. The smaller model is trained not only on the raw training data but also on the outputs of the larger model\u2019s internal computations. The idea is that the hard-won lessons encoded in the parameters of the larger model trickle down into the parameters of the smaller model, giving it a boost.\u00a0<\/p>\n<p>In fact, the days of single monolithic models may be over. Even the largest models on the market, like OpenAI\u2019s GPT-5 and Google DeepMind\u2019s Gemini 3, can be thought of as several small models in a trench coat. Using a technique called \u201cmixture of experts,\u201d large models can turn on just the parts of themselves (the \u201cexperts\u201d) that are required to process a specific piece of text. This combines the abilities of a large model with the speed and lower power consumption of a small one.<\/p>\n<p>But that\u2019s not the end of it. Researchers are still figuring out ways to get the most out of a model\u2019s parameters. As the gains from straight-up scaling tail off, jacking up the number of parameters no longer seems to make the difference it once did. It\u2019s not so much how many you have, but what you do with them.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Can I see one?<\/strong><\/h3>\n<p>You want to <em>see<\/em> a parameter? Knock yourself out: Here\u2019s an embedding. <\/p>\n<div class=\"flourish-embed flourish-text-annotator\" data-src=\"visualisation\/26927979?1184216\"><script src=\"https:\/\/public.flourish.studio\/resources\/embed.js\">hello<\/script><\/div>\n<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>MIT Technology Review Explains: Let our writers untangle the complex,  [&#8230;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[226],"tags":[],"class_list":["post-16011","post","type-post","status-publish","format-standard","hentry","category-technology"],"acf":[],"_links":{"self":[{"href":"https:\/\/ideainthebox.com\/index.php\/wp-json\/wp\/v2\/posts\/16011","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ideainthebox.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ideainthebox.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ideainthebox.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ideainthebox.com\/index.php\/wp-json\/wp\/v2\/comments?post=16011"}],"version-history":[{"count":0,"href":"https:\/\/ideainthebox.com\/index.php\/wp-json\/wp\/v2\/posts\/16011\/revisions"}],"wp:attachment":[{"href":"https:\/\/ideainthebox.com\/index.php\/wp-json\/wp\/v2\/media?parent=16011"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ideainthebox.com\/index.php\/wp-json\/wp\/v2\/categories?post=16011"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ideainthebox.com\/index.php\/wp-json\/wp\/v2\/tags?post=16011"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}