Inside ChatGPT: How AI Understands and Generates Language

How ChatGPT Actually Works - no math, no code

Inside ChatGPT: How AI Understands and Generates Language

Watch the video (especially for all the amazing visuals!):

You might have heard that AI can do all sorts of mind-blowing stuff, from talking to you like a human to generating code and even analyzing images. But how do these models actually work under the hood? If you’re a beginner in Python or you have limited programming experience, don’t worry. We’ll try to keep it simple while still covering the core ideas. By the end of this, you’ll have a clearer sense of how LLMs fit into the bigger picture of AI, how they handle language, why they sometimes get things wrong, and why they’re still incredibly powerful tools for building next-generation applications.

Let’s start with a quick story. Picture yourself in a language class. You want to learn French, but instead of learning all the rules of grammar one by one, you start reading a massive amount of French text — novels, news articles, random social media posts — and you try predicting how the French sentences would continue by looking at just the first few words of them. At first, you’re obviously going to make mistakes and make random guesses. But after a lot of practice and feedback, if we tell you in your language what should’ve been the right guess, you begin to spot patterns. You see that certain words cluster together more often than others. By the time you’ve read a million French sentences, you’re actually pretty good at predicting the next words. That’s more or less what LLMs do on a massive scale. They read through vast amounts of internet data — web pages, books, code, you name it — then try to predict the next word in a sequence. That’s called next-token prediction, but you don’t need to worry about the term.

When we say AI, we mean a whole field of study and technology aimed at building machines that can perform tasks that normally require human intelligence, such as recognizing speech, identifying images, or making decisions. Within AI, there’s a subfield called machine learning, which is all about teaching computers to learn from data without you programming every single rule. Traditionally, people would hard-code everything if they wanted a computer to do a task, like “If X, do Y; if A, do B.” Machine learning flips that by letting algorithms digest data and learn these patterns themselves, which allows us to scale to much more complex patterns than a few “if, then” statements.

Inside machine learning, there’s this concept of neural networks, which are very loosely inspired by how the human brain works. Imagine a network of a bunch of small units (we call them neurons) that are connected to each other. In the human brain, neurons pass signals to each other through connections called synapses. In an artificial neural network, these “neurons” pass numbers to each other, compared to chemicals and current for our brain. By tuning those connections with billions of examples, the network learns to perform a task. Just like babies learn from observing and playing with things. Likewise, deep learning, compared to machine learning, just means we’re using really big networks with a lot of layers of neurons, so they can detect more complicated patterns.

LLMs belong to this category of deep learning, and they use a specific format of neural network called the transformer. Think of a transformer like a factory assembly line that can handle a lot of text in parallel instead of just one by one, as we used to do with previous networks. Instead of reading a sentence word by word from left to right, the model can look at everything simultaneously and decide which parts of the text are more important. This is made possible by something called the “attention mechanism,” which tells the model to pay special attention to certain words depending on the context. If the text says, “He sat on the bank and watched the river flow,” the attention mechanism helps the model realize that “bank” is referring to the side of a river and not a financial institution. This is all learned through looking at examples and learning from them. Just like our French example or like learning new movements in specific sports where “Practice makes perfect”. Same thing for LLMs, explaining our huge data requirements of pretty much the entire internet.

But before we get into all that fancy stuff, let’s go back to how we turn raw text data into something a neural network can understand. The model can’t just read letters like a human can. It needs everything to be in numbers. So we break text down into pieces called tokens. If you have a sentence like “The cat sat on the mat,” you might split it into tokens like “The,” “cat,” “sat,” “on,” “the,” “mat.” Usually, these tokens are words, but depending on the tokenizer, which is the program that splits texts into tokens, they can be smaller pieces like subwords or even individual characters. The important part is that the model sees a sequence of tokens, not just raw letters. They don’t understand letters, just bits.

After tokenization, each token is mapped to a list of numbers, typically called embedding, so the computer can do math with it. For instance, “cat” might map to [0.12, -0.99, 3.11, …] while “dog” might map to [0.10, -1.02, 3.20, …]. The difference between these numbers might be small or large, and that difference is basically capturing how closely related their meanings are. Words like “cat” and “kitten” will be super close in that embedding space because their meanings overlap a lot. Words like “cat” and “banana” will be much further apart. You can see these numbers as each one representing its relationship with a characteristic, like its size, colour, shape, etc. So here our cat and dogs would be quite similar, especially for smaller dogs, and the banana would be pretty different. This is also the reason behind this meme, where all these images are super similar, even for us!

These mappings are crucial for everything the model does. They let the model measure similarities, differences, and relationships among words. They’re not directly intuitive to you or me, but they are for the neural network, which can handle them in large matrix multiplications of these lists super efficiently.

Now, the magic really happens when we talk about how these models are trained. During training, the LLM tries to guess the next token in a massive dataset. Imagine reading a trillion words from the internet and, each time, you cover up the next word, you try to predict it. The difference between the model’s guess and the actual next word is called the error. The training process adjusts the model’s internal parameters — those billions of values that connect neurons — to reduce that error bit by bit. Over many passes through this huge dataset, the model refines its parameters to become really good at next-token prediction. In that sense, an LLM is kind of like a giant compression system that’s trying to store the essence of everything it’s read. But because it’s so compressed, it doesn’t memorize every single line perfectly. It basically picks up the patterns and relationships so it can produce new text that lines up with what it learned.

One of the big reasons LLMs are so powerful is that they can then generate coherent sentences and even entire paragraphs in a variety of styles. They can mimic formal writing, casual conversation, or even the style of specific authors, all by using the patterns they extracted from their training data. That’s why you can tell an LLM to “Write me a poem about a lonely rocket ship” and it can produce something that actually reads like a poem.

At this point, you might wonder, “If these models learned everything from the internet, why do they sometimes mess up?” That’s because the training approach is essentially just next-word prediction. The model doesn’t “know” facts in the way we do. It only knows patterns and probabilities. So if it sees a question it’s not 100% sure about, it might produce a probable sequence of words that sounds right but is factually incorrect. That’s known as hallucination. It might convincingly tell you that the capital of Argentina is Bogotá, which is definitely wrong, but it can sound so confident because it’s simply generating text. Think of it like someone who’s very good at speaking fluently but hasn’t fully studied geography. They might deliver a wrong fact with total conviction just because they’re trying to “sound correct.”

To make LLMs more useful and less prone to spitting out weird or dangerous answers, there’s a second phase after training called “post-training.” In this phase, we first have what we call “instruction tuning”, where we give the LLM examples of user instructions and correct responses. The idea is to teach the model to follow instructions more carefully rather than just continuing with the next-word pattern it learned from the internet. Again, simply learning from examples. After that, many labs apply something called reinforcement learning from human feedback, or RLHF. Here, humans rate the quality of the model’s responses, and the model gets rewarded (or penalized) based on those ratings. Over many rounds (and examples), the model better aligns with what people generally consider a “good” answer. It’s a bit like training a dog with treats when it does a trick right. That’s how ChatGPT and similar systems have gotten better at responding to questions in ways that humans prefer.

It’s not perfect, though. Even with RLHF and instruction tuning, the model can still produce inaccuracies or sometimes refuse to answer perfectly valid questions. It’s always a balancing act between safety, alignment with human values, and giving as much helpful information as possible. That’s why it can take months before a newly trained model is actually released.

Some advanced models also use techniques like Mixture of Experts, which splits the network into specialized sections that handle different parts of the text or different tasks. I’ve covered it on my channel already if you are interested in learning more about this approach. This can make the model more efficient by activating only certain parts (experts) for a given input instead of the entire network. It’s like having a group of people who each specialize in a different area, so you don’t need everyone working on the same question at once. That can reduce the cost and speed up inference, which is the process of actually running the model when you interact with it.

When we talk about inference, we’re referring to the phase where you, as the user, send an instruction (or a question) to the model and get a response. Training might happen once (or a few times), but inference is happening constantly whenever a user interacts with the model. Efficient inference matters a lot because you don’t want to wait forever for a response or pay a fortune for server costs. That’s why labs spend a lot of time tweaking these models to run as fast and as cheaply as possible without losing too much quality.

Let’s paint a picture of the whole pipeline in a simple way. You start with a giant pile of internet text (along with other stuff like images, code, or even audio for some models). You tokenize it, splitting it into manageable text pieces and transforming them into an associated number. You map those tokens to embeddings (our lists of numbers). Then, you feed those embeddings through layers and layers of a transformer network that tries to guess the next token. Each time it guesses incorrectly, it adjusts its parameters. Eventually, after seeing tons of examples, the model is good enough to produce text that looks like something a human might write. Then you do a smaller second training phase where you show it specific instructions and human-rated answers. Now it’s better at following commands and giving answers we like. Finally, you deploy it so that people can ask it questions via an API or a chat interface. When you talk to ChatGPT, the text you type is always converted into tokens, which are then processed into embeddings that the model then processes to figure out which words (tokens) are most likely to come next. It sends that predicted text back to you one token at a time, making it look like it’s typing.

You might also hear a lot about RAG (Retrieval-Augmented Generation) or prompt engineering or other techniques. That’s because these models, as amazing as they are, still have limitations. They can’t possibly store every single fact in their parameters, and they might struggle with questions about really new topics or specialized knowledge that wasn’t in their training data. Techniques like RAG let us search external sources (like a private database or the internet in real-time) so the model can generate answers using up-to-date information. But that’s a topic for an upcoming lesson of the course.

You should also know that LLMs are not just about language anymore. Some can handle images, audio, or even video. As long as we can transform these types of data into numbers, we are good. This further blurs the lines between different areas of AI, like natural language processing, computer vision, or speech recognition. Nowadays, a single large model can handle multiple tasks that used to require separate, specialized models. This is part of what has people so excited about generative AI and the rapid progress we’re seeing. It’s not just about text. It’s about a general system that can deal with many forms of input and produce many forms of output.

Of course, LLMs can generate misleading content, exhibit biases, or produce harmful text if not handled properly. That’s why labs do safety testing and implement things like content filters, where the model is told to refuse requests that might lead to harmful or unethical outcomes. This doesn’t always work perfectly, and it’s a constant cat-and-mouse game between developers trying to make models safer and users trying to push their limits. If you’re building your own application on top of an LLM, it’s a good idea to have your own filtering in place and to be mindful of the ways users might try to misuse the technology.

At the end of the day, the core idea is that LLMs are neural networks that have learned a compressed representation of massive amounts of data. They predict the next token, which might not sound like much, but it’s enough to generate text that can explain, summarize, invent stories, and even write code. Because they rely so heavily on patterns they observed, they can slip up on facts they haven’t seen enough examples of or on questions that require a logical reasoning chain they haven’t quite mastered. That’s why you should always verify important answers, especially for things that require reliability, like medical advice or legal opinions. The technology is evolving fast, though. Every few months, you’ll hear about bigger models, better training methods, or more advanced techniques like instruction tuning that push the boundaries of what these models can do.

We hope this article helped clarify what’s going on behind the scenes and gave you a sense of why people are so excited about building applications with LLMs. If you’re new to Python, don’t be scared off by the idea of neural networks and large language models. You don’t need a PhD in math to start playing around with these systems. You just need some curiosity, a bit of time to explore the documentation and ask LLMs to explain it to you, and the willingness to handle trial-and-error as you figure out what works best for your use case. The biggest leaps happen when people of all backgrounds get creative with these tools and discover new ways to use them. So keep learning, keep tinkering, and have fun exploring the world of LLMs. You might just come up with something that surprises everyone, including you. 

If you found this piece useful, check out our full introduction to Python for Generative AI course here!