Secret ChatGPT Ingredient: How Linear Algebra Powers AI Chatbots

Tereza Tizkova
12 min readJan 30, 2023

Hi friends, it´s me with another article. This one is a bit lenghty, but fortunately, you can skip some parts you are familiar with (or skip it all, just look at the images and then click “clap” — works for me too. 🙂)

From Wannabe Therapists to Rappers

The famous Turing Test from the 1950s, aimed to determine whether a computer program could engage in conversation with humans without them realizing that they were communicating with a machine. In the early days of chatbot technology, it did not appear that way. For example, ELIZA, a “therapist” chatbot, was developed to illustrate the superficiality of human-machine communication.

ELIZA attempted to simulate a therapeut by parroting back to the patients what they had typed. Source: https://web.njit.edu/~ronkowit/eliza.html

In November 2022, OpenAI introduced Chat GPT. It broke the internet with its ability to engage in conversations, write rap songs with authentic style and flow, and even pass medical exams almost as well as real doctors.

Articles like “Top 20 use case of ChatGPT for business” or “How ChatGPT will replace human jobs” have been appearing everywhere and the hype at this point is almost annoying.

On the contrary, I feel like many interesting articles on ChatGPT’s mechanics, principles and the underlying math often go overlooked. It may not hurt to add one more focusing on the role of linear algebra in converting sentences and words to numbers, so that ChatGPT (and similar chatbots) can process them.

Prerequisites for this article

This article does not intend to explain you all the mechanisms behind ChatGPT, because I don´t have 100 hours to write.

What this article does explain is how can computers “eat” our language and spit out such amazing results to out commands. You are probably aware that computers are able to chew data in a form of numbers, not words. We will show how numbers are created from words and sentences and what is the mathematics behind this process.

Skip this section if you are at least slightly familiar with the principles of training chatbots and have an understanding of what a neural network is.

Indeed there are tons of videos and articles explaining how ChatGPT works on more or less sophisticated level.

Some of my favorite:

Word Embeddings

Overall, the mathematics behind GPT involves a combination of linear algebra, optimization, and neural network techniques.

Our focus here is on linear algebra, which forms the foundation of word embeddings — the aspect of chatbots development that I find particularly intriguing. Moreover, the basics of linear algebra are easy to grasp and you will find it useful not only in natural language processing, but also accross all areas of mathematics, in engineering, physics, computer science or economics. It allows modeling many natural phenomena, and computing efficiently with such models.

So what are word embeddings?

Word embedding is one of the most important techniques in natural language processing, doing precisely this:

WORDS -> REAL NUMBERS

while being capable of capturing the meaning of a word in a document, semantic and syntactic similarity, relation with other words.

As people are often characterized by their closest friends circle, words can, in some word embedding models, be characterized by words that are closest to them. As quoted by John Rupert Firth,

you know a word by the company it keeps.

What are word embeddings exactly? Simply, they are vector representations of a particular word. It is a technique in Natural Language Processing where words are turned into numbers. For each word, its embedding then captures the meaning of the word and similar words should end up with similar embedding values. But how do we assigning numbers to words?

Word2vec

We will explain one type of word embeddings called Word2vec. With a pinch of salt we can for now be satisfied with saying that other types work on similar principles.

Word2vec is one of the most popular technique to learn word embeddings using a two-layer neural network. Its input is a text corpus and its output is a set of vectors.

One-hot encoding

First of all, you know you can’t feed a word just as a text string to a neural network, so we need a way to represent the words to the network. To do this, we first build a vocabulary of words from our training documents–let’s say we have a vocabulary of 10,000 unique words.

We’re going to represent an input word like “ants” as a one-hot vector. This vector will have 10,000 components (one for every word in our vocabulary) and we’ll place a “1” in the position corresponding to the word “ants”, and 0s in all of the other positions.

The output of the network is a single vector (also with 10,000 components) containing, for every word in our vocabulary, the probability that a randomly selected nearby word is that vocabulary word.

One-hot encoding is a method of converting categorical variables (here simply words) into several binary columns, where a 1 indicates the presence of that row belonging to that category.

The lines corresponding to each character are vectors that represent them.

This is, pretty obviously, not a great a choice for the encoding of categorical variables from a machine learning perspective.

Disclaimer: The Harry Potter characters´ names, as you noticed, consist of two words (name + surname). When encoding words into vectors, we first have to tokenize them, e.g. create “tokens” from them. There are various approaches to tokenization (chopping text up into pieces called tokens). In some of them, terms like “Los Angeles”, “Harry Potter” or similar words that clearly belong together are considered as one token. You can read more about tokenization for example here. I will often use “word” and “token” interchangeably in this text.

This text is NOT about neural networks, but to see the bigger picture, this is what can happen inside of neural network after you create the vector from your words (or tokens, to be precise.) You can read more for example here.

Beginnings of Word2vec

In 2013, a paper by Tomas Mikolov, et al. at Google introducing Word2vec was published. It showed how to create high-quality word vectors from massive data sets containing even billions of words. Unlike many Natural Language Processing systems and techniques at the time, which treated words as atomic units with no notion of similarity, word2vec aims to capture the relationships between words.

Disclaimer: Word2vec is not used in ChatGPT, but is often presented as simple example of word embedding model. It was first introduced in 2013, so I may say it is now a classic among word embeddings.

There are many more word embedding models, e.g. GloVe, Bag of Words, TF-IDF, ELMo, Transformer or BERT. ChatGPT (or, to be precise, GPT-3) uses a type of word embedding called transformer-based word embeddings.

Let´s take a step back now.

If you think about it, if we represented each existing word like a vector with tons of zeros and number 1 only on one place… How long would the vectors have to be? Very long.

We have to think of different approach to expressing words in a form of numbers. I will now show really easy tricks from a mathematical discipline called linear algebra that will help reduce the length of word vectors so that we are actually able to reasonably use them in deep learning models and so that we still capture the meaning of words.

Intuition behind linear algebra.

Feel free to skip this if you know what linear algebra is and understand its principles.

Linear Algebra is a mathematical discipline that helps define relationships between data points in a vector space.

Why Linear and why Algebra?

Algebra = “Algebra” means, roughly, “relationships”. Grade-school algebra explores the relationship between unknown numbers. We can say x + y = 4 without saying exactly what are x and y. Easy.

Linear = In math terms, a function F is linear if scaling inputs scales the output, and adding inputs adds the outputs. Wanna see a really easy example?

If you take a hike to a mountain and the slope is in linear shape (e.g. it is straight line), if you make one step up, your height increases still by the same amount. You can say that your height is a function of number of steps you have made. More steps = bigger height.

On the contrary, if the mountain had non-linear slope, in some parts of your hike, you could make one step and suddenly be much higher (because the slope is steeper) or in some parts, like valleys, you could make one step and your height would be almost unchanged. That is non-linear relationship between number of your steps and height.

Linear Algebra in Word Embeddings

These concepts from linear algebra come handy in word embeddings. You probably know some of this from high school already, so no worries, understanding the basics is not rocket science.

Vectors

You can spot vectors “in the nature” as arrows pointing somewhere on a x-y axis usually. A vector represents a quantity or phenomenon that has two independent properties: magnitude and direction. Examples of vectors in nature are velocity, momentum, force, electromagnetic fields and weight.

Mathematically: Array of numbers.

Geometrically: Each of these numbers tells where it is on . A vector space is something like our physical space, but we are able to tell where is its beginning.

Vector space

Vector space is a space of vectors. It is just a collection of vectors — objects which can be added together, or scaled by a constant. (That is a sign of linearity that we explained earlier.) When you come accross a graph where you see at least two axes and a bunch of arrows pointing to various directions, you are probably looking at a vector space.

Cosine similarity

You probably still remember trigonometric from high school.

Source
Source

Cosine similarity measures how similar are two sequences of numbers (e.g. two vectors).

Mathematically, cosine similarity measures the cosine of the angle between two vectors projected in a multi-dimensional space. The smaller the angle between the two vectors, the more similar they are to each other.

Cosine similarity formula:

https://www.machinelearningplus.com/nlp/cosine-similarity/

Matrices

Matrices are like mini-spreadsheets for your math equations. [1] Geometrically, you can visualize m x n matrix like a table listing m vectors where each has n coordinates (so they “live” in n-dimensional vector space).

A single number is sometimes called scalar. A vector is ordered array of numbers and matrix is an array of more vectors. . Source

Matrix operations

There are two most important operations you can do with two matrices. The first is addition, which is done exactly how one would guess.

Source

The second is multiplication, which is used in word embeddings to reduce dimension of the vectors. It is not that intuitive that matrices are multiplied like this:

Source

We will soon show how matrix multiplication is useful in reducing dimensionality of word vectors.

“Features” of words

Like a person can be 64% introverted and 36% extroverted or 100% man and 0% woman, there are tons of “features” that different people, objects and words in general can have.

Imagine we set the interval from 0 to 1, think of list of many many features (like several hundreds) that words can have and for each word we rate for each feature, how much the word has it.

Vectors of different characters show their rating (0 minimum, 1 maximum) in 5 dimensions. Images source

Length of these vectors (e.g. number of the features) is usually called dimensionality in deep learning.

If you think about it, we certainly want a big enough dimensionality to capture decent amount of concepts. However, if we do not want a too huge vector because it will then become the bottleneck in training of models where these embeddings are used. [3]

If Harry Potter characters or anyone else (hypothetically) had only two features — introversion and extraversion. If we only rated our words based on that, we could draw vectors on a two-dimensional vector space (because we have two features … that means two dimensions).

You can easily draw a vector by just putting trait #1 on one axis and trait #2 on the other axis and then make lines orthogonal to these axis. Their intersection is the end of the vector.

Yet again, two dimensions aren’t enough to capture enough information about how different people are. Intuitively, if two people have similar level of introversion/extraversion, their vectors would be of similar magnitude (“size”) and direction.

The problem with five dimensions is that we lose the ability to draw neat little arrows in two dimensions. And us, humans, are uncapable of drawing spaces of dimension higher than 3. But then, how can we tell that two vectors look similar (and hence correspond to similar words?)

Now we can use the cosine similarity. If we have two hypothetical people with different scores on 5 different personality traits and we want to compare these people to a person named Jay, we can apply the formula I have mentioned before.

Source

Cosine similarity works for any number of dimensions. This is another graphical example with vectors A, B corresponding to two Harry Potter characters.

The “features” we are still talking about are often called weights in the context of neural networks. (Explaining how exactly are weights computed is out of scope of this text, but it is a huge subject on its own).

Conclusion

We have seen two thigs happening in word embeddings (like Word2vec).

  1. Words are encoded into one-hot vectors (those vectors where there is a 1 on the position of the word and 0s on the rest.)
  2. Words get assigned rating for various features (dimensions).

The crucial step of going from 1 to 2 happens via matrix multiplication.

Matrix multiplication is defined the way it does because it then coincides with composition of linear transformations. (Recall the example of walking on a linear slope and moving in a steadily pace with each step).

Recall the matrix multiplication rule.

Source

Now, if we take a one-hot vector of a word, even if it is very very long, we can multiply it with “feature matrix” and get a reasonable result.

Source

Matrix is like table of vectors.

If you multiply a 1 x 10,000 one-hot vector by a 10,000 x 300 matrix, it will effectively just select the matrix row corresponding to the “1”. Here’s a small example to give you a visual. [4]

This means that the hidden layer of this model is really just operating as a lookup table. The output of the hidden layer is just the “word vector” for the input word.

What´s next?

Back to Word2vec

I wish the mechanisms explained here would be all that you need for understanding neural networks. Indeed even in the Word2vec model, you need more. Here’s a little taste of what happens next in Word2vec. Maybe I will get to it next time.

Source

It is not a prerequisite to understand how everything in generative transformative models work, but it is useful to have an intuition for how neural networks work.

One of my previous articles summarizes principles of neural networks in layman terms, including useful resources for learning at the end.

However, there are tons of better resources to learn and I provided some of them to the end of this article.

Natural Language Processing is not an easy topic, so as always, I will appreicate your comments.

Until next time friends!

References

--

--

Tereza Tizkova
Tereza Tizkova

Written by Tereza Tizkova

My blog is moving to ⚡️ https://terezatizkova.substack.com/ ⚡️ ✴️ Hiring engineers: e2b.dev/careers ✴️

Responses (1)