How do LLMs work?

In this blog, we will look at the inner workings of LLM.

Data Collection and Pre-processing

LLMs are trained on large amounts of text datasets from sources like Wikipedia, news articles, books, and web content. For example, the GPT-3 model was trained on datasets such as Common Crawl, WebText, BookCorpus, and Wikipedia. The first step is data collection and there are a lot of datasets publicly available such as Common Crawl, The Pile, etc. The data then undergoes rigorous processing including filtering for toxic content, hateful language, and personally identifiable information (PII), along with cleaning and formatting for optimal training.

Tokenization

The text dataset generated in the previous step is then converted into tokens. Tokenization splits the text into fundamental units such as words, subwords, punctuation marks, etc. This creates a pool of potential vocabulary items for the model. Common Tokenization method includes Byte Pair Encoding (BPE), and WordPiece.

Embeddings

These tokenized units are then converted into numerical representations called embeddings. Embeddings capture their semantic meaning and relationships with other words. Several types of embeddings are possible, including position embeddings, which encode the position of a token in a sentence, token embeddings, which encode the semantic meaning of a token, and segment embeddings, which provides information on which segment (sentence) the token belongs to.

Pre-training

Pre-training is a step where the model is trained on these pre-processed datasets with the primary task of predicting the next word in the sentence. Usually, the model uses Language modeling tasks such as masked language modeling and next-sentence prediction.

Masked Language Modeling (MLM): Some of the input tokens are randomly masked, and the model tries to predict the original masked words based on context. This helps the model learn semantic and syntactic relationships.

Next Sentence Prediction (NSP): The model is given two sentences, and predicts if the second sentence follows the first in the original text. This helps understand continuity relationships between sentences.

It is during the pre-training phase that the model tries to understand the language and the relationship with words. During this phase, the model tries to learn grammar rules, linguistic patterns, factual information, and reasoning abilities. The pre-training process is usually very expensive and takes a long time to complete.

Transfer Learning and Fine-tuning

Pre-training enables us to apply transfer learning. The idea behind transfer learning is to take the knowledge learned by the models in the pre-training phase and fine-tune it further for a particular task. For example, in the instruction-tuned models which are models fine-tuned specifically to follow complex instructions.

Fine-tuning requires much less data than pre-training, as it focuses on adapting the model to a specific task or domain.

Nishi Ajmera's blog

Nishi Ajmera's blog

How do LLMs work?