print("Total number of character:",len(text_data))
print(text_data[:99])
```
%% Output
Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no
%% Cell type:markdown id:7b1a1383 tags:
# Foundation models and pretraining foundation models
This code is slightly adapted from https://github.com/rasbt/LLM-workshop-2024 Section 4. If you want more details on how the entire LM is implemented, you can find almost everyhing in in *utils_gpt2.py* (written by Sebastian Raschka). Of course, alternatively we could also train foundation models by loading the architecture from e.g. *huggingface*.
This notebook implements the training loop to pretrain a small GPT2-model using the next-word prediction task.
*Notes*
- We use few text data to make runtime feasible (the same book as in *text_preprocessing*.)
- Focus on following along the outputs of the code - you *do not have to understand every detail*. If you want more details on how the entire LM is implemented, check *utils_gpt2.py* (model written by Sebastian Raschka)
- We focus on text-generation models because we can also evaluate the training process by checking if generated texts are coherent, and because it is probably faster.
%% Cell type:markdown id:016b074f tags:
## Talking GPU
GPU very fast for tensor optations -> gradient and
Let's define the model architecture for a small GPT2 model first using a *config* - this object typically stores fixed model parameters required e.g. for initializing the mnodel.
%% Cell type:code id:d1df06d1 tags:
``` python
GPT_CONFIG_124M={
"vocab_size":50257,# Vocabulary size (needs to fit the tokenizer you indend to use)
-We define two convenience functions, `text_to_token_ids` and `token_ids_to_text`, for converting between token and text representations. These are similar to our own functions in the *text_preprecessing* section.
We define two convenience functions, `text_to_token_ids` and `token_ids_to_text`, for converting between token and text representations. These are similar to our own functions in the *text_preprecessing* section.
- Next, we divide the dataset into a training and a validation set and use the data loaders from chapter 2 to prepare the batches for LLM training
- For visualization purposes, the figure below assumes a `max_length=6`, but for the training loader, we set the `max_length` equal to the context length that the LLM supports
- The figure below only shows the input tokens for simplicity
- Since we train the LLM to predict the next word in the text, the targets look the same as these inputs, except that the targets are shifted by one position
Next, we divide the dataset into a training and a validation set and use the **data loaders** (see comment).
**Data loaders** are functions that provide the data to the model for training and evaluation. They do loading, shuffling, batching and transforming data on the fly and are useful for parallelization and large datasets
Here, the data loader does tokenization´and splits the text in overlapping sequences of correct length.
-Looking at the results above, we can see that the model starts out generating incomprehensible strings of words, whereas towards the end, it's able to produce grammatically more or less correct sentences
- However, based on the training and validation set losses, we can see that the model starts overfitting
- If we were to check a few passages it writes towards the end, we would find that they are contained in the training set verbatim -- it simply memorizes the training data
-Model learns to generate coherent texts.
- However, based on the training and validation set losses, we can see that the model starts overfitting.
- If we were to check a few passages it writes towards the end, we would find that they are contained in the training set verbatim -- it simply memorizes the training data.
- Also note that the **overfitting** here occurs because we have a very, very small training set, and we iterate over it so many times
- There are decoding strategies (not covered in this workshop) that can mitigate this memorization by a certain degree
- Also note that the overfitting here occurs because we have a very, very small training set, and we iterate over it so many times
- The LLM training here primarily serves educational purposes; we mainly want to see that the model can learn to produce coherent text
- Instead of spending weeks or months on training this model on vast amounts of expensive hardware, we load pretrained weights later