Commit 3cdea70c authored by Erik Senn's avatar Erik Senn
Browse files

Replace 4_foundation_models_and_training.ipynb

parent 0748abb6
Loading
Loading
Loading
Loading
+62 −166
Original line number Diff line number Diff line
%% Cell type:markdown id:fd1adaef tags:

# Setup and data

GPU required? Yes

%% Cell type:code id:4b271769 tags:

``` python
# Imports (note that you also need imports from the .py function files)
import numpy as np
import torch  # PyTorch / ML tool
import tiktoken  # tokenizer for GPT2
import matplotlib

# from utils import *  # ensure to have this file in the same directory
from utils_gpt2 import (
    plot_losses,
    calc_loss_batch,
    evaluate_model,
    generate_and_print_sample,
    create_dataloader_v1,
    GPTModel,
    calc_loss_loader,
    # generate_text_simple, # this function we also define in this notebook
)
```

%% Cell type:code id:2296a5a9 tags:

``` python
# Data
datapath = "../data/"

with open(datapath + "the-verdict.txt", "r", encoding="utf-8") as f:
    text_data = f.read()

print("Total number of character:", len(text_data))
print(text_data[:99])
```

%% Output

    Total number of character: 20479
    I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no

%% Cell type:markdown id:7b1a1383 tags:

# Foundation models and pretraining foundation models
This code is slightly adapted from https://github.com/rasbt/LLM-workshop-2024 Section 4. If you want more details on how the entire LM is implemented, you can find almost everyhing in in *utils_gpt2.py* (written by Sebastian Raschka). Of course, alternatively we could also train foundation models by loading the architecture from e.g. *huggingface*.

This notebook implements the training loop to pretrain a small GPT2-model using the next-word prediction task.

*Notes*
- We use few text data to make runtime feasible (the same book as in *text_preprocessing*.)
- Focus on following along the outputs of the code - you *do not have to understand every detail*. If you want more details on how the entire LM is implemented, check *utils_gpt2.py* (model written by Sebastian Raschka)
- We focus on text-generation models because we can also evaluate the training process by checking if generated texts are coherent, and because it is probably faster.

%% Cell type:markdown id:016b074f tags:

## Talking GPU

GPU very fast for tensor optations -> gradient and
### GPU

How much GPU do you need?
- Model need to fit the model
-  the current training batch

%% Cell type:markdown id:66dd524e-864c-4012-b0a2-ccfc56e80024 tags:

## Model architecture

%% Cell type:markdown id:d20acff1 tags:

Let's define the model architecture for a small GPT2 model first using a *config* - this object typically stores fixed model parameters required e.g. for initializing the mnodel.

%% Cell type:code id:d1df06d1 tags:

``` python
GPT_CONFIG_124M = {
    "vocab_size": 50257,  # Vocabulary size (needs to fit the tokenizer you indend to use)
    "context_length": 256,  # Shortened context length (orig: 1024)
    "emb_dim": 768,  # Embedding dimension
    "n_heads": 12,  # Number of attention heads
    "n_layers": 12,  # Number of layers
    "drop_rate": 0.1,  # Dropout rate
    "qkv_bias": False,  # Query-key-value bias (intercepts for Self-attention)
}
```

%% Cell type:code id:4f986a0d tags:

``` python
# manual count E:
print("TODO: FINISH MANUAL COUNT")
token_embedding_matrix = GPT_CONFIG_124M["vocab_size"] * GPT_CONFIG_124M["emb_dim"]
position_embedding_matrix = (
    GPT_CONFIG_124M["context_length"] * GPT_CONFIG_124M["emb_dim"]
)

# one transfomer block
self_attention = 3 * GPT_CONFIG_124M["emb_dim"] * GPT_CONFIG_124M["emb_dim"]
# feed_forward = GPT_CONFIG_124M['emb_dim'] * GPT_CONFIG_124M['emb_dim'] * 4 + GPT_CONFIG_124M['emb_dim'] + GPT_CONFIG_124M['emb_dim']
```

%% Output

    TODO: FINISH MANUAL COUNT

%% Cell type:code id:930f215b tags:

``` python
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.eval()  # sets model to evluation mode. disables dropout during inference

model_count = sum(p.numel() for p in model.parameters() if p.requires_grad)
model_count = "{:,}".format(model_count)
print("Trainable parameters: ", model_count)
# s:  162,447,360 - 162,419,712
```

%% Output

    Trainable parameters:  162,419,712

%% Cell type:markdown id:f51c5c59 tags:

### Task*
As always:
- Explain the trainable parameter count of the model.
- Explain the trainable parameter count of the model (see notebook for 5 for an example how to count parms of a GPT-2 model).
- What are it's in-and output dimensions?
- Do you understand each component in the model?

%% Cell type:markdown id:59f80895-be35-4bb5-81cb-f357ef7367fe tags:
%% Cell type:markdown id:dbfee5e1 tags:

## Text processing

For text processing
%% Cell type:markdown id:59f80895-be35-4bb5-81cb-f357ef7367fe tags:

- We define two convenience functions, `text_to_token_ids` and `token_ids_to_text`, for converting between token and text representations. These are similar to our own functions in the *text_preprecessing* section.
We define two convenience functions, `text_to_token_ids` and `token_ids_to_text`, for converting between token and text representations. These are similar to our own functions in the *text_preprecessing* section.

%% Cell type:code id:5e062b82-3540-48ce-8eb4-009686d0d16c tags:

``` python
# text processing convinience
def text_to_token_ids(text, tokenizer):
    encoded = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
    encoded_tensor = torch.tensor(encoded).unsqueeze(0)  # add batch dimension
    return encoded_tensor


def token_ids_to_text(token_ids, tokenizer):
    flat = token_ids.squeeze(0)  # remove batch dimension
    return tokenizer.decode(flat.tolist())
```

%% Cell type:markdown id:1773068a tags:

The next function  *generate_text_simple* will be used it to **generate text** given:
- a *model* that outputs $batch \times n_tokens \times vocab_size$ logits (can be transformed to probability for each token in the vocabulary).
- a start sequence of token ids *idx*
- a number of words to generate
- a maximum context size = maximum length of a sequence.

### Task*
- Think (or try) yourself: How would you implement next-word prediction given the inputs above?
- Go through the function below and compare to your own idea.

%% Cell type:markdown id:741881f3-cee0-49ad-b11d-b9df3b3ac234 tags:

<img src="figures/02.png" width=1200px>

%% Cell type:code id:ee7c2a1e tags:

``` python
def generate_text_simple(model, idx, max_new_tokens, context_size):
    # idx is (batch, n_tokens) array of indices in the current context
    for _ in range(max_new_tokens):  # iterate for each new token to generate

        # Crop current context if it exceeds the supported context size
        # E.g., if LLM supports only 5 tokens, and the context size is 10
        # then only the last 5 tokens are used as context
        idx_cond = idx[:, -context_size:]

        # Get the predictions for the input sequence
        with torch.no_grad():
            logits = model(
                idx_cond
            )  # most models do not include the softmax at the end, but output the "logits" = inputs to final softmax.

        # Focus only on the last time step: T
        # The contextualized embedding of last token contains the most information from previous tokens (causal self-attention)
        # (batch, n_tokens, vocab_size) becomes (batch, vocab_size)
        logits = logits[:, -1, :]

        # Apply softmax to get probabilities
        probas = torch.softmax(logits, dim=-1)  # (batch, vocab_size)

        # Get the idx of the vocab entry with the highest probability value
        # Alternatively, one could also sample from probas instead of taking highest probabilty token (no sure how common this is empirically).
        idx_next = torch.argmax(probas, dim=-1, keepdim=True)  # (batch, 1)

        # Append sampled index to the running sequence.
        # This way, the newly generate token can be used for
        idx = torch.cat((idx, idx_next), dim=1)  # (batch, n_tokens+1)

    return idx
```

%% Cell type:markdown id:f70a9fb5 tags:

## Generate text (untrained model)

%% Cell type:markdown id:85cbfe04 tags:

Now, we can use the (not trained) LM to generate text!

%% Cell type:code id:6516f757-849c-468f-88f7-28ac9debf6be tags:

``` python
start_context = "Once upon a time"
tokenizer = tiktoken.get_encoding("gpt2")

token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids(start_context, tokenizer),
    max_new_tokens=10,
    context_size=GPT_CONFIG_124M["context_length"],
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
```

%% Output

    Output text:
     Once upon a time Wisdom StoresLtم refres RexAngel Cleveland expression diligent

%% Cell type:markdown id:e4d3249b-b2a0-44c4-b589-ae4b403b8305 tags:

- As we can see above, the model does not produce good text because it has not been trained yet
- How do we measure or capture what "good text" is, in a numeric form, to track it during training?
- The next subsection introduces metrics to calculate a loss metric for the generated outputs that we can use to measure the training progress
- The next chapters on finetuning LLMs will also introduce additional ways to measure model quality
As we can see above, the model does not produce good text because it has not been trained yet

%% Cell type:markdown id:955f9e1a-7bf7-40d8-b1fa-eacabdee8d8e tags:

<br>

%% Cell type:markdown id:2ec6c217-e429-40c7-ad71-5d0a9da8e487 tags:

## Preprocessing the text dataset

%% Cell type:markdown id:379330f1-80f4-4e34-8724-41d892b04cee tags:

Lets quickly check our dataset again:

%% Cell type:code id:6kgJbe4ehI4q tags:

``` python
# First 100 characters
print(text_data[:99])

total_characters = len(text_data)
total_tokens = len(tokenizer.encode(text_data))

print("Characters:", total_characters)
print("Tokens:", total_tokens)
```

%% Output

    I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no
    Characters: 20479
    Tokens: 5145

%% Cell type:markdown id:a8830cb9-90f6-4e7c-8620-beeabc2d39f7 tags:

With 5,145 tokens, the text is very short for training an LLM, but again, it's for educational purposes.

%% Cell type:markdown id:bedcad87-a0e8-4b9d-ac43-4e927ccbb50f tags:

- Next, we divide the dataset into a training and a validation set and use the data loaders from chapter 2 to prepare the batches for LLM training
- For visualization purposes, the figure below assumes a `max_length=6`, but for the training loader, we set the `max_length` equal to the context length that the LLM supports
- The figure below only shows the input tokens for simplicity
    - Since we train the LLM to predict the next word in the text, the targets look the same as these inputs, except that the targets are shifted by one position
Next, we divide the dataset into a training and a validation set and use the **data loaders** (see comment).

%% Cell type:markdown id:46bdaa07-ba96-4ac1-9d71-b3cc153910d9 tags:
*Note*:

**Data loaders** are functions that provide the data to the model for training and evaluation. They do loading, shuffling, batching and transforming data on the fly and are useful for parallelization and large datasets
Here, the data loader does tokenization´and splits the text in overlapping sequences of correct length.

<img src="figures/03.png" width=1500px>

%% Cell type:code id:0959c855-f860-4358-8b98-bc654f047578 tags:

``` python
# Train/validation ratio
train_ratio = 0.80
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]

torch.manual_seed(123)

# Data loader
# data loaders are functions that provide the data to the model for training and evaluation
# they do loading, shuffling, batching and transforming data on the fly
# and are useful for parallelization and large datasets
# Here, the data loader does tokenization´and splits the text in overlapping sequences of correct length.
train_loader = create_dataloader_v1(
    train_data,
    batch_size=2,  # small batch size to reduce computing demand
    max_length=GPT_CONFIG_124M["context_length"],
    stride=GPT_CONFIG_124M["context_length"],
    drop_last=True,
    shuffle=True,
    num_workers=0,
)

val_loader = create_dataloader_v1(
    val_data,
    batch_size=2,
    max_length=GPT_CONFIG_124M["context_length"],
    stride=GPT_CONFIG_124M["context_length"],
    drop_last=False,
    shuffle=False,
    num_workers=0,
)

print("Train loader:")
for x, y in train_loader:
    print(x.shape, y.shape)

print("\nValidation loader:")
for x, y in val_loader:
    print(x.shape, y.shape)
```

%% Output

    Train loader:
    torch.Size([2, 256]) torch.Size([2, 256])
    torch.Size([2, 256]) torch.Size([2, 256])
    torch.Size([2, 256]) torch.Size([2, 256])
    torch.Size([2, 256]) torch.Size([2, 256])
    torch.Size([2, 256]) torch.Size([2, 256])
    torch.Size([2, 256]) torch.Size([2, 256])
    torch.Size([2, 256]) torch.Size([2, 256])
    torch.Size([2, 256]) torch.Size([2, 256])
    
    Validation loader:
    torch.Size([2, 256]) torch.Size([2, 256])
    torch.Size([2, 256]) torch.Size([2, 256])

%% Cell type:markdown id:5c3085e8-665e-48eb-bb41-cdde61537e06 tags:

- Next, let's calculate the initial loss before we start training

%% Cell type:markdown id:f0691332-84d0-48b3-b462-a885ddeb4fca tags:

- If you have a machine with a CUDA-supported GPU, the LLM will train on the GPU without making any changes to the code
- Via the `device` setting, we ensure that the data is loaded onto the same device as the LLM model
Next, let's calculate the initial loss before we start training.

%% Cell type:code id:56f5b0c9-1065-4d67-98b9-010e42fc1e2a tags:

``` python
calc_loss_loader


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(
    device
)  # no assignment model = model.to(device) necessary for nn.Module classes


torch.manual_seed(123)  # For reproducibility due to the shuffling in the data loader

with torch.no_grad():  # Disable gradient tracking for efficiency because we are not training, yet
    train_loss = calc_loss_loader(train_loader, model, device)
    val_loss = calc_loss_loader(val_loader, model, device)

print("Training loss:", train_loss)
print("Validation loss:", val_loss)
```

%% Output

    Training loss: 10.98758347829183
    Validation loss: 10.98110580444336

%% Cell type:markdown id:b9339f8d-00cb-4206-af67-58c32bd72055 tags:

## Training on next word prediction

%% Cell type:markdown id:652a4cf4-e98f-46d9-bdec-60e7ccb8d6bd tags:
%% Cell type:markdown id:268d8aaa tags:

- In this section, we finally implement the code for training the LLM

<img src="figures/04.png" width=700px>
We set up the next word prediction task and training data using the autoregressive structure from the lecture.

%% Cell type:code id:Mtp4gY0ZO-qq tags:

``` python
def train_model_simple(
    model,
    train_loader,
    val_loader,
    optimizer,
    device,
    num_epochs,
    eval_freq,
    eval_iter,
    start_context,
    tokenizer,
):
    # Initialize lists to track losses and tokens seen
    train_losses, val_losses, track_tokens_seen = [], [], []
    tokens_seen, global_step = 0, -1

    # Main training loop
    for epoch in range(num_epochs):
        model.train()  # Set model to training mode

        for input_batch, target_batch in train_loader:
            optimizer.zero_grad()  # Reset loss gradients from previous batch iteration
            loss = calc_loss_batch(input_batch, target_batch, model, device)
            loss.backward()  # Calculate loss gradients
            optimizer.step()  # Update model weights using loss gradients
            tokens_seen += input_batch.numel()
            global_step += 1

            # Optional evaluation step
            if global_step % eval_freq == 0:
                train_loss, val_loss = evaluate_model(
                    model, train_loader, val_loader, device, eval_iter
                )
                train_losses.append(train_loss)
                val_losses.append(val_loss)
                track_tokens_seen.append(tokens_seen)
                print(
                    f"Ep {epoch+1} (Step {global_step:06d}): "
                    f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}"
                )

        # Print a sample text after each epoch
        generate_and_print_sample(model, tokenizer, device, start_context)

    return train_losses, val_losses, track_tokens_seen
```

%% Cell type:markdown id:a301b333-b9d4-4eeb-a212-3a9874e3ac47 tags:

- Now, let's train the LLM using the training function defined above:

%% Cell type:code id:3422000b-7aa2-485b-92df-99372cd22311 tags:

``` python
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)

num_epochs = 10
train_losses, val_losses, tokens_seen = train_model_simple(
    model,
    train_loader,
    val_loader,
    optimizer,
    device,
    num_epochs=num_epochs,
    eval_freq=5,
    eval_iter=5,
    start_context="Every effort moves you",
    start_context="University is a ",
    tokenizer=tokenizer,
)
```

%% Output
%% Cell type:markdown id:a5b8b19a tags:

    Ep 1 (Step 000000): Train loss 9.783, Val loss 9.927
    Ep 1 (Step 000005): Train loss 7.985, Val loss 8.335
    Every effort moves you,,,,,,,,,,,,.
You can save your model weights (and load them later if you want).

%% Cell type:code id:139885c4-40ed-4765-b307-511d5a967fcd tags:

``` python
torch.save(model.state_dict(), "model.pth")
```

%% Cell type:code id:0WSRu2i0iHJE tags:

``` python
epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)
```

%% Cell type:markdown id:8bc83ded-5f80-4e1c-bf4d-ccb59999d995 tags:

- Looking at the results above, we can see that the model starts out generating incomprehensible strings of words, whereas towards the end, it's able to produce grammatically more or less correct sentences
- However, based on the training and validation set losses, we can see that the model starts overfitting
- If we were to check a few passages it writes towards the end, we would find that they are contained in the training set verbatim -- it simply memorizes the training data
- Model learns to generate coherent texts.
- However, based on the training and validation set losses, we can see that the model starts overfitting.
- If we were to check a few passages it writes towards the end, we would find that they are contained in the training set verbatim -- it simply memorizes the training data.

- Also note that the **overfitting** here occurs because we have a very, very small training set, and we iterate over it so many times
- There are decoding strategies (not covered in this workshop) that can mitigate this memorization by a certain degree
- Also note that the overfitting here occurs because we have a very, very small training set, and we iterate over it so many times
  - The LLM training here primarily serves educational purposes; we mainly want to see that the model can learn to produce coherent text
  - Instead of spending weeks or months on training this model on vast amounts of expensive hardware, we load pretrained weights later

%% Cell type:markdown id:c58ebc3a-34d1-4efe-94a0-ef5bec732162 tags:

<br>
<br>
<br>
<br>



# Exercise 1: Generate text from the pretrained LLM

%% Cell type:markdown id:b25558c3-a4f4-48de-a18e-ed63ff9ee02a tags:

- Use the model to generate new text (HINT: scroll up to see how we generated text before)

%% Cell type:markdown id:1d62ff8c-78ea-47fa-b02d-9313531cb4df tags:

<br>
<br>
<br>
<br>



# Exercise 2: Load the pretrained model in a new session

%% Cell type:markdown id:3a62addc-41ed-4853-8aec-365ef4611f79 tags:

- Open a new Python session or Jupyter notebook and load the model there

%% Cell type:markdown id:7f4b25e3-d1aa-4559-897c-36588bba2057 tags:

<br>
<br>
<br>
<br>

## Task

# Exercise 3 (Optional): Train the LLM on your own favorite texts

%% Cell type:markdown id:11f349d5-35e4-4502-8b86-ab57b5ca2f0c tags:

<br>
<br>
<br>
<br>
- Generate text from the pretrained LLM! Experiment!

%% Cell type:markdown id:c4f1f2c8-4524-4323-a9c0-9fd15b01a5d1 tags:

# Solution to Exercise 1
## Example Solution

%% Cell type:code id:f564c82a-49f7-46da-ad78-b9cb846eb5e3 tags:

``` python
start_context = "Every effort moves you"
start_context = "More Layers, more "
tokenizer = tiktoken.get_encoding("gpt2")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids(start_context, tokenizer).to(device),
    max_new_tokens=10,
    context_size=GPT_CONFIG_124M["context_length"],
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
```

%% Cell type:code id:3e9d58e1-afba-44c7-9f82-7516adff359d tags:
%% Cell type:markdown id:1d62ff8c-78ea-47fa-b02d-9313531cb4df tags:

``` python
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
```
## Task

%% Cell type:markdown id:b64b3b1f-c8d3-4755-a926-dc86eeae0ba0 tags:
- Load the pretrained model **in a new session** (or delete the model).

<br>
<br>
<br>
<br>
*Note*: This is how you would do it e.g. for pretrained weights.

%% Cell type:markdown id:06640a19-514c-47d1-8744-bdaeadd5c083 tags:

# Solution to Exercise 2
## Example Solution

%% Cell type:code id:a998656c-3615-4673-a9f9-c8eefb6b6611 tags:
%% Cell type:code id:e2a85852-7f7f-449a-8993-230f8d82abf5 tags:

``` python
import torch

# Imports from a local file
GPTModel
del model  # if you did not reset the session
```

%% Cell type:code id:a998656c-3615-4673-a9f9-c8eefb6b6611 tags:

``` python
model = GPTModel(GPT_CONFIG_124M)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.load_state_dict(torch.load("model.pth", map_location=device))
model.eval()
model.eval()  # disable dropout
model.to(device)

# test for prediction
token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids(start_context, tokenizer).to(device),
    max_new_tokens=10,
    context_size=GPT_CONFIG_124M["context_length"],
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
```

%% Cell type:markdown id:7f4b25e3-d1aa-4559-897c-36588bba2057 tags:

## Task*

Train the LLM on your own favorite texts!