Replace 4_foundation_models_and_training.ipynb (3cdea70c) · Commits · Erik Senn / llm_class_public

notebooks/4_foundation_models_and_training.ipynb

+62 −166

Original line number	Diff line number	Diff line
		%% Cell type:markdown id:fd1adaef tags:

		# Setup and data

		GPU required? Yes

		%% Cell type:code id:4b271769 tags:

		``` python
		# Imports (note that you also need imports from the .py function files)
		import numpy as np
		import torch # PyTorch / ML tool
		import tiktoken # tokenizer for GPT2
		import matplotlib

		# from utils import * # ensure to have this file in the same directory
		from utils_gpt2 import (
		plot_losses,
		calc_loss_batch,
		evaluate_model,
		generate_and_print_sample,
		create_dataloader_v1,
		GPTModel,
		calc_loss_loader,
		# generate_text_simple, # this function we also define in this notebook
		)
		```

		%% Cell type:code id:2296a5a9 tags:

		``` python
		# Data
		datapath = "../data/"

		with open(datapath + "the-verdict.txt", "r", encoding="utf-8") as f:
		text_data = f.read()

		print("Total number of character:", len(text_data))
		print(text_data[:99])
		```

		%% Output

		Total number of character: 20479
		I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no

		%% Cell type:markdown id:7b1a1383 tags:

		# Foundation models and pretraining foundation models
		This code is slightly adapted from https://github.com/rasbt/LLM-workshop-2024 Section 4. If you want more details on how the entire LM is implemented, you can find almost everyhing in in utils_gpt2.py (written by Sebastian Raschka). Of course, alternatively we could also train foundation models by loading the architecture from e.g. huggingface.

		This notebook implements the training loop to pretrain a small GPT2-model using the next-word prediction task.

		Notes
		- We use few text data to make runtime feasible (the same book as in text_preprocessing.)
		- Focus on following along the outputs of the code - you do not have to understand every detail. If you want more details on how the entire LM is implemented, check utils_gpt2.py (model written by Sebastian Raschka)
		- We focus on text-generation models because we can also evaluate the training process by checking if generated texts are coherent, and because it is probably faster.

		%% Cell type:markdown id:016b074f tags:

		## Talking GPU

		GPU very fast for tensor optations -> gradient and
		### GPU

		How much GPU do you need?
		- Model need to fit the model
		- the current training batch

		%% Cell type:markdown id:66dd524e-864c-4012-b0a2-ccfc56e80024 tags:

		## Model architecture

		%% Cell type:markdown id:d20acff1 tags:

		Let's define the model architecture for a small GPT2 model first using a config - this object typically stores fixed model parameters required e.g. for initializing the mnodel.

		%% Cell type:code id:d1df06d1 tags:

		``` python
		GPT_CONFIG_124M = {
		"vocab_size": 50257, # Vocabulary size (needs to fit the tokenizer you indend to use)
		"context_length": 256, # Shortened context length (orig: 1024)
		"emb_dim": 768, # Embedding dimension
		"n_heads": 12, # Number of attention heads
		"n_layers": 12, # Number of layers
		"drop_rate": 0.1, # Dropout rate
		"qkv_bias": False, # Query-key-value bias (intercepts for Self-attention)
		}
		```

		%% Cell type:code id:4f986a0d tags:

		``` python
		# manual count E:
		print("TODO: FINISH MANUAL COUNT")
		token_embedding_matrix = GPT_CONFIG_124M["vocab_size"] * GPT_CONFIG_124M["emb_dim"]
		position_embedding_matrix = (
		GPT_CONFIG_124M["context_length"] * GPT_CONFIG_124M["emb_dim"]
		)

		# one transfomer block
		self_attention = 3 * GPT_CONFIG_124M["emb_dim"] * GPT_CONFIG_124M["emb_dim"]
		# feed_forward = GPT_CONFIG_124M['emb_dim'] * GPT_CONFIG_124M['emb_dim'] * 4 + GPT_CONFIG_124M['emb_dim'] + GPT_CONFIG_124M['emb_dim']
		```

		%% Output

		TODO: FINISH MANUAL COUNT

		%% Cell type:code id:930f215b tags:

		``` python
		torch.manual_seed(123)
		model = GPTModel(GPT_CONFIG_124M)
		model.eval() # sets model to evluation mode. disables dropout during inference

		model_count = sum(p.numel() for p in model.parameters() if p.requires_grad)
		model_count = "{:,}".format(model_count)
		print("Trainable parameters: ", model_count)
		# s: 162,447,360 - 162,419,712
		```

		%% Output

		Trainable parameters: 162,419,712

		%% Cell type:markdown id:f51c5c59 tags:

		### Task*
		As always:
		- Explain the trainable parameter count of the model.
		- Explain the trainable parameter count of the model (see notebook for 5 for an example how to count parms of a GPT-2 model).
		- What are it's in-and output dimensions?
		- Do you understand each component in the model?

		%% Cell type:markdown id:59f80895-be35-4bb5-81cb-f357ef7367fe tags:
		%% Cell type:markdown id:dbfee5e1 tags:

		## Text processing

		For text processing
		%% Cell type:markdown id:59f80895-be35-4bb5-81cb-f357ef7367fe tags:

		- We define two convenience functions, `text_to_token_ids` and `token_ids_to_text`, for converting between token and text representations. These are similar to our own functions in the text_preprecessing section.
		We define two convenience functions, `text_to_token_ids` and `token_ids_to_text`, for converting between token and text representations. These are similar to our own functions in the text_preprecessing section.

		%% Cell type:code id:5e062b82-3540-48ce-8eb4-009686d0d16c tags:

		``` python
		# text processing convinience
		def text_to_token_ids(text, tokenizer):
		encoded = tokenizer.encode(text, allowed_special={"<\|endoftext\|>"})
		encoded_tensor = torch.tensor(encoded).unsqueeze(0) # add batch dimension
		return encoded_tensor


		def token_ids_to_text(token_ids, tokenizer):
		flat = token_ids.squeeze(0) # remove batch dimension
		return tokenizer.decode(flat.tolist())
		```

		%% Cell type:markdown id:1773068a tags:

		The next function generate_text_simple will be used it to generate text given:
		- a model that outputs $batch \times n_tokens \times vocab_size$ logits (can be transformed to probability for each token in the vocabulary).
		- a start sequence of token ids idx
		- a number of words to generate
		- a maximum context size = maximum length of a sequence.

		### Task*
		- Think (or try) yourself: How would you implement next-word prediction given the inputs above?
		- Go through the function below and compare to your own idea.

		%% Cell type:markdown id:741881f3-cee0-49ad-b11d-b9df3b3ac234 tags:

		<img src="figures/02.png" width=1200px>

		%% Cell type:code id:ee7c2a1e tags:

		``` python
		def generate_text_simple(model, idx, max_new_tokens, context_size):
		# idx is (batch, n_tokens) array of indices in the current context
		for _ in range(max_new_tokens): # iterate for each new token to generate

		# Crop current context if it exceeds the supported context size
		# E.g., if LLM supports only 5 tokens, and the context size is 10
		# then only the last 5 tokens are used as context
		idx_cond = idx[:, -context_size:]

		# Get the predictions for the input sequence
		with torch.no_grad():
		logits = model(
		idx_cond
		) # most models do not include the softmax at the end, but output the "logits" = inputs to final softmax.

		# Focus only on the last time step: T
		# The contextualized embedding of last token contains the most information from previous tokens (causal self-attention)
		# (batch, n_tokens, vocab_size) becomes (batch, vocab_size)
		logits = logits[:, -1, :]

		# Apply softmax to get probabilities
		probas = torch.softmax(logits, dim=-1) # (batch, vocab_size)

		# Get the idx of the vocab entry with the highest probability value
		# Alternatively, one could also sample from probas instead of taking highest probabilty token (no sure how common this is empirically).
		idx_next = torch.argmax(probas, dim=-1, keepdim=True) # (batch, 1)

		# Append sampled index to the running sequence.
		# This way, the newly generate token can be used for
		idx = torch.cat((idx, idx_next), dim=1) # (batch, n_tokens+1)

		return idx
		```

		%% Cell type:markdown id:f70a9fb5 tags:

		## Generate text (untrained model)

		%% Cell type:markdown id:85cbfe04 tags:

		Now, we can use the (not trained) LM to generate text!

		%% Cell type:code id:6516f757-849c-468f-88f7-28ac9debf6be tags:

		``` python
		start_context = "Once upon a time"
		tokenizer = tiktoken.get_encoding("gpt2")

		token_ids = generate_text_simple(
		model=model,
		idx=text_to_token_ids(start_context, tokenizer),
		max_new_tokens=10,
		context_size=GPT_CONFIG_124M["context_length"],
		)

		print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
		```

		%% Output

		Output text:
		Once upon a time Wisdom StoresLtم refres RexAngel Cleveland expression diligent

		%% Cell type:markdown id:e4d3249b-b2a0-44c4-b589-ae4b403b8305 tags:

		- As we can see above, the model does not produce good text because it has not been trained yet
		- How do we measure or capture what "good text" is, in a numeric form, to track it during training?
		- The next subsection introduces metrics to calculate a loss metric for the generated outputs that we can use to measure the training progress
		- The next chapters on finetuning LLMs will also introduce additional ways to measure model quality
		As we can see above, the model does not produce good text because it has not been trained yet

		%% Cell type:markdown id:955f9e1a-7bf7-40d8-b1fa-eacabdee8d8e tags:

		<br>

		%% Cell type:markdown id:2ec6c217-e429-40c7-ad71-5d0a9da8e487 tags:

		## Preprocessing the text dataset

		%% Cell type:markdown id:379330f1-80f4-4e34-8724-41d892b04cee tags:

		Lets quickly check our dataset again:

		%% Cell type:code id:6kgJbe4ehI4q tags:

		``` python
		# First 100 characters
		print(text_data[:99])

		total_characters = len(text_data)
		total_tokens = len(tokenizer.encode(text_data))

		print("Characters:", total_characters)
		print("Tokens:", total_tokens)
		```

		%% Output

		I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no
		Characters: 20479
		Tokens: 5145

		%% Cell type:markdown id:a8830cb9-90f6-4e7c-8620-beeabc2d39f7 tags:

		With 5,145 tokens, the text is very short for training an LLM, but again, it's for educational purposes.

		%% Cell type:markdown id:bedcad87-a0e8-4b9d-ac43-4e927ccbb50f tags:

		- Next, we divide the dataset into a training and a validation set and use the data loaders from chapter 2 to prepare the batches for LLM training
		- For visualization purposes, the figure below assumes a `max_length=6`, but for the training loader, we set the `max_length` equal to the context length that the LLM supports
		- The figure below only shows the input tokens for simplicity
		- Since we train the LLM to predict the next word in the text, the targets look the same as these inputs, except that the targets are shifted by one position
		Next, we divide the dataset into a training and a validation set and use the data loaders (see comment).

		%% Cell type:markdown id:46bdaa07-ba96-4ac1-9d71-b3cc153910d9 tags:
		Note:

		Data loaders are functions that provide the data to the model for training and evaluation. They do loading, shuffling, batching and transforming data on the fly and are useful for parallelization and large datasets
		Here, the data loader does tokenization´and splits the text in overlapping sequences of correct length.

		<img src="figures/03.png" width=1500px>

		%% Cell type:code id:0959c855-f860-4358-8b98-bc654f047578 tags:

		``` python
		# Train/validation ratio
		train_ratio = 0.80
		split_idx = int(train_ratio * len(text_data))
		train_data = text_data[:split_idx]
		val_data = text_data[split_idx:]

		torch.manual_seed(123)

		# Data loader
		# data loaders are functions that provide the data to the model for training and evaluation
		# they do loading, shuffling, batching and transforming data on the fly
		# and are useful for parallelization and large datasets
		# Here, the data loader does tokenization´and splits the text in overlapping sequences of correct length.
		train_loader = create_dataloader_v1(
		train_data,
		batch_size=2, # small batch size to reduce computing demand
		max_length=GPT_CONFIG_124M["context_length"],
		stride=GPT_CONFIG_124M["context_length"],
		drop_last=True,
		shuffle=True,
		num_workers=0,
		)

		val_loader = create_dataloader_v1(
		val_data,
		batch_size=2,
		max_length=GPT_CONFIG_124M["context_length"],
		stride=GPT_CONFIG_124M["context_length"],
		drop_last=False,
		shuffle=False,
		num_workers=0,
		)

		print("Train loader:")
		for x, y in train_loader:
		print(x.shape, y.shape)

		print("\nValidation loader:")
		for x, y in val_loader:
		print(x.shape, y.shape)
		```

		%% Output

		Train loader:
		torch.Size([2, 256]) torch.Size([2, 256])
		torch.Size([2, 256]) torch.Size([2, 256])
		torch.Size([2, 256]) torch.Size([2, 256])
		torch.Size([2, 256]) torch.Size([2, 256])
		torch.Size([2, 256]) torch.Size([2, 256])
		torch.Size([2, 256]) torch.Size([2, 256])
		torch.Size([2, 256]) torch.Size([2, 256])
		torch.Size([2, 256]) torch.Size([2, 256])

		Validation loader:
		torch.Size([2, 256]) torch.Size([2, 256])
		torch.Size([2, 256]) torch.Size([2, 256])

		%% Cell type:markdown id:5c3085e8-665e-48eb-bb41-cdde61537e06 tags:

		- Next, let's calculate the initial loss before we start training

		%% Cell type:markdown id:f0691332-84d0-48b3-b462-a885ddeb4fca tags:

		- If you have a machine with a CUDA-supported GPU, the LLM will train on the GPU without making any changes to the code
		- Via the `device` setting, we ensure that the data is loaded onto the same device as the LLM model
		Next, let's calculate the initial loss before we start training.

		%% Cell type:code id:56f5b0c9-1065-4d67-98b9-010e42fc1e2a tags:

		``` python
		calc_loss_loader


		device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
		model.to(
		device
		) # no assignment model = model.to(device) necessary for nn.Module classes


		torch.manual_seed(123) # For reproducibility due to the shuffling in the data loader

		with torch.no_grad(): # Disable gradient tracking for efficiency because we are not training, yet
		train_loss = calc_loss_loader(train_loader, model, device)
		val_loss = calc_loss_loader(val_loader, model, device)

		print("Training loss:", train_loss)
		print("Validation loss:", val_loss)
		```

		%% Output

		Training loss: 10.98758347829183
		Validation loss: 10.98110580444336

		%% Cell type:markdown id:b9339f8d-00cb-4206-af67-58c32bd72055 tags:

		## Training on next word prediction

		%% Cell type:markdown id:652a4cf4-e98f-46d9-bdec-60e7ccb8d6bd tags:
		%% Cell type:markdown id:268d8aaa tags:

		- In this section, we finally implement the code for training the LLM

		<img src="figures/04.png" width=700px>
		We set up the next word prediction task and training data using the autoregressive structure from the lecture.

		%% Cell type:code id:Mtp4gY0ZO-qq tags:

		``` python
		def train_model_simple(
		model,
		train_loader,
		val_loader,
		optimizer,
		device,
		num_epochs,
		eval_freq,
		eval_iter,
		start_context,
		tokenizer,
		):
		# Initialize lists to track losses and tokens seen
		train_losses, val_losses, track_tokens_seen = [], [], []
		tokens_seen, global_step = 0, -1

		# Main training loop
		for epoch in range(num_epochs):
		model.train() # Set model to training mode

		for input_batch, target_batch in train_loader:
		optimizer.zero_grad() # Reset loss gradients from previous batch iteration
		loss = calc_loss_batch(input_batch, target_batch, model, device)
		loss.backward() # Calculate loss gradients
		optimizer.step() # Update model weights using loss gradients
		tokens_seen += input_batch.numel()
		global_step += 1

		# Optional evaluation step
		if global_step % eval_freq == 0:
		train_loss, val_loss = evaluate_model(
		model, train_loader, val_loader, device, eval_iter
		)
		train_losses.append(train_loss)
		val_losses.append(val_loss)
		track_tokens_seen.append(tokens_seen)
		print(
		f"Ep {epoch+1} (Step {global_step:06d}): "
		f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}"
		)

		# Print a sample text after each epoch
		generate_and_print_sample(model, tokenizer, device, start_context)

		return train_losses, val_losses, track_tokens_seen
		```

		%% Cell type:markdown id:a301b333-b9d4-4eeb-a212-3a9874e3ac47 tags:

		- Now, let's train the LLM using the training function defined above:

		%% Cell type:code id:3422000b-7aa2-485b-92df-99372cd22311 tags:

		``` python
		torch.manual_seed(123)
		model = GPTModel(GPT_CONFIG_124M)
		model.to(device)
		optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)

		num_epochs = 10
		train_losses, val_losses, tokens_seen = train_model_simple(
		model,
		train_loader,
		val_loader,
		optimizer,
		device,
		num_epochs=num_epochs,
		eval_freq=5,
		eval_iter=5,
		start_context="Every effort moves you",
		start_context="University is a ",
		tokenizer=tokenizer,
		)
		```

		%% Output
		%% Cell type:markdown id:a5b8b19a tags:

		Ep 1 (Step 000000): Train loss 9.783, Val loss 9.927
		Ep 1 (Step 000005): Train loss 7.985, Val loss 8.335
		Every effort moves you,,,,,,,,,,,,.
		You can save your model weights (and load them later if you want).

		%% Cell type:code id:139885c4-40ed-4765-b307-511d5a967fcd tags:

		``` python
		torch.save(model.state_dict(), "model.pth")
		```

		%% Cell type:code id:0WSRu2i0iHJE tags:

		``` python
		epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
		plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)
		```

		%% Cell type:markdown id:8bc83ded-5f80-4e1c-bf4d-ccb59999d995 tags:

		- Looking at the results above, we can see that the model starts out generating incomprehensible strings of words, whereas towards the end, it's able to produce grammatically more or less correct sentences
		- However, based on the training and validation set losses, we can see that the model starts overfitting
		- If we were to check a few passages it writes towards the end, we would find that they are contained in the training set verbatim -- it simply memorizes the training data
		- Model learns to generate coherent texts.
		- However, based on the training and validation set losses, we can see that the model starts overfitting.
		- If we were to check a few passages it writes towards the end, we would find that they are contained in the training set verbatim -- it simply memorizes the training data.

		- Also note that the overfitting here occurs because we have a very, very small training set, and we iterate over it so many times
		- There are decoding strategies (not covered in this workshop) that can mitigate this memorization by a certain degree
		- Also note that the overfitting here occurs because we have a very, very small training set, and we iterate over it so many times
		- The LLM training here primarily serves educational purposes; we mainly want to see that the model can learn to produce coherent text
		- Instead of spending weeks or months on training this model on vast amounts of expensive hardware, we load pretrained weights later

		%% Cell type:markdown id:c58ebc3a-34d1-4efe-94a0-ef5bec732162 tags:

		<br>
		<br>
		<br>
		<br>



		# Exercise 1: Generate text from the pretrained LLM

		%% Cell type:markdown id:b25558c3-a4f4-48de-a18e-ed63ff9ee02a tags:

		- Use the model to generate new text (HINT: scroll up to see how we generated text before)

		%% Cell type:markdown id:1d62ff8c-78ea-47fa-b02d-9313531cb4df tags:

		<br>
		<br>
		<br>
		<br>



		# Exercise 2: Load the pretrained model in a new session

		%% Cell type:markdown id:3a62addc-41ed-4853-8aec-365ef4611f79 tags:

		- Open a new Python session or Jupyter notebook and load the model there

		%% Cell type:markdown id:7f4b25e3-d1aa-4559-897c-36588bba2057 tags:

		<br>
		<br>
		<br>
		<br>

		## Task

		# Exercise 3 (Optional): Train the LLM on your own favorite texts

		%% Cell type:markdown id:11f349d5-35e4-4502-8b86-ab57b5ca2f0c tags:

		<br>
		<br>
		<br>
		<br>
		- Generate text from the pretrained LLM! Experiment!

		%% Cell type:markdown id:c4f1f2c8-4524-4323-a9c0-9fd15b01a5d1 tags:

		# Solution to Exercise 1
		## Example Solution

		%% Cell type:code id:f564c82a-49f7-46da-ad78-b9cb846eb5e3 tags:

		``` python
		start_context = "Every effort moves you"
		start_context = "More Layers, more "
		tokenizer = tiktoken.get_encoding("gpt2")

		device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

		token_ids = generate_text_simple(
		model=model,
		idx=text_to_token_ids(start_context, tokenizer).to(device),
		max_new_tokens=10,
		context_size=GPT_CONFIG_124M["context_length"],
		)
		print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
		```

		%% Cell type:code id:3e9d58e1-afba-44c7-9f82-7516adff359d tags:
		%% Cell type:markdown id:1d62ff8c-78ea-47fa-b02d-9313531cb4df tags:

		``` python
		print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
		```
		## Task

		%% Cell type:markdown id:b64b3b1f-c8d3-4755-a926-dc86eeae0ba0 tags:
		- Load the pretrained model in a new session (or delete the model).

		<br>
		<br>
		<br>
		<br>
		Note: This is how you would do it e.g. for pretrained weights.

		%% Cell type:markdown id:06640a19-514c-47d1-8744-bdaeadd5c083 tags:

		# Solution to Exercise 2
		## Example Solution

		%% Cell type:code id:a998656c-3615-4673-a9f9-c8eefb6b6611 tags:
		%% Cell type:code id:e2a85852-7f7f-449a-8993-230f8d82abf5 tags:

		``` python
		import torch

		# Imports from a local file
		GPTModel
		del model # if you did not reset the session
		```

		%% Cell type:code id:a998656c-3615-4673-a9f9-c8eefb6b6611 tags:

		``` python
		model = GPTModel(GPT_CONFIG_124M)
		device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
		model.load_state_dict(torch.load("model.pth", map_location=device))
		model.eval()
		model.eval() # disable dropout
		model.to(device)

		# test for prediction
		token_ids = generate_text_simple(
		model=model,
		idx=text_to_token_ids(start_context, tokenizer).to(device),
		max_new_tokens=10,
		context_size=GPT_CONFIG_124M["context_length"],
		)
		print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
		```

		%% Cell type:markdown id:7f4b25e3-d1aa-4559-897c-36588bba2057 tags:

		## Task*

		Train the LLM on your own favorite texts!