Upload New File (8619ac61) · Commits · Erik Senn / llm_class_public

notebooks/5_task_specific_classification.ipynb

0 → 100644

+274 −0

Original line number	Diff line number	Diff line
		%% Cell type:markdown id:9d925756-845a-412e-846c-16a4e0124372 tags:

		# Setup and data

		Need GPU? Yes

		%% Cell type:code id:f7cfd67a-d6b5-4e90-a2fe-7398f1e3293d tags:

		``` python
		from transformers import GPT2Tokenizer, GPT2Config, GPT2LMHeadModel, GPT2Model
		import torch
		import torch.nn as nn # sorry, but simpler as this
		from IPython.display import Markdown, display

		device = torch.device(
		"cuda:0" if torch.cuda.is_available() else "cpu"
		) # device for torch
		```

		%% Cell type:code id:f5c12df2-4d26-4793-a9cc-aac5b35473da tags:

		``` python
		# for you later: data imports
		```

		%% Cell type:markdown id:71de2739-28a9-4ede-b594-58d5ac5efd53 tags:

		# Sentiment classification

		%% Cell type:markdown id:51fb99aa-3cdd-40ee-8619-91d377f7c3d7 tags:

		We want to classify sentiment (positive, negative) using pretrained LMs (or alternatively, classify emotions, ...).
		We will work with a version GPT-2.
		- Other (larger) GPT2 versions are documented https://huggingface.co/transformers/v2.2.0/pretrained_models.html. (documentation says 117M parameters, but its 124.4 Million)
		- Note that sentiment classification does not require causal LLMs, so we could also use e.g. BERT, or turn off the masking. (We use GPT2 because we can also test prompting here).

		%% Cell type:markdown id:abcaae5e-eb7d-4269-8d58-f9129211995e tags:

		## GPT2 Model architecture

		%% Cell type:markdown id:327a954a-73cd-444b-a18c-a74b7aab4dfd tags:

		First, let us load the basic model architecture (no weights) and investigate:

		%% Cell type:code id:3af6d22c-d67c-4679-b289-f2fee2f5b1e6 tags:

		``` python
		# Init GPT-2 model from the configuration
		config = GPT2Config()
		# config.n_layer = 24 # how to change config of model
		print(config)
		model = GPT2Model(config)
		print(model)
		num_parameters = sum(
		p.numel() for p in model.parameters() if p.requires_grad
		) # assuming all components with grad are trained
		print("Parameters that require grad: ", num_parameters)
		```

		%% Cell type:markdown id:0063c8ca-c044-4b57-9830-6d30f65bd71a tags:

		### Task

		- What is the meaning of the config parameters of the LM, and how do they shape the printed architecture below? Explain each one.
		- Try to manually replicate the number of parameters on the model (see print). Not easy, right?
		- Note: "n_inner": 4 is the default.
		- For experimenting, you can change the config e.g. config.n_layer = x to assign different values (any verify that your computation is correct for other inputs). (not possible for all parameters.

		Note: The c_attn is a implementation of the standard self-attention mechanism (concatination of query, key value). Softmax is probably hidden under the hood here. c_proj aggregates the attention heads and is a linear layer with parameters.

		%% Cell type:markdown id:044824c4-781a-4519-9c19-0de6f6209910 tags:

		### Example solution

		Note: It appears that the parameters are shared in multihead attention heads which is surprising, or the parameter counting method in the code does not show all parameters. I believe this is not standard?

		%% Cell type:code id:a25d2ab4-bb2b-4cb2-bd5a-9674b383f6d3 tags:

		``` python
		l = config.n_positions
		d = config.n_embd
		v = config.vocab_size

		input_embedding = v * d + l * d # token embedding + absolute pos embeddings
		attention = 3 * (d * d + d) # input = output dimension # assume biases are included
		att_proj = (
		d * d + d
		) # linear combination to get from aggregate attention heads (trainable aswell)
		layer_norm = d + d # weight and bias
		mlp = d * (4 * d) + 4 * d + (4 * d) * d + d # (weight l1 bias l1, weight l2, bias l2)
		n_parms_block = (
		layer_norm + (attention) + att_proj + layer_norm + mlp
		) # parameters across attention heads seem to be shared?
		output_model = 0 # would be l x v, but not included in this model (and also not counted if its included)

		final_count = input_embedding + config.n_layer * n_parms_block + 1536
		print(final_count)
		```

		%% Cell type:markdown id:a6c19989-7d9e-4b25-b6c1-34797ca73156 tags:

		## Classification using Prompting

		%% Cell type:markdown id:4242657e-ff02-4a35-90b0-08ba8bb88478 tags:

		Now, lets load the trained model including weights. We could also keep the initialized model architecture and load in the weights differently, but from_pretrained does the job for us.

		Note: GPT2LMHeadModel includes the head model which is a linear layer mapping back to $l \times v$, but they are not counted as parameters I believe.

		%% Cell type:code id:3d909862-a200-48f8-85ba-ed3729d1af58 tags:

		``` python
		# Load pre-trained GPT-2 model and tokenizer
		tokenizer = GPT2Tokenizer.from_pretrained(
		"gpt2"
		) # includes model architecture AND trained parameters
		gpt2 = GPT2LMHeadModel.from_pretrained("gpt2")
		print(gpt2)
		print("Parameter count", sum(p.numel() for p in gpt2.parameters() if p.requires_grad))

		# Example input text
		input_text = "This LLM class is way too easy and "
		# input_text = "This movie was fantastic!"
		# input_text = "Awesome awesome awesome "

		# Tokenize the input text
		inputs = tokenizer(input_text, return_tensors="pt")

		# Feature extraction: compute logits from input
		with torch.no_grad():
		outputs = gpt2(**inputs) # dimension is batch_size x l x v
		logits = outputs.logits # get output values

		# [what does this do? Why]
		next_token_logits = logits[:, -1, :] # (batch_size, vocab_size)

		predicted_token_id = torch.argmax(next_token_logits, dim=-1)
		predicted_token = tokenizer.decode(predicted_token_id) # transform back to token

		# Output predicted token (can be interpreted for sentiment)
		print(
		f"Predicted next token id and token: \n {predicted_token_id[0]} : ({predicted_token})"
		)
		```

		%% Cell type:markdown id:7e82fd11-b1f9-4cd8-b10c-0e5d1e9fdf00 tags:

		### Task

		- Go through the code step by step, and trace the dimensionality of the data. Ask if you have questions.
		- What does logits[:, -1, :] do? Why is this okay?
		- What does torch.argmax(next_token_logits, dim=-1) do? Why can we take the argmax of logits and do not to transform to probabilities first?
		- Our output is a predicted token. Is that what we want? How would you evaluate the prediction?

		%% Cell type:markdown id:c6b93e3e-9b4d-4c3e-84da-c122cb4a877c tags:

		### Task*

		As we see, GPT-2 does not do a great job when prompting.
		- What could this be related to? Do you ideas how to improve the prompting model?
		- Maybe generating longer sequences (autoregressively) is helpful:
		- Iteratively add the predicted token to the input sequence and generate an output of 10 tokens (e.g. look at code for pretraining if you are stuck).
		- Try a different generative (open source) model, e.g. a larger GPT.

		%% Cell type:markdown id:0f7cc8f4-dc75-428a-a7a2-29dc0e111129 tags:

		## Classification from hidden states using feature extraction

		%% Cell type:markdown id:7de4dd2d-c2ad-4c20-93fd-dc56b208a819 tags:

		Now, we load GPT2Model without the last classification layer.

		%% Cell type:code id:eabc096f-3b14-41b6-a486-9f5f75235f8f tags:

		``` python
		# Load the model
		gpt2 = GPT2Model.from_pretrained("gpt2")
		print("Parameter count", sum(p.numel() for p in gpt2.parameters() if p.requires_grad))

		# Load pre-trained tokenizer
		tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

		# Example input
		input_text = "This LLM class is boring and way to easy."
		# input_text = "This movie was fantastic!"
		# input_text = "Awesome awesome awesome "

		inputs = tokenizer(input_text, return_tensors="pt")

		# Get the hidden states from GPT-2 (without the final language modeling head)
		with torch.no_grad():
		outputs = gpt2(**inputs)
		hidden_states = (
		outputs.last_hidden_state
		) # (batch_size, sequence_length, hidden_size)
		print("hidden_states shape:", hidden_states.shape)

		# Typically, we first reduce dimensionality from l x hidden_side to l (not required but usual for dim reduction for the classifier).
		# As before, we can use the last token (or take a mean over all tokens)
		hidden_states = hidden_states[:, -1, :]
		print("hidden_states last token: ", hidden_states.shape)


		# Custom classifier (untrained)
		class SentimentClassifier(nn.Module):
		def __init__(self, hidden_size, num_classes):
		super(SentimentClassifier, self).__init__()
		self.fc = nn.Linear(hidden_size, num_classes)

		def forward(self, x):
		self.fc(x)
		return self.fc(x)


		# Instantiate the classifier
		classifier = SentimentClassifier(
		hidden_size=hidden_states.size(-1), num_classes=2
		) # Binary sentiment (positive/negative)

		# Pass the pooled features into the classifier
		logits = classifier(
		hidden_states
		) # again only output logits, its sufficient to take argmax here beause transformation is monot.

		print(logits)
		```

		%% Cell type:markdown id:74242f1d-3e2c-466c-9766-e5300e04ac04 tags:

		### Task

		- What the the hidden states? What happens to their dimension in the code?
		- What does our classifier look like? How many parameters does it have, and where do they come from? Interpret the output.

		%% Cell type:markdown id:693c25d2-288e-4916-9e05-4bd7e7cc8553 tags:

		### Task: A sentiment classifier (A outline of a potential project)

		- Load a labelled dataset of choice (e.g. spam emails, imbd movie reviews, tweets, ...).
		- Set the correct number of output classes.
		- Use feature extraction and train your classifier.
		- Tokenize your text and apply the LM to extract hidden states for each text sequence.
		- Train your classifer treating hidden states as fixed X and your labels as y.
		- Choose your loss function as logloss if binary, or cross-categorical cross entroy, or MSE if your outcome is continuous.
		- Evaluate your model performance on a test set. Do you beat a naive benchmark model?

		Optional components*:
		- Use a different classification model e.g. larger neural network, boosting, forests; and optimize hyperparameters.
		- Use a different LM.
		- Use the same LM, but only load the config. What happens to the predictions?

		%% Cell type:markdown id:5cb2f99a tags:

		## Classification and fine-tuning

		%% Cell type:markdown id:4747564f tags:

		### Task*

		Implement the previous model from feature extraction using an LLM of your choice.
		Now we add fine-tuning.
		- Do not simply extract the hidden states, but load the model and use e.g. a huggingface pipeline to add a sentiment classifier.
		- Freeze all layers (using soemthing like param.requires_grad = False) for the LM. This means at the start you cannot change parameters of the model. Verify that it acts as a feature extractor and no parameter updates happen.
		- Train you classifier first, using feature extraction (if not done before). Save your test loss.
		- Now, start to unfreeze certain layers in LM using something like param.requires_grad = True to adapt the hidden states to your application. (No rule how many, its trial and error).
		- Fine-tune the model (classifer and unfrozen layers).
		- Evaluate if fine-tuning improved performance by looking at your test loss.

		Optinal: Explore other fine-tuning approaches, e.g. LoRa https://huggingface.co/docs/diffusers/main/en/training/lora.

		%% Cell type:code id:dcddd148 tags:

		``` python
		```