We want to classify sentiment (positive, negative) using pretrained LMs (or alternatively, classify emotions, ...).
We will work with a version GPT-2.
- Other (larger) GPT2 versions are documented https://huggingface.co/transformers/v2.2.0/pretrained_models.html. (documentation says 117M parameters, but its 124.4 Million)
- Note that sentiment classification does not require causal LLMs, so we could also use e.g. BERT, or turn off the masking. (We use GPT2 because we can also test prompting here).
- What is the meaning of the config parameters of the LM, and how do they shape the printed architecture below? Explain each one.
- Try to manually replicate the number of parameters on the model (see print). Not easy, right?
- Note: "n_inner": 4 is the default.
- For experimenting, you can change the config e.g. config.n_layer = x to assign different values (any verify that your computation is correct for other inputs). (not possible for all parameters.
*Note*: The c_attn is a implementation of the standard self-attention mechanism (concatination of query, key value). Softmax is probably hidden under the hood here. c_proj aggregates the attention heads and is a linear layer with parameters.
*Note:* It appears that the parameters are shared in multihead attention heads which is surprising, or the parameter counting method in the code does not show all parameters. I believe this is not standard?
Now, lets load the trained model including weights. We could also keep the initialized model architecture and load in the weights differently, but *from_pretrained* does the job for us.
*Note*: GPT2LMHeadModel includes the *head* model which is a linear layer mapping back to $l \times v$, but they are not counted as parameters I believe.
### Task: A sentiment classifier (A outline of a potential project)
- Load a labelled dataset of choice (e.g. spam emails, imbd movie reviews, tweets, ...).
- Set the correct number of output classes.
- Use feature extraction and train your classifier.
- Tokenize your text and apply the LM to extract hidden states for each text sequence.
- Train your classifer treating hidden states as fixed X and your labels as y.
- Choose your loss function as logloss if binary, or cross-categorical cross entroy, or MSE if your outcome is continuous.
- Evaluate your model performance on a test set. Do you beat a naive benchmark model?
Optional components*:
- Use a different classification model e.g. larger neural network, boosting, forests; and optimize hyperparameters.
- Use a different LM.
- Use the same LM, but only load the config. What happens to the predictions?
%% Cell type:markdown id:5cb2f99a tags:
## Classification and fine-tuning
%% Cell type:markdown id:4747564f tags:
### Task*
Implement the previous model from feature extraction using an LLM of your choice.
Now we add fine-tuning.
- Do not simply extract the hidden states, but load the model and use e.g. a huggingface pipeline to add a sentiment classifier.
- Freeze all layers (using soemthing like *param.requires_grad = False*) for the LM. This means at the start you cannot change parameters of the model. Verify that it acts as a feature extractor and no parameter updates happen.
- Train you classifier first, using feature extraction (if not done before). Save your test loss.
- Now, start to *unfreeze* certain layers in LM using something like *param.requires_grad = True* to adapt the hidden states to your application. (No rule how many, its trial and error).
-*Fine-tune* the model (classifer and unfrozen layers).
- Evaluate if fine-tuning improved performance by looking at your test loss.
Optinal: Explore other fine-tuning approaches, e.g. LoRa https://huggingface.co/docs/diffusers/main/en/training/lora.