- The column *label_num* already includes our outcome vector $\mathbf{y}$ with each element as either 0 = no spam (ham) or 1 = spam
- We do not have any numerical features $\mathbf{X}$ to predict spam.
So lets **create** some potenially relevant **features** from the text that might predict spam!
- Length of the Email (number of words)
- Usage of certain words (word_freq_{word}), e.g. the word "call" or "money" might be more common in spam emails.
*Note*: In later stages of the class we can also use the LMs numerical representation of texts here! For now, we only work with standard numerical variables.
)# optimizer. We use an advanced gradient based method
```
%% Cell type:markdown id: tags:
## Train the neural network
%% Cell type:markdown id: tags:
Now, lets implement the training loop.
*Note*: In the training loop, in addition to our loss function, we also compute **accuracy** (0/1 loss) as a measure of classification performance, because often this is what we are interested in:
# of loss w.r.t our learnable params (=Backpropagation)
loss.backward()
# Update params based on the gradients and the optimizer
optimizer.step()
```
%% Cell type:markdown id: tags:
## Task
- Play around with the architecture of the network (number of layers, activations, number of neurons). How does it impact model performance?
- Increase the number of epochs and parameters (by a lot). What happens to the loss after training compared to the previous settings?
- Currently, we only look at loss on the *training* set. This does not take into account overfitting issues!
1) Compute and print the loss and accuracy also on the *test* set within the training loop. Does your model overfit?
2) **Early stopping** against overfitting: Implement a procedure in the training loop that terminates the loop (break) when the loss on the *test* set did not improve in a while (e.g. does not get better than the best test loss after 500 epochs).
-**Evaluation**: How good do you think your model is in terms of accuracy? Is it better than a naive benchmark model that classifies every email as *no spam*?
%% Cell type:markdown id: tags:
## Example Solution
%% Cell type:code id: tags:
``` python
# Init params for early stopping
early_stopping_epochs=(
500# max number of epochs that loss_test can be larger than lowest_test_loss
)
early_stopping_counter=0
lowest_test_loss=np.inf# large initial value for first iteration
# reinit model (otherwise we would continue training)
model=torch.nn.Sequential(
torch.nn.Linear(K,N_Z),
torch.nn.ReLU(),
torch.nn.Linear(N_Z,N_Z),
torch.nn.ReLU(),
torch.nn.Linear(N_Z,1),
torch.nn.Sigmoid(),
)
# move model to device
model.to(device)
# training params
n_epochs=2500# number of times to iterate through the complete training dataset
learning_rate=1e-3# update step of optimizer
loss_function=torch.nn.BCELoss()# Loss function for training
print("Benchmark accuracy on test set:",acc_benchmark_test.item())
print("Trained model accuracy on test set:",acc_test.item())
ifacc_test.item()>acc_benchmark_test.item():
print("Model outperforms benchmark.")
else:
print("Model does not outperform benchmark.")
```
%% Cell type:markdown id: tags:
## Task
Currently, we use gradient descent to train the model.
Now, we want to implement *minibatch gradient descent*, a computationally more efficient version of stochastic gradient descent where you compute the gradients on a *batch* of *b* observations and take the mean of the gradients for updating your parameters e.g. with *batch_size=32*. For each epoch, process the entire batched dataset after random shuffling.
- Start by stochastic gradient descent (which is equivalent to minibatch gradient descent for *batch_size=1*).
- Now adapt your code to work for *batch_size>1* (if that is not the case yet). You could use torch.split to create chunks.
-*Note*: Your loss should still be monitored per epoch, not batch.
-*Check*: How does model performance and training time change?
%% Cell type:markdown id: tags:
## Example Solution
*Note*:
- Typically, one would only move the single batches to the GPU instead of the entire dataset.
- This implementation is pretty slow.
- Training time: Here, training requires less epochs, but is not faster in computing time compared to standard gradient descent.
- Performance: Here, the optimal solution does not outperform gradient descent.
The choice between gradient methods depends on the exact setting. If you expect many local minima and saddle points or work on very large datasets, SGD / minibatch GD is probably preferrable to standard GD.
%% Cell type:code id: tags:
``` python
# Init params for early stopping
early_stopping_epochs=(
500# max number of epochs that loss_test can be larger than lowest_test_loss
)
early_stopping_counter=0
lowest_test_loss=np.inf# large initial value for first iteration
# reinit model (otherwise we would continue training)
model=torch.nn.Sequential(
torch.nn.Linear(K,N_Z),
torch.nn.ReLU(),
torch.nn.Linear(N_Z,N_Z),
torch.nn.ReLU(),
torch.nn.Linear(N_Z,1),
torch.nn.Sigmoid(),
)
# move model to device
model.to(device)
# training params
n_epochs=2500# number of times to iterate through the complete training dataset
learning_rate=1e-3# update step of optimizer
loss_function=torch.nn.BCELoss()# Loss function for training
# Increase counter if test loss increases, otherwise reset
ifloss_test>lowest_test_loss:
early_stopping_counter+=1
else:
early_stopping_counter=0
lowest_test_loss=loss_test
# Break if counter >= early_stopping_epochs
ifearly_stopping_counter>=early_stopping_epochs:
print("Early stopping at epoch",t)
break
```
%% Cell type:markdown id: tags:
## Task*
How well can you train your model? Go crazy!
- Experiment with input normalization, architecture, regularization, additional features in the data, different optimization procedures.
- Check the pytorch documentation to find new cool functions: https://pytorch.org/docs/stable/index.html https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html
*Note*: If you try many combinations of hyperparameters, there is a risk of overfitting to the test set by hyperparameter choice.
Use an additional hold-out sample for the final evaluation (train, validation for hyperparameter selection, test).
%% Cell type:markdown id: tags:
# Additional References
**How do neural networks learn?**:
See this cool visualization https://playground.tensorflow.org/
**For eager programmers**:
If you want to explore gradient flow and how to build your own neural network more manually, check some of these references:
- Torch but more manual https://github.com/karpathy/nn-zero-to-hero/tree/master (e.g. lecture 2 and 3)
- Without torch but very simple network structure https://github.com/JLDC/Data-Science-Fundamentals/blob/master/notebooks/205_my-own-neural-network-1.ipynb
- Torch for perceptron https://github.com/rasbt/deeplearning-models/blob/master/pytorch_ipynb/basic-ml/perceptron.ipynb