Commit 195070fd authored by Erik Senn's avatar Erik Senn
Browse files

Upload New File

parent 9f4b1793
Loading
Loading
Loading
Loading
+547 −0
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

# Setup and data

GPU required? No (but speeds things up)

%% Cell type:code id: tags:

``` python
# Imports (note that you also need imports from the .py function files)
import numpy as np
import pandas as pd
import torch  # PyTorch / ML tool
from IPython.display import Markdown, display

# from utils import *  # ensure to have this file in the same directory

device = torch.device(
    "cuda:0" if torch.cuda.is_available() else "cpu"
)  # device for torch
```

%% Cell type:code id: tags:

``` python
# Data
datapath = "../data/"

# read csv
data = pd.read_csv(datapath + "email_spam.csv")

try:
    data = data.drop(columns=["Unnamed: 0"])
except:
    pass
```

%% Cell type:markdown id: tags:

# Neural Network for Spam Classification
*Note*: NN code is partly adapted from https://github.com/Atcold/NYU-DLSP20/blob/master/04-spiral_classification.ipynb

We want to use a feed-forward neural network to build a **spam filter for emails**, using past data of spam messages and emails.

%% Cell type:markdown id: tags:

## Data Preparation

%% Cell type:markdown id: tags:

First, lets look at the data:
- *Label* means outcome / target variable.

%% Cell type:code id: tags:

``` python
print(data.head())


display(Markdown("**No Spam Example**:"))
print(data[data["label_num"] == 0].iloc[2]["text"][:1000] + "\n")

display(Markdown("**Spam Example**:"))
print(data[data["label_num"] == 1].iloc[5]["text"][:1000])
```

%% Cell type:markdown id: tags:

- The column *label_num* already includes our outcome vector $\mathbf{y}$ with each element as either 0 = no spam (ham) or 1 = spam
- We do not have any numerical features $\mathbf{X}$ to predict spam.
So lets **create** some potenially relevant **features** from the text that might predict spam!
  - Length of the Email (number of words)
  - Usage of certain words (word_freq_{word}), e.g. the word "call" or "money" might be more common in spam emails.

*Note*: In later stages of the class we can also use the LMs numerical representation of texts here! For now, we only work with standard numerical variables.

%% Cell type:code id: tags:

``` python
# Create potentially relevant numerical features

# length of email
data["n_words"] = data["text"].apply(lambda x: len(x.split()))

# list of words that might distinguish spam from ok emails (ham)
potentially_relevant_words = [
    "call",
    "free",
    "urgent",
    "money",
    "mobile",
    "text",
    "reply",
    "win",
    "prize",
    "cash",
    "txt",
    "go",
    "get",
    "u",
    "come",
    "ok",
]

for word in potentially_relevant_words:
    data["word_freq_" + word] = data["text"].apply(
        lambda x: x.split().count(word.lower())
    )

# keep only numerical columns
data = data.select_dtypes(include=[np.number])

# descriptive statistics: are means different for spam and ham?
# if yes, that is an indicator that this variable might be useful
data.groupby("label_num").mean()
```

%% Cell type:markdown id: tags:

Split into a training and test set.

%% Cell type:code id: tags:

``` python
# Train-test split
share_data_train = 0.8
train_idx = data.sample(
    frac=share_data_train, random_state=42
).index  # indices for sample splitting

data_train = data.iloc[train_idx]
data_test = data.drop(train_idx)

# Split into X and y, convert to torch tensors
label_column = "label_num"
X_train, y_train = data_train.drop(columns=[label_column]), data_train[label_column]
X_test, y_test = data_test.drop(columns=[label_column]), data_test[label_column]
```

%% Cell type:markdown id: tags:

Format the data inputs to our model for training using torch:
- torch.tensor are basically np.arrays that can store and propagate gradients (see optional notebook "optional_tensor_into" for details)
- .to(device) moves the data to the correct computing device: GPU or CPU

%% Cell type:code id: tags:

``` python
# format as torch.tensor and correct input formats
# y has to be N x 1, not N

# optional: standardize data
# X_train = (X_train - X_train.mean()) / X_train.std()
# X_test = (X_test - X_train.mean()) / X_train.std() # using mean and std from train.


X_train = torch.tensor(X_train.values).float().to(device)
y_train = torch.tensor(y_train.values).float().unsqueeze(1).to(device)
X_test = torch.tensor(X_test.values).float().to(device)
y_test = torch.tensor(y_test.values).float().unsqueeze(1).to(device)
```

%% Cell type:markdown id: tags:

## Build the neural network

%% Cell type:markdown id: tags:

We define the neural network architecture in the torch framework, using torch.Sequential.

This creates a computational graph for the architecture we provide.
Also, it allows for gradient computation and parameter updates (see later).


**Think:**
- What should be in the input and output dimensions?
- How many layers does this neural network have?
- How many trainable parameters should the model have?

%% Cell type:code id: tags:

``` python
K = X_train.shape[1]  # input
N_Z = 25  # num_hidden_units in hidden layer

# torch nn package to create our model
# each module has a weight and bias
# we add the activation function after the output linear layer
model = torch.nn.Sequential(
    torch.nn.Linear(K, N_Z),
    torch.nn.ReLU(),
    torch.nn.Linear(N_Z, N_Z),
    torch.nn.ReLU(),
    torch.nn.Linear(N_Z, 1),
    torch.nn.Sigmoid(),
)

# move model to the computing device: GPU or CPU
model.to(device)

# show model architecture
print(model)

# how many trainable parameters?
model_count = sum(p.numel() for p in model.parameters() if p.requires_grad)
print("Model count:", model_count)
```

%% Cell type:markdown id: tags:

Now, we specify the training-related parameters: number of training iterations, the loss function and the optimizer.

%% Cell type:code id: tags:

``` python
n_epochs = 2500  # number of times to iterate through the complete training dataset
learning_rate = 1e-3  # update step of optimizer
loss_function = torch.nn.BCELoss()  # Loss function for training
optimizer = torch.optim.Adam(
    model.parameters(), lr=learning_rate, weight_decay=1e-5
)  # optimizer. We use an advanced gradient based method
```

%% Cell type:markdown id: tags:

## Train the neural network

%% Cell type:markdown id: tags:

Now, lets implement the training loop.

*Note*: In the training loop, in addition to our loss function, we also compute **accuracy** (0/1 loss) as a measure of classification performance, because often this is what we are interested in:

$$Acc = \frac{TP + TN}{TP + TN + FP + FN} = \frac{\mathbb{1}_{\hat y = y}}{N}$$
- TP = True positives, TN = True negatives, FP = False positives, FN = False negatives.
- In words, accuracy is the share of correctly classified labels (after applying a threshold on the predicted probabilities, e.g. .5).

We cannot directly use accuracy as loss function in the optimization because accuracy is non-differentiable :(


%% Cell type:code id: tags:

``` python
# Training
for t in range(n_epochs):

    # Feed forward to get the predicted probabilities
    y_pred_proba = model(X_train)

    # Compute the loss on the training
    loss = loss_function(y_pred_proba, y_train)

    # Compute additional loss called accuracy: How many predictions did we get right?
    y_pred = torch.round(y_pred_proba)  # convert to 0/1 using .5 as threshold
    acc = (y_train == y_pred).sum().float() / len(y_train)

    # Print training progress
    print("[EPOCH]: %i, [LOSS]: %.6f, [ACCURACY]: %.3f" % (t, loss.item(), acc))

    # zero the gradients before running
    # the backward pass.
    optimizer.zero_grad()

    # Backward pass to compute the gradient
    # of loss w.r.t our learnable params (=Backpropagation)
    loss.backward()

    # Update params based on the gradients and the optimizer
    optimizer.step()
```

%% Cell type:markdown id: tags:

## Task
- Play around with the architecture of the network (number of layers, activations, number of neurons). How does it impact model performance?
- Increase the number of epochs and parameters (by a lot). What happens to the loss after training compared to the previous settings?
- Currently, we only look at loss on the *training* set. This does not take into account overfitting issues!
  1) Compute and print the loss and accuracy also on the *test* set within the training loop. Does your model overfit?
  2) **Early stopping** against overfitting: Implement a procedure in the training loop that terminates the loop (break) when the loss on the *test* set did not improve in a while (e.g. does not get better than the best test loss after 500 epochs).
- **Evaluation**: How good do you think your model is in terms of accuracy? Is it better than a naive benchmark model that classifies every email as *no spam*?

%% Cell type:markdown id: tags:

## Example Solution

%% Cell type:code id: tags:

``` python
# Init params for early stopping
early_stopping_epochs = (
    500  # max number of epochs that loss_test can be larger than lowest_test_loss
)
early_stopping_counter = 0
lowest_test_loss = np.inf  # large initial value for first iteration

# reinit model (otherwise we would continue training)
model = torch.nn.Sequential(
    torch.nn.Linear(K, N_Z),
    torch.nn.ReLU(),
    torch.nn.Linear(N_Z, N_Z),
    torch.nn.ReLU(),
    torch.nn.Linear(N_Z, 1),
    torch.nn.Sigmoid(),
)

# move model to device
model.to(device)

# training params
n_epochs = 2500  # number of times to iterate through the complete training dataset
learning_rate = 1e-3  # update step of optimizer
loss_function = torch.nn.BCELoss()  # Loss function for training
optimizer = torch.optim.Adam(
    model.parameters(), lr=learning_rate, weight_decay=1e-5
)  # optimizer. We use an advanced gradient based method

# Training
for t in range(n_epochs):

    # Feed forward to get the predicted probabilities
    y_pred_proba = model(X_train)

    # Compute the loss on the training
    loss = loss_function(y_pred_proba, y_train)

    # Compute additional loss called accuracy: How many predictions did we get right?
    y_pred = torch.round(y_pred_proba)  # convert to 0/1 using .5 as threshold
    acc = (y_train == y_pred).sum().float() / len(y_train)

    # Compute loss on validation test
    y_pred_proba_test = model(X_test)
    loss_test = loss_function(y_pred_proba_test, y_test)
    y_pred_test = torch.round(y_pred_proba_test)  # convert to 0/1 using .5 as threshold
    acc_test = (y_test == y_pred_test).sum().float() / len(y_test)

    # Print training progress
    print(
        "[EPOCH]: %i, [LOSS]: %.6f, [LOSS TEST]: %.6f, [ACCURACY]: %.3f, [ACCURACY TEST]: %.3f"
        % (t, loss.item(), loss_test.item(), acc, acc_test)
    )

    # Early stopping
    # Increase counter if test loss increases, otherwise reset
    if loss_test > lowest_test_loss:
        early_stopping_counter += 1
    else:
        early_stopping_counter = 0
        lowest_test_loss = loss_test
    # Break if counter >= early_stopping_epochs
    if early_stopping_counter >= early_stopping_epochs:
        print("Early stopping at epoch", t)
        break

    # zero the gradients before running
    # the backward pass.
    optimizer.zero_grad()

    # Backward pass to compute the gradient
    # of loss w.r.t our learnable params (=Backpropagation)
    loss.backward()

    # Update params based on the gradients and the optimizer
    optimizer.step()
```

%% Cell type:code id: tags:

``` python
# Natural benchmark on test set: the majority class
majority_class = (
    y_train.mean().round()
)  # only works for binary classification with 0/1 labels, otherwise compute mode
pred_benchmark_test = torch.Tensor.repeat(majority_class, len(y_test)).unsqueeze(1)
acc_benchmark_test = (y_test == pred_benchmark_test).sum().float() / len(y_test)

print("Benchmark accuracy on test set:", acc_benchmark_test.item())
print("Trained model accuracy on test set:", acc_test.item())
if acc_test.item() > acc_benchmark_test.item():
    print("Model outperforms benchmark.")
else:
    print("Model does not outperform benchmark.")
```

%% Cell type:markdown id: tags:

## Task

Currently, we use gradient descent to train the model.
Now, we want to implement *minibatch gradient descent*, a computationally more efficient version of stochastic gradient descent where you compute the gradients on a *batch* of *b* observations and take the mean of the gradients for updating your parameters e.g. with *batch_size=32*. For each epoch, process the entire batched dataset after random shuffling.
- Start by stochastic gradient descent (which is equivalent to minibatch gradient descent for *batch_size=1*).
- Now adapt your code to work for *batch_size>1* (if that is not the case yet). You could use torch.split to create chunks.
- *Note*: Your loss should still be monitored per epoch, not batch.
- *Check*: How does model performance and training time change?


%% Cell type:markdown id: tags:

## Example Solution

*Note*:
- Typically, one would only move the single batches to the GPU instead of the entire dataset.
- This implementation is pretty slow.
- Training time: Here, training requires less epochs, but is not faster in computing time compared to standard gradient descent.
- Performance: Here, the optimal solution does not outperform gradient descent.

The choice between gradient methods depends on the exact setting. If you expect many local minima and saddle points or work on very large datasets, SGD / minibatch GD is probably preferrable to standard GD.

%% Cell type:code id: tags:

``` python
# Init params for early stopping
early_stopping_epochs = (
    500  # max number of epochs that loss_test can be larger than lowest_test_loss
)
early_stopping_counter = 0
lowest_test_loss = np.inf  # large initial value for first iteration

# reinit model (otherwise we would continue training)
model = torch.nn.Sequential(
    torch.nn.Linear(K, N_Z),
    torch.nn.ReLU(),
    torch.nn.Linear(N_Z, N_Z),
    torch.nn.ReLU(),
    torch.nn.Linear(N_Z, 1),
    torch.nn.Sigmoid(),
)

# move model to device
model.to(device)

# training params
n_epochs = 2500  # number of times to iterate through the complete training dataset
learning_rate = 1e-3  # update step of optimizer
loss_function = torch.nn.BCELoss()  # Loss function for training
optimizer = torch.optim.Adam(
    model.parameters(), lr=learning_rate, weight_decay=1e-5
)  # optimizer. We use an advanced gradient based method

# batch size and number
batch_size = 32
n_batches = int(np.ceil(len(X_train) / batch_size))

# Training
for t in range(n_epochs):

    # shuffle training data
    idx = torch.randperm(X_train.size(0))
    X_train = X_train[idx]
    y_train = y_train[idx]

    # loss and accuracy for this epoch
    loss_batch = torch.zeros(n_batches)
    acc_batch = torch.zeros(n_batches)
    loss_test_batch = torch.zeros(n_batches)
    acc_test_batch = torch.zeros(n_batches)

    for batch in range(n_batches):
        # select batch after shuffling using indices

        X_train_batch = X_train[
            batch * batch_size : min((batch + 1) * batch_size, len(X_train))
        ]
        y_train_batch = y_train[
            batch * batch_size : min((batch + 1) * batch_size, len(y_train))
        ]

        # Feed forward to get the predicted probabilities
        y_pred_proba = model(X_train_batch)

        # Compute the loss on the training
        loss = loss_function(y_pred_proba, y_train_batch)

        # Compute additional loss called accuracy: How many predictions did we get right?
        y_pred = torch.round(y_pred_proba)  # convert to 0/1 using .5 as threshold
        acc = (y_train_batch == y_pred).sum().float() / len(y_train_batch)

        # Compute loss on validation test
        y_pred_proba_test = model(X_test)
        loss_test = loss_function(y_pred_proba_test, y_test)
        y_pred_test = torch.round(
            y_pred_proba_test
        )  # convert to 0/1 using .5 as threshold
        acc_test = (y_test == y_pred_test).sum().float() / len(y_test)

        # add loss and accuracy to epoch values
        loss_batch[batch] = loss
        acc_batch[batch] = acc
        loss_test_batch[batch] = loss_test
        acc_test_batch[batch] = acc_test

        # zero the gradients before running
        # the backward pass.
        optimizer.zero_grad()

        # Backward pass to compute the gradient
        # of loss w.r.t our learnable params (=Backpropagation)
        loss.backward()

        # Update params based on the gradients and the optimizer
        optimizer.step()

    # Print training progress
    loss = loss_batch.mean()
    acc = acc_batch.mean()
    loss_test = loss_test_batch.mean()
    acc_test = acc_test_batch.mean()
    print(
        "[EPOCH]: %i, [LOSS]: %.6f, [LOSS TEST]: %.6f, [ACCURACY]: %.3f, [ACCURACY TEST]: %.3f"
        % (t, loss, loss_test, acc, acc_test)
    )

    # Early stopping
    # Increase counter if test loss increases, otherwise reset
    if loss_test > lowest_test_loss:
        early_stopping_counter += 1
    else:
        early_stopping_counter = 0
        lowest_test_loss = loss_test
    # Break if counter >= early_stopping_epochs
    if early_stopping_counter >= early_stopping_epochs:
        print("Early stopping at epoch", t)
        break
```

%% Cell type:markdown id: tags:

## Task*

How well can you train your model? Go crazy!
- Experiment with input normalization, architecture, regularization, additional features in the data, different optimization procedures.
- Check the pytorch documentation to find new cool functions: https://pytorch.org/docs/stable/index.html https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html

*Note*: If you try many combinations of hyperparameters, there is a risk of overfitting to the test set by hyperparameter choice.
Use an additional hold-out sample for the final evaluation (train, validation for hyperparameter selection, test).

%% Cell type:markdown id: tags:

# Additional References

**How do neural networks learn?**:
See this cool visualization https://playground.tensorflow.org/

**For eager programmers**:
If you want to explore gradient flow and how to build your own neural network more manually, check some of these references:
- Torch but more manual https://github.com/karpathy/nn-zero-to-hero/tree/master (e.g. lecture 2 and 3)
- Without torch but very simple network structure https://github.com/JLDC/Data-Science-Fundamentals/blob/master/notebooks/205_my-own-neural-network-1.ipynb
- Torch for perceptron https://github.com/rasbt/deeplearning-models/blob/master/pytorch_ipynb/basic-ml/perceptron.ipynb