Upload New File (195070fd) · Commits · Erik Senn / llm_class_public

notebooks/2_neural_networks.ipynb

0 → 100644

+547 −0

Original line number	Diff line number	Diff line
		%% Cell type:markdown id: tags:

		# Setup and data

		GPU required? No (but speeds things up)

		%% Cell type:code id: tags:

		``` python
		# Imports (note that you also need imports from the .py function files)
		import numpy as np
		import pandas as pd
		import torch # PyTorch / ML tool
		from IPython.display import Markdown, display

		# from utils import * # ensure to have this file in the same directory

		device = torch.device(
		"cuda:0" if torch.cuda.is_available() else "cpu"
		) # device for torch
		```

		%% Cell type:code id: tags:

		``` python
		# Data
		datapath = "../data/"

		# read csv
		data = pd.read_csv(datapath + "email_spam.csv")

		try:
		data = data.drop(columns=["Unnamed: 0"])
		except:
		pass
		```

		%% Cell type:markdown id: tags:

		# Neural Network for Spam Classification
		Note: NN code is partly adapted from https://github.com/Atcold/NYU-DLSP20/blob/master/04-spiral_classification.ipynb

		We want to use a feed-forward neural network to build a spam filter for emails, using past data of spam messages and emails.

		%% Cell type:markdown id: tags:

		## Data Preparation

		%% Cell type:markdown id: tags:

		First, lets look at the data:
		- Label means outcome / target variable.

		%% Cell type:code id: tags:

		``` python
		print(data.head())


		display(Markdown("No Spam Example:"))
		print(data[data["label_num"] == 0].iloc[2]["text"][:1000] + "\n")

		display(Markdown("Spam Example:"))
		print(data[data["label_num"] == 1].iloc[5]["text"][:1000])
		```

		%% Cell type:markdown id: tags:

		- The column label_num already includes our outcome vector $\mathbf{y}$ with each element as either 0 = no spam (ham) or 1 = spam
		- We do not have any numerical features $\mathbf{X}$ to predict spam.
		So lets create some potenially relevant features from the text that might predict spam!
		- Length of the Email (number of words)
		- Usage of certain words (word_freq_{word}), e.g. the word "call" or "money" might be more common in spam emails.

		Note: In later stages of the class we can also use the LMs numerical representation of texts here! For now, we only work with standard numerical variables.

		%% Cell type:code id: tags:

		``` python
		# Create potentially relevant numerical features

		# length of email
		data["n_words"] = data["text"].apply(lambda x: len(x.split()))

		# list of words that might distinguish spam from ok emails (ham)
		potentially_relevant_words = [
		"call",
		"free",
		"urgent",
		"money",
		"mobile",
		"text",
		"reply",
		"win",
		"prize",
		"cash",
		"txt",
		"go",
		"get",
		"u",
		"come",
		"ok",
		]

		for word in potentially_relevant_words:
		data["word_freq_" + word] = data["text"].apply(
		lambda x: x.split().count(word.lower())
		)

		# keep only numerical columns
		data = data.select_dtypes(include=[np.number])

		# descriptive statistics: are means different for spam and ham?
		# if yes, that is an indicator that this variable might be useful
		data.groupby("label_num").mean()
		```

		%% Cell type:markdown id: tags:

		Split into a training and test set.

		%% Cell type:code id: tags:

		``` python
		# Train-test split
		share_data_train = 0.8
		train_idx = data.sample(
		frac=share_data_train, random_state=42
		).index # indices for sample splitting

		data_train = data.iloc[train_idx]
		data_test = data.drop(train_idx)

		# Split into X and y, convert to torch tensors
		label_column = "label_num"
		X_train, y_train = data_train.drop(columns=[label_column]), data_train[label_column]
		X_test, y_test = data_test.drop(columns=[label_column]), data_test[label_column]
		```

		%% Cell type:markdown id: tags:

		Format the data inputs to our model for training using torch:
		- torch.tensor are basically np.arrays that can store and propagate gradients (see optional notebook "optional_tensor_into" for details)
		- .to(device) moves the data to the correct computing device: GPU or CPU

		%% Cell type:code id: tags:

		``` python
		# format as torch.tensor and correct input formats
		# y has to be N x 1, not N

		# optional: standardize data
		# X_train = (X_train - X_train.mean()) / X_train.std()
		# X_test = (X_test - X_train.mean()) / X_train.std() # using mean and std from train.


		X_train = torch.tensor(X_train.values).float().to(device)
		y_train = torch.tensor(y_train.values).float().unsqueeze(1).to(device)
		X_test = torch.tensor(X_test.values).float().to(device)
		y_test = torch.tensor(y_test.values).float().unsqueeze(1).to(device)
		```

		%% Cell type:markdown id: tags:

		## Build the neural network

		%% Cell type:markdown id: tags:

		We define the neural network architecture in the torch framework, using torch.Sequential.

		This creates a computational graph for the architecture we provide.
		Also, it allows for gradient computation and parameter updates (see later).


		Think:
		- What should be in the input and output dimensions?
		- How many layers does this neural network have?
		- How many trainable parameters should the model have?

		%% Cell type:code id: tags:

		``` python
		K = X_train.shape[1] # input
		N_Z = 25 # num_hidden_units in hidden layer

		# torch nn package to create our model
		# each module has a weight and bias
		# we add the activation function after the output linear layer
		model = torch.nn.Sequential(
		torch.nn.Linear(K, N_Z),
		torch.nn.ReLU(),
		torch.nn.Linear(N_Z, N_Z),
		torch.nn.ReLU(),
		torch.nn.Linear(N_Z, 1),
		torch.nn.Sigmoid(),
		)

		# move model to the computing device: GPU or CPU
		model.to(device)

		# show model architecture
		print(model)

		# how many trainable parameters?
		model_count = sum(p.numel() for p in model.parameters() if p.requires_grad)
		print("Model count:", model_count)
		```

		%% Cell type:markdown id: tags:

		Now, we specify the training-related parameters: number of training iterations, the loss function and the optimizer.

		%% Cell type:code id: tags:

		``` python
		n_epochs = 2500 # number of times to iterate through the complete training dataset
		learning_rate = 1e-3 # update step of optimizer
		loss_function = torch.nn.BCELoss() # Loss function for training
		optimizer = torch.optim.Adam(
		model.parameters(), lr=learning_rate, weight_decay=1e-5
		) # optimizer. We use an advanced gradient based method
		```

		%% Cell type:markdown id: tags:

		## Train the neural network

		%% Cell type:markdown id: tags:

		Now, lets implement the training loop.

		Note: In the training loop, in addition to our loss function, we also compute accuracy (0/1 loss) as a measure of classification performance, because often this is what we are interested in:

		$$Acc = \frac{TP + TN}{TP + TN + FP + FN} = \frac{\mathbb{1}_{\hat y = y}}{N}$$
		- TP = True positives, TN = True negatives, FP = False positives, FN = False negatives.
		- In words, accuracy is the share of correctly classified labels (after applying a threshold on the predicted probabilities, e.g. .5).

		We cannot directly use accuracy as loss function in the optimization because accuracy is non-differentiable :(


		%% Cell type:code id: tags:

		``` python
		# Training
		for t in range(n_epochs):

		# Feed forward to get the predicted probabilities
		y_pred_proba = model(X_train)

		# Compute the loss on the training
		loss = loss_function(y_pred_proba, y_train)

		# Compute additional loss called accuracy: How many predictions did we get right?
		y_pred = torch.round(y_pred_proba) # convert to 0/1 using .5 as threshold
		acc = (y_train == y_pred).sum().float() / len(y_train)

		# Print training progress
		print("[EPOCH]: %i, [LOSS]: %.6f, [ACCURACY]: %.3f" % (t, loss.item(), acc))

		# zero the gradients before running
		# the backward pass.
		optimizer.zero_grad()

		# Backward pass to compute the gradient
		# of loss w.r.t our learnable params (=Backpropagation)
		loss.backward()

		# Update params based on the gradients and the optimizer
		optimizer.step()
		```

		%% Cell type:markdown id: tags:

		## Task
		- Play around with the architecture of the network (number of layers, activations, number of neurons). How does it impact model performance?
		- Increase the number of epochs and parameters (by a lot). What happens to the loss after training compared to the previous settings?
		- Currently, we only look at loss on the training set. This does not take into account overfitting issues!
		1) Compute and print the loss and accuracy also on the test set within the training loop. Does your model overfit?
		2) Early stopping against overfitting: Implement a procedure in the training loop that terminates the loop (break) when the loss on the test set did not improve in a while (e.g. does not get better than the best test loss after 500 epochs).
		- Evaluation: How good do you think your model is in terms of accuracy? Is it better than a naive benchmark model that classifies every email as no spam?

		%% Cell type:markdown id: tags:

		## Example Solution

		%% Cell type:code id: tags:

		``` python
		# Init params for early stopping
		early_stopping_epochs = (
		500 # max number of epochs that loss_test can be larger than lowest_test_loss
		)
		early_stopping_counter = 0
		lowest_test_loss = np.inf # large initial value for first iteration

		# reinit model (otherwise we would continue training)
		model = torch.nn.Sequential(
		torch.nn.Linear(K, N_Z),
		torch.nn.ReLU(),
		torch.nn.Linear(N_Z, N_Z),
		torch.nn.ReLU(),
		torch.nn.Linear(N_Z, 1),
		torch.nn.Sigmoid(),
		)

		# move model to device
		model.to(device)

		# training params
		n_epochs = 2500 # number of times to iterate through the complete training dataset
		learning_rate = 1e-3 # update step of optimizer
		loss_function = torch.nn.BCELoss() # Loss function for training
		optimizer = torch.optim.Adam(
		model.parameters(), lr=learning_rate, weight_decay=1e-5
		) # optimizer. We use an advanced gradient based method

		# Training
		for t in range(n_epochs):

		# Feed forward to get the predicted probabilities
		y_pred_proba = model(X_train)

		# Compute the loss on the training
		loss = loss_function(y_pred_proba, y_train)

		# Compute additional loss called accuracy: How many predictions did we get right?
		y_pred = torch.round(y_pred_proba) # convert to 0/1 using .5 as threshold
		acc = (y_train == y_pred).sum().float() / len(y_train)

		# Compute loss on validation test
		y_pred_proba_test = model(X_test)
		loss_test = loss_function(y_pred_proba_test, y_test)
		y_pred_test = torch.round(y_pred_proba_test) # convert to 0/1 using .5 as threshold
		acc_test = (y_test == y_pred_test).sum().float() / len(y_test)

		# Print training progress
		print(
		"[EPOCH]: %i, [LOSS]: %.6f, [LOSS TEST]: %.6f, [ACCURACY]: %.3f, [ACCURACY TEST]: %.3f"
		% (t, loss.item(), loss_test.item(), acc, acc_test)
		)

		# Early stopping
		# Increase counter if test loss increases, otherwise reset
		if loss_test > lowest_test_loss:
		early_stopping_counter += 1
		else:
		early_stopping_counter = 0
		lowest_test_loss = loss_test
		# Break if counter >= early_stopping_epochs
		if early_stopping_counter >= early_stopping_epochs:
		print("Early stopping at epoch", t)
		break

		# zero the gradients before running
		# the backward pass.
		optimizer.zero_grad()

		# Backward pass to compute the gradient
		# of loss w.r.t our learnable params (=Backpropagation)
		loss.backward()

		# Update params based on the gradients and the optimizer
		optimizer.step()
		```

		%% Cell type:code id: tags:

		``` python
		# Natural benchmark on test set: the majority class
		majority_class = (
		y_train.mean().round()
		) # only works for binary classification with 0/1 labels, otherwise compute mode
		pred_benchmark_test = torch.Tensor.repeat(majority_class, len(y_test)).unsqueeze(1)
		acc_benchmark_test = (y_test == pred_benchmark_test).sum().float() / len(y_test)

		print("Benchmark accuracy on test set:", acc_benchmark_test.item())
		print("Trained model accuracy on test set:", acc_test.item())
		if acc_test.item() > acc_benchmark_test.item():
		print("Model outperforms benchmark.")
		else:
		print("Model does not outperform benchmark.")
		```

		%% Cell type:markdown id: tags:

		## Task

		Currently, we use gradient descent to train the model.
		Now, we want to implement minibatch gradient descent, a computationally more efficient version of stochastic gradient descent where you compute the gradients on a batch of b observations and take the mean of the gradients for updating your parameters e.g. with batch_size=32. For each epoch, process the entire batched dataset after random shuffling.
		- Start by stochastic gradient descent (which is equivalent to minibatch gradient descent for batch_size=1).
		- Now adapt your code to work for batch_size>1 (if that is not the case yet). You could use torch.split to create chunks.
		- Note: Your loss should still be monitored per epoch, not batch.
		- Check: How does model performance and training time change?


		%% Cell type:markdown id: tags:

		## Example Solution

		Note:
		- Typically, one would only move the single batches to the GPU instead of the entire dataset.
		- This implementation is pretty slow.
		- Training time: Here, training requires less epochs, but is not faster in computing time compared to standard gradient descent.
		- Performance: Here, the optimal solution does not outperform gradient descent.

		The choice between gradient methods depends on the exact setting. If you expect many local minima and saddle points or work on very large datasets, SGD / minibatch GD is probably preferrable to standard GD.

		%% Cell type:code id: tags:

		``` python
		# Init params for early stopping
		early_stopping_epochs = (
		500 # max number of epochs that loss_test can be larger than lowest_test_loss
		)
		early_stopping_counter = 0
		lowest_test_loss = np.inf # large initial value for first iteration

		# reinit model (otherwise we would continue training)
		model = torch.nn.Sequential(
		torch.nn.Linear(K, N_Z),
		torch.nn.ReLU(),
		torch.nn.Linear(N_Z, N_Z),
		torch.nn.ReLU(),
		torch.nn.Linear(N_Z, 1),
		torch.nn.Sigmoid(),
		)

		# move model to device
		model.to(device)

		# training params
		n_epochs = 2500 # number of times to iterate through the complete training dataset
		learning_rate = 1e-3 # update step of optimizer
		loss_function = torch.nn.BCELoss() # Loss function for training
		optimizer = torch.optim.Adam(
		model.parameters(), lr=learning_rate, weight_decay=1e-5
		) # optimizer. We use an advanced gradient based method

		# batch size and number
		batch_size = 32
		n_batches = int(np.ceil(len(X_train) / batch_size))

		# Training
		for t in range(n_epochs):

		# shuffle training data
		idx = torch.randperm(X_train.size(0))
		X_train = X_train[idx]
		y_train = y_train[idx]

		# loss and accuracy for this epoch
		loss_batch = torch.zeros(n_batches)
		acc_batch = torch.zeros(n_batches)
		loss_test_batch = torch.zeros(n_batches)
		acc_test_batch = torch.zeros(n_batches)

		for batch in range(n_batches):
		# select batch after shuffling using indices

		X_train_batch = X_train[
		batch * batch_size : min((batch + 1) * batch_size, len(X_train))
		]
		y_train_batch = y_train[
		batch * batch_size : min((batch + 1) * batch_size, len(y_train))
		]

		# Feed forward to get the predicted probabilities
		y_pred_proba = model(X_train_batch)

		# Compute the loss on the training
		loss = loss_function(y_pred_proba, y_train_batch)

		# Compute additional loss called accuracy: How many predictions did we get right?
		y_pred = torch.round(y_pred_proba) # convert to 0/1 using .5 as threshold
		acc = (y_train_batch == y_pred).sum().float() / len(y_train_batch)

		# Compute loss on validation test
		y_pred_proba_test = model(X_test)
		loss_test = loss_function(y_pred_proba_test, y_test)
		y_pred_test = torch.round(
		y_pred_proba_test
		) # convert to 0/1 using .5 as threshold
		acc_test = (y_test == y_pred_test).sum().float() / len(y_test)

		# add loss and accuracy to epoch values
		loss_batch[batch] = loss
		acc_batch[batch] = acc
		loss_test_batch[batch] = loss_test
		acc_test_batch[batch] = acc_test

		# zero the gradients before running
		# the backward pass.
		optimizer.zero_grad()

		# Backward pass to compute the gradient
		# of loss w.r.t our learnable params (=Backpropagation)
		loss.backward()

		# Update params based on the gradients and the optimizer
		optimizer.step()

		# Print training progress
		loss = loss_batch.mean()
		acc = acc_batch.mean()
		loss_test = loss_test_batch.mean()
		acc_test = acc_test_batch.mean()
		print(
		"[EPOCH]: %i, [LOSS]: %.6f, [LOSS TEST]: %.6f, [ACCURACY]: %.3f, [ACCURACY TEST]: %.3f"
		% (t, loss, loss_test, acc, acc_test)
		)

		# Early stopping
		# Increase counter if test loss increases, otherwise reset
		if loss_test > lowest_test_loss:
		early_stopping_counter += 1
		else:
		early_stopping_counter = 0
		lowest_test_loss = loss_test
		# Break if counter >= early_stopping_epochs
		if early_stopping_counter >= early_stopping_epochs:
		print("Early stopping at epoch", t)
		break
		```

		%% Cell type:markdown id: tags:

		## Task*

		How well can you train your model? Go crazy!
		- Experiment with input normalization, architecture, regularization, additional features in the data, different optimization procedures.
		- Check the pytorch documentation to find new cool functions: https://pytorch.org/docs/stable/index.html https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html

		Note: If you try many combinations of hyperparameters, there is a risk of overfitting to the test set by hyperparameter choice.
		Use an additional hold-out sample for the final evaluation (train, validation for hyperparameter selection, test).

		%% Cell type:markdown id: tags:

		# Additional References

		How do neural networks learn?:
		See this cool visualization https://playground.tensorflow.org/

		For eager programmers:
		If you want to explore gradient flow and how to build your own neural network more manually, check some of these references:
		- Torch but more manual https://github.com/karpathy/nn-zero-to-hero/tree/master (e.g. lecture 2 and 3)
		- Without torch but very simple network structure https://github.com/JLDC/Data-Science-Fundamentals/blob/master/notebooks/205_my-own-neural-network-1.ipynb
		- Torch for perceptron https://github.com/rasbt/deeplearning-models/blob/master/pytorch_ipynb/basic-ml/perceptron.ipynb