Upload New File (12688760) · Commits · Erik Senn / llm_class_public

notebooks/preprocessing_text.ipynb

0 → 100644

+586 −0

Original line number	Diff line number	Diff line
		%% Cell type:markdown id: tags:

		# Setup and data

		GPU required? No

		%% Cell type:code id: tags:

		``` python
		# Imports (note that you also need imports from the .py function files)
		import numpy as np
		import re # regex
		import torch # PyTorch / ML tool
		from transformers import BertTokenizer # load tokenizer
		from transformers import BertModel
		from collections import OrderedDict
		import matplotlib.pyplot as plt
		from sklearn.decomposition import PCA

		from utils import * # ensure to have this file in the same directory
		```

		%% Cell type:code id: tags:

		``` python
		# Data
		datapath = "../data/"

		with open(datapath + "the-verdict.txt", "r", encoding="utf-8") as f:
		raw_text = f.read()

		print("Total number of character:", len(raw_text))
		print(raw_text[:99])
		```

		%% Cell type:markdown id: tags:

		# Text Preprocessing

		Source: The code is partly adapted from https://github.com/rasbt/LLM-workshop-2024/blob/main/02_data/02.ipynb

		%% Cell type:markdown id: tags:

		## Tokenizer

		%% Cell type:markdown id: tags:

		### From scratch
		Lets build a simple tokenizer ourselves, where every word and punctuation sign in the text should correspond to a token.

		%% Cell type:code id: tags:

		``` python
		# Split text into example tokens: separate words and punctuation
		tokens = [item for item in re.split(r'([,.:;?_!"()\']\|--\|\s)', raw_text) if item]
		print(raw_text[:57])
		print(tokens[:20])
		```

		%% Cell type:code id: tags:

		``` python
		# How many tokens are there?
		print("Number of tokens in text:", len(tokens))

		# How many unique tokens are there?
		unique_tokens = sorted(set(tokens))
		print("Number of unique tokens:", len(unique_tokens))
		```

		%% Cell type:markdown id: tags:

		Investigate the vocabulary

		%% Cell type:code id: tags:

		``` python
		# Assign IDs to tokens (=create a vocabulary)
		vocab = {token: integer for integer, token in enumerate(unique_tokens)}

		# Show first entries
		for i, item in enumerate(vocab.items()):
		print(item)
		if i >= 50:
		break
		```

		%% Cell type:markdown id: tags:

		Build our tokenizer class

		%% Cell type:code id: tags:

		``` python
		# Build a tokenizer class
		class SimpleTokenizerV1:
		def __init__(self, vocab):
		self.str_to_int = vocab
		self.int_to_str = {i: s for s, i in vocab.items()}

		def encode(self, text):
		preprocessed = re.split(r'([,.?_!"()\']\|--\|\s)', text)
		preprocessed = [item.strip() for item in preprocessed if item.strip()]
		ids = [self.str_to_int[s] for s in preprocessed]
		return ids

		def decode(self, ids):
		text = " ".join([self.int_to_str[i] for i in ids])
		# Replace spaces before the specified punctuations
		text = re.sub(r'\s+([,.?!"()\'])', r"\1", text)
		return text

		# optional function to showcase the tokens before transforming to IDs.
		def tokenize(self, text):
		preprocessed = re.split(r'([,.?_!"()\']\|--\|\s)', text)
		preprocessed = [item.strip() for item in preprocessed if item.strip()]
		return preprocessed
		```

		%% Cell type:markdown id: tags:

		Test it

		%% Cell type:code id: tags:

		``` python
		# Encode and decode an example text
		tokenizer = SimpleTokenizerV1(vocab)

		text = """"It's the last he painted, you know,"
		Mrs. Gisburn said with pardonable pride."""
		print("Original text:", text)
		tokens = tokenizer.tokenize(text)
		print("Tokens:", tokens)
		ids = tokenizer.encode(text)
		print("Token IDs:", ids)
		reconstructured_text = tokenizer.decode(ids)
		print("Reconstructed text:", reconstructured_text)
		```

		%% Cell type:markdown id: tags:

		Out of vocabulary issue

		What happens if we use new text that contains previously unseen words?

		%% Cell type:code id: tags:

		``` python
		# Out of token issue
		try:
		text = """LLMs are fast and efficient models for NLP tasks."""
		print("Original text:", text)
		tokens = tokenizer.tokenize(text)
		print("Tokens:", tokens)
		ids = tokenizer.encode(text)
		print("Token IDs:", ids)
		reconstructured_text = tokenizer.decode(ids)
		print("Reconstructed text:", reconstructured_text)
		except Exception as e:
		print("Error:", e)
		```

		%% Cell type:markdown id: tags:

		#### Task

		Handle the out of vocabulary issue by using a specific token:

		Modify your vocabulary and/or tokenizer and/or input text to decode/encode unknown words using a new specific token \<unk\>.

		%% Cell type:code id: tags:

		``` python
		# cell(s) for your solution
		```

		%% Cell type:markdown id: tags:

		#### Example Solution

		%% Cell type:code id: tags:

		``` python
		unknown_token = {"<unk>": len(vocab)} # add token for unknown to vocab: last index
		vocab.update(unknown_token)
		```

		%% Cell type:code id: tags:

		``` python
		class SimpleTokenizerV2:
		def __init__(self, vocab, unknown_token):
		self.str_to_int = vocab
		self.int_to_str = {i: s for s, i in vocab.items()}
		self.unknown_token = unknown_token

		def encode(self, text):
		preprocessed = re.split(r'([,.?_!"()\']\|--\|\s)', text)
		preprocessed = [item.strip() for item in preprocessed if item.strip()]
		preprocessed = [
		s if s in self.str_to_int else list(self.unknown_token.keys())[0]
		for s in preprocessed
		] # check if words are known. If not, replace by the unknown token
		ids = [self.str_to_int[s] for s in preprocessed]
		return ids

		def decode(self, ids):
		text = " ".join([self.int_to_str[i] for i in ids])
		# Replace spaces before the specified punctuations
		text = re.sub(r'\s+([,.?!"()\'])', r"\1", text)
		return text

		# optional function to showcase the tokens before transforming to IDs.
		def tokenize(self, text):
		preprocessed = re.split(r'([,.?_!"()\']\|--\|\s)', text)
		preprocessed = [item.strip() for item in preprocessed if item.strip()]
		preprocessed = [
		s if s in self.str_to_int else list(self.unknown_token.keys())[0]
		for s in preprocessed
		] # check if words are known. If not, replace by the unknown token
		return preprocessed
		```

		%% Cell type:code id: tags:

		``` python
		tokenizer = SimpleTokenizerV2(vocab, unknown_token)

		# Out of vocab example
		text = """I love LLMs because they know english better than me."""
		print("Original text:", text)
		tokens = tokenizer.tokenize(text)
		print("Tokens:", tokens)
		ids = tokenizer.encode(text)
		print("Token IDs:", ids)
		reconstructured_text = tokenizer.decode(ids)
		print("Reconstructed text:", reconstructured_text)
		```

		%% Cell type:markdown id: tags:

		### Pretrained
		Now, we learn to load a pretrained tokenizers of a real LLM.

		%% Cell type:markdown id: tags:

		Lets load the tokenizer and have a look at some elements the vocabulary:

		%% Cell type:code id: tags:

		``` python
		tokenizer = BertTokenizer.from_pretrained(
		"bert-base-uncased", clean_up_tokenization_spaces=True
		)
		print("Size of Vocabulary:", len(tokenizer.vocab))
		OrderedDict(list(tokenizer.vocab.items())[::1000])
		```

		%% Cell type:markdown id: tags:

		Encode and decode using the loaded tokenizer.

		%% Cell type:code id: tags:

		``` python
		text = "Does Donald J. Trump have a better golf handicap than Biden?"
		print("Original text:", text)
		tokens = tokenizer.tokenize(text)
		print("Tokens:", tokens)
		ids = tokenizer.encode(text) # Automatically adds special tokens at start and end
		print("Token IDs:", ids)
		reconstructured_text = tokenizer.decode(ids)
		print("Reconstructed text:", reconstructured_text)
		```

		%% Cell type:markdown id: tags:

		## Token Embeddings

		%% Cell type:markdown id: tags:

		### From scratch

		%% Cell type:markdown id: tags:

		Initalize the token embedding layer randomly using a torch embedding layer.


		Note: This basically builds a standard matrix with random entries which can be used for the deep learning components later (e.g. can compute gradients). If you are curious, check out the optional notebook "optional_tensor_intro".

		%% Cell type:code id: tags:

		``` python
		vocab_size = 100 # example
		output_dim = 10 # example

		torch.manual_seed(123)
		token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
		print(token_embedding_layer.weight)
		```

		%% Cell type:markdown id: tags:

		Create the token embeddings of an example document

		%% Cell type:code id: tags:

		``` python
		ids = torch.tensor([2, 3, 5, 1])
		token_embedding = token_embedding_layer(ids)
		print(token_embedding.shape)
		print(token_embedding)
		```

		%% Cell type:markdown id: tags:

		Visualize embeddings vectors of the tokens.

		The embedding of each token can be interpreted as a vector that points in a direction.

		%% Cell type:code id: tags:

		``` python
		token_embedding_vectors = [
		token_embedding[i].detach().numpy() for i in range(token_embedding.shape[0])
		]
		plot_embedding_pca(
		token_embedding_vectors,
		labels=[f"Token{i}" for i in range(len(token_embedding_vectors))],
		)
		```

		%% Cell type:markdown id: tags:

		These embeddings are meaningless, as the token embeddings are randomly initialized and not trained.

		In an LLM, token embeddings are trained as part of the training process.
		Therefore, lets look at some pretrained token embeddings.

		%% Cell type:markdown id: tags:

		### Pretrained

		%% Cell type:code id: tags:

		``` python
		model = BertModel.from_pretrained("bert-base-uncased")

		# Token embeddings
		token_embedding = model.embeddings.word_embeddings.weight
		print("Token Embeddings:", token_embedding.shape)
		print("Token Embeddings:", token_embedding)
		```

		%% Cell type:markdown id: tags:

		Visualize embeddings vectors of selected words.

		The embedding of each token can be interpreted as a vector that points in a direction.

		%% Cell type:code id: tags:

		``` python
		# select some words you are interested in (need to be in the vocab)
		words = [
		"king",
		"queen",
		"man",
		"woman",
		"dog",
		"cat",
		"apple",
		"orange",
		"car",
		"bike",
		"criminal",
		"president",
		]
		# '[UNK]', '[CLS]'

		# select embeddings for the selected words
		vocab_selected = {
		item[0]: item[1] for item in tokenizer.vocab.items() if item[0] in words
		}

		# Note what tokens were not available
		missing_tokens = set(words) - set(vocab_selected.keys())
		if len(missing_tokens) > 0:
		print("Non-existing Tokens:", missing_tokens)

		token_embedding_vectors = [
		token_embedding[list(vocab_selected.values())][i].detach().numpy()
		for i in range(len(vocab_selected))
		]

		plot_embedding_pca(token_embedding_vectors, labels=list(vocab_selected.keys()))
		```

		%% Cell type:markdown id: tags:

		#### Task
		Experiment with the visualization of token embeddings of some words.
		- Do similar words have more similar token embeddings?
		- Look at the difference of the words "king"/ "queen" and "man"/ "woman". What do you notice?
		- Test some tokens you are interested in. (Make sure they are part of the vocab)

		%% Cell type:markdown id: tags:

		#### Task*

		Manually compute the input embeddings for a text document in two ways (without positional embeddings):
		1) By an index lookup: Implement manually what the torch.embedding_layer does - for each token-id, look up the corresponding row in the token embedding matrix to construct the token embeddings.
		2) By matrix multiplication: Transform each token-id to a one-hot-encoded vector of length $v$ (vocabulary) with many 0s and 1 only for corresponding token. Multiply the one-hot-encoding representation of the document (its a matrix!) with the embedding matrix.

		Are the results of 1. and 2. equivalent? Which approach do you prefer?

		%% Cell type:markdown id: tags:

		## Positional embeddings

		%% Cell type:markdown id: tags:

		### From scratch

		%% Cell type:markdown id: tags:

		Randomly intialize a positional embedding layer:

		%% Cell type:code id: tags:

		``` python
		max_length = 4 # max length of a text document (small for illustraction purpose)
		output_dim = 10 # (small for illustraction purpose)

		pos_embedding_layer = torch.nn.Embedding(max_length, output_dim)
		print(pos_embedding_layer)
		print(pos_embedding_layer.weight)
		```

		%% Cell type:markdown id: tags:

		Create the positional embeddings for an input sequence

		%% Cell type:code id: tags:

		``` python
		pos_embedding = pos_embedding_layer(torch.arange(max_length))
		print(pos_embedding.shape)
		print(pos_embedding)
		```

		%% Cell type:code id: tags:

		``` python
		position = range(max_length)
		pos_embedding_vectors = [pos_embedding[i].detach().numpy() for i in position]
		plot_embedding_pca(
		pos_embedding_vectors,
		labels=[f"Position {i}" for i in range(len(pos_embedding_vectors))],
		)
		```

		%% Cell type:markdown id: tags:

		### Pretrained

		%% Cell type:code id: tags:

		``` python
		model = BertModel.from_pretrained("bert-base-uncased")

		# Positional embeddings
		pos_embedding = (
		model.embeddings.position_embeddings.weight
		) # access positional embeddings in the model object
		print("Positional Embeddings Size:", pos_embedding.shape)
		print("Positional Embeddings:", pos_embedding)
		```

		%% Cell type:code id: tags:

		``` python
		position = [1, 2, 3, 4, 5, 10, 25, 50, 100, 200, 300, 511] # note: 0 is always CLS
		pos_embedding_vectors = [pos_embedding[i].detach().numpy() for i in position]
		plot_embedding_pca(pos_embedding_vectors, labels=[f"Position {i}" for i in position])
		```

		%% Cell type:markdown id: tags:

		## Combining it all: construct input embeddings

		%% Cell type:code id: tags:

		``` python
		# Example text
		text = "My father gave me a small loan of a million dollars."

		# Load tokenizer and embedding layers/matrices
		tokenizer = BertTokenizer.from_pretrained(
		"bert-base-uncased", clean_up_tokenization_spaces=True
		)
		model = BertModel.from_pretrained("bert-base-uncased")
		token_embedding_layer = model.embeddings.word_embeddings
		pos_embedding_layer = model.embeddings.position_embeddings

		# Create embeddings for the example text
		ids = tokenizer.encode(text)
		token_embedding = token_embedding_layer(torch.tensor(ids))
		pos_embedding = pos_embedding_layer(torch.arange(len(token_embedding)))

		input_embedding = (
		token_embedding + pos_embedding
		) # absolute additive positional embedding
		print(input_embedding.shape)
		print(input_embedding)
		```

		%% Cell type:code id: tags:

		``` python
		# Visualize the input embedding
		# remove CLS and SEP token for simplicity
		input_embedding = input_embedding[1:-1]
		tokens = tokenizer.convert_ids_to_tokens(ids)[1:-1]
		input_embedding_vectors = [
		input_embedding[i].detach().numpy() for i in range(len(input_embedding))
		]

		plot_embedding_pca(input_embedding_vectors, tokens)
		```

		%% Cell type:markdown id: tags:

		Influence of positional embeddings

		We can now further investigate the influence of positional embeddings on the final input embedding of a token by plotting the input representation of a token for different positions:

		For the words that switch position between two sentences, the final embedding is slightly different due to the positional embeddings.
		This is how positional embeddings help to keep information about word order.

		%% Cell type:code id: tags:

		``` python
		# Example text
		text1 = "The cat chased the dog."
		text2 = "The dog chased the cat."
		text3 = "She only likes him."
		text4 = "Only she likes him."
		texts = [text1, text2, text3, text4]

		for text in texts:
		# Load tokenizer and embedding layers/matrices
		tokenizer = BertTokenizer.from_pretrained(
		"bert-base-uncased", clean_up_tokenization_spaces=True
		)
		model = BertModel.from_pretrained("bert-base-uncased")
		token_embedding_layer = model.embeddings.word_embeddings
		pos_embedding_layer = model.embeddings.position_embeddings

		# Create embeddings for the example text
		ids = tokenizer.encode(text)
		token_embedding = token_embedding_layer(torch.tensor(ids))
		pos_embedding = pos_embedding_layer(torch.arange(len(token_embedding)))

		input_embedding = (
		token_embedding + pos_embedding
		) # absolute additive positional embedding

		# Visualize the input embedding
		# remove CLS and SEP token for simplicity
		input_embedding = input_embedding[1:-1]
		tokens = tokenizer.convert_ids_to_tokens(ids)[1:-1]
		input_embedding_vectors = [
		input_embedding[i].detach().numpy() for i in range(len(input_embedding))
		]
		print(text)
		plot_embedding_pca(input_embedding_vectors, tokens)
		```

		%% Cell type:markdown id: tags:

		### Task*
		Write a neat function / class to directly transform text into its input embeddings.
		- If you want, you can allow for token embeddings and positional embeddings to be flexible inputs.
		- You can fix the tokenizer for simplicity.