Commit 12688760 authored by Erik Senn's avatar Erik Senn
Browse files

Upload New File

parent d703f7a8
Loading
Loading
Loading
Loading
+586 −0
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

# Setup and data

GPU required? No

%% Cell type:code id: tags:

``` python
# Imports (note that you also need imports from the .py function files)
import numpy as np
import re  # regex
import torch  # PyTorch / ML tool
from transformers import BertTokenizer  # load tokenizer
from transformers import BertModel
from collections import OrderedDict
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

from utils import *  # ensure to have this file in the same directory
```

%% Cell type:code id: tags:

``` python
# Data
datapath = "../data/"

with open(datapath + "the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Total number of character:", len(raw_text))
print(raw_text[:99])
```

%% Cell type:markdown id: tags:

# Text Preprocessing

*Source:* The code is partly adapted from https://github.com/rasbt/LLM-workshop-2024/blob/main/02_data/02.ipynb

%% Cell type:markdown id: tags:

## Tokenizer

%% Cell type:markdown id: tags:

### From scratch
**Lets build a simple tokenizer** ourselves, where every word and punctuation sign in the text should correspond to a token.

%% Cell type:code id: tags:

``` python
# Split text into example tokens: separate words and punctuation
tokens = [item for item in re.split(r'([,.:;?_!"()\']|--|\s)', raw_text) if item]
print(raw_text[:57])
print(tokens[:20])
```

%% Cell type:code id: tags:

``` python
# How many tokens are there?
print("Number of tokens in text:", len(tokens))

# How many unique tokens are there?
unique_tokens = sorted(set(tokens))
print("Number of unique tokens:", len(unique_tokens))
```

%% Cell type:markdown id: tags:

Investigate the vocabulary

%% Cell type:code id: tags:

``` python
# Assign IDs to tokens (=create a vocabulary)
vocab = {token: integer for integer, token in enumerate(unique_tokens)}

# Show first entries
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break
```

%% Cell type:markdown id: tags:

Build our tokenizer class

%% Cell type:code id: tags:

``` python
# Build a tokenizer class
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i: s for s, i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r"\1", text)
        return text

    # optional function to showcase the tokens before transforming to IDs.
    def tokenize(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        return preprocessed
```

%% Cell type:markdown id: tags:

Test it

%% Cell type:code id: tags:

``` python
# Encode and decode an example text
tokenizer = SimpleTokenizerV1(vocab)

text = """"It's the last he painted, you know,"
           Mrs. Gisburn said with pardonable pride."""
print("Original text:", text)
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)
ids = tokenizer.encode(text)
print("Token IDs:", ids)
reconstructured_text = tokenizer.decode(ids)
print("Reconstructed text:", reconstructured_text)
```

%% Cell type:markdown id: tags:

**Out of vocabulary issue**

What happens if we use new text that contains previously unseen words?

%% Cell type:code id: tags:

``` python
# Out of token issue
try:
    text = """LLMs are fast and efficient models for NLP tasks."""
    print("Original text:", text)
    tokens = tokenizer.tokenize(text)
    print("Tokens:", tokens)
    ids = tokenizer.encode(text)
    print("Token IDs:", ids)
    reconstructured_text = tokenizer.decode(ids)
    print("Reconstructed text:", reconstructured_text)
except Exception as e:
    print("Error:", e)
```

%% Cell type:markdown id: tags:

#### Task

Handle the out of vocabulary issue by using a specific token:

Modify your vocabulary and/or tokenizer and/or input text to decode/encode **unknown words** using a new specific token \<unk\>.

%% Cell type:code id: tags:

``` python
# cell(s) for your solution
```

%% Cell type:markdown id: tags:

#### Example Solution

%% Cell type:code id: tags:

``` python
unknown_token = {"<unk>": len(vocab)}  # add token for unknown to vocab: last index
vocab.update(unknown_token)
```

%% Cell type:code id: tags:

``` python
class SimpleTokenizerV2:
    def __init__(self, vocab, unknown_token):
        self.str_to_int = vocab
        self.int_to_str = {i: s for s, i in vocab.items()}
        self.unknown_token = unknown_token

    def encode(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            s if s in self.str_to_int else list(self.unknown_token.keys())[0]
            for s in preprocessed
        ]  # check if words are known. If not, replace by the unknown token
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r"\1", text)
        return text

    # optional function to showcase the tokens before transforming to IDs.
    def tokenize(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            s if s in self.str_to_int else list(self.unknown_token.keys())[0]
            for s in preprocessed
        ]  # check if words are known. If not, replace by the unknown token
        return preprocessed
```

%% Cell type:code id: tags:

``` python
tokenizer = SimpleTokenizerV2(vocab, unknown_token)

# Out of vocab example
text = """I love LLMs because they know english better than me."""
print("Original text:", text)
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)
ids = tokenizer.encode(text)
print("Token IDs:", ids)
reconstructured_text = tokenizer.decode(ids)
print("Reconstructed text:", reconstructured_text)
```

%% Cell type:markdown id: tags:

### Pretrained
Now, we learn to load a pretrained tokenizers of a real LLM.

%% Cell type:markdown id: tags:

Lets load the tokenizer and have a look at some elements the vocabulary:

%% Cell type:code id: tags:

``` python
tokenizer = BertTokenizer.from_pretrained(
    "bert-base-uncased", clean_up_tokenization_spaces=True
)
print("Size of Vocabulary:", len(tokenizer.vocab))
OrderedDict(list(tokenizer.vocab.items())[::1000])
```

%% Cell type:markdown id: tags:

**Encode and decode** using the loaded tokenizer.

%% Cell type:code id: tags:

``` python
text = "Does Donald J. Trump have a better golf handicap than Biden?"
print("Original text:", text)
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)
ids = tokenizer.encode(text)  # Automatically adds special tokens at start and end
print("Token IDs:", ids)
reconstructured_text = tokenizer.decode(ids)
print("Reconstructed text:", reconstructured_text)
```

%% Cell type:markdown id: tags:

## Token Embeddings

%% Cell type:markdown id: tags:

### From scratch

%% Cell type:markdown id: tags:

Initalize the token embedding layer randomly using a torch embedding layer.


*Note*: This basically builds a standard matrix with random entries which can be used for the deep learning components later (e.g. can compute gradients). If you are curious, check out the optional notebook "optional_tensor_intro".

%% Cell type:code id: tags:

``` python
vocab_size = 100  # example
output_dim = 10  # example

torch.manual_seed(123)
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(token_embedding_layer.weight)
```

%% Cell type:markdown id: tags:

Create the token embeddings of an example document

%% Cell type:code id: tags:

``` python
ids = torch.tensor([2, 3, 5, 1])
token_embedding = token_embedding_layer(ids)
print(token_embedding.shape)
print(token_embedding)
```

%% Cell type:markdown id: tags:

Visualize embeddings vectors of the tokens.

The embedding of each token can be interpreted as a vector that points in a direction.

%% Cell type:code id: tags:

``` python
token_embedding_vectors = [
    token_embedding[i].detach().numpy() for i in range(token_embedding.shape[0])
]
plot_embedding_pca(
    token_embedding_vectors,
    labels=[f"Token{i}" for i in range(len(token_embedding_vectors))],
)
```

%% Cell type:markdown id: tags:

These embeddings are meaningless, as the token embeddings are **randomly initialized and not trained**.

In an LLM, token embeddings are trained as part of the training process.
Therefore, lets look at some pretrained token embeddings.

%% Cell type:markdown id: tags:

### Pretrained

%% Cell type:code id: tags:

``` python
model = BertModel.from_pretrained("bert-base-uncased")

# Token embeddings
token_embedding = model.embeddings.word_embeddings.weight
print("Token Embeddings:", token_embedding.shape)
print("Token Embeddings:", token_embedding)
```

%% Cell type:markdown id: tags:

Visualize embeddings vectors of selected words.

The embedding of each token can be interpreted as a vector that points in a direction.

%% Cell type:code id: tags:

``` python
# select some words you are interested in (need to be in the vocab)
words = [
    "king",
    "queen",
    "man",
    "woman",
    "dog",
    "cat",
    "apple",
    "orange",
    "car",
    "bike",
    "criminal",
    "president",
]
#  '[UNK]', '[CLS]'

# select embeddings for the selected words
vocab_selected = {
    item[0]: item[1] for item in tokenizer.vocab.items() if item[0] in words
}

# Note what tokens were not available
missing_tokens = set(words) - set(vocab_selected.keys())
if len(missing_tokens) > 0:
    print("Non-existing Tokens:", missing_tokens)

token_embedding_vectors = [
    token_embedding[list(vocab_selected.values())][i].detach().numpy()
    for i in range(len(vocab_selected))
]

plot_embedding_pca(token_embedding_vectors, labels=list(vocab_selected.keys()))
```

%% Cell type:markdown id: tags:

#### Task
Experiment with the visualization of token embeddings of some words.
- Do similar words have more similar token embeddings?
- Look at the difference of the words "king"/ "queen" and "man"/ "woman". What do you notice?
- Test some tokens you are interested in. (Make sure they are part of the vocab)

%% Cell type:markdown id: tags:

#### Task*

Manually compute the input embeddings for a text document in two ways (without positional embeddings):
   1) **By an index lookup**: Implement manually what the torch.embedding_layer does - for each token-id, look up the corresponding row in the token embedding matrix to construct the token embeddings.
   2) **By matrix multiplication**: Transform each token-id to a one-hot-encoded vector of length $v$ (vocabulary) with many 0s and 1 only for corresponding token. Multiply the one-hot-encoding representation of the document (its a matrix!) with the embedding matrix.

   Are the results of 1. and 2. equivalent? Which approach do you prefer?

%% Cell type:markdown id: tags:

## Positional embeddings

%% Cell type:markdown id: tags:

### From scratch

%% Cell type:markdown id: tags:

Randomly intialize a positional embedding layer:

%% Cell type:code id: tags:

``` python
max_length = 4  # max length of a text document (small for illustraction purpose)
output_dim = 10  # (small for illustraction purpose)

pos_embedding_layer = torch.nn.Embedding(max_length, output_dim)
print(pos_embedding_layer)
print(pos_embedding_layer.weight)
```

%% Cell type:markdown id: tags:

Create the positional embeddings for an input sequence

%% Cell type:code id: tags:

``` python
pos_embedding = pos_embedding_layer(torch.arange(max_length))
print(pos_embedding.shape)
print(pos_embedding)
```

%% Cell type:code id: tags:

``` python
position = range(max_length)
pos_embedding_vectors = [pos_embedding[i].detach().numpy() for i in position]
plot_embedding_pca(
    pos_embedding_vectors,
    labels=[f"Position {i}" for i in range(len(pos_embedding_vectors))],
)
```

%% Cell type:markdown id: tags:

### Pretrained

%% Cell type:code id: tags:

``` python
model = BertModel.from_pretrained("bert-base-uncased")

# Positional embeddings
pos_embedding = (
    model.embeddings.position_embeddings.weight
)  # access positional embeddings in the model object
print("Positional Embeddings Size:", pos_embedding.shape)
print("Positional Embeddings:", pos_embedding)
```

%% Cell type:code id: tags:

``` python
position = [1, 2, 3, 4, 5, 10, 25, 50, 100, 200, 300, 511]  # note: 0 is always CLS
pos_embedding_vectors = [pos_embedding[i].detach().numpy() for i in position]
plot_embedding_pca(pos_embedding_vectors, labels=[f"Position {i}" for i in position])
```

%% Cell type:markdown id: tags:

## Combining it all: construct input embeddings

%% Cell type:code id: tags:

``` python
# Example text
text = "My father gave me a small loan of a million dollars."

# Load tokenizer and embedding layers/matrices
tokenizer = BertTokenizer.from_pretrained(
    "bert-base-uncased", clean_up_tokenization_spaces=True
)
model = BertModel.from_pretrained("bert-base-uncased")
token_embedding_layer = model.embeddings.word_embeddings
pos_embedding_layer = model.embeddings.position_embeddings

# Create embeddings for the example text
ids = tokenizer.encode(text)
token_embedding = token_embedding_layer(torch.tensor(ids))
pos_embedding = pos_embedding_layer(torch.arange(len(token_embedding)))

input_embedding = (
    token_embedding + pos_embedding
)  # absolute additive positional embedding
print(input_embedding.shape)
print(input_embedding)
```

%% Cell type:code id: tags:

``` python
# Visualize the input embedding
# remove CLS and SEP token for simplicity
input_embedding = input_embedding[1:-1]
tokens = tokenizer.convert_ids_to_tokens(ids)[1:-1]
input_embedding_vectors = [
    input_embedding[i].detach().numpy() for i in range(len(input_embedding))
]

plot_embedding_pca(input_embedding_vectors, tokens)
```

%% Cell type:markdown id: tags:

**Influence of positional embeddings**

We can now further investigate the influence of positional embeddings on the final input embedding of a token by plotting the input representation of a token for different positions:

For the words that switch position between two sentences, the final embedding is slightly different due to the positional embeddings.
This is how positional embeddings help to keep information about word order.

%% Cell type:code id: tags:

``` python
# Example text
text1 = "The cat chased the dog."
text2 = "The dog chased the cat."
text3 = "She only likes him."
text4 = "Only she likes him."
texts = [text1, text2, text3, text4]

for text in texts:
    # Load tokenizer and embedding layers/matrices
    tokenizer = BertTokenizer.from_pretrained(
        "bert-base-uncased", clean_up_tokenization_spaces=True
    )
    model = BertModel.from_pretrained("bert-base-uncased")
    token_embedding_layer = model.embeddings.word_embeddings
    pos_embedding_layer = model.embeddings.position_embeddings

    # Create embeddings for the example text
    ids = tokenizer.encode(text)
    token_embedding = token_embedding_layer(torch.tensor(ids))
    pos_embedding = pos_embedding_layer(torch.arange(len(token_embedding)))

    input_embedding = (
        token_embedding + pos_embedding
    )  # absolute additive positional embedding

    # Visualize the input embedding
    # remove CLS and SEP token for simplicity
    input_embedding = input_embedding[1:-1]
    tokens = tokenizer.convert_ids_to_tokens(ids)[1:-1]
    input_embedding_vectors = [
        input_embedding[i].detach().numpy() for i in range(len(input_embedding))
    ]
    print(text)
    plot_embedding_pca(input_embedding_vectors, tokens)
```

%% Cell type:markdown id: tags:

### Task*
Write a neat function / class to directly transform text into its input embeddings.
- If you want, you can allow for token embeddings and positional embeddings to be flexible inputs.
- You can fix the tokenizer for simplicity.