Commit da694ab5 authored by jbleher's avatar jbleher
Browse files

Initial commit

parents
Loading
Loading
Loading
Loading
+2 MiB

File added.

No diff preview for this file type.

+103 −0
Original line number Diff line number Diff line
%% Cell type:markdown id:0a23ac74-2509-4920-a88a-5975d05074a3 tags:

# Simple Retrieval Augmented Generation

In this first notebook, we will build a simple retrieval augmented generation sytem.
For this purpose, we will get a Wikipedia page as context, and feed this to a LLM.
The LLM will be mistral and is running on the AIDAHO server aidaho-edu.uni-hohenheim.de/ollama

%% Cell type:markdown id:0f877f1d-38a0-4135-b798-57032665d0fa tags:

## The modules

%% Cell type:code id:23bae9fd-cc5c-46cc-b503-9d025f42b158 tags:

``` python
from langchain_community.llms import Ollama               # This module if for talking to LLMs
from langchain_community.vectorstores import Chroma       # This module delivers the vector database
from langchain_core.output_parsers import StrOutputParser # This module allows to parse the output from the LLM (string converstion, error handling,...)
from langchain_core.prompts import PromptTemplate         # This module is used to build prompt templates (placeholder management).
from langchain_core.runnables import RunnablePassthrough  # Handy tool, to just passthrough text
import requests                                           # A module to make HTTP requets
from bs4 import BeautifulSoup                             # A module to parse through HTML code.
```

%% Cell type:markdown id:1e9450a5-b36d-43c7-b682-1bc5b1792ecc tags:

## Get web content

1. Use the get function from the requests package to retrieve the url: "https://de.wikipedia.org/wiki/Karlsruher_Institut_f%C3%BCr_Technologie " store the result in an object called response
2. Use the function BeaufifulSoup to parse the content object from the response. Use the "html.parser". Call the result soup.
3. Invoke the function find on soup. Find the "div" with  {"class":"mw-body-content"}.
4. Extract the text into different strigns. Use a blank space as the seperator.
5. Wrap everything in a function called *fetch_web_content(url)*.

%% Cell type:code id:866611e7-d6ef-4952-97a4-59a3cfa51ab8 tags:

``` python
```

%% Cell type:markdown id:51bb7d3e-f619-43c1-a5a3-338df35a407d tags:

## Preprocess web content
1. In the result from fetch_web_content replace line brakes "\n with a blank.
2. Use the function strip to remove leading and trailing whitespaces.

%% Cell type:code id:73ff5f8f-6754-4d60-9547-0288e592a232 tags:

``` python
```

%% Cell type:markdown id:f032e270-344f-4116-8e8e-cf9c8f7b3893 tags:

## Wrap Wikipedia Content
Write a function that has no argument and just uses the url "https://de.wikipedia.org/wiki/Karlsruher_Institut_f%C3%BCr_Technologie " to call the *fetch_web_content* function from above and then invokes the *preprocess_content* on the result.
The resulting object should be returned by the function

%% Cell type:code id:468b2b91-5426-46f3-9736-0a899bed4133 tags:

``` python
```

%% Cell type:markdown id:0342f171-212d-4cc7-8924-4cf3e701ab3b tags:

## Baseprompt
Write a base brompt that uses the placeholder {context}, {question}.

%% Cell type:code id:03e284f9-b222-4afb-847c-102ba9a1f099 tags:

``` python
prompt = PromptTemplate.from_template(
    """

    """)
```

%% Cell type:markdown id:93c7c6af-361f-4d9e-a48b-987b14d6d57e tags:

## Setup the LLM
We will use a mistral instance hosted on the AIDAHO servers https://aidaho-edu.uni-hohenheim.de/ollama

%% Cell type:code id:eede178b-6147-4510-b6e3-c5c96779f115 tags:

``` python
llm = Ollama(
    base_url="https://aidaho-edu.uni-hohenheim.de/ollama",
    model="mistral"
    )
```

%% Cell type:markdown id:2fc5d826-75a7-439e-a34a-947051b572e5 tags:

## Make the chain

A chain in Langchain can be defined by concatenating the elements with a "|".

1. A dictionary with context and question will be needed. This will be fed into the prompt which again is fed into the llm. With the function StrOutputParser() the result is parsed. This entire chain can be stored into a chain object.
2. Define your query.
3. Invoke the query on the chain.

%% Cell type:code id:bc251b61-a5f2-4f58-b88b-62baab9deebc tags:

``` python
```
+207 −0
Original line number Diff line number Diff line
%% Cell type:markdown id:5003901a-54f7-446d-a0b3-c0534305d7a3 tags:

# Retrieval Augmented Generation

In this second notebook, we will build a retrieval augmented generation sytem (still very simpliefied).
For this purpose, we will again get a Wikipedia page as context, and feed this to a LLM.
This time, we will not use the entire wikipedia page but only parts of it.
Again, the LLM will be mistral and is running on the AIDAHO server aidaho-edu.uni-hohenheim.de/ollama

%% Cell type:markdown id:c671fc8e-5090-4441-8eef-183e0f4bb0ba tags:

## The modules

%% Cell type:code id:73e868ef-4380-4da2-b675-a9a09086c9d2 tags:

``` python
import bs4                                                           # Beautiful Soup: Used for parsing and extracting data from HTML and XML documents.
from langchain import hub                                            # Provides access to shared models, chains, and prompts from the LangChain community.
from langchain_chroma import Chroma                                  # VectorDB: A vector database used for storing and retrieving embeddings, often used in semantic search.
from langchain_community.document_loaders import WebBaseLoader       # Loads web pages and extracts content for processing.
from langchain_core.output_parsers import StrOutputParser            # Parses and formats the string output from a language model.
from langchain_core.runnables import RunnablePassthrough             # Passes data through without modification, useful for testing or chaining.
from langchain_core.prompts import PromptTemplate                    # Creates structured prompts with placeholders for dynamic input.
from langchain_community.llms import Ollama                          # An interface to Ollama, a language model used for generating responses.
from langchain_text_splitters import RecursiveCharacterTextSplitter  # Splits long text into smaller chunks based on character limits, preserving context.
from langchain_community.embeddings import OllamaEmbeddings          # Provides embeddings generation using Ollama, useful for transforming text into
                                                                     # numerical vectors for tasks like semantic search or similarity comparisons.
import torch
from transformers import AutoTokenizer, AutoModel
from tqdm import tqdm  # Import tqdm for progress bars
```

%% Cell type:markdown id:861fb786-c4d5-4345-9965-312885a7857a tags:

## Load relevant documents
This time we use the **`WebBaseLoader`** in the document_loaders object from the langchain_community module.

%% Cell type:code id:739e0e65-5559-40fd-8421-950750c808fd tags:

``` python
loader = WebBaseLoader(
    web_paths=("https://de.wikipedia.org/wiki/Karlsruher_Institut_f%C3%BCr_Technologie",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("mw-body-content", )
        )
    ),
)
docs = loader.load()
```

%% Cell type:markdown id:dec45e6b-18b0-4a7d-ad01-cea1a19f49a8 tags:

## Create Short Contexts
1. Use the function **`RecursiveCharacterTextSplitter`** to split the documents into overlapping chunks of size 5000 characters with an overlap of 100. Store the result in the variable **`text_splitter`**.
2. invoke the function split_document on the variable text_splitter. Call the result ``*splits`**.

%% Cell type:code id:ef58cee5-dbc5-4f07-a912-1742f1955ca5 tags:

``` python
```

%% Cell type:markdown id:11eb0065-7ef8-468b-a48f-885af50651ac tags:

## Embedding Generation
For each split, we would like to generate now the corresponding embedding. The resulting embedding value will be stored in the vector database for fast retrievel together with the corresponding split. For this purpose, we can either use our mistral model (remote) or can use the GPU locally.

%% Cell type:markdown id:91274a8f-b68b-415c-82d4-e3407c52544e tags:

### Remote Embedding generation
In order to use a remote embedding generation, use the function OllamaEmbeddings with the base_url https://aidaho-edu.uni-hohenheim.de/ollama and the model "mistral".

%% Cell type:code id:0cb33c4c-c2a7-4098-b817-4e9fc3aacda6 tags:

``` python
```

%% Cell type:markdown id:888ab126-993b-4864-a0ab-fa74c677702c tags:

### Local Embedding Generation

%% Cell type:code id:c9df7b9d-ae1d-4f06-9ee4-ae122bb9a1ac tags:

``` python
# Load the tokenizer and model for e5-large
tokenizer = AutoTokenizer.from_pretrained("intfloat/e5-large")
model = AutoModel.from_pretrained("intfloat/e5-large")

# Set device (use GPU if available, otherwise CPU)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)  # Move model to device
```

%% Cell type:code id:08d4f4ac-6cce-4a8d-a663-46526193a374 tags:

``` python
# Define the embedding function
def embed(texts):
    embeddings = []  # Initialize an empty list to store the embeddings
    for text in tqdm(texts, desc="Generating Embeddings", unit="text"):
        # Tokenize the input text
        inputs = tokenizer(
            text,
            padding=True,
            truncation=True,
            return_tensors="pt",
            max_length=512
        ).to(device)  # Move inputs to the appropriate device

        # Forward pass through the model to get hidden states
        with torch.no_grad():
            outputs = model(**inputs)

        # Use the [CLS] token representation (first token) as the embedding
        embedding = outputs.last_hidden_state[:, 0, :]

        # Convert the embedding to a list and add it to the embeddings list
        embeddings.append(embedding.cpu().numpy().tolist()[0])

    return embeddings


# Create a class that wraps the embed function to fit the expected interface
class CustomEmbedding:
    def embed_documents(self, texts):
        return embed(texts)
    def embed_query(self, text):
        return embed([text])[0]  # Process the single query text and return its embedding

# Create an instance of the embedding class
embedding = CustomEmbedding()
```

%% Cell type:markdown id:a3fa6e38-af79-4631-8cfd-1562dbe9d31b tags:

## Setup the vectordatabase
1. Use the function from_documents in the Chroma module. It takes to arguments **`documents`** which are the splits and **`embedding`** which will be the function defined above (either the one for the remote or local. Call the resulting object **`vectorstore`**.
2. Call the function **`as_retriever()`** on the vectorstore. Call the resulting object **`retriever`**

%% Cell type:code id:c2893a58-27eb-43b4-bbb6-29232c91f05f tags:

``` python
```

%% Cell type:markdown id:ff10da12-3c42-497f-9144-710be348cb22 tags:

## The base prompt
Use the following base prompt.

%% Cell type:code id:38000740-0cf1-4823-8a83-e2b277867cc5 tags:

``` python
prompt = PromptTemplate.from_template(
    """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question.
    Use three sentences maximum and keep the answer concise. If you don't know the answer, just say that you don't know.
    You are an AI assistant who answers the questions with the provided context.

    Context:
    {context}

    Question:
    {question}

    Answer:
    """)
```

%% Cell type:markdown id:1720f643-3dcd-4d5a-8e3c-c644e46a7039 tags:

## The LLM
Setup the LLM.

%% Cell type:code id:44973a2f-945b-46fe-9650-25a709e38586 tags:

``` python
```

%% Cell type:markdown id:ee87d65e-a6ac-4bb5-b2c3-224e50fd3dde tags:

## The LangChain

1. **Build the Chain**:
   - Create a dictionary with two keys: `context` and `question`.
     - **`context`**: This key combines a retriever with a formatting function:
       - **`retriever`**: A component that retrieves relevant documents based on the query.
       - **`format_docs(docs)` Function**: A function that formats the retrieved documents into a string. It joins the content of each document into a single formatted string separated by double newlines.
     - **`question`**: Use `RunnablePassthrough()`, which passes the user’s question as-is, without modifications.
   - Chain these elements together with a prompt, an LLM, and an output parser to create the full processing pipeline:
     - **`prompt`**: A template that structures the input context and question for the LLM.
     - **`llm`**: A language model that generates a response based on the formatted prompt.
     - **`StrOutputParser()`**: A parser that formats the LLM’s output into a readable string.

2. **Define Your Query**:
   - Prepare a query that you want the chain to answer using the context provided by the retriever. "Wieviel Drittmittel hat das KIT eingenommen?"

3. **Invoke the Query on the Chain**:
   - Use the `invoke` method on the `rag_chain` object, passing in the query to see the final response.

%% Cell type:code id:caa463aa-1d4c-442c-8409-15741123ee44 tags:

``` python
```

%% Cell type:code id:d358c0f5-0a7d-46fe-aa55-83bf4c290c1e tags:

``` python
```
+1 −0
Original line number Diff line number Diff line
pip install langchain langchain-core langchain-community langchain-chroma transfomers
+0 −0

File added.

Preview size limit exceeded, changes collapsed.