Commit 5b6bcaeb authored by Jonas Boysen's avatar Jonas Boysen
Browse files

after recording

parent 9d5ba24b
Loading
Loading
Loading
Loading
+89 −66
Original line number Diff line number Diff line
%% Cell type:markdown id:36fcdd98-89e8-464a-802d-fefeb1a4894e tags:

# Hands-on 7.1 Unsupervised Learning

%% Cell type:markdown id:cc965bb9-4eca-4a80-af94-bc04af5941fd tags:

## 7.1.2 Clustering

%% Cell type:markdown id:e6a306f0-6352-4ea2-89b2-4a4ac4d8c09f tags:

### 7.1.2 - Task 1 - K-Means

%% Cell type:markdown id:c97d98c3-ac82-41a1-ab99-e7f469fd65d9 tags:

Load and visualize the the Iris data set with and without class labels.
Use two of the four features for visualization.
The petal width should be displayed on the y-axis and the petal length should be displayed on the x-axis.

%% Cell type:code id:005169de-05ce-4909-a99f-a6fb2915f089 tags:

``` python
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target
x_labels = iris.feature_names
y_labels = iris.target_names
print(x_labels)

fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey=True)

fig.suptitle("Task 1")

ax1.scatter(X[:, 2], X[:, 3])
ax1.set(xlabel=x_labels[2], ylabel=x_labels[3])

scatter = ax2.scatter(X[:, 2], X[:, 3], c=y)
ax2.legend(scatter.legend_elements()[0], y_labels)
plt.show()
```

%% Cell type:markdown id:dea0d9e6-b2a7-4a37-aa1d-b5f6c670af1f tags:

### 7.1.2 - Task 2 - K-Means

%% Cell type:markdown id:10416888-bfef-4134-8a24-414ba2402623 tags:

Visualize the results of K-means with 5 clusters and 10 initializations next to the target result.
Also display the cluster centers ontop of the visualization.

%% Cell type:code id:1ac6d9b2-aca4-4f8d-b649-b46431d1609c tags:

``` python
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris
from sklearn.cluster import KMeans

iris = load_iris()
X = iris.data
y = iris.target
x_labels = iris.feature_names
y_labels = iris.target_names

kmeans = KMeans(n_clusters=5, n_init=10)
kmeans.fit(X[:, 2:4])
print(kmeans.labels_)
print(kmeans.cluster_centers_)

fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey=True)
fig.suptitle("Task 2")

ax1.scatter(X[:, 2], X[:, 3], c=kmeans.labels_)
ax1.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c="black", marker="x")

scatter = ax2.scatter(X[:, 2], X[:, 3], c=y)
ax2.legend(scatter.legend_elements()[0], y_labels)

plt.show()
```

%% Cell type:markdown id:eaf49470 tags:

### 7.1.2 - Task 3 - K-Means

%% Cell type:markdown id:27ec3761 tags:

Find a good value for the number of clusters k by visualizing the inertia of different k (1-10)

%% Cell type:code id:9104fb49 tags:

``` python
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris
from sklearn.cluster import KMeans

iris = load_iris()
X = iris.data
y = iris.target
x_labels = iris.feature_names
y_labels = iris.target_names

inertias = [9999]
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, n_init=10)
    kmeans.fit(X[:, 2:4])
    print(i, kmeans.inertia_)
    inertias.append(kmeans.inertia_)

plt.plot(inertias)
plt.title("Task 3")
plt.xlabel("k")
plt.ylabel("inertia")
plt.ylim(0, 500)
plt.show()
```

%% Cell type:markdown id:36955304-1116-4f9c-bf20-e13f3d0dcaf7 tags:

### 7.1.2 - Task 4 - DBSCAN

%% Cell type:markdown id:1b4a3e62 tags:

Load the Iris data set and visualize the outcome of the DBSCAN algorithm with epsilon = 0.3 and min_samples = 3 next to the labelled data set. Draw a red cross on every data instance considered as an anomaly by the DBSCAN algorithm.

%% Cell type:code id:4853842e tags:

``` python
import matplotlib.pyplot as plt
import numpy as np

from sklearn.datasets import load_iris
from sklearn.cluster import DBSCAN

iris = load_iris()
X = iris.data
y = iris.target
x_labels = iris.feature_names
y_labels = iris.target_names

dbscan = DBSCAN(eps=0.3, min_samples=3)
dbscan.fit(X[:, 2:4])
print(dbscan.labels_)

fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey=True)
fig.suptitle("Task 4")

ax1.scatter(X[:, 2], X[:, 3], c=dbscan.labels_)

scatter = ax2.scatter(X[:, 2], X[:, 3], c=y)
ax2.legend(scatter.legend_elements()[0], y_labels)

anomaly_idcs = np.where(dbscan.labels_ == -1)
for idx in anomaly_idcs:
    ax1.scatter(X[idx, 2], X[idx, 3], c="red", marker="x")

print(anomaly_idcs)
plt.show()
```

%% Cell type:markdown id:40c20803 tags:

### 7.1.2 - Task 5 - DBSCAN

%% Cell type:markdown id:df4a7fe4 tags:

Create a moon data set with the sklearn.datasets module including 200 samples, with Gaussian noise comprising a std deviation of 0.05. Fix the randomly generated data set with a random state to always receive the same dataset. Find good parameters for the DBSCAN algorithm by trial and error. The algorithm should split the two moons into two clusters.

%% Cell type:code id:41b77f5f tags:

``` python
import matplotlib.pyplot as plt

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons

X, y = make_moons(200, noise=0.05, random_state=20)

fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey=True)
fig.suptitle("Task 5")

dbscan = DBSCAN(eps=0.3)
dbscan.fit(X)

print(X.shape)
ax1.scatter(X[:, 0], X[:, 1], c=dbscan.labels_)
ax2.scatter(X[:, 0], X[:, 1], c=y)

plt.show()
```

%% Cell type:markdown id:47e562b7 tags:

## 7.1.3 Anomaly Detection

%% Cell type:markdown id:d58dd347 tags:

### 7.1.3 - Task 1 - GMM

%% Cell type:markdown id:115414a3 tags:

a)
Create a dataset with the make_blob function of the datasets module. The dataset should contain five blobs with the following number of data instances (50, 20, 30, 70, 10). Each data instance should be 2-dimensional. Pass a random seed to the function to make the data set reproducible across multiple function calls. Save the centers of the blobs in a variable and set the std deviation of the first two blobs to 1.0 and the other blobs to 0.8. Visualize the created dataset with labels.

b)
Use the GaussianMixture class from the mixture module to create a Gaussian mixture model that fits the data set. Try different parameters to find a good representation.

%% Cell type:code id:e3b5ae00 tags:

``` python
import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs
from sklearn.mixture import GaussianMixture

X, y, centers = make_blobs(
    n_samples=[50, 20, 30, 70, 10],
    random_state=1,
    return_centers=True,
    cluster_std=[1.,1.,.8,.8,.8])

gmm = GaussianMixture(n_components=5, covariance_type="spherical")
y_pred = gmm.fit_predict(X=X)

print(y_pred)

fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey=True)
fig.suptitle("Task 1")

ax1.scatter(X[:, 0], X[:, 1], c=y, s=5)
ax2.scatter(X[:, 0], X[:, 1], c=y_pred, s=5)
```

%% Cell type:markdown id:35020d39 tags:

### 7.1.3 - Task 2 - GMM

%% Cell type:markdown id:95682b55 tags:

Perform an anomaly detection using a GMM with 2 components. Create a blob dataset with 3 blobs with 120, 70 and 10 data instances and with std deviations of 0.8, 0.8 and 5.0 respectivly. Mark the anomalies with a red cross.

%% Cell type:code id:ac18d0d0 tags:

``` python
import matplotlib.pyplot as plt
import numpy as np

from sklearn.mixture import GaussianMixture
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=[120,70,10], cluster_std=[0.8,0.8,5.], random_state=1)

gmm = GaussianMixture(n_components=2)
gmm.fit(X=X)

log_probs = gmm.score_samples(X=X)
# print(log_probs)

threshold = np.percentile(log_probs, 5)
print(threshold)
anomalies = X[log_probs < threshold]
print(anomalies)

fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey=True)
fig.suptitle("Task 1")

ax1.scatter(X[:, 0], X[:, 1], c=y, s=5)
ax2.scatter(X[:, 0], X[:, 1], s=5)
for anomaly in anomalies:
    ax2.scatter(anomaly[0], anomaly[1], c="red", marker="x")
```

%% Cell type:markdown id:3d0b9206 tags:

## 7.1.4 Dimensionality Reduction

%% Cell type:markdown id:9a08ae29 tags:

### 7.1.4 - Task 1 - PCA & t-SNE

%% Cell type:markdown id:f8ac9a60 tags:

Get used to the digits dataset. Visualize a single data instance as an image and print the corresponding label.

%% Cell type:code id:5973382c tags:

``` python
import matplotlib.pyplot as plt
import numpy as np

from sklearn.datasets import load_digits

X, y = load_digits(return_X_y=True)

# print(X)
# print(X.shape)

img = np.reshape(X[500], (8,8))
print(img)
print(img.shape)

plt.imshow(img, cmap=plt.gray())
```

%% Cell type:markdown id:d1eb9eea tags:

### 7.1.4 - Task 2 - PCA & t-SNE

%% Cell type:markdown id:f5f9e744 tags:

Use the PCA class of the decomposition module and the TSNE class of the manifold module to visualze the complete MNIST dataset in 2D plots instead of showing a single image.

%% Cell type:code id:be65239d tags:

``` python
import matplotlib.pyplot as plt

from matplotlib.colors import ListedColormap
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

colors = ListedColormap(["red", "blue", "green", "yellow", "orange", "purple", "gray", "black", "brown", "teal"])

X, y = load_digits(return_X_y=True)

fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey=True)
fig.suptitle("Task 2")

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

tsne = TSNE()
X_tsne = tsne.fit_transform(X)

scatter = ax1.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap=colors, s=2)
ax1.legend(scatter.legend_elements()[0], y)

scatter = ax2.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap=colors, s=2)
ax2.legend(scatter.legend_elements()[0], y)
```

%% Cell type:markdown id:22ced03a tags:

# Hands-on 7.2 Supervised Learning

%% Cell type:markdown id:5c377e71 tags:

## 7.2.3 Classification

%% Cell type:markdown id:914661a7 tags:

### 7.2.3 - Task 1 - SVM

%% Cell type:markdown id:d979fe1c tags:

Load the iris dataset and use a SVM classifier with a linear kernel and C=1.0 to train on the features petal length and petal width. Visualize the predictions next to the data with original labels. Print the mean accuracy of the classifier on this data set. Draw the three decision boundaries between the three classes in the plot.

%% Cell type:code id:2075a53e tags:

``` python
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris
from sklearn.svm import SVC

iris = load_iris()
X = iris.data
y = iris.target
x_labels = iris.feature_names
y_labels = iris.target_names

fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey=True)
fig.suptitle("Task 1")

svm = SVC(C=1.0, kernel="linear")
svm.fit(X=X[:,2:4], y=y)

print(svm.score(X=X[:,2:4], y=y))
scatter = ax1.scatter(X[:, 2], X[:, 3], c=y, cmap="viridis")
ax1.legend(scatter.legend_elements()[0], y_labels)

y_pred = svm.predict(X=X[:, 2:4])
scatter = ax2.scatter(X[:, 2], X[:, 3], c=y_pred, cmap="viridis")
ax2.legend(scatter.legend_elements()[0], y_labels)

b = svm.intercept_
w = svm.coef_

print("b", b)
print("w", w)

# 0 = w2*x2 + w1*x1 + b
# x2 = -(w1*x1/w2) - b/w2

x = np.array([ax1.get_xlim()[0], ax1.get_xlim()[1]])
print(x)

ax2.set_ylim(ax2.get_ylim())

for j in range(3):
    ax2.plot(x, -x*w[j, 0] / w[j, 1] - b[j] / w[j,1])
```

%% Cell type:markdown id:f6253c67 tags:

### 7.2.3 - Task 2 - SVM

%% Cell type:markdown id:83cda835 tags:

Load the iris dataset and use a SVM classifier with an rbf kernel and try out different parameters to train on the features petal length and petal width. Print the mean accuracy of the classifier on this data set. Draw the decision boundaries between the three classes in the plot.

%% Cell type:code id:d4cd74d2 tags:

``` python
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris
from sklearn.svm import SVC

iris = load_iris()
X = iris.data
y = iris.target
x_labels = iris.feature_names
y_labels = iris.target_names

fig, ax1 = plt.subplots(1, 1)
fig.suptitle("Task 2")

scatter = ax1.scatter(X[:, 2], X[:, 3], c=y, cmap="viridis")
ax1.legend(scatter.legend_elements()[0], y_labels)

svm = SVC()
svm.fit(X[:, 2:4], y)

print(svm.score(X[:, 2:4], y))

x_coords = np.arange(0, 7.5, 0.01)
y_coords = np.arange(0, 2.8, 0.01)

print(x_coords)
xx, yy = np.meshgrid(x_coords, y_coords)

z = svm.predict(np.column_stack((xx.ravel(), yy.ravel())))

print(z)
z = z.reshape(xx.shape)

ax1.contourf(xx, yy, z, alpha=0.3, cmap="viridis")
```

%% Cell type:markdown id:1e7c8f45 tags:

### 7.2.3 - Task 3 - KNN

%% Cell type:markdown id:61f6cde9 tags:

Create a 2D blob dataset with 500 data instances and 5 blobs. Use a KNN-classifier and choose a value for K. Visualize the predictions next to the true labels in two plots. Use the cotourf function to draw the decision boundaries of the classifier.

%% Cell type:code id:2cc1246d tags:

``` python
import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier

X, y = make_blobs(n_samples=500, n_features=2, centers=5, random_state=1)
fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
fig.suptitle("Task 3")

ax1.scatter(X[:, 0], X[:, 1], c=y, cmap="viridis", s=5)

knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X, y)
y_pred = knn.predict(X=X)
print("score:", knn.score(X, y))
ax2.scatter(X[:, 0], X[:, 1], c=y_pred, cmap="viridis", s=5)

x_lim = ax1.get_xlim()
y_lim = ax1.get_ylim()
print(x_lim)

x_coords = np.arange(x_lim[0], x_lim[1], 0.1)
y_coords = np.arange(y_lim[0], y_lim[1], 0.1)

xx, yy = np.meshgrid(x_coords, y_coords)

z = knn.predict(np.column_stack((xx.ravel(), yy.ravel())))

z = z.reshape(xx.shape)

ax1.contourf(xx, yy, z, alpha=0.3, cmap="viridis")
```

%% Cell type:markdown id:413c44dc tags:

### 7.2.3 - Task 4 - Decision Tree

%% Cell type:markdown id:964cf8c4 tags:

Create a moons dataset with default values and a std deviation of 0.2. Use a decision tree to classify the data. Visualize the decision boundaries for a decision tree without restrictions and with minimum samples per leaf = 4

%% Cell type:code id:0b26104e tags:

``` python
import matplotlib.pyplot as plt
import numpy as np

from sklearn.datasets import make_moons
from sklearn.tree import DecisionTreeClassifier

X, y = make_moons(noise=0.2)

tree1 = DecisionTreeClassifier()
tree2 = DecisionTreeClassifier(min_samples_leaf=4)
tree1.fit(X,y)
tree2.fit(X,y)
score1 = tree1.score(X,y)
score2 = tree2.score(X,y)

print("score1:", score1, "score2:", score2)

fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
fig.suptitle("Task 4")

ax1.scatter(X[:, 0], X[:, 1], c=y, cmap="viridis", s=5)
ax2.scatter(X[:, 0], X[:, 1], c=y, cmap="viridis", s=5)

ax1.set_title("no restrictions")
ax2.set_title("min_samples_leaf=4")

x_lim = ax1.get_xlim()
y_lim = ax1.get_ylim()

x_coords = np.arange(x_lim[0], x_lim[1], 0.001)
y_coords = np.arange(y_lim[0], y_lim[1], 0.001)

xx, yy = np.meshgrid(x_coords, y_coords)

z_1 = tree1.predict(np.column_stack((xx.ravel(), yy.ravel())))
z_2 = tree2.predict(np.column_stack((xx.ravel(), yy.ravel())))

z_1 = z_1.reshape(xx.shape)
z_2 = z_2.reshape(xx.shape)

ax1.contourf(xx, yy, z_1, alpha=0.3, cmap="viridis")
ax2.contourf(xx, yy, z_2, alpha=0.3, cmap="viridis")
```

%% Cell type:markdown id:1f26c787 tags:

### 7.2.3 - Task 5 - Random Forest

%% Cell type:markdown id:a55c43e1 tags:

Create a moons dataset with 1000 samples and a std. deviation of 0.2. Use a random forest classifier with 150 trees and draw the decision boundaries.

%% Cell type:code id:47b947ea tags:

``` python
import matplotlib.pyplot as plt
import numpy as np

from sklearn.datasets import make_moons
from sklearn.ensemble import RandomForestClassifier

X, y = make_moons(n_samples=1000, noise=.2, random_state=22)

forest = RandomForestClassifier(n_estimators=150)
forest.fit(X,y)

fig, ax1 = plt.subplots(1, 1)
fig.suptitle("Task 4")

ax1.scatter(X[:, 0], X[:, 1], c=y, cmap="viridis", s=5)

x_lim = ax1.get_xlim()
y_lim = ax1.get_ylim()

x_coords = np.arange(x_lim[0], x_lim[1], 0.001)
y_coords = np.arange(y_lim[0], y_lim[1], 0.001)

xx, yy = np.meshgrid(x_coords, y_coords)

z = forest.predict(np.column_stack((xx.ravel(), yy.ravel())))

z = z.reshape(xx.shape)

ax1.contourf(xx, yy, z, alpha=0.3, cmap="viridis")
```

%% Cell type:markdown id:1daaa736 tags:

## 7.2.4 Regression

%% Cell type:markdown id:8d75e79c tags:

### 7.2.4 - Task 1 - Regression with scikit-learn

%% Cell type:markdown id:75335348 tags:

Load the california housing dataset. Plot the sixth feature of the data set (average number of rooms per dwelling) on
the x-axis and the value in $100000 on the y-axis. Use linear regression and KNN regression to predict the value with the average number of rooms per dwelling and visualize the predictions into the corresponding plots.

%% Cell type:code id:10059aa2 tags:

``` python
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression

X, y = fetch_california_housing(return_X_y=True)

coord = X[:, 6:]
print(coord.shape)
print(coord)

linear = LinearRegression()
linear.fit(X,y)
print("linear:", linear.score(X, y))
y_lin = linear.predict(X)

knn = KNeighborsRegressor()
knn.fit(X,y)
print("knn:", knn.score(X, y))
y_knn = knn.predict(X)

forest = RandomForestRegressor()
forest.fit(X,y)
print("forest:", forest.score(X, y))
y_forest = forest.predict(X)

fig, (ax1,ax2,ax3,ax4) = plt.subplots(1, 4, sharey=True)
ax1.set(xlabel="longitude", ylabel="latitude", title="data")
ax2.set(xlabel="longitude", title="Linear")
ax3.set(xlabel="longitude", title="KNN")
ax4.set(xlabel="longitude", title="Random-Forest")

ax1.scatter(coord[:, 1], coord[:, 0], c=y, cmap="viridis", s=2)
ax2.scatter(coord[:, 1], coord[:, 0], c=y_lin, cmap="viridis", s=2)
ax3.scatter(coord[:, 1], coord[:, 0], c=y_knn, cmap="viridis", s=2)
ax4.scatter(coord[:, 1], coord[:, 0], c=y_forest, cmap="viridis", s=2)
```

%% Cell type:markdown id:44a04d6a tags:

# Hands-on 7.3 Model Evaluation

%% Cell type:markdown id:ba02f77d tags:

## 7.3.4 - Case Study

%% Cell type:markdown id:798d1b87 tags:

### 7.3.4 - Task 1 - Case Study

%% Cell type:markdown id:2054df58 tags:

Load the data stored in the "grain_data.csv" with pandas. In this file different grain types and their contents are listed. The class labels from 0 to 6 are "Barley","Oat","Corn","Rice","Rye","Wheat","Spelt" and the content labels are given in the first row.

%% Cell type:code id:60748145 tags:

``` python
import pandas as pd

filename = "grain_data.csv"
df = pd.read_csv(filename)
print(df)
y = df["Label"].values
X = df[['Protein', 'Fat', 'Carbohydrates', 'Fiber', 'Minerals']].values
label_strings = ["Barley","Oat","Corn","Rice","Rye","Wheat","Spelt"]
df["Label"] = df["Label"].apply(lambda x: label_strings[x])
print(df)
print(X)
print(y)
```

%% Cell type:markdown id:f6ee549f tags:

### 7.3.4 - Task 2 - Case Study

%% Cell type:markdown id:d39812c3 tags:

Get a feeling for the imported data by visualizing it in 2D using the t-SNE algorithm

%% Cell type:code id:9e2bb6ad tags:

``` python
import pandas as pd
import matplotlib.pyplot as plt

from matplotlib.colors import ListedColormap
from sklearn.manifold import TSNE

filename = "grain_data_new.csv"
filename = "grain_data.csv"
df = pd.read_csv(filename)
print(df.columns)
labels = df["Label"]
data = df[['Protein', 'Fat', 'Carbohydrates', 'Fiber', 'Minerals']]
print(data)
y = df["Label"].values
X = df[['Protein', 'Fat', 'Carbohydrates', 'Fiber', 'Minerals']].values
label_strings = ["Barley","Oat","Corn","Rice","Rye","Wheat","Spelt"]

colors = ListedColormap(["red", "blue", "green", "black", "teal", "brown", "yellow"])

tsne = TSNE()
X_2D = tsne.fit_transform(data, labels)
X_new = tsne.fit_transform(X)

fig, ax = plt.subplots(1, 1)

label_strings = ["Barley","Oat","Corn","Rice","Rye","Wheat","Spelt"]
colours = ListedColormap(["red", "blue", "green", "black", "teal", "brown", "yellow"])
scatter = ax.scatter(X_2D[:, 0], X_2D[:, 1], s=3, c=labels, cmap=colours)
scatter = ax.scatter(X_new[:, 0], X_new[:, 1], c=y, s=3, cmap=colors)
ax.legend(scatter.legend_elements()[0], label_strings)

plt.show()
```

%% Cell type:markdown id:30bf1be5 tags:

### 7.3.4 - Task 3 - Case Study

%% Cell type:markdown id:9382d0ae tags:

Split the data into a train and test set. Use a classifier of your choice and train it on the train split. Print the confusion matrix, the accuracy, the precision of each class and the recall of each class of the classifier on the test split.

%% Cell type:code id:94a22985 tags:

``` python
import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score

filename = "grain_data.csv"
df = pd.read_csv(filename)
labels = df["Label"]
data = df[['Protein', 'Fat', 'Carbohydrates', 'Fiber', 'Minerals']]
y = df["Label"].values
X = df[['Protein', 'Fat', 'Carbohydrates', 'Fiber', 'Minerals']].values
label_strings = ["Barley","Oat","Corn","Rice","Rye","Wheat","Spelt"]

X_train, X_test, y_train, y_test = train_test_split(X, y)

# split the training data
X_train, X_test, y_train, y_test = train_test_split(data, labels)
print(X_train.shape, X_test.shape)

# train a random forest with the train split
forest = RandomForestClassifier()
forest.fit(X_train, y_train)

# test and print the accuracy of the forest on the test set
print("Score:", forest.score(X_test, y_test))
y_test_pred = forest.predict(X_test)
print("Accuracy:", accuracy_score(y_test,y_test_pred))
print("Confusion Matrix: \n", confusion_matrix(y_test,y_test_pred))
print("Precision:\n", precision_score(y_test,y_test_pred,average=None))
print("Recall:\n", recall_score(y_test,y_test_pred,average=None))
y_pred = forest.predict(X_test)
print(y_pred)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average=None)
recall = recall_score(y_test, y_pred, average=None)
print(f"accuracy:{accuracy:.3f}")
[print(f'precision {label_strings[i]}:{p:.3f}') for i, p in enumerate(precision)]
[print(f'recall {label_strings[i]}:{r:.3f}') for i, r in enumerate(recall)]
cm = confusion_matrix(y_test, y_pred)

print(cm)
```

%% Cell type:markdown id:6e8f5ab9 tags:

### 7.3.4 - Task 4 - Case Study

%% Cell type:markdown id:bb3b51cd tags:

Perform a k-fold cross-validation with k=10 on the data set with a classifier of your choice and print the mean accuracy.

%% Cell type:code id:c05fa0f7 tags:

``` python
import numpy as np
import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold

filename = "grain_data.csv"
df = pd.read_csv(filename)
labels = df["Label"]
data = df[['Protein', 'Fat', 'Carbohydrates', 'Fiber', 'Minerals']]
y = df["Label"].values
X = df[['Protein', 'Fat', 'Carbohydrates', 'Fiber', 'Minerals']].values
label_strings = ["Barley","Oat","Corn","Rice","Rye","Wheat","Spelt"]

k = 10
k_fold = KFold(n_splits=k, shuffle=True)

indices = k_fold.split(X)
accuracy_list = []
for train, test in indices:
indices_generator = k_fold.split(X)

accuracies = []
for train, test in indices_generator:
    X_train = [X[i] for i in train]
    X_test = [X[i] for i in test]
    y_train = [y[i] for i in train]
    y_test = [y[i] for i in test]

    forest = RandomForestClassifier()
    forest.fit(X_train, y_train)

    y_pred = forest.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print("Accuracy:", accuracy)
    accuracy_list.append(accuracy)
    print("accuracy:", accuracy)
    accuracies.append(accuracy)

mean_accuracy = np.array(accuracies).mean()

print("Mean Accuracy:", np.mean(accuracy_list))
print(accuracies)
print(f"Mean accuracy: {mean_accuracy:.3f}")
```

%% Cell type:markdown id:68f28bc2 tags:

### 7.3.4 Task 5 - Case Study

%% Cell type:markdown id:131c6311 tags:

Graphically highlight learning curves using differently sized training/test spluts during a training of a linear and a rbf SVM to analyze the performance of the used classifiers.
Graphically highlight learning curves using differently sized training/test splits during a training of a linear and a rbf SVM to analyze the performance of the used classifiers.

%% Cell type:code id:304aba08 tags:

``` python
import numpy as np
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import learning_curve
from sklearn.svm import SVC
from sklearn.model_selection import learning_curve

filename = "grain_data.csv"
df = pd.read_csv(filename)
labels = df["Label"]
data = df[['Protein', 'Fat', 'Carbohydrates', 'Fiber', 'Minerals']]

dataset_size = len(df)
print(dataset_size)

fig, (ax1,ax2) = plt.subplots(1, 2, sharey=True)
fig.suptitle("Task 5")
y = df["Label"].values
X = df[['Protein', 'Fat', 'Carbohydrates', 'Fiber', 'Minerals']].values
label_strings = ["Barley","Oat","Corn","Rice","Rye","Wheat","Spelt"]

svm1 = SVC(C=100, kernel="rbf", gamma=1)
svm2 = SVC(C=100, kernel="linear")
fig, (ax1,ax2) = plt.subplots(1, 2)

train_size_abs, train_scores, test_scores, fit_times, _ = learning_curve(svm1,data,labels,train_sizes=np.linspace(0.2,1.,10), return_times=True, shuffle=True)
print(train_size_abs,train_scores,test_scores,fit_times)
svm_rbf = SVC(C=100, kernel="rbf", gamma=1)
svm_lin = SVC(C=100, kernel="linear")

rel_train_size = train_size_abs / dataset_size
train_sizes_abs, train_scores, test_scores = learning_curve(
    estimator=svm_rbf,
    X=X,
    y=y,
    train_sizes=np.linspace(0.2, 0.9, 10),
    shuffle=True,
)
print(train_sizes_abs, train_scores, test_scores)

mean_train_score = np.mean(train_scores, axis=-1)
mean_test_score = np.mean(test_scores, axis=-1)
print(mean_train_score)
print(mean_test_score)

mean_train_scores = np.mean(train_scores,axis=-1)
mean_test_scores = np.mean(test_scores,axis=-1)
rel_train_size = train_sizes_abs / len(df)

ax1.plot(rel_train_size, mean_train_scores, c="tab:blue")
ax1.plot(rel_train_size, mean_test_scores, c="tab:red")
ax1.plot(rel_train_size, mean_train_score, c="tab:blue")
ax1.plot(rel_train_size, mean_test_score, c="tab:red")
ax1.set(title="rbf", xlabel="Rel. # train split samples", ylabel="Score")
ax1.legend(["Train", "Test"], loc="lower right")

train_size_abs, train_scores, test_scores, fit_times, _ = learning_curve(svm2,data,labels,train_sizes=np.linspace(0.2,1.,10), return_times=True, shuffle=True)

rel_train_size = train_size_abs / dataset_size
train_sizes_abs, train_scores, test_scores = learning_curve(
    estimator=svm_lin,
    X=X,
    y=y,
    train_sizes=np.linspace(0.2, 0.9, 10),
    shuffle=True,
)
mean_train_score = np.mean(train_scores, axis=-1)
mean_test_score = np.mean(test_scores, axis=-1)

mean_train_scores = np.mean(train_scores,axis=-1)
mean_test_scores = np.mean(test_scores,axis=-1)
rel_train_size = train_sizes_abs / len(df)

ax2.plot(rel_train_size,mean_train_scores, c="tab:blue")
ax2.plot(rel_train_size,mean_test_scores, c="tab:red")
ax2.plot(rel_train_size, mean_train_score, c="tab:blue")
ax2.plot(rel_train_size, mean_test_score, c="tab:red")
ax2.set(title="linear", xlabel="Rel. # train split samples")
ax2.legend(["Train", "Test"], loc="lower right")

ylim1 = ax1.get_ylim()
ylim2 = ax2.get_ylim()
min_ylim = min(ylim1[0], ylim2[0])
max_ylim = max(ylim1[1], ylim2[1])
ax1.set_ylim((min_ylim, max_ylim))
ax2.set_ylim((min_ylim, max_ylim))
```

plt.show()
%% Cell type:code id:ee0cf36d tags:

``` python
```
+17 −16
Original line number Diff line number Diff line
%% Cell type:markdown id:36fcdd98-89e8-464a-802d-fefeb1a4894e tags:

# Hands-on 7.1 Unsupervised Learning

%% Cell type:markdown id:cc965bb9-4eca-4a80-af94-bc04af5941fd tags:

## 7.1.2 Clustering

%% Cell type:markdown id:e6a306f0-6352-4ea2-89b2-4a4ac4d8c09f tags:

### 7.1.2 - Task 1 - K-Means

%% Cell type:markdown id:c97d98c3-ac82-41a1-ab99-e7f469fd65d9 tags:

Load and visualize the the Iris data set with and without class labels.
Use two of the four features for visualization.
The petal width should be displayed on the y-axis and the petal length should be displayed on the x-axis.

%% Cell type:code id:005169de-05ce-4909-a99f-a6fb2915f089 tags:

``` python
```

%% Cell type:markdown id:dea0d9e6-b2a7-4a37-aa1d-b5f6c670af1f tags:

### 7.1.2 - Task 2 - K-Means

%% Cell type:markdown id:10416888-bfef-4134-8a24-414ba2402623 tags:

Visualize the results of K-means with 5 clusters and 10 initializations next to the target result.
Also display the cluster centers ontop of the visualization.

%% Cell type:code id:1ac6d9b2-aca4-4f8d-b649-b46431d1609c tags:

``` python
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target
x_labels = iris.feature_names
y_labels = iris.target_names
```

%% Cell type:markdown id:eaf49470 tags:

### 7.1.2 - Task 3 - K-Means

%% Cell type:markdown id:27ec3761 tags:

Find a good value for the number of clusters k by visualizing the inertia of different k (1-10)

%% Cell type:code id:9104fb49 tags:

``` python
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris
from sklearn.cluster import KMeans

iris = load_iris()
X = iris.data
y = iris.target
x_labels = iris.feature_names
y_labels = iris.target_names
```

%% Cell type:markdown id:36955304-1116-4f9c-bf20-e13f3d0dcaf7 tags:

### 7.1.2 - Task 4 - DBSCAN

%% Cell type:markdown id:1b4a3e62 tags:

Load the Iris data set and visualize the outcome of the DBSCAN algorithm with epsilon = 0.3 and min_samples = 3 next to the labelled data set. Draw a red cross on every data instance considered as an anomaly by the DBSCAN algorithm.

%% Cell type:code id:4853842e tags:

``` python
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target
x_labels = iris.feature_names
y_labels = iris.target_names
```

%% Cell type:markdown id:40c20803 tags:

### 7.1.2 - Task 5 - DBSCAN

%% Cell type:markdown id:df4a7fe4 tags:

Create a moon data set with the sklearn.datasets module including 200 samples, with Gaussian noise comprising a std deviation of 0.05. Fix the randomly generated data set with a random state to always receive the same dataset. Find good parameters for the DBSCAN algorithm by trial and error. The algorithm should split the two moons into two clusters.

%% Cell type:code id:41b77f5f tags:

``` python
import matplotlib.pyplot as plt

from sklearn.cluster import DBSCAN
```

%% Cell type:markdown id:47e562b7 tags:

## 7.1.3 Anomaly Detection

%% Cell type:markdown id:d58dd347 tags:

### 7.1.3 - Task 1 - GMM

%% Cell type:markdown id:115414a3 tags:

a)
Create a dataset with the make_blob function of the datasets module. The dataset should contain five blobs with the following number of data instances (50, 20, 30, 70, 10). Each data instance should be 2-dimensional. Pass a random seed to the function to make the data set reproducible across multiple function calls. Save the centers of the blobs in a variable and set the std deviation of the first two blobs to 1.0 and the other blobs to 0.8. Visualize the created dataset with labels.

b)
Use the GaussianMixture class from the mixture module to create a Gaussian mixture model that fits the data set. Try different parameters to find a good representation.

%% Cell type:code id:e3b5ae00 tags:

``` python
import matplotlib.pyplot as plt
```

%% Cell type:markdown id:35020d39 tags:

### 7.1.3 - Task 2 - GMM

%% Cell type:markdown id:95682b55 tags:

Perform an anomaly detection using a GMM with 2 components. Create a blob dataset with 3 blobs with 120, 70 and 10 data instances and with std deviations of 0.8, 0.8 and 5.0 respectivly. Mark the anomalies with a red cross.

%% Cell type:code id:ac18d0d0 tags:

``` python
import matplotlib.pyplot as plt

from sklearn.mixture import GaussianMixture
from sklearn.datasets import make_blobs
```

%% Cell type:markdown id:3d0b9206 tags:

## 7.1.4 Dimensionality Reduction

%% Cell type:markdown id:9a08ae29 tags:

### 7.1.4 - Task 1 - PCA & t-SNE

%% Cell type:markdown id:f8ac9a60 tags:

Get used to the digits dataset. Visualize a single data instance as an image and print the corresponding label.

%% Cell type:code id:5973382c tags:

``` python
import matplotlib.pyplot as plt
```

%% Cell type:markdown id:d1eb9eea tags:

### 7.1.4 - Task 2 - PCA & t-SNE

%% Cell type:markdown id:f5f9e744 tags:

Use the PCA class of the decomposition module and the TSNE class of the manifold module to visualze the complete MNIST dataset in 2D plots instead of showing a single image.

%% Cell type:code id:be65239d tags:

``` python
import matplotlib.pyplot as plt

from matplotlib.colors import ListedColormap

colors = ListedColormap(["red", "blue", "green", "yellow", "orange", "purple", "gray", "black", "brown", "teal"])
```

%% Cell type:markdown id:22ced03a tags:

# Hands-on 7.2 Supervised Learning

%% Cell type:markdown id:5c377e71 tags:

## 7.2.3 Classification

%% Cell type:markdown id:914661a7 tags:

### 7.2.3 - Task 1 - SVM

%% Cell type:markdown id:d979fe1c tags:

Load the iris dataset and use a SVM classifier with a linear kernel and C=1.0 to train on the features petal length and petal width. Visualize the predictions next to the data with original labels. Print the mean accuracy of the classifier on this data set. Draw the three decision boundaries between the three classes in the plot.

%% Cell type:code id:2075a53e tags:

``` python
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target
x_labels = iris.feature_names
y_labels = iris.target_names

fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey=True)
fig.suptitle("Task 1")
```

%% Cell type:markdown id:f6253c67 tags:

### 7.2.3 - Task 2 - SVM

%% Cell type:markdown id:83cda835 tags:

Load the iris dataset and use a SVM classifier with an rbf kernel and try out different parameters to train on the features petal length and petal width. Print the mean accuracy of the classifier on this data set. Draw the decision boundaries between the three classes in the plot.

%% Cell type:code id:d4cd74d2 tags:

``` python
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris
from sklearn.svm import SVC

iris = load_iris()
X = iris.data
y = iris.target
x_labels = iris.feature_names
y_labels = iris.target_names
```

%% Cell type:markdown id:1e7c8f45 tags:

### 7.2.3 - Task 3 - KNN

%% Cell type:markdown id:61f6cde9 tags:

Create a 2D blob dataset with 500 data instances and 5 blobs. Use a KNN-classifier and choose a value for K. Visualize the predictions next to the true labels in two plots. Use the cotourf function to draw the decision boundaries of the classifier.

%% Cell type:code id:2cc1246d tags:

``` python
import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=500, n_features=2, centers=5)
```

%% Cell type:markdown id:413c44dc tags:

### 7.2.3 - Task 4 - Decision Tree

%% Cell type:markdown id:964cf8c4 tags:

Create a moons dataset with default values and a std deviation of 0.2. Use a decision tree to classify the data. Visualize the decision boundaries for a decision tree without restrictions and with minimum samples per leaf = 4

%% Cell type:code id:0b26104e tags:

``` python
import matplotlib.pyplot as plt
import numpy as np

from sklearn.datasets import make_moons

X, y = make_moons(noise=0.2)
```

%% Cell type:markdown id:1f26c787 tags:

### 7.2.3 - Task 5 - Random Forest

%% Cell type:markdown id:a55c43e1 tags:

Create a moons dataset with 1000 samples and a std. deviation of 0.2. Use a random forest classifier with 150 trees and draw the decision boundaries.

%% Cell type:code id:47b947ea tags:

``` python
import matplotlib.pyplot as plt
import numpy as np

from sklearn.datasets import make_moons

X, y = make_moons(n_samples=1000, noise=.2, random_state=22)
```

%% Cell type:markdown id:1daaa736 tags:

## 7.2.4 Regression

%% Cell type:markdown id:8d75e79c tags:

### 7.2.4 - Task 1 - Regression with scikit-learn

%% Cell type:markdown id:75335348 tags:

Load the california housing dataset. Plot the coordinates of the houses with a colormap of the value in $100000. Use linear regression, KNN regression and random forest regression to predict the value with all available features. Visualize the predictions by plotting the predicted values on the coordinates.

%% Cell type:code id:10059aa2 tags:

``` python
import matplotlib.pyplot as plt
```

%% Cell type:markdown id:44a04d6a tags:

# Hands-on 7.3 Model Evaluation

%% Cell type:markdown id:ba02f77d tags:

## 7.3.4 - Case Study

%% Cell type:markdown id:798d1b87 tags:

### 7.3.4 - Task 1 - Case Study

%% Cell type:markdown id:2054df58 tags:

Load the data stored in the "grain_data.csv" with pandas. In this file different grain types and their contents are listed. The class labels from 0 to 6 are "Barley","Oat","Corn","Rice","Rye","Wheat","Spelt" and the content labels are given in the first row.

%% Cell type:code id:60748145 tags:

``` python
```

%% Cell type:markdown id:f6ee549f tags:

### 7.3.4 - Task 2 - Case Study

%% Cell type:markdown id:d39812c3 tags:

Get a feeling for the imported data by visualizing it in 2D using the t-SNE algorithm

%% Cell type:code id:9e2bb6ad tags:

``` python
import pandas as pd
import matplotlib.pyplot as plt

from matplotlib.colors import ListedColormap

filename = "grain_data_new.csv"
filename = "grain_data.csv"
df = pd.read_csv(filename)
labels = df["Label"]
data = df[['Protein', 'Fat', 'Carbohydrates', 'Fiber', 'Minerals']]
y = df["Label"].values
X = df[['Protein', 'Fat', 'Carbohydrates', 'Fiber', 'Minerals']].values
label_strings = ["Barley","Oat","Corn","Rice","Rye","Wheat","Spelt"]

colors = ListedColormap(["red", "blue", "green", "black", "teal", "brown", "yellow"])
```

%% Cell type:markdown id:30bf1be5 tags:

### 7.3.4 - Task 3 - Case Study

%% Cell type:markdown id:9382d0ae tags:

Split the data into a train and test set. Use a classifier of your choice and train it on the train split. Print the confusion matrix, the accuracy, the precision of each class and the recall of each class of the classifier on the test split.

%% Cell type:code id:94a22985 tags:

``` python
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score

filename = "grain_data_new.csv"
filename = "grain_data.csv"
df = pd.read_csv(filename)
labels = df["Label"]
data = df[['Protein', 'Fat', 'Carbohydrates', 'Fiber', 'Minerals']]
y = df["Label"].values
X = df[['Protein', 'Fat', 'Carbohydrates', 'Fiber', 'Minerals']].values
label_strings = ["Barley","Oat","Corn","Rice","Rye","Wheat","Spelt"]
```

%% Cell type:markdown id:6e8f5ab9 tags:

### 7.3.4 - Task 4 - Case Study

%% Cell type:markdown id:bb3b51cd tags:

Perform a k-fold cross-validation with k=10 on the data set with a classifier of your choice and print the mean accuracy.

%% Cell type:code id:c05fa0f7 tags:

``` python
import pandas as pd

filename = "grain_data_new.csv"
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

filename = "grain_data.csv"
df = pd.read_csv(filename)
labels = df["Label"]
data = df[['Protein', 'Fat', 'Carbohydrates', 'Fiber', 'Minerals']]
y = df["Label"].values
X = df[['Protein', 'Fat', 'Carbohydrates', 'Fiber', 'Minerals']].values
label_strings = ["Barley","Oat","Corn","Rice","Rye","Wheat","Spelt"]
```

%% Cell type:markdown id:68f28bc2 tags:

### 7.3.4 Task 5 - Case Study

%% Cell type:markdown id:131c6311 tags:

Graphically highlight learning curves to detect issues such as overfitting and underfitting.
Graphically highlight learning curves using differently sized training/test splits during a training of a linear and a rbf SVM to analyze the performance of the used classifiers.

%% Cell type:code id:304aba08 tags:

``` python
import pandas as pd
import matplotlib.pyplot as plt


filename = "grain_data_new.csv"
filename = "grain_data.csv"
df = pd.read_csv(filename)
labels = df["Label"]
data = df[['Protein', 'Fat', 'Carbohydrates', 'Fiber', 'Minerals']]
y = df["Label"].values
X = df[['Protein', 'Fat', 'Carbohydrates', 'Fiber', 'Minerals']].values
label_strings = ["Barley","Oat","Corn","Rice","Rye","Wheat","Spelt"]

fig, (ax1,ax2) = plt.subplots(1, 2)
```