Load the Iris data set and visualize the outcome of the DBSCAN algorithm with epsilon = 0.3 and min_samples = 3 next to the labelled data set. Draw a red cross on every data instance considered as an anomaly by the DBSCAN algorithm.
Create a moon data set with the sklearn.datasets module including 200 samples, with Gaussian noise comprising a std deviation of 0.05. Fix the randomly generated data set with a random state to always receive the same dataset. Find good parameters for the DBSCAN algorithm by trial and error. The algorithm should split the two moons into two clusters.
Create a dataset with the make_blob function of the datasets module. The dataset should contain five blobs with the following number of data instances (50, 20, 30, 70, 10). Each data instance should be 2-dimensional. Pass a random seed to the function to make the data set reproducible across multiple function calls. Save the centers of the blobs in a variable and set the std deviation of the first two blobs to 1.0 and the other blobs to 0.8. Visualize the created dataset with labels.
b)
Use the GaussianMixture class from the mixture module to create a Gaussian mixture model that fits the data set. Try different parameters to find a good representation.
Perform an anomaly detection using a GMM with 2 components. Create a blob dataset with 3 blobs with 120, 70 and 10 data instances and with std deviations of 0.8, 0.8 and 5.0 respectivly. Mark the anomalies with a red cross.
Get used to the digits dataset. Visualize a single data instance as an image and print the corresponding label.
%% Cell type:code id:5973382c tags:
``` python
importmatplotlib.pyplotasplt
importnumpyasnp
fromsklearn.datasetsimportload_digits
X,y=load_digits(return_X_y=True)
# print(X)
# print(X.shape)
img=np.reshape(X[500],(8,8))
print(img)
print(img.shape)
plt.imshow(img,cmap=plt.gray())
```
%% Cell type:markdown id:d1eb9eea tags:
### 7.1.4 - Task 2 - PCA & t-SNE
%% Cell type:markdown id:f5f9e744 tags:
Use the PCA class of the decomposition module and the TSNE class of the manifold module to visualze the complete MNIST dataset in 2D plots instead of showing a single image.
Load the iris dataset and use a SVM classifier with a linear kernel and C=1.0 to train on the features petal length and petal width. Visualize the predictions next to the data with original labels. Print the mean accuracy of the classifier on this data set. Draw the three decision boundaries between the three classes in the plot.
Load the iris dataset and use a SVM classifier with an rbf kernel and try out different parameters to train on the features petal length and petal width. Print the mean accuracy of the classifier on this data set. Draw the decision boundaries between the three classes in the plot.
Create a 2D blob dataset with 500 data instances and 5 blobs. Use a KNN-classifier and choose a value for K. Visualize the predictions next to the true labels in two plots. Use the cotourf function to draw the decision boundaries of the classifier.
Create a moons dataset with default values and a std deviation of 0.2. Use a decision tree to classify the data. Visualize the decision boundaries for a decision tree without restrictions and with minimum samples per leaf = 4
Load the california housing dataset. Plot the sixth feature of the data set (average number of rooms per dwelling) on
the x-axis and the value in $100000 on the y-axis. Use linear regression and KNN regression to predict the value with the average number of rooms per dwelling and visualize the predictions into the corresponding plots.
Load the data stored in the "grain_data.csv" with pandas. In this file different grain types and their contents are listed. The class labels from 0 to 6 are "Barley","Oat","Corn","Rice","Rye","Wheat","Spelt" and the content labels are given in the first row.
Split the data into a train and test set. Use a classifier of your choice and train it on the train split. Print the confusion matrix, the accuracy, the precision of each class and the recall of each class of the classifier on the test split.
Graphically highlight learning curves using differently sized training/test spluts during a training of a linear and a rbf SVM to analyze the performance of the used classifiers.
Graphically highlight learning curves using differently sized training/test splits during a training of a linear and a rbf SVM to analyze the performance of the used classifiers.
Load the Iris data set and visualize the outcome of the DBSCAN algorithm with epsilon = 0.3 and min_samples = 3 next to the labelled data set. Draw a red cross on every data instance considered as an anomaly by the DBSCAN algorithm.
%% Cell type:code id:4853842e tags:
``` python
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
x_labels = iris.feature_names
y_labels = iris.target_names
```
%% Cell type:markdown id:40c20803 tags:
### 7.1.2 - Task 5 - DBSCAN
%% Cell type:markdown id:df4a7fe4 tags:
Create a moon data set with the sklearn.datasets module including 200 samples, with Gaussian noise comprising a std deviation of 0.05. Fix the randomly generated data set with a random state to always receive the same dataset. Find good parameters for the DBSCAN algorithm by trial and error. The algorithm should split the two moons into two clusters.
%% Cell type:code id:41b77f5f tags:
``` python
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
```
%% Cell type:markdown id:47e562b7 tags:
## 7.1.3 Anomaly Detection
%% Cell type:markdown id:d58dd347 tags:
### 7.1.3 - Task 1 - GMM
%% Cell type:markdown id:115414a3 tags:
a)
Create a dataset with the make_blob function of the datasets module. The dataset should contain five blobs with the following number of data instances (50, 20, 30, 70, 10). Each data instance should be 2-dimensional. Pass a random seed to the function to make the data set reproducible across multiple function calls. Save the centers of the blobs in a variable and set the std deviation of the first two blobs to 1.0 and the other blobs to 0.8. Visualize the created dataset with labels.
b)
Use the GaussianMixture class from the mixture module to create a Gaussian mixture model that fits the data set. Try different parameters to find a good representation.
%% Cell type:code id:e3b5ae00 tags:
``` python
import matplotlib.pyplot as plt
```
%% Cell type:markdown id:35020d39 tags:
### 7.1.3 - Task 2 - GMM
%% Cell type:markdown id:95682b55 tags:
Perform an anomaly detection using a GMM with 2 components. Create a blob dataset with 3 blobs with 120, 70 and 10 data instances and with std deviations of 0.8, 0.8 and 5.0 respectivly. Mark the anomalies with a red cross.
%% Cell type:code id:ac18d0d0 tags:
``` python
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
from sklearn.datasets import make_blobs
```
%% Cell type:markdown id:3d0b9206 tags:
## 7.1.4 Dimensionality Reduction
%% Cell type:markdown id:9a08ae29 tags:
### 7.1.4 - Task 1 - PCA & t-SNE
%% Cell type:markdown id:f8ac9a60 tags:
Get used to the digits dataset. Visualize a single data instance as an image and print the corresponding label.
%% Cell type:code id:5973382c tags:
``` python
import matplotlib.pyplot as plt
```
%% Cell type:markdown id:d1eb9eea tags:
### 7.1.4 - Task 2 - PCA & t-SNE
%% Cell type:markdown id:f5f9e744 tags:
Use the PCA class of the decomposition module and the TSNE class of the manifold module to visualze the complete MNIST dataset in 2D plots instead of showing a single image.
Load the iris dataset and use a SVM classifier with a linear kernel and C=1.0 to train on the features petal length and petal width. Visualize the predictions next to the data with original labels. Print the mean accuracy of the classifier on this data set. Draw the three decision boundaries between the three classes in the plot.
Load the iris dataset and use a SVM classifier with an rbf kernel and try out different parameters to train on the features petal length and petal width. Print the mean accuracy of the classifier on this data set. Draw the decision boundaries between the three classes in the plot.
%% Cell type:code id:d4cd74d2 tags:
``` python
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.svm import SVC
iris = load_iris()
X = iris.data
y = iris.target
x_labels = iris.feature_names
y_labels = iris.target_names
```
%% Cell type:markdown id:1e7c8f45 tags:
### 7.2.3 - Task 3 - KNN
%% Cell type:markdown id:61f6cde9 tags:
Create a 2D blob dataset with 500 data instances and 5 blobs. Use a KNN-classifier and choose a value for K. Visualize the predictions next to the true labels in two plots. Use the cotourf function to draw the decision boundaries of the classifier.
%% Cell type:code id:2cc1246d tags:
``` python
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=500, n_features=2, centers=5)
```
%% Cell type:markdown id:413c44dc tags:
### 7.2.3 - Task 4 - Decision Tree
%% Cell type:markdown id:964cf8c4 tags:
Create a moons dataset with default values and a std deviation of 0.2. Use a decision tree to classify the data. Visualize the decision boundaries for a decision tree without restrictions and with minimum samples per leaf = 4
%% Cell type:code id:0b26104e tags:
``` python
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_moons
X, y = make_moons(noise=0.2)
```
%% Cell type:markdown id:1f26c787 tags:
### 7.2.3 - Task 5 - Random Forest
%% Cell type:markdown id:a55c43e1 tags:
Create a moons dataset with 1000 samples and a std. deviation of 0.2. Use a random forest classifier with 150 trees and draw the decision boundaries.
%% Cell type:code id:47b947ea tags:
``` python
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=1000, noise=.2, random_state=22)
```
%% Cell type:markdown id:1daaa736 tags:
## 7.2.4 Regression
%% Cell type:markdown id:8d75e79c tags:
### 7.2.4 - Task 1 - Regression with scikit-learn
%% Cell type:markdown id:75335348 tags:
Load the california housing dataset. Plot the coordinates of the houses with a colormap of the value in $100000. Use linear regression, KNN regression and random forest regression to predict the value with all available features. Visualize the predictions by plotting the predicted values on the coordinates.
%% Cell type:code id:10059aa2 tags:
``` python
import matplotlib.pyplot as plt
```
%% Cell type:markdown id:44a04d6a tags:
# Hands-on 7.3 Model Evaluation
%% Cell type:markdown id:ba02f77d tags:
## 7.3.4 - Case Study
%% Cell type:markdown id:798d1b87 tags:
### 7.3.4 - Task 1 - Case Study
%% Cell type:markdown id:2054df58 tags:
Load the data stored in the "grain_data.csv" with pandas. In this file different grain types and their contents are listed. The class labels from 0 to 6 are "Barley","Oat","Corn","Rice","Rye","Wheat","Spelt" and the content labels are given in the first row.
%% Cell type:code id:60748145 tags:
``` python
```
%% Cell type:markdown id:f6ee549f tags:
### 7.3.4 - Task 2 - Case Study
%% Cell type:markdown id:d39812c3 tags:
Get a feeling for the imported data by visualizing it in 2D using the t-SNE algorithm
%% Cell type:code id:9e2bb6ad tags:
``` python
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
filename = "grain_data_new.csv"
filename = "grain_data.csv"
df = pd.read_csv(filename)
labels = df["Label"]
data = df[['Protein', 'Fat', 'Carbohydrates', 'Fiber', 'Minerals']]
y = df["Label"].values
X = df[['Protein', 'Fat', 'Carbohydrates', 'Fiber', 'Minerals']].values
Split the data into a train and test set. Use a classifier of your choice and train it on the train split. Print the confusion matrix, the accuracy, the precision of each class and the recall of each class of the classifier on the test split.
%% Cell type:code id:94a22985 tags:
``` python
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score
filename = "grain_data_new.csv"
filename = "grain_data.csv"
df = pd.read_csv(filename)
labels = df["Label"]
data = df[['Protein', 'Fat', 'Carbohydrates', 'Fiber', 'Minerals']]
y = df["Label"].values
X = df[['Protein', 'Fat', 'Carbohydrates', 'Fiber', 'Minerals']].values
Graphically highlight learning curves to detect issues such as overfitting and underfitting.
Graphically highlight learning curves using differently sized training/test splits during a training of a linear and a rbf SVM to analyze the performance of the used classifiers.
%% Cell type:code id:304aba08 tags:
``` python
import pandas as pd
import matplotlib.pyplot as plt
filename = "grain_data_new.csv"
filename = "grain_data.csv"
df = pd.read_csv(filename)
labels = df["Label"]
data = df[['Protein', 'Fat', 'Carbohydrates', 'Fiber', 'Minerals']]
y = df["Label"].values
X = df[['Protein', 'Fat', 'Carbohydrates', 'Fiber', 'Minerals']].values