Load the Iris data set and visualize the outcome of the DBSCAN algorithm with epsilon = 0.3 and min_samples = 3 next to the labelled data set. Draw a red cross on every data instance considered as an anomaly by the DBSCAN algorithm.
Create a moon data set with the sklearn.datasets module including 200 samples, with Gaussian noise comprising a std deviation of 0.05. Fix the randomly generated data set with a random state to always receive the same dataset. Find good parameters for the DBSCAN algorithm by trial and error. The algorithm should split the two moons into two clusters.
%% Cell type:code id:41b77f5f tags:
``` python
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
```
%% Cell type:markdown id:47e562b7 tags:
## 7.1.3 Anomaly Detection
%% Cell type:markdown id:d58dd347 tags:
### 7.1.3 - Task 1 - GMM
%% Cell type:markdown id:115414a3 tags:
a)
Create a dataset with the make_blob function of the datasets module. The dataset should contain five blobs with the following number of data instances (50, 20, 30, 70, 10). Each data instance should be 2-dimensional. Pass a random seed to the function to make the data set reproducible across multiple function calls. Save the centers of the blobs in a variable and set the std deviation of the first two blobs to 1.0 and the other blobs to 0.8. Visualize the created dataset once with labels and once without labels.
b)
Use the GaussianMixture class from the mixture module to create a Gaussian mixture model that fits the data set. Try different parameters to find a good representation.
%% Cell type:code id:e3b5ae00 tags:
``` python
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
```
%% Cell type:markdown id:35020d39 tags:
### 7.1.3 - Task 2 - GMM
%% Cell type:markdown id:95682b55 tags:
Perform an anomaly detection using a GMM with 2 components. Create a blob dataset with 3 blobs with 120, 70 and 10 data instances and with std deviations of 0.8, 0.8 and 5.0 respectivly. Mark the anomalies with a red cross.
%% Cell type:code id:ac18d0d0 tags:
``` python
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
from sklearn.datasets import make_blobs
```
%% Cell type:markdown id:3d0b9206 tags:
## 7.1.4 Dimensionality Reduction
%% Cell type:markdown id:9a08ae29 tags:
### 7.1.4 - Task 1 - PCA & t-SNE
%% Cell type:markdown id:f8ac9a60 tags:
Get used to the digits dataset. Visualize a single data instance as an image and print the corresponding label.
%% Cell type:code id:5973382c tags:
``` python
import matplotlib.pyplot as plt
```
%% Cell type:markdown id:d1eb9eea tags:
### 7.1.4 - Task 2 - PCA & t-SNE
%% Cell type:markdown id:f5f9e744 tags:
Use the PCA class of the decomposition module and the TSNE class of the manifold module to visualze the complete MNIST dataset in 2D plots instead of showing a single image.
Load the iris dataset and use a SVM classifier with a linear kernel and C=1.0 to train on the features petal length and petal width. Visualize the predictions next to the data with original labels. Print the mean accuracy of the classifier on this data set. Draw the three decision boundaries between the three classes in the plot.
Load the iris dataset and use a SVM classifier with an rbf kernel and try out different parameters to train on the features petal length and petal width. Print the mean accuracy of the classifier on this data set. Draw the decision boundaries between the three classes in the plot.
Create a 2D blob dataset with 500 data instances and 5 blobs. Use a KNN-classifier and choose a value for K. Visualize the predictions next to the true labels in two plots. Use the cotourf function to draw the decision boundaries of the classifier.
%% Cell type:code id:2cc1246d tags:
``` python
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=500, n_features=2, centers=5)
Create a moons dataset with default values and a std deviation of 0.2. Use a decision tree to classify the data. Visualize the decision boundaries for a decision tree without restrictions and with minimum samples per leaf = 4
%% Cell type:code id:0b26104e tags:
``` python
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.datasets import make_moons
X, y = make_moons(noise=0.2)
colors = ListedColormap(["red", "blue"])
```
%% Cell type:markdown id:1f26c787 tags:
### 7.2.3 - Task 5 - Random Forest
%% Cell type:markdown id:a55c43e1 tags:
Create a moons dataset with 1000 samples and a std. deviation of 0.2. Use a random forest classifier with 150 trees and draw the decision boundaries.
%% Cell type:code id:47b947ea tags:
``` python
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.colors import ListedColormap
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=1000, noise=.2, random_state=22)
colors = ListedColormap(["red", "blue"])
```
%% Cell type:markdown id:1daaa736 tags:
## 7.2.4 Regression
%% Cell type:markdown id:8d75e79c tags:
### 7.2.4 - Task 1 - Regression with scikit-learn
%% Cell type:markdown id:75335348 tags:
Load the boston house-prices dataset. Plot the sixth feature of the data set (average number of rooms per dwelling) on
the x-axis and the value in $1000 on the y-axis. Use linear regression and KNN regression to predict the value with the average number of rooms per dwelling and visualize the predictions into the corresponding plots.
Load the california housing dataset. Plot the coordinates of the houses with a colormap of the value in $1000. Use linear regression, KNN regression and random forest regression to predict the value with all available features. Visualize the predictions by plotting the predicted values on the coordinates.
Load the data stored in the "grain_data.csv". In this file different grain types and their contents are listed. The class labels from 0 to 6 are "Barley","Oat","Corn","Rice","Rye","Wheat","Spelt" and the content labels are given in the first row.
Split the data into a train and test set. Use a classifier of your choice and train it on the train split. Print the confusion matrix, the accuracy, the precision of each class and the recall of each class of the classifier on the test split.
%% Cell type:code id:94a22985 tags:
``` python
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score
filename = "grain_data.csv"
df = pd.read_csv(filename)
labels = df["Label"]
data = df[['Protein', 'Fat', 'Carbohydrates', 'Fiber', 'Minerals']]