Load the Iris data set and visualize the outcome of the DBSCAN algorithm with epsilon = 0.3 and min_samples = 3 next to the labelled data set. Draw a red cross on every data instance considered as an anomaly by the DBSCAN algorithm.
Create a moon data set with the sklearn.datasets module including 200 samples, with Gaussian noise comprising a std deviation of 0.05. Fix the randomly generated data set with a random state to always receive the same dataset. Find good parameters for the DBSCAN algorithm by trial and error. The algorithm should split the two moons into two clusters.
Create a dataset with the make_blob function of the datasets module. The dataset should contain five blobs with the following number of data instances (50, 20, 30, 70, 10). Each data instance should be 2-dimensional. Pass a random seed to the function to make the data set reproducible across multiple function calls. Save the centers of the blobs in a variable and set the std deviation of the first two blobs to 1.0 and the other blobs to 0.8. Visualize the created dataset with labels.
b)
Use the GaussianMixture class from the mixture module to create a Gaussian mixture model that fits the data set. Try different parameters to find a good representation.
Perform an anomaly detection using a GMM with 2 components. Create a blob dataset with 3 blobs with 120, 70 and 10 data instances and with std deviations of 0.8, 0.8 and 5.0 respectivly. Mark the anomalies with a red cross.
Get used to the digits dataset. Visualize a single data instance as an image and print the corresponding label.
%% Cell type:code id:5973382c tags:
``` python
importmatplotlib.pyplotasplt
importnumpyasnp
fromsklearn.datasetsimportload_digits
X,y=load_digits(return_X_y=True)
# print(X)
# print(X.shape)
img=np.reshape(X[500],(8,8))
print(img)
print(img.shape)
plt.imshow(img,cmap=plt.gray())
```
%% Cell type:markdown id:d1eb9eea tags:
### 7.1.4 - Task 2 - PCA & t-SNE
%% Cell type:markdown id:f5f9e744 tags:
Use the PCA class of the decomposition module and the TSNE class of the manifold module to visualze the complete MNIST dataset in 2D plots instead of showing a single image.
Load the iris dataset and use a SVM classifier with a linear kernel and C=1.0 to train on the features petal length and petal width. Visualize the predictions next to the data with original labels. Print the mean accuracy of the classifier on this data set. Draw the three decision boundaries between the three classes in the plot.
Load the iris dataset and use a SVM classifier with an rbf kernel and try out different parameters to train on the features petal length and petal width. Print the mean accuracy of the classifier on this data set. Draw the decision boundaries between the three classes in the plot.
Create a 2D blob dataset with 500 data instances and 5 blobs. Use a KNN-classifier and choose a value for K. Visualize the predictions next to the true labels in two plots. Use the cotourf function to draw the decision boundaries of the classifier.
Create a moons dataset with default values and a std deviation of 0.2. Use a decision tree to classify the data. Visualize the decision boundaries for a decision tree without restrictions and with minimum samples per leaf = 4
Load the california housing dataset. Plot the sixth feature of the data set (average number of rooms per dwelling) on
the x-axis and the value in $100000 on the y-axis. Use linear regression and KNN regression to predict the value with the average number of rooms per dwelling and visualize the predictions into the corresponding plots.
Load the data stored in the "grain_data.csv". In this file different grain types and their contents are listed. The class labels from 0 to 6 are "Barley","Oat","Corn","Rice","Rye","Wheat","Spelt" and the content labels are given in the first row.
Load the data stored in the "grain_data.csv" with pandas. In this file different grain types and their contents are listed. The class labels from 0 to 6 are "Barley","Oat","Corn","Rice","Rye","Wheat","Spelt" and the content labels are given in the first row.
Split the data into a train and test set. Use a classifier of your choice and train it on the train split. Print the confusion matrix, the accuracy, the precision of each class and the recall of each class of the classifier on the test split.
Graphically highlight learning curves using differently sized training/test spluts during a training of a linear and a rbf SVM to analyze the performance of the used classifiers.