chapter 7_1 and 7_2 final (1511c64f) · Commits · IntroML / IntroML Chapter7

hands-on/hands-on.ipynb

0 → 100644

+1501 −0

File added.

Preview size limit exceeded, changes collapsed.

hands-on/hands-on_template.ipynb

+16 −24

Original line number	Diff line number	Diff line
		%% Cell type:markdown id:36fcdd98-89e8-464a-802d-fefeb1a4894e tags:

		# Hands-on 7.1 Unsupervised Learning

		%% Cell type:markdown id:cc965bb9-4eca-4a80-af94-bc04af5941fd tags:

		## 7.1.2 Clustering

		%% Cell type:markdown id:e6a306f0-6352-4ea2-89b2-4a4ac4d8c09f tags:

		### 7.1.2 - Task 1 - K-Means

		%% Cell type:markdown id:c97d98c3-ac82-41a1-ab99-e7f469fd65d9 tags:

		Load and visualize the the Iris data set with and without class labels.
		Use two of the four features for visualization.
		The petal width should be displayed on the y-axis and the petal length should be displayed on the x-axis.

		%% Cell type:code id:005169de-05ce-4909-a99f-a6fb2915f089 tags:

		``` python
		```

		%% Cell type:markdown id:dea0d9e6-b2a7-4a37-aa1d-b5f6c670af1f tags:

		### 7.1.2 - Task 2 - K-Means

		%% Cell type:markdown id:10416888-bfef-4134-8a24-414ba2402623 tags:

		Visualize the results of K-means with 5 clusters and 10 initializations next to the target result.
		Also display the cluster centers ontop of the visualization.

		%% Cell type:code id:1ac6d9b2-aca4-4f8d-b649-b46431d1609c tags:

		``` python
		import matplotlib.pyplot as plt

		from matplotlib.colors import ListedColormap
		from sklearn.datasets import load_iris

		iris = load_iris()
		X = iris.data
		y = iris.target
		x_labels = iris.feature_names
		y_labels = iris.target_names
		x_labels = ("sepal length", "sepal width", "petal length", "petal width")
		colours = ListedColormap(["blue", "green", "red", "yellow"])
		```

		%% Cell type:markdown id:eaf49470 tags:

		### 7.1.2 - Task 3 - K-Means

		%% Cell type:markdown id:27ec3761 tags:

		Find a good value for the number of clusters k by visualizing the inertia of different k (1-10)

		%% Cell type:code id:9104fb49 tags:

		``` python
		import matplotlib.pyplot as plt

		from matplotlib.colors import ListedColormap
		from sklearn.datasets import load_iris
		from sklearn.cluster import KMeans

		iris = load_iris()
		X = iris.data
		y = iris.target
		x_labels = iris.feature_names
		y_labels = iris.target_names
		```

		%% Cell type:markdown id:36955304-1116-4f9c-bf20-e13f3d0dcaf7 tags:

		### 7.1.2 - Task 4 - DBSCAN

		%% Cell type:markdown id:1b4a3e62 tags:

		Load the Iris data set and visualize the outcome of the DBSCAN algorithm with epsilon = 0.3 and min_samples = 3 next to the labelled data set. Draw a red cross on every data instance considered as an anomaly by the DBSCAN algorithm.

		%% Cell type:code id:4853842e tags:

		``` python
		import matplotlib.pyplot as plt

		from matplotlib.colors import ListedColormap
		from sklearn.datasets import load_iris

		iris = load_iris()
		X = iris.data
		y = iris.target
		x_labels = iris.feature_names
		y_labels = iris.target_names
		x_labels = ("sepal_length", "sepal_width", "petal_length", "petal_width")
		colours = ListedColormap(["blue", "green", "red", "yellow"])
		```

		%% Cell type:markdown id:40c20803 tags:

		### 7.1.2 - Task 5 - DBSCAN

		%% Cell type:markdown id:df4a7fe4 tags:

		Create a moon data set with the sklearn.datasets module including 200 samples, with Gaussian noise comprising a std deviation of 0.05. Fix the randomly generated data set with a random state to always receive the same dataset. Find good parameters for the DBSCAN algorithm by trial and error. The algorithm should split the two moons into two clusters.

		%% Cell type:code id:41b77f5f tags:

		``` python
		import matplotlib.pyplot as plt

		from matplotlib.colors import ListedColormap
		from sklearn.cluster import DBSCAN
		```

		%% Cell type:markdown id:47e562b7 tags:

		## 7.1.3 Anomaly Detection

		%% Cell type:markdown id:d58dd347 tags:

		### 7.1.3 - Task 1 - GMM

		%% Cell type:markdown id:115414a3 tags:

		a)
		Create a dataset with the make_blob function of the datasets module. The dataset should contain five blobs with the following number of data instances (50, 20, 30, 70, 10). Each data instance should be 2-dimensional. Pass a random seed to the function to make the data set reproducible across multiple function calls. Save the centers of the blobs in a variable and set the std deviation of the first two blobs to 1.0 and the other blobs to 0.8. Visualize the created dataset once with labels and once without labels.
		Create a dataset with the make_blob function of the datasets module. The dataset should contain five blobs with the following number of data instances (50, 20, 30, 70, 10). Each data instance should be 2-dimensional. Pass a random seed to the function to make the data set reproducible across multiple function calls. Save the centers of the blobs in a variable and set the std deviation of the first two blobs to 1.0 and the other blobs to 0.8. Visualize the created dataset with labels.

		b)
		Use the GaussianMixture class from the mixture module to create a Gaussian mixture model that fits the data set. Try different parameters to find a good representation.

		%% Cell type:code id:e3b5ae00 tags:

		``` python
		import matplotlib.pyplot as plt

		from matplotlib.colors import ListedColormap
		```

		%% Cell type:markdown id:35020d39 tags:

		### 7.1.3 - Task 2 - GMM

		%% Cell type:markdown id:95682b55 tags:

		Perform an anomaly detection using a GMM with 2 components. Create a blob dataset with 3 blobs with 120, 70 and 10 data instances and with std deviations of 0.8, 0.8 and 5.0 respectivly. Mark the anomalies with a red cross.

		%% Cell type:code id:ac18d0d0 tags:

		``` python
		import matplotlib.pyplot as plt

		from sklearn.mixture import GaussianMixture
		from sklearn.datasets import make_blobs
		```

		%% Cell type:markdown id:3d0b9206 tags:

		## 7.1.4 Dimensionality Reduction

		%% Cell type:markdown id:9a08ae29 tags:

		### 7.1.4 - Task 1 - PCA & t-SNE

		%% Cell type:markdown id:f8ac9a60 tags:

		Get used to the digits dataset. Visualize a single data instance as an image and print the corresponding label.

		%% Cell type:code id:5973382c tags:

		``` python
		import matplotlib.pyplot as plt
		```

		%% Cell type:markdown id:d1eb9eea tags:

		### 7.1.4 - Task 2 - PCA & t-SNE

		%% Cell type:markdown id:f5f9e744 tags:

		Use the PCA class of the decomposition module and the TSNE class of the manifold module to visualze the complete MNIST dataset in 2D plots instead of showing a single image.

		%% Cell type:code id:be65239d tags:

		``` python
		import matplotlib.pyplot as plt

		from matplotlib.colors import ListedColormap

		colors = ListedColormap(["red", "blue", "green", "yellow", "orange", "purple", "gray", "black", "brown", "teal"])
		```

		%% Cell type:markdown id:22ced03a tags:

		# Hands-on 7.2 Supervised Learning

		%% Cell type:markdown id:5c377e71 tags:

		## 7.2.3 Classification

		%% Cell type:markdown id:914661a7 tags:

		### 7.2.3 - Task 1 - SVM

		%% Cell type:markdown id:d979fe1c tags:

		Load the iris dataset and use a SVM classifier with a linear kernel and C=1.0 to train on the features petal length and petal width. Visualize the predictions next to the data with original labels. Print the mean accuracy of the classifier on this data set. Draw the three decision boundaries between the three classes in the plot.

		%% Cell type:code id:2075a53e tags:

		``` python
		import matplotlib.pyplot as plt

		from matplotlib.colors import ListedColormap
		from sklearn.datasets import load_iris

		iris = load_iris()
		X = iris.data
		y = iris.target
		x_labels = iris.feature_names
		y_labels = iris.target_names
		x_labels = ("sepal_length", "sepal_width", "petal_length", "petal_width")
		colours = ListedColormap(["red", "blue"])

		fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey=True)
		fig.suptitle("Task 1")
		```

		%% Cell type:markdown id:f6253c67 tags:

		### 7.2.3 - Task 2 - SVM

		%% Cell type:markdown id:83cda835 tags:

		Load the iris dataset and use a SVM classifier with an rbf kernel and try out different parameters to train on the features petal length and petal width. Print the mean accuracy of the classifier on this data set. Draw the decision boundaries between the three classes in the plot.

		%% Cell type:code id:d4cd74d2 tags:

		``` python
		import matplotlib.pyplot as plt

		from matplotlib.colors import ListedColormap
		from sklearn.datasets import load_iris
		from sklearn.svm import SVC

		iris = load_iris()
		X = iris.data
		y = iris.target
		x_labels = iris.feature_names
		y_labels = iris.target_names
		x_labels = ("sepal_length", "sepal_width", "petal_length", "petal_width")
		colours = ListedColormap(["red", "blue"])
		```

		%% Cell type:markdown id:1e7c8f45 tags:

		### 7.2.3 - Task 3 - KNN

		%% Cell type:markdown id:61f6cde9 tags:

		Create a 2D blob dataset with 500 data instances and 5 blobs. Use a KNN-classifier and choose a value for K. Visualize the predictions next to the true labels in two plots. Use the cotourf function to draw the decision boundaries of the classifier.

		%% Cell type:code id:2cc1246d tags:

		``` python
		import matplotlib.pyplot as plt

		from matplotlib.colors import ListedColormap
		from sklearn.datasets import make_blobs

		X, y = make_blobs(n_samples=500, n_features=2, centers=5)
		colours = ListedColormap(["red", "blue", "green", "yellow", "orange"])
		```

		%% Cell type:markdown id:413c44dc tags:

		### 7.2.3 - Task 4 - Decision Tree

		%% Cell type:markdown id:964cf8c4 tags:

		Create a moons dataset with default values and a std deviation of 0.2. Use a decision tree to classify the data. Visualize the decision boundaries for a decision tree without restrictions and with minimum samples per leaf = 4

		%% Cell type:code id:0b26104e tags:

		``` python
		import matplotlib.pyplot as plt
		import numpy as np

		from matplotlib.colors import ListedColormap
		from sklearn.datasets import make_moons

		X, y = make_moons(noise=0.2)
		colors = ListedColormap(["red", "blue"])
		```

		%% Cell type:markdown id:1f26c787 tags:

		### 7.2.3 - Task 5 - Random Forest

		%% Cell type:markdown id:a55c43e1 tags:

		Create a moons dataset with 1000 samples and a std. deviation of 0.2. Use a random forest classifier with 150 trees and draw the decision boundaries.

		%% Cell type:code id:47b947ea tags:

		``` python
		import matplotlib.pyplot as plt
		import numpy as np

		from matplotlib.colors import ListedColormap
		from sklearn.datasets import make_moons

		X, y = make_moons(n_samples=1000, noise=.2, random_state=22)
		colors = ListedColormap(["red", "blue"])
		```

		%% Cell type:markdown id:1daaa736 tags:

		## 7.2.4 Regression

		%% Cell type:markdown id:8d75e79c tags:

		### 7.2.4 - Task 1 - Regression with scikit-learn

		%% Cell type:markdown id:75335348 tags:

		Load the california housing dataset. Plot the coordinates of the houses with a colormap of the value in $1000. Use linear regression, KNN regression and random forest regression to predict the value with all available features. Visualize the predictions by plotting the predicted values on the coordinates.
		Load the california housing dataset. Plot the coordinates of the houses with a colormap of the value in $100000. Use linear regression, KNN regression and random forest regression to predict the value with all available features. Visualize the predictions by plotting the predicted values on the coordinates.

		%% Cell type:code id:10059aa2 tags:

		``` python
		import matplotlib.pyplot as plt
		```

		%% Cell type:markdown id:44a04d6a tags:

		# Hands-on 7.3 Model Evaluation

		%% Cell type:markdown id:ba02f77d tags:

		## 7.3.4 - Case Study

		%% Cell type:markdown id:798d1b87 tags:

		### 7.3.4 - Task 1 - Case Study

		%% Cell type:markdown id:2054df58 tags:

		Load the data stored in the "grain_data.csv". In this file different grain types and their contents are listed. The class labels from 0 to 6 are "Barley","Oat","Corn","Rice","Rye","Wheat","Spelt" and the content labels are given in the first row.

		%% Cell type:code id:60748145 tags:

		``` python
		import pandas as pd

		filename = "grain_data.csv"
		df = pd.read_csv(filename)
		print(df)
		label_strings = ["Barley","Oat","Corn","Rice","Rye","Wheat","Spelt"]
		df["Label"] = df["Label"].apply(lambda x: label_strings[x])
		print(df)
		```

		%% Cell type:markdown id:f6ee549f tags:

		### 7.3.4 - Task 2 - Case Study

		%% Cell type:markdown id:d39812c3 tags:

		Get a feeling for the imported data by visualizing it in 2D using the t-SNE algorithm

		%% Cell type:code id:9e2bb6ad tags:

		``` python
		import pandas as pd
		import matplotlib.pyplot as plt

		from matplotlib.colors import ListedColormap
		from sklearn.manifold import TSNE

		filename = "grain_data_new.csv"
		df = pd.read_csv(filename)
		print(df.columns)
		labels = df["Label"]
		data = df[['Protein', 'Fat', 'Carbohydrates', 'Fiber', 'Minerals']]
		print(data)

		tsne = TSNE()
		X_2D = tsne.fit_transform(data, labels)

		fig, ax = plt.subplots(1, 1)

		label_strings = ["Barley","Oat","Corn","Rice","Rye","Wheat","Spelt"]
		colours = ListedColormap(["red", "blue", "green", "black", "teal", "brown", "yellow"])
		scatter = ax.scatter(X_2D[:, 0], X_2D[:, 1], s=3, c=labels, cmap=colours)
		ax.legend(scatter.legend_elements()[0], label_strings)

		plt.show()
		```

		%% Output

		Index(['Protein', 'Fat', 'Carbohydrates', 'Fiber', 'Minerals', 'Label'], dtype='object')
		Protein Fat Carbohydrates Fiber Minerals
		0 11.76 1.66 66.98 9.41 3.54
		1 12.43 2.06 59.55 8.90 2.57
		2 9.09 2.21 69.83 10.34 2.61
		3 10.97 2.15 64.19 9.31 2.08
		4 12.67 2.20 64.48 8.43 1.91
		.. ... ... ... ... ...
		695 9.79 1.46 54.95 7.87 2.45
		696 19.22 1.84 67.39 6.07 2.25
		697 22.86 1.73 20.99 8.85 1.82
		698 25.16 1.64 71.32 11.78 1.74
		699 20.00 1.49 70.00 10.60 1.02

		[700 rows x 5 columns]



		%% Cell type:markdown id:30bf1be5 tags:

		### 7.3.4 - Task 3 - Case Study

		%% Cell type:markdown id:9382d0ae tags:

		Split the data into a train and test set. Use a classifier of your choice and train it on the train split. Print the confusion matrix, the accuracy, the precision of each class and the recall of each class of the classifier on the test split.

		%% Cell type:code id:94a22985 tags:

		``` python
		import pandas as pd
		from sklearn.ensemble import RandomForestClassifier
		from sklearn.model_selection import train_test_split
		from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score

		filename = "grain_data.csv"
		df = pd.read_csv(filename)
		labels = df["Label"]
		data = df[['Protein', 'Fat', 'Carbohydrates', 'Fiber', 'Minerals']]

		# split the training data
		X_train, X_test, y_train, y_test = train_test_split(data, labels)

		# train a random forest with the train split
		forest = RandomForestClassifier()
		forest.fit(X_train, y_train)

		# test and print the accuracy of the forest on the test set
		print("Score:", forest.score(X_test, y_test))
		y_test_pred = forest.predict(X_test)
		print("Accuracy:", accuracy_score(y_test,y_test_pred))
		print("Confusion Matrix: \n", confusion_matrix(y_test,y_test_pred))
		print("Precision:\n", precision_score(y_test,y_test_pred,average=None))
		print("Recall:\n", recall_score(y_test,y_test_pred,average=None))
		```

		%% Output

		Score: 0.9828571428571429
		Accuracy: 0.9828571428571429
		Confusion Matrix:
		[[24 0 0 0 0 0 0]
		[ 0 29 0 0 0 0 0]
		[ 0 0 28 0 0 0 0]
		[ 0 0 0 29 0 0 0]
		[ 1 0 0 0 20 1 0]
		[ 0 0 0 0 1 22 0]
		[ 0 0 0 0 0 0 20]]
		Precision:
		[0.96 1. 1. 1. 0.95238095 0.95652174
		1. ]
		Recall:
		[1. 1. 1. 1. 0.90909091 0.95652174
		1. ]

		%% Cell type:markdown id:6e8f5ab9 tags:

		### 7.3.4 - Task 4 - Case Study

		%% Cell type:markdown id:bb3b51cd tags:

		Perform a k-fold cross-validation with k=10 on the data set with a classifier of your choice and print the mean accuracy.

		%% Cell type:code id:c05fa0f7 tags:

		``` python
		import numpy as np
		import pandas as pd

		from sklearn.ensemble import RandomForestClassifier
		from sklearn.model_selection import KFold
		from sklearn.metrics import accuracy_score

		filename = "grain_data.csv"
		df = pd.read_csv(filename)
		labels = df["Label"]
		data = df[['Protein', 'Fat', 'Carbohydrates', 'Fiber', 'Minerals']]

		k = 10
		k_fold = KFold(n_splits=k, shuffle=True)

		indices = k_fold.split(X)
		accuracy_list = []
		for train, test in indices:
		X_train = [X[i] for i in train]
		X_test = [X[i] for i in test]
		y_train = [y[i] for i in train]
		y_test = [y[i] for i in test]
		forest = RandomForestClassifier()
		forest.fit(X_train, y_train)
		y_pred = forest.predict(X_test)
		accuracy = accuracy_score(y_test, y_pred)
		print("Accuracy:", accuracy)
		accuracy_list.append(accuracy)

		print("Mean Accuracy:", np.mean(accuracy_list))
		```

		%% Output

		Accuracy: 0.95
		Accuracy: 0.95
		Accuracy: 0.98
		Accuracy: 0.97
		Accuracy: 0.98
		Accuracy: 0.98
		Accuracy: 0.98
		Accuracy: 0.97
		Accuracy: 0.95
		Accuracy: 0.97
		Mean Accuracy: 0.968

		%% Cell type:markdown id:68f28bc2 tags:

		### 7.3.4 Task 5 - Case Study

		%% Cell type:markdown id:131c6311 tags:

		Graphically highlight learning curves to detect issues such as overfitting and underfitting.

		%% Cell type:code id:304aba08 tags:

		``` python
		# TODO: new dataset
		import numpy as np
		import pandas as pd
		import matplotlib.pyplot as plt

		from sklearn.model_selection import learning_curve
		from sklearn.svm import SVC

		filename = "grain_data.csv"
		df = pd.read_csv(filename)
		labels = df["Label"]
		data = df[['Protein', 'Fat', 'Carbohydrates', 'Fiber', 'Minerals']]

		dataset_size = len(df)
		print(dataset_size)

		svm1 = SVC(C=100, kernel="rbf", gamma=1)
		svm2 = SVC(C=100, kernel="linear")

		train_size_abs, train_scores, test_scores, fit_times, _ = learning_curve(svm1,data,labels,train_sizes=np.linspace(0.2,1.,10), return_times=True, shuffle=True)
		print(train_size_abs,train_scores,test_scores,fit_times)

		rel_train_size = train_size_abs / dataset_size

		fig, (ax1,ax2) = plt.subplots(1, 2)

		mean_train_scores = np.mean(train_scores,axis=-1)
		mean_test_scores = np.mean(test_scores,axis=-1)

		ax1.set_ylim((0,1.1))
		ax2.set_ylim((0,1.1))

		ax1.plot(rel_train_size, mean_train_scores, c="tab:blue")
		ax1.plot(rel_train_size, mean_test_scores, c="tab:red")
		ax1.set(title="Complex")
		ax1.set_xlabel("Number of samples in training split")

		train_size_abs, train_scores, test_scores, fit_times, _ = learning_curve(svm2,data,labels,train_sizes=np.linspace(0.2,1.,10), return_times=True, shuffle=True)
		mean_train_scores = np.mean(train_scores,axis=-1)
		mean_test_scores = np.mean(test_scores,axis=-1)

		ax2.plot(train_size_abs,mean_train_scores, c="tab:blue")
		ax2.plot(train_size_abs,mean_test_scores, c="tab:red")
		ax2.set(title="Simple")

		plt.show()
		```

		%% Output

		700
		[112 161 211 261 311 360 410 460 510 560] [[1. 1. 1. 1. 1.]
		[1. 1. 1. 1. 1.]
		[1. 1. 1. 1. 1.]
		[1. 1. 1. 1. 1.]
		[1. 1. 1. 1. 1.]
		[1. 1. 1. 1. 1.]
		[1. 1. 1. 1. 1.]
		[1. 1. 1. 1. 1.]
		[1. 1. 1. 1. 1.]
		[1. 1. 1. 1. 1.]] [[0.37857143 0.4 0.4 0.49285714 0.37857143]
		[0.45714286 0.61428571 0.52857143 0.6 0.36428571]
		[0.55714286 0.65 0.52857143 0.58571429 0.42142857]
		[0.57142857 0.7 0.65 0.62142857 0.47142857]
		[0.67857143 0.7 0.71428571 0.67857143 0.5 ]
		[0.68571429 0.72857143 0.73571429 0.7 0.55 ]
		[0.75 0.76428571 0.73571429 0.67857143 0.54285714]
		[0.78571429 0.78571429 0.75 0.69285714 0.54285714]
		[0.8 0.79285714 0.72857143 0.71428571 0.55714286]
		[0.79285714 0.81428571 0.74285714 0.73571429 0.55714286]] [[0.00346184 0.00205278 0.00205326 0.00210142 0.00201678]
		[0.00410295 0.00274968 0.00261092 0.00260949 0.00260377]
		[0.00457597 0.00349402 0.00366998 0.00395942 0.00392556]
		[0.00487757 0.00582075 0.00443983 0.00449061 0.00456357]
		[0.00606894 0.00583315 0.00598145 0.00579023 0.0063591 ]
		[0.00747991 0.00734544 0.00756383 0.00799131 0.00786328]
		[0.00952983 0.00936627 0.00879741 0.00927734 0.00919986]
		[0.01388073 0.0110209 0.01094484 0.01123977 0.01188397]
		[0.01383805 0.01597929 0.01278591 0.01245928 0.01403737]
		[0.01499009 0.01495266 0.01512051 0.01488566 0.01492906]]