fix X_new (78ca516c) · Commits · IntroML / IntroML Chapter7

self-study programming/example_solutions_self_study_programming.ipynb

+1 −1

Original line number	Diff line number	Diff line
		%% Cell type:markdown id:6a5d57f4 tags:

		# Example Solutions: Self-Study Programming

		%% Cell type:markdown id:b3368675 tags:

		## 7.1 Unsupervised Learning

		%% Cell type:markdown id:aebb39da tags:

		### Exercise 1

		%% Cell type:markdown id:9e2f94ba tags:

		The read_csv function will load the data from the csv file. The data is two-dimensional. Use the “KMeans” class from the “cluster” module of sklearn to cluster the data. Find a suitable value for the parameter “n_clusters” (“k”). Do this by plotting the inertia of the different “k” with the “matplotlib.pyplot” module. Save the “k” that you have chosen in the variable “k_chosen” which is defaulted to 0.

		%% Cell type:code id:4c8b3baa tags:

		``` python
		import csv
		import matplotlib.pyplot as plt
		import numpy as np
		from sklearn.cluster import KMeans


		def read_csv():
		X_1 = []
		y_1 = []
		with open("self_study_programming_7_1_task1_data.csv", newline="") as file:
		reader = csv.reader(file, delimiter=",")
		for row in reader:
		X_1.append([float(row[i]) for i in range(2)])
		y_1.append(int(row[2]))

		X_2 = []
		y_2 = []
		with open("self_study_programming_7_1_task2_data.csv", newline="") as file:
		reader = csv.reader(file, delimiter=",")
		for row in reader:
		X_2.append([float(row[i]) for i in range(4)])
		y_2.append(int(row[4]))

		X_1 = np.array(X_1)
		y_1 = np.array(y_1)
		X_2 = np.array(X_2)
		y_2 = np.array(y_2)

		return X_1, y_1, X_2, y_2

		X_1, y_1, X_2, y_2 = read_csv()
		k_chosen = 0
		inertia = [99999]
		for i in range(1, 10):
		kmeans = KMeans(n_clusters=i, n_init=10)
		kmeans.fit(X_1)
		inertia.append(kmeans.inertia_)
		k_chosen = "3 (2)"
		plt.title("task3")
		plt.ylim(0, 99999)
		plt.xlabel("k")
		plt.ylabel("inertia")
		plt.plot(inertia)
		plt.show()
		print("k chosen:", k_chosen)
		```

		%% Cell type:markdown id:c4cd8f5c tags:

		### Exercise 2

		%% Cell type:markdown id:9d7b824c tags:

		The read_csv function will load the data from the csv file.. The data is four-dimensional and is labelled in three different classes. Use the “PCA” class from the “decomposition” module of sklearn to reduce the dimensionality of the data to two dimensions. Plot the transformed two-dimensional data with the “matplotlib.pyplot” moudle. Make sure that the data samples from the three classes are colored differently.

		%% Cell type:code id:8a81b444 tags:

		``` python
		import csv
		import matplotlib.pyplot as plt
		import numpy as np
		from matplotlib.colors import ListedColormap
		from sklearn.decomposition import PCA


		def read_csv():
		X_1 = []
		y_1 = []
		with open("self_study_programming_7_1_task1_data.csv", newline="") as file:
		reader = csv.reader(file, delimiter=",")
		for row in reader:
		X_1.append([float(row[i]) for i in range(2)])
		y_1.append(int(row[2]))

		X_2 = []
		y_2 = []
		with open("self_study_programming_7_1_task2_data.csv", newline="") as file:
		reader = csv.reader(file, delimiter=",")
		for row in reader:
		X_2.append([float(row[i]) for i in range(4)])
		y_2.append(int(row[4]))

		X_1 = np.array(X_1)
		y_1 = np.array(y_1)
		X_2 = np.array(X_2)
		y_2 = np.array(y_2)

		return X_1, y_1, X_2, y_2

		X_1, y_1, X_2, y_2 = read_csv()
		pca = PCA(n_components=2)
		X_new = pca.fit_transform(X_2)
		fig, ax = plt.subplots(1, 1)
		colours = ListedColormap(["red", "blue", "green", "black", "teal", "orange"])
		scatter = ax.scatter(X_2[:, 0], X_2[:, 1], s=3, c=y_2, cmap=colours)
		scatter = ax.scatter(X_new[:, 0], X_new[:, 1], s=3, c=y_2, cmap=colours)
		ax.legend(scatter.legend_elements()[0], ["0", "1", "2", "3", "4", "5"])
		plt.show()
		```

		%% Cell type:markdown id:4d7b05ab tags:

		## 7.2 Supervised Learning

		%% Cell type:markdown id:4b218c45 tags:

		### Exercise 1

		%% Cell type:markdown id:555a9372 tags:

		Load the iris data from the “dataset” module of sklearn. The data of the iris data set has four dimensions. The task is to find out the importance of each of the four features, which are “sepal length”, “sepal width”, “petal length” and “petal width”. Use the “RandomForestClassifier” class from the “ensemble” module of sklearn to create ten instances of a random forest classifier. Use suitable parameters of your choice to train the ten classifiers and find out the mean feature importance of each of the four features among the ten classifiers. Save the mean values in the variable “mean_feature_importances” as a one-dimensional list with four items, each with the mean feature importance of the respective feature.
		Tip: Start to fit a single random forest and access the feature importance of the first classifier before iterating over ten classifiers. Use the documentation of sklearn to find out how to access the feature importances of a random forest classifier. The “numpy.mean“ function can be used to calculate mean values of arrays or lists. Make sure to select the right axis with the axis parameter along which the mean should be calculated.

		%% Cell type:code id:d2168504 tags:

		``` python
		import numpy as np
		from sklearn.ensemble import RandomForestClassifier
		from sklearn.datasets import load_iris


		X, y = load_iris(return_X_y=True)
		mean_feature_importances = []
		feature_importance = []
		for i in range(10):
		forest = RandomForestClassifier()
		forest.fit(X,y)
		feature_importance.append(forest.feature_importances_)
		print(np.mean(feature_importance, axis=0))
		```

		%% Cell type:markdown id:24d537c8 tags:

		### Exercise 2

		%% Cell type:markdown id:2bb9c605 tags:

		Load the iris data from the “dataset” module of sklearn. The data of the iris data set has four dimensions. The task is to use the two most important features found out in exercise 1 to fit a random forest classifier and compare the mean accuracy of the classifier against a random forest classifier that is fitted with all four features. Save the two mean accuracy values in the variable “mean_accuracy_list” as a one-dimensional list with two items. The first item shall be the mean accuracy of the classifier fitted with only two features.
		Tip: If you did not solve exercise 1, you can choose two features on your own. If you need to know how to slice the data in order to be able to access only two dimensions, take a look in the hands-on videos. You can find in the documentation how to access the mean accuracy of a random forest classifier.

		%% Cell type:code id:5727cb6c tags:

		``` python
		import numpy as np
		from sklearn.ensemble import RandomForestClassifier
		from sklearn.datasets import load_iris


		X, y = load_iris(return_X_y=True)
		mean_accuracy_list = []
		####Add your code for task2 here####
		forest_2 = RandomForestClassifier()
		forest_2.fit(X[:,2:4],y)
		mean_accuracy_list.append(forest_2.score(X[:,2:4],y))
		forest_4 = RandomForestClassifier()
		forest_4.fit(X,y)
		mean_accuracy_list.append(forest_4.score(X, y))
		####End of your code####
		print(mean_accuracy_list)
		```

		%% Cell type:markdown id:2ca24640 tags:

		## 7.3 Model Evaluation

		%% Cell type:markdown id:00172dc1 tags:

		### Exercise 1

		%% Cell type:markdown id:822a0e8e tags:

		The protein analyzer of the grain mill is defect. Since you have data available, you want to try and predict the protein values of the grain by using the values of the other ingredients as features. The load data function will load the grain data from the csv file “grain_data.csv”. The other ingredients are provided as “ingredients” and the respective protein values are available as “protein_values”. The list of ingredients contains the fat, the carbohydrates, the fiber and the minerals in gram. Use the data and the labels to fit a linear regression, a k-nearest-neighbor regressor and a random forest regressor. Use parameters of your choice. Split the data and the labels with the “train_test_split” function of the “model_selection” module of sklearn. Save the coefficient of determination of each classifier and the mean squared error of each classifier in the respective lists “r_2” and “MSE”. The error and the coefficient of determination shall be determined on the test set. The order in which the values shall be stored in the respective lists is: linear regression, knn regressor, random forest regressor.
		Tip: The needed regressors can be found in the “linear model”, the ”neighbors” and the “ensemble” modules of sklearn. The required metrics can be calculated with functions of the “metrics” module of sklearn.

		%% Cell type:code id:4168caeb tags:

		``` python
		import csv
		import numpy as np
		from sklearn.ensemble import RandomForestRegressor
		from sklearn.model_selection import train_test_split
		from sklearn.metrics import mean_squared_error, r2_score
		from sklearn.linear_model import LinearRegression
		from sklearn.neighbors import KNeighborsRegressor


		def load_data():
		data = []
		labels = []
		with open("grain_data.csv", newline="") as file:
		reader = csv.reader(file, delimiter=",")
		header = True
		for row in reader:
		if header:
		header = False
		else:
		data.append([float(row[i]) for i in range(5)])
		labels.append(int(row[5]))
		data = np.array(data)
		return data, labels

		data, labels = load_data()
		ingredients = data[:, 1:5]
		protein_values = data[:, 0]
		r_2 = []
		MSE = []
		X_train, X_test, y_train, y_test = train_test_split(ingredients, protein_values)

		linear = LinearRegression()
		linear.fit(X_train,y_train)
		y_pred = linear.predict(X_test)
		r_2.append(r2_score(y_test, y_pred))
		MSE.append(mean_squared_error(y_test, y_pred))

		knn = KNeighborsRegressor()
		knn.fit(X_train,y_train)
		y_pred = knn.predict(X_test)
		r_2.append(r2_score(y_test, y_pred))
		MSE.append(mean_squared_error(y_test, y_pred))

		forest = RandomForestRegressor()
		forest.fit(X_train, y_train)
		y_pred = forest.predict(X_test)
		r_2.append(r2_score(y_test, y_pred))
		MSE.append(mean_squared_error(y_test, y_pred))

		print("r_2:", r_2)
		print("MSE:", MSE)
		```

		%% Cell type:markdown id:ef741f26 tags:

		### Exercise 2

		%% Cell type:markdown id:0c34c552 tags:

		You are unsure if the results of exercise 1 are reliable, since the outcome of the evaluation is dependent on the split of the data. To check the reliability, you are going to perform a ten-fold cross-validation. The data is provided in the same way as for exercise 1. Save the mean values of the coefficient of determination of the ten-fold cross-validation in the r_2_mean list for the linear regression, the knn regressor and the random forest regressor in the same order. Do the same for the mean values of the ten-fold cross-validation of the mean squared errors.

		%% Cell type:code id:ac0bb99e tags:

		``` python
		import csv
		import numpy as np
		from sklearn.ensemble import RandomForestRegressor
		from sklearn.model_selection import train_test_split, KFold
		from sklearn.metrics import mean_squared_error, r2_score
		from sklearn.linear_model import LinearRegression
		from sklearn.neighbors import KNeighborsRegressor

		def load_data():
		data = []
		labels = []
		with open("grain_data.csv", newline="") as file:
		reader = csv.reader(file, delimiter=",")
		header = True
		for row in reader:
		if header:
		header = False
		else:
		data.append([float(row[i]) for i in range(5)])
		labels.append(int(row[5]))
		data = np.array(data)
		return data, labels

		data, labels = load_data()
		ingredients = data[:, 1:5]
		protein_values = data[:, 0]
		r_2_mean = []
		MSE_mean = []
		k_fold = KFold(n_splits=10, shuffle=True)

		indices = k_fold.split(ingredients)
		r_2_list = [[] for _ in range(3)]
		MSE_list = [[] for _ in range(3)]

		for train, test in indices:
		X_train = [ingredients[i] for i in train]
		X_test = [ingredients[i] for i in test]
		y_train = [protein_values[i] for i in train]
		y_test = [protein_values[i] for i in test]

		linear = LinearRegression()
		linear.fit(X_train, y_train)
		y_pred = linear.predict(X_test)
		r_2 = r2_score(y_test, y_pred)
		MSE = mean_squared_error(y_test, y_pred)
		r_2_list[0].append(r_2)
		MSE_list[0].append(MSE)

		knn = KNeighborsRegressor()
		knn.fit(X_train, y_train)
		y_pred = knn.predict(X_test)
		r_2 = r2_score(y_test, y_pred)
		MSE = mean_squared_error(y_test, y_pred)
		r_2_list[1].append(r_2)
		MSE_list[1].append(MSE)

		forest = RandomForestRegressor()
		forest.fit(X_train, y_train)
		y_pred = forest.predict(X_test)
		r_2 = r2_score(y_test, y_pred)
		MSE = mean_squared_error(y_test, y_pred)
		r_2_list[2].append(r_2)
		MSE_list[2].append(MSE)

		r_2_mean = np.mean(r_2_list, axis=1)
		MSE_mean = np.mean(MSE_list, axis=1)

		print("r_2:", r_2_mean)
		print("MSE:", MSE_mean)
		```