The read_csv function will load the data from the csv file. The data is two-dimensional. Use the “KMeans” class from the “cluster” module of sklearn to cluster the data. Find a suitable value for the parameter “n_clusters” (“k”). Do this by plotting the inertia of the different “k” with the “matplotlib.pyplot” module. Save the “k” that you have chosen in the variable “k_chosen” which is defaulted to 0.
The read_csv function will load the data from the csv file.. The data is four-dimensional and is labelled in three different classes. Use the “PCA” class from the “decomposition” module of sklearn to reduce the dimensionality of the data to two dimensions. Plot the transformed two-dimensional data with the “matplotlib.pyplot” moudle. Make sure that the data samples from the three classes are colored differently.
Load the iris data from the “dataset” module of sklearn. The data of the iris data set has four dimensions. The task is to find out the importance of each of the four features, which are “sepal length”, “sepal width”, “petal length” and “petal width”. Use the “RandomForestClassifier” class from the “ensemble” module of sklearn to create ten instances of a random forest classifier. Use suitable parameters of your choice to train the ten classifiers and find out the mean feature importance of each of the four features among the ten classifiers. Save the mean values in the variable “mean_feature_importances” as a one-dimensional list with four items, each with the mean feature importance of the respective feature.
Tip: Start to fit a single random forest and access the feature importance of the first classifier before iterating over ten classifiers. Use the documentation of sklearn to find out how to access the feature importances of a random forest classifier. The “numpy.mean“ function can be used to calculate mean values of arrays or lists. Make sure to select the right axis with the axis parameter along which the mean should be calculated.
Load the iris data from the “dataset” module of sklearn. The data of the iris data set has four dimensions. The task is to use the two most important features found out in exercise 1 to fit a random forest classifier and compare the mean accuracy of the classifier against a random forest classifier that is fitted with all four features. Save the two mean accuracy values in the variable “mean_accuracy_list” as a one-dimensional list with two items. The first item shall be the mean accuracy of the classifier fitted with only two features.
Tip: If you did not solve exercise 1, you can choose two features on your own. If you need to know how to slice the data in order to be able to access only two dimensions, take a look in the hands-on videos. You can find in the documentation how to access the mean accuracy of a random forest classifier.
The protein analyzer of the grain mill is defect. Since you have data available, you want to try and predict the protein values of the grain by using the values of the other ingredients as features. The load data function will load the grain data from the csv file “grain_data.csv”. The other ingredients are provided as “ingredients” and the respective protein values are available as “protein_values”. The list of ingredients contains the fat, the carbohydrates, the fiber and the minerals in gram. Use the data and the labels to fit a linear regression, a k-nearest-neighbor regressor and a random forest regressor. Use parameters of your choice. Split the data and the labels with the “train_test_split” function of the “model_selection” module of sklearn. Save the coefficient of determination of each classifier and the mean squared error of each classifier in the respective lists “r_2” and “MSE”. The error and the coefficient of determination shall be determined on the test set. The order in which the values shall be stored in the respective lists is: linear regression, knn regressor, random forest regressor.
Tip: The needed regressors can be found in the “linear model”, the ”neighbors” and the “ensemble” modules of sklearn. The required metrics can be calculated with functions of the “metrics” module of sklearn.
You are unsure if the results of exercise 1 are reliable, since the outcome of the evaluation is dependent on the split of the data. To check the reliability, you are going to perform a ten-fold cross-validation. The data is provided in the same way as for exercise 1. Save the mean values of the coefficient of determination of the ten-fold cross-validation in the r_2_mean list for the linear regression, the knn regressor and the random forest regressor in the same order. Do the same for the mean values of the ten-fold cross-validation of the mean squared errors.