{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Random Forests (Part 2; python models)\n", "\n", "```\n", "date: \"Block 06\"\n", "author: \"Daniel Lawson\"\n", "email: dan.lawson@bristol.ac.uk\n", "output: html_document\n", "version: 1.0.1\n", "```\n", "\n", "Here we get a random forest classifier running on the kddcup data. We start by importing data from R, after the standard boiler plate stuff." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Activity 1: Read in the data that we saved from R. \n", "\n", "This requires telling python that the first column is the \"index column\" (like row names in R). We use the function pd.read_csv." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "r_train=pd.read_csv('https://raw.githubusercontent.com/dsbristol/dst/master/data/conndataC_train.csv',index_col=0) \n", "r_test=pd.read_csv('https://raw.githubusercontent.com/dsbristol/dst/master/data/conndataC_test.csv',index_col=0) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also need the output of the Random Forest that was run in R (**block06-TreesAndForests_Part1.Rmd**).\n", "\n", "You should really save this locally, but for convenience I've added it to the github repo." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "r_rf_roc=pd.read_csv('https://raw.githubusercontent.com/dsbristol/dst/master/data/conndataC_RFroc.csv',index_col=0) # EDIT" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
durationorig_bytesresp_bytesorig_ip_bytesresp_ip_byteshttp
642030.1739535.7235856.1984796.3526296.4861610
2080550.0295595.8944038.1942296.5998708.2789360
729880.0582696.2324489.0038086.7345929.0532190
2229600.7793256.8480057.5611227.2399337.8038431
711980.0198036.2025369.0058966.7165959.0505240
\n", "
" ], "text/plain": [ " duration orig_bytes resp_bytes orig_ip_bytes resp_ip_bytes http\n", "64203 0.173953 5.723585 6.198479 6.352629 6.486161 0\n", "208055 0.029559 5.894403 8.194229 6.599870 8.278936 0\n", "72988 0.058269 6.232448 9.003808 6.734592 9.053219 0\n", "222960 0.779325 6.848005 7.561122 7.239933 7.803843 1\n", "71198 0.019803 6.202536 9.005896 6.716595 9.050524 0" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r_train.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Convert the data into the format expected by random forest in python." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "r_train_features= np.array(r_train)[:,0:4]\n", "r_train_labels= np.array(r_train)[:,5].ravel() # this becomes a 'horizontal' array\n", "\n", "r_test_features= np.array(r_test)[:,0:4]\n", "r_test_labels= np.array(r_test)[:,5].ravel() " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0., 0., 0., ..., 0., 1., 0.])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r_train_labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Q2 Run a Random Forest Classifier. \n", "\n", "How do you access the prediction probabilities?\n", "\n", "Look up how Python handles Random Forests. How does it differ to R?" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "rdata_rf = RandomForestClassifier(n_estimators=100, max_features=3)\n", "rdata_rf.fit(r_train_features,r_train_labels);" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "?RandomForestClassifier" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "rdata_rf_predictions=rdata_rf.predict_proba(r_test_features)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The below accesses the probability of class 1, but generalises to multi-class datasets." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0. , 1. , 1. , ..., 0.88, 1. , 0.96])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rdata_rf_predictions[:,1] " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Activity 3 make an ROC curve dataset using the function roc_curve.\n", "\n", "We'll extract an ROC curve and take a look at how Python represents it." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import roc_curve\n", "rdata_rf_fpr, rdata_rf_tpr, _ = roc_curve(r_test_labels,rdata_rf_predictions[:,1]) " ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "array([0. , 0.69397993, 0.73578595, 0.76254181, 0.79264214,\n", " 0.80434783, 0.81772575, 0.8277592 , 0.8361204 , 0.84280936,\n", " 0.84949833, 0.85284281, 0.86956522, 0.87792642, 0.88294314,\n", " 0.88461538, 0.89464883, 0.89799331, 0.90301003, 0.90301003,\n", " 0.90635452, 0.91137124, 0.91137124, 0.91304348, 0.91471572,\n", " 0.92140468, 0.92474916, 0.92809365, 0.93143813, 0.93979933,\n", " 0.94314381, 0.94481605, 0.94816054, 0.94983278, 0.95317726,\n", " 0.9548495 , 0.95819398, 0.95986622, 0.95986622, 0.96153846,\n", " 0.96488294, 0.96488294, 0.96822742, 0.96989967, 0.96989967,\n", " 0.97324415, 0.97491639, 0.97491639, 0.97491639, 0.97491639,\n", " 0.97491639, 0.97658863, 0.97658863, 0.97826087, 0.98160535,\n", " 0.98160535, 0.98494983, 0.98494983, 0.98662207, 0.98829431,\n", " 0.98829431, 0.98829431, 0.98829431, 0.98829431, 0.98829431,\n", " 0.98829431, 0.98829431, 0.98996656, 0.99331104, 0.99498328,\n", " 0.99498328, 0.99498328, 0.99498328, 0.99498328, 0.99498328,\n", " 0.99498328, 1. ])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rdata_rf_tpr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we will plot the actual ROC curve, showing the R implementation in red and the python implementation in blue. Which is better? Why?" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZEAAAEaCAYAAADQVmpMAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAgAElEQVR4nO3de5xVdb3/8debUS6CCgbeQBQNw1veRtQ4x+zCJcvAy8/wrl3I8wiPmp3U0uyHlf56aKlJKRWpnJSIzMZLcvB4Ny8MiingBVFjCBUFRHESGD6/P9aC2Y4zw8yavWbvmXk/H4/9mLW+67v2/sw3nHfrsr9LEYGZmVkW3UpdgJmZdVwOETMzy8whYmZmmTlEzMwsM4eImZll5hAxM7PMHCJmZpaZQ8S6DEmvSqqV9J6k1yXdKKlPgz6fknSfpHclvSPpDkl7N+izjaSrJf0jfa+X0/X+7fsbmZWeQ8S6mqMjog9wAHAgcNHGDZIOB/4H+AuwMzAEeAZ4VNLuaZ/uwP8C+wBjgG2Aw4G3geF5FS1pi7ze26wtHCLWJUXE68AskjDZ6KfAzRFxTUS8GxErIuJi4HHgh2mf04DBwDERsSAiNkTEmxFxWUTc3dhnSdpH0mxJKyS9Iel7afuNkn5U0O9ISTUF669KukDS34E16fLMBu99jaRr0+VtJf1W0jJJSyX9SFJFG4fKrFkOEeuSJA0CvgAsSte3Aj4F/LGR7jOAkeny54F7IuK9Fn7O1sC9wD0kRzcfJzmSaakTgS8CfYHpwFHpe5IGxAnALWnfG4H16WccCIwCvt6KzzJrNYeIdTW3S3oXWAK8CVyatm9H8t/Dskb2WQZsvN7xsSb6NOVLwOsRcVVE/Cs9wnmiFftfGxFLIqI2Il4DngKOSbd9Fng/Ih6XtANwFHBuRKyJiDeBnwPjW/FZZq3mELGuZlxEbA0cCQyjPhxWAhuAnRrZZyfgrXT57Sb6NGUX4OVMlSaWNFi/heToBOAk6o9CdgW2BJZJWiVpFXADsH0bPttssxwi1iVFxIMkp3+uTNfXAI8B/6eR7idQfwrqXmC0pN4t/KglwO5NbFsDbFWwvmNjpTZY/yNwZHo67hjqQ2QJ8AHQPyL6pq9tImKfFtZplolDxLqyq4GRkvZP1y8ETpf0n5K2ltQvvfB9OPB/0z7TSP5g/0nSMEndJH1M0vckHdXIZ9wJ7CTpXEk90vc9NN02j+Qax3aSdgTO3VzBEbEceAD4HfBKRCxM25eR3Fl2VXoLcjdJe0j6dIZxMWsxh4h1Wekf5JuBH6TrjwCjgWNJrnu8RnKB+t8i4qW0zwckF9efB2YDq4EnSU6LfeRaR0S8S3JR/mjgdeAl4DPp5mkktxC/ShIAf2hh6bekNdzSoP00oDuwgOT03Exad+rNrNXkh1KZmVlWPhIxM7PMcgsRSVMlvSnpuSa2S9K1khZJ+rukgwq2nS7ppfR1el41mplZ2+R5JHIjybQQTfkCMDR9TQB+BSBpO5J79w8lmUbiUkn9cqzTzMwyyi1EIuIhYEUzXcaSTDEREfE40FfSTiQXNmenU06sJLl42VwYmZlZiZTymshAPvxFqpq0ral2MzMrMx16ZlBJE0hOhdG7d++Dhw0bVuKKzKykamvh/fc/3BYBr71WnPffeWfo1etDTRs2iLoQGzY0WA5Rt0FsSNcbXxZ1key3qX+6b2sI6NYt6NYtqFC6rKAibesmkmUF3bp9ePmVNxa+FREDsg5JKUNkKcmUEBsNStuWkkxJUdj+QGNvEBFTgCkAlZWVUV1dnUedZlam5t84h7/+8T3qNqR/dO/5a9OdKw+B4c3P1r9hA9R+0I33aitYU5v+/Fc33nu/gjXrtuS9uq1Yswbeew/WrPloXm1Or17Qpw9s0zv52bsIP3v3hi23bF0dhSS1KWFLGSJVwERJ00kuor8TEcskzQJ+UnAxfRQFz3wwsy7q2mvh1VcBePL1wVzw8Jd4oOaQBp2ObHr/6vS1GVLjf6y32Q52asMf+622gopOODF/biEi6VaS/0X7p89IuJRkgjgi4nrgbpJZRxcB7wNnpttWSLoMmJO+1aSIaO4CvZl1ZtOmwcyZUFUFwCO9RjKmdhJ99Q7/r+J7nHbudmz7vW/V929wuqm1evZMgsRaptN8Y92ns8w6od/9Dr761WR52DBuO3kmp12xD4MGwf33w06e1KXNJM2NiMqs+3foC+tm1nk89xxMnAhvbZx0f8UKWHYI8CwMHMTaur68dAkcdBDceacDpFw4RMyspOLb5/P41AWMWv1H+rCGEVs+mWxY+0Hy85DhMLgvABMmwDnntO1CshWXQ8TM2s/778OqVZtW6+pg7G/Hcdfqq9h9m+U8NO4aBvZ5p77/yJEwbnAJCrWWcoiYWX7Wr4f585N7ZyE5F5VaxbZM4gfcxbc5cecHufxvn2bgrj8pUaGWlUPEzIovAh5/HL77XXjkkQ9v23NP5p/4Iz7zsy+x/N1eHP3J1/j9H3ZEu5amVGsbh4iZFVdtLVx+OVx2WX3b7bcnPysqeHHQZ/ncmK3Yog/86UY46qhdUc+SVGpF4BAxs7Zbtgx+//vkIscf/gBPP520X3MNHHccDBzIW2/BzTfDtEuSs1wPPwx77VXasq3tHCJm1na//S1ccgkAUzmT5zgFzjgTXu0HVyVdHnwQnnoqWf71rx0gnYVDxMzaZt26TQGy7p33+XrfnkSIrf/04W4VFTBlCpx2GvToUYI6LRcOETNrvYjkm4HPP5+8gA+23Z6vnNaLiGSWkuOOK3GN1i4cImbWepdfDr/8JQD/qDyW02sv4eVe+7LkL/CLXzhAupJSPpTKzDqqu+5Kfs6bx1kD/sTc9Qcw/PAtuPXW5ADFug4fiZhZ4yKSu64ase5fdSz/txNgwP48/zx88Ytw663tXJ+VBYeImX3YBx/AggVwwQUwe/ZHNs/lIE5gBovZY9ODq484op1rtLLhEDEzePFFWLw4WT7nnGR9oxtuAGDt+m5MeXgvzp95GNtvXcsvvvU63XfZEYDRo9u7YCsXDhGzLmzDBrjzh9WsuOy6gtbDk9fZ/wkDdya23JHHHkvuuFq5EsaMgWnT+tC/f59SlW1lxCFi1kWteHU14498ndmvVQI3frTDL+oXe/eGcePgxBPhC1+Abr4lx1IOEbMuKAK+evxqHnhtN67nm4y+chQce2yTz4XdYYc2P3XWOqlcQ0TSGOAaoAL4TURc0WD7rsBUYACwAjglImrSbXXAs2nXf0TEl/Os1axTeu45+OlPkzmtUhFwxYIv85dnvsLPOI9vLjwPhg0rYZHWkeUWIpIqgMnASKAGmCOpKiIWFHS7Erg5Im6S9FngcuDUdFttRByQV31mXcJtt8G0abDHHiDx8rrBnPXmJO6tHcGxvWdxzifnwOAfl7pK68DyPLM5HFgUEYsjYi0wHRjboM/ewH3p8v2NbDezIlj73Iv85MyX2PeN/+WJLUbwq1/BzHdH0+1vj8BWW5W6POvA8jydNRBYUrBeAxzaoM8zwLEkp7yOAbaW9LGIeBvoKakaWA9cERG351irWcd3zjlwyy3URTde3rAbf1+/D8/U7skz/IXq3cWyZXD88cns7DvvXOpirbMo9YX17wDXSToDeAhYCmw8ebtrRCyVtDtwn6RnI+Llwp0lTQAmAAwe7OcwW9f0+uvwxBPwxIz9efzdrzBnw0G8ty55ylM3beATO6ziiCPEqacm3yw3K6Y8Q2QpsEvB+qC0bZOI+CfJkQiS+gDHRcSqdNvS9OdiSQ8ABwIvN9h/CjAFoLKyMnL5LczKwTvvbPoyYPWCXjzydB8ef7Y3Tzy3Fa/+M5lXfQtOZf9tX+W0k3ty8MGw//6w997d6NVru1JWbp1cniEyBxgqaQhJeIwHTirsIKk/sCIiNgAXkdyphaR+wPsR8UHaZwTw0xxrNSudZ55pco6qTcaOhbVr+RuHM4K/AbAL/+AwZnM2j3MoT3AQT9Fr1NEweUY7FG2WyC1EImK9pInALJJbfKdGxHxJk4DqiKgCjgQulxQkp7O+le6+F3CDpA0kF/+vaHBXl1nn8N57cPDBH7oFt0k77MA7E66Cy+C2C5/kmMOWAT2AT6cv4NCGlx3N8pXrNZGIuBu4u0HbDwqWZwIzG9nvb8B+edZmVjJ33JFMcAiwZk0SIOedByec0Px+++4LDydTjew8bvhHb1MxK4FSX1g36zzuuw/uvHPz/X7+8w+vd+sGn/kMHHZYPnWZ5cghYlYsP/4xPPBAMtFUc3r2TB7/d/LJyXq3bn7ouHVYDhGz1li1Cs48E1av/ui2p56CESPgoYfavy6zEvFcnGatMX8+3H47vPEGrF374de++8L48aWu0Kxd+UjEysutt8K55yYPuihH69YlP6++Gj7/+Xb72LVrk4cM/vKXyXoTk+2atTuHiJWXuXNhxQqYMKHUlTStT592uQheV5dcYpk+Hf70p+SBUH37wje/Cfv53kUrEw4RKz89esDkyaWuoqQeeggmToRnn00ya9y45EzZyJHQvXupqzOr5xAxKyNvvpl8ZeSWW2DwYPjv/06eFeUHQlm5coiYlZFLL4UZM+CSS+DCCz1Lu5U/h4hZGVm1CoYMgUmTSl2JWcv4Fl+zMvHUU3D33bDLLpvva1YuHCJmZWDhQhg9Gvr1g5tuKnU1Zi3nEDErsVdeSb5yssUWcO+9MGhQqSsyazlfEzErofXrkyOQ2lp48EH4+MdLXZFZ6zhEzHLW8FEhdXXJ9z8eeQTuugteegmuuspfILSOySFilpN58+AHP0geH9KU3XaDU06B449vt7LMiqrzhMj778PTT5e6CmurN94odQVttnBh8n2PP/4Rtt0Wvv3tZLqSQnvumUz46+sf1tF1nhBZuBAOOqjUVVgxDBhQ6gpaLSI5RXXllfD73ydfErz44iRA+vUrdXVm+ek8IbLHHsmJZev49tij1BW0yMqVyd1U99wDs2bB0qXJ9CTnnw/f/S7071/qCs3yl2uISBoDXANUAL+JiCsabN8VmAoMAFYAp0RETbrtdODitOuPIqL5u+f79oWxY4v7C1iXsnQpPPpocsttc959N3kS7hNPJDPW9+2bTIw4ejQcfTRsv3371GtWDnILEUkVwGRgJFADzJFUFRELCrpdCdwcETdJ+ixwOXCqpO2AS4FKIIC56b4r86rXupa6uuT5Uo88kgTHo4/Ca6+1bF8JDjkEvv99GDMGhg9PvuNh1hXl+U9/OLAoIhYDSJoOjAUKQ2Rv4Nvp8v3A7enyaGB2RKxI950NjAFuzbFe68TWrEmOHDYGxmOP1T/hdqedkovc556b/Nx77+Sx502pqPB07GYb5RkiA4ElBes1wKEN+jwDHEtyyusYYGtJH2ti34ENP0DSBGACwODBg4tWuHUeCxfCGWckz7qqq0uOIvbdF046KQmMESOS22z9pECzbEp9EP4d4DpJZwAPAUuBumb3KBARU4ApAJWVlZFHgdaxPfooPPlkcpQxahQcfvhHb7c1s+zyDJGlQOF8pIPStk0i4p8kRyJI6gMcFxGrJC0Fjmyw7wM51mqd3Pnn+zsZZnnIcwLGOcBQSUMkdQfGA1WFHST1l7SxhotI7tQCmAWMktRPUj9gVNpmZmZlJLcQiYj1wESSP/4LgRkRMV/SJElfTrsdCbwg6UVgB+DH6b4rgMtIgmgOMGnjRXYzMysfuV4TiYi7gbsbtP2gYHkmMLOJfadSf2RiZmZlyM8TMTOzzBwiZmaWmUPEzMwyc4iYmVlmDhEzM8vMIWJmZpk5RMzMLDOHiJmZZeYQMTOzzBwiZmaWmUPEzMwyc4iYmVlmDhEzM8vMIWJmZpk5RMzMLDOHiJmZZeYQMTOzzBwiZmaWWa4hImmMpBckLZJ0YSPbB0u6X9LTkv4u6ai0fTdJtZLmpa/r86zTOqd//QtWrSp1FWadW27PWJdUAUwGRgI1wBxJVRGxoKDbxcCMiPiVpL1Jnse+W7rt5Yg4IK/6rGNaswbeeKNlr9Wr6/fr2bN0NZt1ZrmFCDAcWBQRiwEkTQfGAoUhEsA26fK2wD+zftjbb8ONN2bdu2OIKHUF7aO2tulgWLOm8X222w522CF5HXRQ/fKOO8KwYdC/f/v+DmZdRZ4hMhBYUrBeAxzaoM8Pgf+RdDbQG/h8wbYhkp4GVgMXR8TDDT9A0gRgQrJ2MGeeWazSrRz0718fBoceWr9cGBA77AADBkD37qWu1qxryjNEWuJE4MaIuErS4cA0SfsCy4DBEfG2pIOB2yXtExGrC3eOiCnAFID99quMO+5o7/Lbn1TqCvLXvXsSIFtuWepKzGxz8gyRpcAuBeuD0rZCXwPGAETEY5J6Av0j4k3gg7R9rqSXgT2B6qY+rEcP2G234hVvZmabl+fdWXOAoZKGSOoOjAeqGvT5B/A5AEl7AT2B5ZIGpBfmkbQ7MBRYnGOtZmaWQW5HIhGxXtJEYBZQAUyNiPmSJgHVEVEFnA/8WtJ5JBfZz4iIkHQEMEnSOmADcFZErMirVjMzy0bRSW75qaysjOrqJs92mZlZIyTNjYjKrPv7G+tmZpaZQ8TMzDJziJiZWWYOETMzy8whYmZmmTlEzMwsM4eImZll5hAxM7PMHCJmZpaZQ8TMzDJziJiZWWYOETMzy8whYmZmmbU6RCR1k3RyHsWYmVnH0mSISNpG0kWSrpM0SomzSR4OdUL7lWhmZuWquYdSTQNWAo8BXwe+BwgYFxHz2qE2MzMrc82FyO4RsR+ApN8Ay4DBEfGvdqnMzMzKXnPXRNZtXIiIOqDGAWJmZoWaC5H9Ja2W9K6kd4FPFqyvbsmbSxoj6QVJiyRd2Mj2wZLul/S0pL9LOqpg20Xpfi9IGt36X83MzPLW5OmsiKhoyxtLqgAmAyOBGmCOpKqIWFDQ7WJgRkT8StLewN3AbunyeGAfYGfgXkl7pkdEZmZWJpq7O6unpHPTu7MmSGru+kljhgOLImJxRKwFpgNjG/QJYJt0eVvgn+nyWGB6RHwQEa8Ai9L3MzOzMtLc6aybgErgWeAo4KpWvvdAYEnBek3aVuiHwCmSakiOQs5uxb6k4VYtqXr58uWtLM/MzNqquRDZOyJOiYgbgOOBf8/h808EboyIQSRBNU1Si78AGRFTIqIyIioHDBiQQ3lmZtac5k5RFd6dtV5Sa997KbBLwfqgtK3Q14Ax6Wc8Jqkn0L+F+5qZWYk19//6D0jvxlqd8e6sOcBQSUMkdSe5UF7VoM8/gM8BSNoL6AksT/uNl9RD0hBgKPBk6341MzPLW3NHIs9ExIFZ3zg9epkIzAIqgKkRMV/SJKA6IqqA84FfSzqP5CL7GRERwHxJM4AFwHrgW74zy8ys/Cj5m93IBumpiDionevJrLKyMqqrq0tdhplZhyJpbkRUZt2/uSOR7SV9u6mNEfGzrB9qZmadQ3MhUgH0IZl00czM7COaC5FlETGp3SoxM7MOp7m7s3wEYmZmzWouRD7XblWYmVmH1GSIRMSK9izEzMw6nlY/Y93MzGwjh4iZmWXmEDEzs8wcImZmlplDxMzMMnOImJlZZg4RMzPLzCFiZmaZOUTMzCwzh4iZmWXmEDEzs8wcImZmllmuISJpjKQXJC2SdGEj238uaV76elHSqoJtdQXbqvKs08zMsmnuoVRtIqkCmAyMBGqAOZKqImLBxj4RcV5B/7OBAwveojYiDsirPjMza7s8j0SGA4siYnFErAWmA2Ob6X8icGuO9ZiZWZHlGSIDgSUF6zVp20dI2hUYAtxX0NxTUrWkxyWNa2K/CWmf6uXLlxerbjMza6FyubA+HpgZEXUFbbtGRCVwEnC1pD0a7hQRUyKiMiIqBwwY0F61mplZKs8QWQrsUrA+KG1rzHganMqKiKXpz8XAA3z4eomZmZWBPENkDjBU0hBJ3UmC4iN3WUkaBvQDHito6yepR7rcHxgBLGi4r5mZlVZud2dFxHpJE4FZQAUwNSLmS5oEVEfExkAZD0yPiCjYfS/gBkkbSILuisK7uszMrDzow3+7O67Kysqorq4udRlmZh2KpLnp9edMyuXCupmZdUAOETMzy8whYmZmmTlEzMwsM4eImZll5hAxM7PMHCJmZpaZQ8TMzDJziJiZWWYOETMzy8whYmZmmTlEzMwsM4eImZll5hAxM7PMHCJmZpaZQ8TMzDJziJiZWWYOETMzyyzXEJE0RtILkhZJurCR7T+XNC99vShpVcG20yW9lL5Oz7NOMzPLZou83lhSBTAZGAnUAHMkVUXEgo19IuK8gv5nAwemy9sBlwKVQABz031X5lWvmZm1Xp5HIsOBRRGxOCLWAtOBsc30PxG4NV0eDcyOiBVpcMwGxuRYq5mZZZBniAwElhSs16RtHyFpV2AIcF9r9zUzs9Iplwvr44GZEVHXmp0kTZBULal6+fLlOZVmZmZNyTNElgK7FKwPStsaM576U1kt3jcipkREZURUDhgwoI3lmplZa+UZInOAoZKGSOpOEhRVDTtJGgb0Ax4raJ4FjJLUT1I/YFTaZmZmZSS3u7MiYr2kiSR//CuAqRExX9IkoDoiNgbKeGB6RETBviskXUYSRACTImJFXrWamVk2Kvjb3aFVVlZGdXV1qcswM+tQJM2NiMqs+5fLhXUzM+uAHCJmZpaZQ8TMzDJziJiZWWYOETMzy8whYmZmmTlEzMwsM4eImZll5hAxM7PMHCJmZpaZQ8TMzDJziJiZWWYOETMzy8whYmZmmTlEzMwsM4eImZll5hAxM7PMHCJmZpZZriEiaYykFyQtknRhE31OkLRA0nxJtxS010mal76qGtvXzMxKa4u83lhSBTAZGAnUAHMkVUXEgoI+Q4GLgBERsVLS9gVvURsRB+RVn5mZtV2eRyLDgUURsTgi1gLTgbEN+nwDmBwRKwEi4s0c6zEzsyLLM0QGAksK1mvStkJ7AntKelTS45LGFGzrKak6bR+XY51mZpZRbqezWvH5Q4EjgUHAQ5L2i4hVwK4RsVTS7sB9kp6NiJcLd5Y0AZgAMHjw4Pat3MzMcj0SWQrsUrA+KG0rVANURcS6iHgFeJEkVIiIpenPxcADwIENPyAipkREZURUDhgwoPi/gZmZNSvPEJkDDJU0RFJ3YDzQ8C6r20mOQpDUn+T01mJJ/ST1KGgfASzAzMzKSm6nsyJivaSJwCygApgaEfMlTQKqI6Iq3TZK0gKgDviviHhb0qeAGyRtIAm6Kwrv6jIzs/KgiCh1DUVRWVkZ1dXVpS7DzKxDkTQ3Iiqz7u9vrJuZWWYOETMzy8whYmZmmTlEzMwsM4eImZll5hAxM7PMHCJmZpaZQ8TMzDJziJiZWWYOETMzy8whYmZmmTlEzMwsM4eImZll5hAxM7PMHCJmZpaZQ8TMzDJziJiZWWYOETMzy8whYmZmmeUaIpLGSHpB0iJJFzbR5wRJCyTNl3RLQfvpkl5KX6fnWaeZmWWzRV5vLKkCmAyMBGqAOZKqImJBQZ+hwEXAiIhYKWn7tH074FKgEghgbrrvyrzqNTOz1svzSGQ4sCgiFkfEWmA6MLZBn28AkzeGQ0S8mbaPBmZHxIp022xgTI61mplZBrkdiQADgSUF6zXAoQ367Akg6VGgAvhhRNzTxL4DG36ApAnAhHT1A0nPFaf0TbYF3ili36b6tLS9ufXC5f7AW5uppbU8FpuvMWvf1oxFS9raayxaMw4t7e+x2HyfYo/FJzZXbLMiIpcXcDzwm4L1U4HrGvS5E/gzsCUwhCQ4+gLfAS4u6HcJ8J3NfF51Dr/DlGL2bapPS9ubW2+w7LHopGPRkrb2GovWjIPHovOORZ6ns5YCuxSsD0rbCtUAVRGxLiJeAV4EhrZw3/ZwR5H7NtWnpe3Nrbem1iw8Ftnev9hj0ZK29hqL1r63x6J1/TvEWChNoqKTtAVJKHyOJADmACdFxPyCPmOAEyPidEn9gaeBA0gvpgMHpV2fAg6OiBXNfF51RFTm8st0MB6Leh6Leh6Leh6Lem0di9yuiUTEekkTgVkk1zumRsR8SZNIDp+q0m2jJC0A6oD/ioi3ASRdRhI8AJOaC5DUlFx+kY7JY1HPY1HPY1HPY1GvTWOR25GImZl1fv7GupmZZeYQMTOzzBwiZmaWWZcIEUlHSnpY0vWSjix1PaUmqbekaklfKnUtpSRpr/TfxExJ/1HqekpJ0jhJv5b0B0mjSl1PKUnaXdJvJc0sdS2lkP59uCn993Dy5vqXfYhImirpzYbfRm/J5I4FAngP6Eny3ZQOqUhjAXABMCOfKttHMcYiIhZGxFnACcCIPOvNU5HG4vaI+AZwFvCVPOvNU5HGYnFEfC3fSttXK8flWGBm+u/hy5t973K/O0vSESQBcHNE7Ju2VZB8B2XT5I7AiSS3El/e4C2+CrwVERsk7QD8LCI2m67lqEhjsT/wMZJAfSsi7myf6ourGGMREW9K+jLwH8C0iLiFDqhYY5HudxXw+4h4qp3KL6oij8XMiDi+vWrPUyvHZSzw14iYJ+mWiDipuffOc+6sooiIhyTt1qB50+SOAJKmA2Mj4nKguVM0K4EeedTZHooxFunpvN7A3kCtpLsjYkOedeehWP8u0u8rVUm6C+iQIVKkfxcCriD549EhAwSK/vei02jNuJAEyiBgHi04W1X2IdKElkzuuImkY0lmBu4LXJdvae2uVWMREd8HkHQG6RFartW1r9b+uziS5NC9B3B3rpW1v1aNBXA28HlgW0kfj4jr8yyunbX238XHgB8DB0q6KA2bzqipcbkWuE7SF2nB9CgdNURaJSJuA24rdR3lJCJuLHUNpRYRDwAPlLiMshAR15L88ejy0lkzzip1HaUSEWuAM1vav+wvrDehXCZoLAcei3oei3oei3oei8YVZVw6aojMAYZKGiKpOzAeqCpxTaXisajnsajnsajnsWhcUcal7ENE0q3AY8AnJNVI+lpErAc2Tu64EJhRODtwZ+WxqOexqOexqOexaFye41L2t/iamVn5KvsjETMzK18OETMzy8whYmZmmTlEzMwsM4eImZll5hAxM7PMHCJmRSCpTtK8gtduSp5j8066vlDSpWnfwvbnJV1Z6mQLdikAAAEASURBVPrNsuoSc2eZtYPaiDigsCGdNfXhiPiSpN7APEkbJ7Tb2N4LeFrSnyPi0fYt2aztfCRi1g7SSe3mAh9v0F5LMuX2wFLUZdZWDhGz4uhVcCrrzw03ptOLHwbMb9DeDxgKPNQ+ZZoVl09nmRXHR05npf5d0tPABuCKiJifPsfk3yU9QxIgV0fE6+1Yq1nROETM8vVwRDT29LyN10SGAI9LmhER89q7OLO28ukssxKKiFdIHkt7QalrMcvCIWJWetcDRzTyDGyzsuep4M3MLDMfiZiZWWYOETMzy8whYmZmmTlEzMwsM4eImZll5hAxM7PMHCJmZpaZQ8TMzDL7/7MCQOPU9sAdAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "#plt.clf()\n", "plt.semilogx(np.array(r_rf_roc)[:,0], np.array(r_rf_roc)[:,1],'r')\n", "plt.semilogx(rdata_rf_fpr,rdata_rf_tpr,'b')\n", "plt.xlim([0.00001,1])\n", "plt.ylim([0.6,1])\n", "plt.xlabel('FPR')\n", "plt.ylabel('TPR')\n", "plt.title('ROC curve')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The R Random Forest defaults are prodicing a true positive rate about 5% higher at very low false positive rate! \n", "\n", "## Mastery question: why is that? \n", "\n", "Can you demonstrate it?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 2: Natively read and process the input data\n", "\n", "Here we go back to the KDD cup [10%](http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz) data, with its [column names](http://kdd.ics.uci.edu/databases/kddcup99/kddcup.names). If you kept your code in `/code` and data in `/data` then this will work for you, otherwise download again or change the locations below.\n", "\n", "(We first met this data in `block01-EDA.Rmd`)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reading in the header. Its a big clunky to make a list out of the strange format that the data are provided in. I've provided code for this for the future." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "header=pd.read_csv('../data/kddcup.names',sep=\"\\t\", header=None,skiprows=1).iloc[:,0].tolist()\n", "colnames=[str(x).split(':')[0] for x in header]+['normal']" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('../data/kddcup.data_10_percent.gz', sep=\",\", header=None, names=colnames)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The usual data checking" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
durationprotocol_typeserviceflagsrc_bytesdst_byteslandwrong_fragmenturgenthot...dst_host_srv_countdst_host_same_srv_ratedst_host_diff_srv_ratedst_host_same_src_port_ratedst_host_srv_diff_host_ratedst_host_serror_ratedst_host_srv_serror_ratedst_host_rerror_ratedst_host_srv_rerror_ratenormal
00tcphttpSF18154500000...91.00.00.110.00.00.00.00.0normal.
10tcphttpSF2394860000...191.00.00.050.00.00.00.00.0normal.
20tcphttpSF23513370000...291.00.00.030.00.00.00.00.0normal.
30tcphttpSF21913370000...391.00.00.030.00.00.00.00.0normal.
40tcphttpSF21720320000...491.00.00.020.00.00.00.00.0normal.
\n", "

5 rows × 42 columns

\n", "
" ], "text/plain": [ " duration protocol_type service flag src_bytes dst_bytes land \\\n", "0 0 tcp http SF 181 5450 0 \n", "1 0 tcp http SF 239 486 0 \n", "2 0 tcp http SF 235 1337 0 \n", "3 0 tcp http SF 219 1337 0 \n", "4 0 tcp http SF 217 2032 0 \n", "\n", " wrong_fragment urgent hot ... dst_host_srv_count \\\n", "0 0 0 0 ... 9 \n", "1 0 0 0 ... 19 \n", "2 0 0 0 ... 29 \n", "3 0 0 0 ... 39 \n", "4 0 0 0 ... 49 \n", "\n", " dst_host_same_srv_rate dst_host_diff_srv_rate \\\n", "0 1.0 0.0 \n", "1 1.0 0.0 \n", "2 1.0 0.0 \n", "3 1.0 0.0 \n", "4 1.0 0.0 \n", "\n", " dst_host_same_src_port_rate dst_host_srv_diff_host_rate \\\n", "0 0.11 0.0 \n", "1 0.05 0.0 \n", "2 0.03 0.0 \n", "3 0.03 0.0 \n", "4 0.02 0.0 \n", "\n", " dst_host_serror_rate dst_host_srv_serror_rate dst_host_rerror_rate \\\n", "0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 \n", "\n", " dst_host_srv_rerror_rate normal \n", "0 0.0 normal. \n", "1 0.0 normal. \n", "2 0.0 normal. \n", "3 0.0 normal. \n", "4 0.0 normal. \n", "\n", "[5 rows x 42 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Our data is a bit big for interactive exploration. The methods work fine but you have to wait too long. For this session, lets downsample\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(100000, 42)\n", "(100000, 42)\n" ] } ], "source": [ "print(df.shape)\n", "df=df.sample(100000)\n", "print(df.shape)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "duration int64\n", "protocol_type object\n", "service object\n", "flag object\n", "src_bytes int64\n", "dst_bytes int64\n", "land int64\n", "wrong_fragment int64\n", "urgent int64\n", "hot int64\n", "num_failed_logins int64\n", "logged_in int64\n", "num_compromised int64\n", "root_shell int64\n", "su_attempted int64\n", "num_root int64\n", "num_file_creations int64\n", "num_shells int64\n", "num_access_files int64\n", "num_outbound_cmds int64\n", "is_host_login int64\n", "is_guest_login int64\n", "count int64\n", "srv_count int64\n", "serror_rate float64\n", "srv_serror_rate float64\n", "rerror_rate float64\n", "srv_rerror_rate float64\n", "same_srv_rate float64\n", "diff_srv_rate float64\n", "srv_diff_host_rate float64\n", "dst_host_count int64\n", "dst_host_srv_count int64\n", "dst_host_same_srv_rate float64\n", "dst_host_diff_srv_rate float64\n", "dst_host_same_src_port_rate float64\n", "dst_host_srv_diff_host_rate float64\n", "dst_host_serror_rate float64\n", "dst_host_srv_serror_rate float64\n", "dst_host_rerror_rate float64\n", "dst_host_srv_rerror_rate float64\n", "normal object\n", "dtype: object" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes # Usual checking of data " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that many fields that we are interested in are called \"object\". The classifiers don't like this, so we are doing to convert them to factors (analogous to factors in R). These are represented as integer numbers.\n", "\n", "Only the class (\"normal\") will stay in the string (object) format.\n", "\n", "**Note:** many changes have been made to the way Python handles factors, and may come again in hte future!" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "#df['protocol_type_cat'] = df[1].astype('category') # Direct one-hot encoding. But only in later versions of pandas.\n", "df['protocol_type'], protocols= pd.factorize(df['protocol_type'])\n", "df['service'], services = pd.factorize(df['service'])\n", "df['flag'], flags = pd.factorize(df['flag'])\n", "# We have the key to convert back in protocols,services, flag" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['icmp', 'tcp', 'udp'], dtype='object')" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "protocols" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "duration int64\n", "protocol_type int64\n", "service int64\n", "flag int64\n", "src_bytes int64\n", "dst_bytes int64\n", "land int64\n", "wrong_fragment int64\n", "urgent int64\n", "hot int64\n", "num_failed_logins int64\n", "logged_in int64\n", "num_compromised int64\n", "root_shell int64\n", "su_attempted int64\n", "num_root int64\n", "num_file_creations int64\n", "num_shells int64\n", "num_access_files int64\n", "num_outbound_cmds int64\n", "is_host_login int64\n", "is_guest_login int64\n", "count int64\n", "srv_count int64\n", "serror_rate float64\n", "srv_serror_rate float64\n", "rerror_rate float64\n", "srv_rerror_rate float64\n", "same_srv_rate float64\n", "diff_srv_rate float64\n", "srv_diff_host_rate float64\n", "dst_host_count int64\n", "dst_host_srv_count int64\n", "dst_host_same_srv_rate float64\n", "dst_host_diff_srv_rate float64\n", "dst_host_same_src_port_rate float64\n", "dst_host_srv_diff_host_rate float64\n", "dst_host_serror_rate float64\n", "dst_host_srv_serror_rate float64\n", "dst_host_rerror_rate float64\n", "dst_host_srv_rerror_rate float64\n", "normal object\n", "dtype: object" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we extract out features and labels, both in pandas and numpy format" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "features_pd= df.iloc[:,:df.shape[1]-1]\n", "labels_pd= df.iloc[:,df.shape[1]-1:]\n", "labels= labels_pd.values.ravel() # this becomes a 'horizontal' array, i.e. a row vector\n", "\n", "## Not needed here:\n", "## features= np.array(features_pd)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['smurf.', 'smurf.', 'neptune.', ..., 'smurf.', 'smurf.',\n", " 'ipsweep.'], dtype=object)" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#labels_pd\n", "labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Saving the data\n", "\n", "There are many ways to save data. This is a simple though lazy one: through a pickle file\n", "\n", "In python this allows reproduceable test/train data as we used a random seed above. For R use, we'd have to save the X_train/y_train/X_test/y_test objects separately as csvs or similar.\n", "\n", "Pickles can store very many types of object, but those that use memory pointers don't work (we encounter these when working with large scale data)." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "# Save a python object into a picle file\n", "import pickle\n", "pickle.dump( features_pd, open( \"06-features_pd.pickle\", \"wb\" ) )\n", "pickle.dump( labels, open( \"06-labels.pickle\", \"wb\" ) )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Test/train split\n", "\n", "Now separate data in train set and test set\n", "\n", "features= pd.DataFrame(features)\n", "\n", "Create training and testing vars\n", "\n", "Note: if train_size + test_size < 1.0 we are subsampling. This is useful for making test code run faster.\n", "\n", "Use small numbers for slow classifiers, as KNN, Radius, SVC,..." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "X_train, y_train: (50000, 41) (50000,)\n", "X_test, y_test: (50000, 41) (50000,)\n" ] } ], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " features_pd, labels, train_size=0.5, test_size=0.5,random_state=1)\n", "print (\"X_train, y_train:\", X_train.shape, y_train.shape)\n", "print (\"X_test, y_test:\", X_test.shape, y_test.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have everything we need to start to run the classifiers. Notice the wide range of tuning parameters that are available..." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "clf= RandomForestClassifier(n_jobs=-1, random_state=3, n_estimators=100)\n", "#, max_features=0.8, min_samples_leaf=3, n_estimators=500, min_samples_split=3, random_state=10, verbose=1)\n", "\n", "trained_model= clf.fit(X_train, y_train)\n" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Score: 1.0\n" ] } ], "source": [ "print( \"Score: \", trained_model.score(X_train, y_train))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The score is 1! Is this a rounding error, or severe overfitting?" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['smurf.', 'smurf.', 'smurf.', ..., 'neptune.', 'smurf.', 'normal.'],\n", " dtype=object)" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "# Predicting\n", "y_pred = clf.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is how we make a confusion matrix in sklearn. Also how we evaluate the loss function for categorical labels. We use a \"zero/one\" loss: score 0 for the wrong class, 1 for the right one. Its not the best choice of loss for all applications!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q4 Using confusion_matrix, make and display (in text and on image) the confusion matrix." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Confusion matrix:\n", " [[ 233 0 0 0 0 0 0 0 1 0 0 0\n", " 0 0 0 0]\n", " [ 0 4 0 0 0 0 0 0 1 0 0 0\n", " 0 0 0 0]\n", " [ 0 0 3 0 0 0 0 0 2 0 0 0\n", " 0 0 0 0]\n", " [ 0 0 0 0 0 0 0 0 1 0 0 0\n", " 0 0 0 0]\n", " [ 0 0 0 0 107 0 0 0 0 0 0 0\n", " 0 0 0 0]\n", " [ 0 0 0 0 0 2 0 0 0 0 0 0\n", " 0 0 0 0]\n", " [ 0 0 0 0 0 0 10812 0 0 0 0 0\n", " 0 0 0 0]\n", " [ 0 0 0 0 0 0 0 21 0 0 0 0\n", " 0 0 0 0]\n", " [ 0 0 0 0 1 0 0 0 9923 0 0 0\n", " 0 0 3 0]\n", " [ 0 0 0 0 0 0 0 0 0 25 0 0\n", " 0 0 0 0]\n", " [ 0 0 0 0 0 0 1 0 0 0 120 0\n", " 0 0 0 0]\n", " [ 0 0 0 0 0 0 0 0 2 0 0 151\n", " 0 0 0 0]\n", " [ 0 0 0 0 0 0 0 0 0 0 0 0\n", " 28385 0 0 0]\n", " [ 0 0 0 0 0 0 0 0 0 0 0 0\n", " 0 98 0 0]\n", " [ 0 0 0 0 0 0 0 0 11 0 0 0\n", " 0 0 90 0]\n", " [ 0 0 0 0 0 0 0 0 1 0 0 0\n", " 0 0 0 2]]\n" ] } ], "source": [ "from sklearn.metrics import confusion_matrix\n", "results = confusion_matrix(y_test, y_pred) # EDIT to make a confusion matrix from TEST (rows) and PRED ( columns)\n", "print (\"Confusion matrix:\\n\", results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since this isn't trivial to read, we'll also make a heatmap using seaborn. Make it with log10 of the results." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "sresults=[x/(1+x.sum()) for x in results]" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAV0AAAD4CAYAAABPLjVeAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAc2UlEQVR4nO3de5xdZX3v8c+XkATCJXIRhCQaUKTg5SDmRKsWsUAN1JLa6ikXXwVEp31x4qVe8eBRkGpLK3J8WY7OqGC9gYgWolJBFNDjBRKRS0JAY+QywYAIQi2UZGZ+54+1Rjfj7L2ue++1t983r/Watddaz28/e2XzzJpnPc9vKSIwM7Pe2K7fFTAz+33iRtfMrIfc6JqZ9ZAbXTOzHnKja2bWQ9v34D08PMLM8lLVANse2JS7zZm75/6V36+oXjS6bLvvjtJl5+59IADbz1tUOsbE1s2VytcRY2LrZqD656gjRlPOxdwKMbYN2bloQoymnIth15NG18ysZ6Ym+12DjtzomtlwmZzodw06cqNrZkMlYqrfVegos9GV9AfASmC6s2YzsDoiNnSzYmZmpUw1u9HtOGRM0juBi0nuKN6QLgIuknR696tnZlZQTOVf+iDrSvdU4FkRsa11o6QPAeuBf5ytkKQRYARgdHSUU1a+tIaqmpnlMOA30qaAfYG7ZmzfJ903q4gYA8amX1YZMmZmVsiA9+m+GfimpJ8A96Tbngo8A1jVzYqZmZURgzx6ISK+LumZwHKeeCNtTUQ0+xrezH4/NfxGWubohUjGX/ygB3UxM6tuwLsXzMwGS8NvpKkHj+txwhszy6tyAprHN1yTu82Zf9DLhjPhjZlZzwzyjbTa3qSGzEMPn3Jk6RgLL7y6MRmUmhCjKefi8MXl/02vHb8aGJ5z0YQYTTkXlQ36jTQzs0HS9IFVbnTNbLh49IKZWQ+5e8HMrIcafqVb+sGUkk6psyJmZrWY3JZ/6YMqTwM+q90OSSOS1kpaOzY21u4wM7P6TU3lX/qgY/eCpFva7QL2blduZpax01a1bZ/NzOrV8O6FrD7dvYGXAw/N2C7ge12pkZlZFQN+I+2rwM4RcdPMHZKu7UqNzMyqGORGNyJO7bDvhPqrY2ZWTfTpBlleHjJmZsOl4X26zjJmZk1SOevXY98cy93m7HjEyHBmGWtCMo+1i/+8dHmAZeOXNeJz1BGjKYlN3v+0E0vHOOOuzwHDcy6aEKMp56Kyhl/punvBzIbLIN9IMzMbOL7SNTProQknMTcz652GX+lm5l6Q9AeSjpC084ztK7pXLTOzkhqee6FjoyvpjcDlwBuAdZJWtuz+QIdyTnhjZv0RU/mXPsjqXng98PyI+LWkpcClkpZGxIfpMJ7OCW/MrG8GfPTCdhHxa4CIuFPS4SQN79OoYRCzmVntBrxP9z5Jh0y/SBvgVwB7As/pZsXMzEqZmMi/9EFWo/vXwJbWDRExERF/DRzWtVqZmZUVkX/pg6wsY+Md9n23/uqYmVXU8D5dJ7wxsyapnvDmc/87f8KbE8/u+b2pKs9IMzNrnhqHjElaIekOSRslnT7L/qdKukbSjyTdIumYrJi/N1nGmpJBqQkxmnIuqmR+WzZ+GTA856IJMZpyLiqbnKwljKQ5wPnAUcA4sEbS6oi4reWwdwOXRMRHJR0MXAEs7RTX04DNbLjU16e7HNgYEZsAJF0MrARaG90Adk3XFwL3ZgV1o2tmw6VAoytpBBhp2TSWTu4CWATc07JvHHjBjBBnAldJegOwE3Bk1nu60TWz4VJgcsSM2bNlHA98KiLOlfSHwGckPTuifSUyG11Jy5O6xZq0z2IFcHtEXFGhomZmXRFTtQ2Y2gwsaXm9ON3W6lSSNpGI+L6kHUgmj93fLmjHRlfSe4Gjge0lfYPk0voa4HRJz4uI9xf9FGZmXVVfn+4a4ABJ+5E0tscBM5+CfjdwBPApSQcBOwC/6BQ060r3VcAhwHySmWmLI+IRSR8ErgdmbXRb+0lGR0cz3sLMrEY1jV6IiAlJq4ArgTnABRGxXtL7gLURsRp4K/BxSX9HclPt5MiY/JDV6E5ExCTwqKSfRsQjaWUek9T214mzjJlZ39Q4Iy3tRr1ixrb3tKzfBry4SMysRnerpAUR8Sjw/OmNkhYCzZ5rZ2a/nxo+DTir0T0sIh4HmHE3bi5wUtdqZWZWVp8S2eSVlfDm8TbbHwAe6EqNzMyqGPArXTOzwVLfkLGucJYxM2uSylm/Hj3nlNxtzoJ3XtjzLGNOeNOjGNOfY9t9d5SOMXfvA4HhORdNiOFz8dsYTTkXVYW7F8zMeqjh3QtudM1suDT8wZRudM1suPhK18yshybqmQbcLYUf1yPp092oiJlZLWp8XE83ZGUZWz1zE/AySU8CiIhju1UxM7NSBrx7YTHJoyk+QTLeVsAy4NxOhZxlzMz6pelDxrK6F5YBPwTOAB6OiGuBxyLiuoi4rl2hiBiLiGURsWxkZKTdYWZm9ZuK/EsfZOVemALOk/TF9Od9WWXMzPpqwLsXAIiIceDVkv4UeKS7VTIzq6CmJObdUuiqNSK+BnytS3UxM6usxmekdYW7CsxsuDS80XWWMTNrkspZv/5j1TG525xd/uUKZxmbadgyKNUR4+FTjiwdY+GFVw/VufD3YvjORWUNv9J194KZDRc3umZmvROTzZ4c4UbXzIbLMF3pSnoJsBxYFxFXdadKZmblNX3IWMdpwJJuaFl/PfAvwC7AeyWd3uW6mZkV1/BpwFm5F+a2rI8AR0XEWcCfACe2KyRpRNJaSWvHxsZqqKaZWU5TBZY+yOpe2E7SbiSNsyLiFwAR8Z+SJtoViogxYLq1jdNWnVVLZc3MssTEYN9IW0iSZUxASNonIn4uaWdqGMRsZla7Zre5mVnGlrbZNQW8svbamJlV1PQbaaWGjEXEo8DPaq6LmVl1g3yla2Y2aJp+peuEN2bWJJXvFT248qW525zdL79uOBPemJn1SrQdV9UMzjLWoxhNyia17YFNpcsDzN1z/0Z8jjpi+Hvx2xhNORdV9enJ6rllTY4wMxssNU6OkLRC0h2SNrabhSvpf0i6TdJ6SZ/PiunuBTMbKnVd6UqaA5wPHAWMA2skrY6I21qOOQB4F/DiiHhI0l5ZcX2la2ZDJabyLxmWAxsjYlNEbAUuBlbOOOb1wPkR8RBARNyfFTQr4c0LJO2aru8o6SxJX5F0jqSFmVU2M+uxmFTupTVPTLqMtIRaBNzT8no83dbqmcAzJX1X0g8krciqX1b3wgXAf0vXPww8CpwDHAFcCPxF1huYmfVSke6FGXliytgeOAA4HFgMfFvScyLiV50KdLJdxG8GYCyLiEPT9f8n6aZ2hdLfFiMAo6OjOetuZlZdTNU29HYzsKTl9eJ0W6tx4PqI2Ab8TNKPSRrhNe2CZvXprpN0Srp+s6RlAJKeCWxrVygixiJiWUQsGxkZaXeYmVntauzTXQMcIGk/SfOA44DVM465jOQqF0l7knQ3dByTmdXovg54qaSfAgcD35e0Cfh4us/MrFEilHvpHCcmgFXAlcAG4JKIWC/pfZKOTQ+7EvilpNuAa4C3R8QvO8XNyjL2MHByejNtv/T48Yi4L8dnNzPruTonR0TEFcAVM7a9p2U9gLekSy65xulGxCPAzXmDmpn1y9Rks1N9e3KEmQ2VGm+kdYWzjJlZk1RuMe885Kjcbc7Sm74xnFnGnMxj+BKbPP6T75UuP/+AFwHDcy6a8DnqiNGUc1FV968jq3H3gpkNlaZ3L7jRNbOhkjUUrN/c6JrZUJn06AUzs95p+pVuVpaxN0pa0ukYM7MmiSnlXvohaxrw2cD1kr4j6TRJT84TtDVd2thYlQQ+ZmbFRORf+iGr0d1EklnnbOD5wG2Svi7pJEm7tCvkhDdm1i9Nv9LN6tONiJgCrgKukjQXOBo4HvggkOvK18ysVyanmv1AnKxG9wm/CtKckauB1ZIWdK1WZmYlDfrkiL9qtyMiHq25LmZmlU01fPRCVmrHH/eqImZmdWj6kDEnvDGzJqncYt64ZGXuNufQey4fzoQ3Zma9MtDdC7W9iTMoOZvUjPIA2x7o+CipjubuuT8wPOeiCTGaci6qGvTRC2ZmA6Xp/ZludM1sqLh7wcysh5o+eqFjo9vyrPd7I+JqSScALyJ5HPFYOlnCzKwxanwYcFdkXelemB6zQNJJwM7Al4EjgOXASd2tnplZMVF91FlXZTW6z4mI50raHtgM7BsRk5I+S4dHsksaAUYARkdHa6usmVmWiYZ3L2SNrdgu7WLYBVgALEy3zwfmtivkLGNm1i+Bci/9kHWl+0ngdmAOcAbwRUmbgBcCF3e5bmZmhQ10n25EnCfpC+n6vZI+DRwJfDwibuhFBc3Mihj0Pl0i4t6W9V8Bl3a1RmZmFQz0la6Z2aCZbPiVrrOMmVmTVG4xv/KU43O3OX+25aLhzDLmZB5ObDKzPNRzLrZuKn9rYd7+y4fqXAzL96KqqYZf6bp7wcyGStP/tHaja2ZDxTfSzMx6aEruXjAz65nJflcgQ2aKdUn7S3qbpA9L+pCkv5W0ay8qZ2ZW1JTyL1kkrZB0h6SNkk7vcNxfSgpJy7Jidmx0Jb0R+BiwA/DfSXIuLAF+IOnwDuVGJK2VtHZsbCyrDmZmtZlCuZdOJM0BzgeOBg4Gjpd08CzH7QK8Cbg+T/2yrnRfDxwdEX9PMv33WRFxBrACOK9dISe8MbN+iQJLhuXAxojYFBFbSfLNrJzluLOBc4D/ylO/PE9wm+73nU+ST5eIuJsOWcbMzPqlSPdC61/l6dJ6lbgIuKfl9Xi67TckHQosiYiv5a1f1o20TwBrJF0P/BFJa46kJwMP5n0TM7NeKTJkLCLGgFJ9oJK2Az4EnFykXFaWsQ9Luho4CDg3Im5Pt/8COKxMRc3MummyvhFjm0nuYU1bnG6btgvwbOBaJcPUngKslnRsRKxtFzRPlrH1wPoyNTYz67UaJ0esAQ6QtB9JY3sccML0zoh4GNhz+rWka4G3dWpwIV+frpnZwJgqsHQSERPAKuBKkofxXhIR6yW9T9KxZevnLGNm1iSVOwc+tuQ1uducv73ns84yNtOwZVBqQox58xeXLg+w9fHxRnyOOmJsvbdaz9m8fZ/ViM9RR4ym/D9SlXMvmJn1UNOnAbvRNbOhkmd6bz+50TWzoeLuBTOzHnKja2bWQ00fLtWVRjedvzwCMDo62o23MDObVdP7dLNSOy6U9I+Sbpf0oKRfStqQbntSu3LOMmZm/TJZYOmHrBlplwAPAYdHxO4RsQfwsnTbJd2unJlZUVNE7qUfshrdpRFxTkRsmd4QEVsi4hzgad2tmplZcXVNA+6WrEb3LknvkLT39AZJe0t6J0/MM2lm1gg1JjHviqxG96+APYDr0j7dB4Frgd2BV3e5bmZmhTX9Srd0whtJp0TEhTkObfoIDjNrjspjD9699ITcbc7f3/n5no91qJLa8azaamFmVpOmdy90HKcr6ZZ2u4C92+z73TdxBiVnk5pRHmD+Dksyjmzv8f9Kbik04VxUyVQ2b99nAf5eTJevw6DPSNsbeDnJELFWAr7XlRqZmVXQr6FgeWU1ul8Fdo6Im2buSB9NYWbWKM1ucrMfTHlqh30ntNtnZtYvg969YGY2UCYbfq3rRtfMhkrTr3RLDxmT9O8d9o1IWitp7djYWNm3MDMrLAr81w9ZQ8YObbcLOKRduYgYA6Zb2zhtlYf0mllvNP1KN6t7YQ1wHbPPEmmb2tHMrF8GfcjYBuBvIuInM3dIcsIbM2ucZje52Y3umbTv931DvVUxM6tuouHNbtY43Us77N6t5rqYmVXWrxtkeVXJMnZ3RDw1x6HNPgNm1iSVs369dumrcrc5F9x5ac+zjDnhTY9iOOHNE8uDz8V0eYBtD2wqHWPunvsDw3Muqmr6la4T3pjZUBn0IWNOeGNmA2WyZJdprzjhjZkNlUEfp2tmNlAGvU/XzGygDHqfbimSRoARgNHR0W68hZnZrJrevdAxy5ikXSX9g6TPSDphxr7/265cRIxFxLKIWDYyMlJXXc3MMtWZZUzSCkl3SNoo6fRZ9r9F0m2SbpH0TUlPy4qZldrxQpLhYV8CjpP0JUnz030vzKyxmVmPTUbkXjqRNAc4HzgaOBg4XtLBMw77EbAsIp4LXAr8U1b9shrdp0fE6RFxWUQcC9wIfEvSHlmBzcz6YYrIvWRYDmyMiE0RsRW4GFjZekBEXBMRj6YvfwAszgqa1ac7X9J2ETGVvsH7JW0Gvg3snBXczKzXitxIa73/lBpL84EDLAJasymOAy/oEO5UoO3DHaZlNbpfAf4YuHp6Q0R8StIW4CNZwc3Meq3IkLEZD1woTdJrgGXAS7OOzZoc8Y42278u6QN5K1THnOqqMZpQh6bEaEIdmhKjCXWA3+ZP6Gc9mnIuqqpx9MJmYEnL68XptieQdCRwBvDSiHg8K2jpZ6QBfgaPmTVOROReMqwBDpC0n6R5wHHA6tYDJD0PGAWOjYj789TPWcZ6FMOZtZ5YHnwupstDPedi2313lI4xd+8DG3MuqqrrEewRMSFpFXAlMAe4ICLWS3ofsDYiVgP/THJ/64uSAO5OBx205SxjZjZU6pwcERFXAFfM2PaelvUji8Z0ljEzGyplH8zQK84yZmZDpenTgJ3wxsyGirOMmZn1UNOTmGclvHmKpI9KOl/SHpLOlHSrpEsk7dOh3IiktZLWjo1VHndsZpZbjdOAuyJrnO6ngNtIpsJdAzwGHAN8B/hYu0LOMmZm/dL0RjdzyFhEfARA0mkRcU66/SOS2t5kMzPrl4EevcATr4Q/PWPfnJrrYmZW2aCPXrhc0s4R8euIePf0RknPAMpPfzEz65KBHr3QOvNixvaNkr7WnSqZmZU3Gc1+SprK9n9Iujsinprj0Gb/2jGzJlHVAM97yotztzk/2vLdyu9XlBPe9CiGk7w8sTzAQXstLx1jw/03AMNzLpoQ47FrPlG6PMCOL3tdIxLeDHqfrhPemNlAGeg+XZzwxswGzNQgDxlzwhszGzSDfqVrZjZQmj56wY2umQ2VpncvFH5GmqS9chzjhDdm1hdR4L9+yBoytvvMTcAN6cPYFBEPzlZuxmON47RVfoalmfVG0690s7oXHgDumrFtEXAjyaSH6s+NNjOr0aDfSHs7cBTw9oi4FUDSzyJiv67XzMyshMmY7HcVOsoaMnaupC8A50m6B3gvntZrZg026KkdiYhx4NWSjgW+ASzoeq3MzEpq+jTgQglvJO0IPD0i1kk6JSIuzFGs2WfAzJqkcgKaRbs9K3ebs/mh9T1PeFNoyFhEPBYR69KXHpJgZo0zFZF76QdnGetRjCZlk2rKuXjt0leVjnHBnZcCw3MumhCjjnOxdvGfly6/bPyySu8/bdBHLzjLmJkNlEGfBuwsY2Y2UAZ69IKzjJnZoBn0GWlmZgNloK90zcwGTdPH6Xal0ZU0AowAjI6OduMtzMxm1fQr3Y7jdCWtaFlfKOmTkm6R9HlJbYeMRcRYRCyLiGUjIyN11tfMrKPJmMq99EPW5IgPtKyfC/wc+DNgDeBLWDNrnIGeHDHDsog4JF0/T9JJ3aiQmVkVA929AOwl6S2S3grsKql1nnLhp06YmXVbnU+OkLRC0h2SNko6fZb98yV9Id1/vaSlWTGzGs6PA7sAOwP/CuyZvtFTgN+ZMGFm1m8RkXvpRNIc4HzgaOBg4HhJB8847FTgoYh4BnAecE5W/QplGZtRIWcZM7O6Vc76tf28RbnbnImtm9u+n6Q/BM6MiJenr98FEBH/0HLMlekx35e0PbAFeHJ0aFirdBHkzTKmrEXS3+Q5rlvlhylGE+rgz+FzUSFGZRNbNyvv0voQ3XRpHW61CLin5fV4uo3ZjomICeBhYI9O9csaMnZLm+VWCmQZy6HquLI6xqUNS4wm1KGOGE2oQ1NiNKEOTYpRm9bhrenS9ceXO8uYmdnsNgNLWl4vTrfNdsx42r2wEPhlp6DOMmZmNrs1wAGS9iNpXI8DZib6Wg2cBHwfeBXwrU79udCcLGNVL+nr+JNgWGI0oQ51xGhCHZoSowl1aFKMnoiICUmrgCuBOcAFEbFe0vuAtRGxGvgk8BlJG4EHSRrmjkqPXjAzs+I8wcHMrIfc6JqZ9VBfG92sKXY5yl8g6X5J67KPbhtjiaRrJN0mab2kNxUsv4OkGyTdnJY/q0Jd5kj6kaSvlix/p6RbJd0kaW3JGE+SdKmk2yVtSAeI5y17YPre08sjkt5cog5/l57LdZIukrRDiRhvSsuvz1uH2b5PknaX9A1JP0l/7law/KvTOkxJWlayDv+c/nvcIunfJD2pRIyz0/I3SbpK0r5FY7Tse6ukkLRnwTqcKWlzy/fjmE51GFpFpszVuZB0TP8U2B+YB9wMHFwwxmHAocC6CvXYBzg0Xd8F+HGRepAMn9s5XZ8LXA+8sGRd3gJ8HvhqyfJ3AntW/Hf5V+B16fo84EkV/n23AE8rWG4R8DNgx/T1JcDJBWM8G1gHLCC5WXw18Iwy3yfgn4DT0/XTgXMKlj8IOBC4liRpVJk6/Amwfbp+Tqc6dIixa8v6G4GPFY2Rbl9CcmPprk7ftTZ1OBN4W5Xv5zAs/bzSXQ5sjIhNEbEVuBhYWSRARHyb5I5haRHx84i4MV3/D2ADvzvrpFP5iIhfpy/npkvhu5OSFgN/CnyiaNm6SFpI8j/LJwEiYmtE/KpkuCOAn0bEXSXKbg/smI57XADcW7D8QcD1EfFoJLOErgP+IqtQm+/TSpJfRKQ/2z5jfLbyEbEhIu7IW/E2Ma5KPwfAD0jGixaN8UjLy53I+I52+H/rPOAdFcr/3utno5tnil1PSVoKPI/karVIuTmSbgLuB74REYXKp/4PyZe5SmblAK6S9MMZ0xnz2g/4BXBh2s3xCUk7lazLccBFRQtFxGbgg8DdJPmbH46IqwqGWQf8kaQ9JC0AjuGJg9yL2Dsifp6ub6HemZhlvBb49zIFJb1f0j3AicB7SpRfCWyOiJvLvH9qVdrNcUGnrpph5htpKUk7A18C3jzjqiBTRExGkmt4MbBc0rMLvvcrgPsj4odFys3iJRFxKElWpP8p6bCC5bcn+ZPwoxHxPOA/Sf6kLkTSPOBY4Islyu5GcnW5H7AvsJOk1xSJEREbSP4Mvwr4OklGvMmidZklbtDHBE6SzgAmgM+VKR8RZ0TEkrT8qoLvvQD4X5RorFt8FHg6cAjJL9RzK8QaWP1sdPNMsesJSXNJGtzPRcSXy8ZJ/xS/BliRdewMLwaOlXQnSTfLH0v6bIn335z+vB/4N5IunCLGgfGWK/VLSRrhoo4GboyI+0qUPRL4WUT8IiK2AV8GXlQ0SER8MiKeHxGHkUxj/3GJugDcJ2kfgPTn/SXjVCLpZOAVwIlp41/F54C/LFjm6SS/CG9Ov6eLgRuVpHnNJSLuSy9QpkjSxhb9fg6Ffja6v5lil14ZHUcypa6nJImkD3NDRHyoRPknT99NlrQjcBRwe5EYEfGuiFgcEUtJzsO3IqLQ1Z2knSTtMr1OcvOl0KiOiNgC3CPpwHTTEcBtRWKkjqdE10LqbuCFkhak/zZHkPSzFyJpr/TnU0n6cz9fsj7T0zxJf15eMk5pSp5V+A7g2Ih4tGSMA1perqT4d/TWiNgrIpam39NxkhvQWwrUYZ+Wl6+k4PdzaPTzLh5JX9uPSUYxnFGi/EUkf6ZsI/kSnFoixktI/mS8heTP0JuAYwqUfy7wo7T8OuA9Fc/J4ZQYvUAyCuTmdFlf5nymcQ4B1qaf5zJgt4LldyJJ+LGwwjk4i6RRWAd8BphfIsZ3SH5h3AwcUfb7RJKm75vAT0hGQexesPwr0/XHgfuAK0vUYSPJ/Y/p72fWyIPZYnwpPZ+3AF8BFhWNMWP/nXQevTBbHT4D3JrWYTWwT5X/VwZ18TRgM7Me8o00M7MecqNrZtZDbnTNzHrIja6ZWQ+50TUz6yE3umZmPeRG18ysh/4/R5XmbAaRBagAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import seaborn as sns\n", "\n", "ax = sns.heatmap((sresults), linewidth=0.5) ## EDIT to get it to plot the log10 of the results\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q5 The score was 1 above! Is the error really zero? Check using the zero_one_loss function." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Error: 0.00048000000000003595\n" ] } ], "source": [ "from sklearn.metrics import zero_one_loss\n", "error = zero_one_loss(y_test, y_pred) ## EDIT to get it to output the error.\n", "print (\"Error: \", error)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we'll evaluate two different models, the decision tree and logistic regression." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.0010200000000000209\n" ] } ], "source": [ "from sklearn.tree import DecisionTreeClassifier\n", "cld = DecisionTreeClassifier(criterion='gini', splitter='best', \n", " max_depth=None, min_samples_split=2, min_samples_leaf=1, \n", " min_weight_fraction_leaf=0.0, max_features=None, \n", " random_state=None, max_leaf_nodes=None, class_weight=None)\n", "# other parameters: min_impurity_decrease=0.0, \n", "\n", "trained_model_d= cld.fit(X_train, y_train)\n", "y_pred_d = cld.predict(X_test)\n", "error_d = zero_one_loss(y_test,y_pred_d) \n", "print(error_d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now Logistic Regression. This is quite a bit slower, " ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.019240000000000035\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:764: ConvergenceWarning: lbfgs failed to converge (status=1):\n", "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n", "\n", "Increase the number of iterations (max_iter) or scale the data as shown in:\n", " https://scikit-learn.org/stable/modules/preprocessing.html\n", "Please also refer to the documentation for alternative solver options:\n", " https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n", " extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)\n" ] } ], "source": [ "from sklearn.linear_model import LogisticRegression\n", "cll = LogisticRegression()\n", "\n", "trained_model_l= cll.fit(X_train, y_train)\n", "y_pred_l = cll.predict(X_test)\n", "error_l = zero_one_loss(y_test,y_pred_l) \n", "print(error_l)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Comparison:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Error Random Forest 0.00048000000000003595 Error Decision Tree 0.0010200000000000209 Error Logistic Regression 0.019240000000000035\n" ] } ], "source": [ "print(\"Error Random Forest\",error,\"Error Decision Tree\",error_d,\"Error Logistic Regression\",error_l)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Activity 6: Use the function cross_val_score to get a prediction error only using the training dataset. \n", "\n", "Why is this useful?" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import cross_val_score\n", "?cross_val_score" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.7/site-packages/sklearn/model_selection/_split.py:672: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=10.\n", " % (min_groups, self.n_splits)), UserWarning)\n" ] } ], "source": [ "cvf= cross_val_score(clf,X_train,y_train,cv=10) " ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.9992, 0.9996, 0.9994, 0.9998, 0.9996, 0.9994, 0.9992, 0.9994,\n", " 0.999 , 0.9998])" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cvf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Activity 7: relative performance\n", "\n", "Are the cross validation errors higher or lower tha error in the test dataset? Check by writing an iterator over cvf. Why?" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0.00031999999999998696,\n", " -8.000000000008001e-05,\n", " 0.00012000000000000899,\n", " -0.000280000000000058,\n", " -8.000000000008001e-05,\n", " 0.00012000000000000899,\n", " 0.00031999999999998696,\n", " 0.00012000000000000899,\n", " 0.0005199999999999649,\n", " -0.000280000000000058]" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Compare the error in the test dataset to that learned using CV, vs that learned using the trained data\n", "[(1-x)-error for x in cvf]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 3: Performance as a function of the number of trees\n", "\n", "Here we fit a random forest with a different number of trees, and gather the scores (in both the training and the test dataset)\n", "\n", "## Activity 8: How does the number of trees affect performance?\n", "\n", "Train a RandomForestClassifier with 1-10 trees. Store the scores on left out and training data." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "scores=[]\n", "trainscores=[]\n", "for ntrees in range(1,11):\n", " tmpcf=RandomForestClassifier(n_estimators=ntrees)\n", " tmp_trained_model= tmpcf.fit(X_train, y_train)\n", " tmp_y_pred = tmp_trained_model.predict(X_test)\n", " tmp_y_pred_train = tmp_trained_model.predict(X_train)\n", " tmp_test_error = zero_one_loss(y_test,tmp_y_pred)\n", " tmp_train_error = zero_one_loss(y_train,tmp_y_pred_train)\n", " scores.append(1-tmp_test_error)\n", " trainscores.append(1-tmp_train_error)\n" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0.99842,\n", " 0.99802,\n", " 0.99896,\n", " 0.99888,\n", " 0.99914,\n", " 0.99916,\n", " 0.9994,\n", " 0.99932,\n", " 0.99938,\n", " 0.99918]" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scores" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q8 prediction error as a function of the number of trees\n", "\n", "Here we plot the prediction error as a function of the number of trees. What conclusions do you draw from these results?" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.0, 1.0, 'score vs number of trees')" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAY0AAAEICAYAAACj2qi6AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAgAElEQVR4nO3de5gdVZ3u8e9rAq1RQEw3TEJjYBQdW08I0Abv4SKYAMeEyBmCwoBiwDlkMjITR8hFnU4ngZERhyODQhsHjgOoOCaZeYIkw8WgXJtb5CIxIpBOonQTECWeYJLf+aPWTiqdvlQnu/dOut/P89Sza69aVWut3f3s3161qmopIjAzMyviddWugJmZ7T0cNMzMrDAHDTMzK8xBw8zMCnPQMDOzwhw0zMysMAcN65Wk8yT9tIrl/7Wk30r6g6Th1aqHmTlo2B5O0j7A14CTI+JNEfFip+2HSQpJQ6tTQ7PBxUFjNynjz7GgXfhyPxh4PfBEBcs0s24Mii87SV+UtFbS7yU9LenElD5E0kxJv0rbHpJ0aNr2AUkPSvpdev1A7nh3SZon6WfARuDPJf2FpOWSNqQy/rKbupwpqbVT2sWSlqT1UyQ9meqzVtKMbo5znqSfSrpC0kuSfi1pQm77s5I+mnv/FUnfTeulX+eflrQm7f85Se+VtFLSy5K+sXOR+kb6PH5R+gzThgMkfVvS+lTnZklDcvX8maQrJb0IfKWLttRI+rqkdWn5ekp7B/B0yvaypDu6+ChW5Lb/QdL7uyozHe8KSc+nU13flPSGXB1Ok/Roavs9kkbntnX5/2M2KEXEgF6AdwJrgJHp/WHA29L6F4CfpzwCjgSGA28BXgLOAYYCZ6X3w9N+dwHPA+9O2w9IZXw6vT8K6AAauqjPMOD3wBG5tAeBKWl9PfDhtH4gcHQ37ToP+BMwFRgC/DWwDlDa/izw0Vz+rwDfzX0GAXyT7Ff8ycD/AxYBBwGHAC8A43JlbQYuBvYBzgR+B7wlbf8R8C3gjWn/B4ALO+37N+mzeUMXbWkC7kv71gH3AHM71XVoN5/DTtu7KhO4EliS/rb7Af8JLEj5j0rtPTZ9luemz6+GHv5/vHgZjEvVK9DvDYS3py+EjwL7dNr2NDCxi33OAR7olHYvcF5avwtoym07E7i7U/5vAV/upk7fBb6U1o8gCyLD0vvngQuB/Xtp13nA6tz7YenL88/S+2fpPWgcktv+InBm7v0Pgc/nytoWkFLaA+lzOhjYlA8GZEH2zty+z/fSll8Bp+Tefwx4tlNd+xo0ns+9F/Bq/sseeD/w67R+DSlIdfrfGNfT/48XL4NxGfCnpyJiNfB5si/NFyTdLGlk2nwo2RdWZyOB5zqlPUf2C7xkTW59FHBsOrXxsqSXgU8Bf9ZNtW4k+2IF+CSwKCI2pvefAE4BnpP0E0nv76F5vymt5PZ/Uw/5O/ttbv2PXbzPH2ttROSfbvkc2ec0iqz3sT7X9m+R9RpK8p9VVzp/3qVj7458mXVkQfWhXB1/nNIha8Pfd/r7HUrWu+jp/8ds0BnwQQMgIm6MiA+RfTkEcHnatAZ4Wxe7rEt5894KrM0fNre+BvhJRLw5t7wpIv66myotB+okjSELHjfm6vpgREwk+9JdBHy/UCN39irZF2VJdwGsqEMkKff+rWSf0xqynkZtru37R8S7c3l7e5Ry58+7dOwiujt2Pr2DLAi+O1fHAyKiFBTXAPM6/f2GRcRN0OP/j9mgM+CDhqR3SjpBUg3Zefs/AlvT5hZgrqQjlBmt7D6ApcA7JH1S0lBJZwINwH91U8x/pfznSNonLe+V9K6uMkfEn4AfAF8lO8e+PNV1X0mfknRAyvNKrq599SgwJdWlEThjF49TchAwPR3vfwHvApZGxHpgGfDPkvaX9DpJb5M0rg/HvgmYLalOUi3wJbJTeEW0k31Gf95dhojYClwHXCnpIABJh0j6WMpyHfA5Scem/4M3SjpV0n69/P+YDToDPmiQDWZeRvZr8zdkX36Xpm1fI/slv4zsC/rbZOfmXwROA/6e7Fz/PwCnRURHVwVExO/JBpOnkP1C/g3Zr9GaHup1I9l58h9ExOZc+jnAs5JeAT5HdpprV8wh60W9BPwjud7MLrqfbPylA5gHnBHb75n4K2Bf4MlU3i3AiD4cuxloBVaSXZjwcErrVTotNw/4WTq19L5usn4RWA3clz7b/yYb5CYiWskuKPhGqv9qsnER6Pn/x2zQKV1pY2Zm1qvB0NMwM7MycdAwM7PCHDTMzKwwBw0zMytswD/Irba2Ng477LBqV8PMbK/x0EMPdUREXVfbBnzQOOyww2htbe09o5mZASCp8xMxtvHpKTMzK8xBw8zMCnPQMDOzwgb8mIaZVVdHRweLFi1i/fr1jBgxgkmTJlFbW1vtavWrAd3maj+bvb+XY445Jsys8rZu3RrNzc1RU1MTZE8HDiBqamqiubk5tm7dWu0qlt1AaTPQGt18pxb64gUWkk1E83g32wVcRfagt5XkZpsjmwXtl2k5N5d+DNnD6VanfUvPwSo99fWX6fXA3sroaXHQMMu0t7fHddddF01NTXHddddFe3t7v5bX3Ny8wxdn56W5ublfy6+GgdLmcgSNjwBH9xA0TgFuTV/s7wPuj+0B4Jn0emBaLwWBB1JepX0npPR/Ai5J65cAl/dURm+Lg4YNdtX49dve3r5TeZ2Xmpqafg9clTSQ2txT0Cg0EB4RK4ANPWSZCNyQyrsPeLOkEWTTdi6PiA0R8RJZz2F82rZ/RNyXKngDMCl3rOvT+vWd0rsqw8x6MH/+fGbPns2mTZt2SN+0aROzZ89m/vz5ZS9z0aJFO5XX2aZNm1i8eHHZy87r6OigpaWFuXPn0tLSQkdHl7MblMWe0ub+Vq6rpw5hx+k121JaT+ltXaQDHBzZxD6QzV9wcC9l7ETSBZJaJbW2t7f3vTVmA0RHRwdz587tMc/cuXPL/mW6fv363jMB69YVnaCxbyKCefPmUV9fz9SpU/nSl77E1KlTqa+vZ968eaUzJGVV7TZXyh59yW3qhfT5rxsR10ZEY0Q01tV1eSe82aBQrV+/I0YUOwkwcmT/TLdejd5VtdtcKeUKGmuBQ3Pv61NaT+n1XaQD/LZ02im9vtBLGWbWjWr9+p00aRI1NT1NXAk1NTVMnDixrOVC9XpX1WxzJZUraCwB/irNr/w+4HfpFNNtwMmSDpR0INmUqLelba9Iep8kkU0Xujh3rHPT+rmd0rsqw8y6Ua1fv7W1tcyZM6fHPHPmzOmXexeq1buqZpsrqrsR8vwC3ASsB/5ENpZwPtn81Z+L7ZfDXg38iuwy2sbcvp8hu0x2NfDpXHoj8Hja5xtsv+R2OHA72SW3/w28pbcyelp89ZQNZtW8oqda9yw0NTX12N7S0tTUVPayB8N9GgN+jvDGxsbwU26ts2resVvpsufNm8fs2bO73d7c3MysWbP6rfyOjg4WL17MunXrGDlyJBMnTuzX9ra0tDB16tRC+c4///x+qUOl21xukh6KiMYuN3YXTQbK4p6G5VXzl2C1yh4ov36LGkj3S1QLu3tz3968OGhYXjXv2K323cLt7e3R0tISTU1N0dLSMqC/NKv9We/tegoaPj1lg0ZHRwf19fU9DpLW1NTQ1tZW9lMJ1Sx7MIoI5s+fz9y5c3f4zGtqapgzZw4zZ84kuwbHutLT6ak9+j4Ns3Kq5h27g+Vu4T2FJGbNmkVbWxstLS00NTXR0tJCW1sbs2bNcsDYDX40ug0a1bxjd7DcLbynqa2t7bfB7sHKPQ0bNKp5x+5guVvYBj4HDRs0qnnH7mC5W9gGPgcNGzSqecfuoLlb2AY8Bw0bVGbOnElzc/NOv/prampobm5m5syZA7Jss3LxJbdWVdW6M7uad+zu7XcL28DX0yW3DhpWFb6O3mzP1VPQ8CW3VhWl+Q46K813APTr85DMbNe4p2EV57ujzfZs7mlYjyo9rtCXu6N9Y5bZnsVBYxDrblxh2rRp/Tqu4LujzfZeDhqDWLXGFXx3tNneq9CYhqTxwL8AQ4CWiLis0/ZRwEKgDtgAnB0RbWnb5cCpKevciPheSj8BuALYF3gIOD8iNkv6AvCplH8o8C6gLiI2SHoW+D2wBdjc3Tm3PI9pdM1PfDWz7uzWU24lDSGbZnUC0ACcJamhU7YrgBsiYjTQBCxI+54KHA2MAY4FZkjaX9LrgOuBKRHxHuA50rzgEfHViBgTEWOAS4GfRMSGXFnHp+29BgzrXjWfuuq7o832XkXuCB8LrI6IZyLiNeBmoPMDchqAO9L6nbntDcCKiNgcEa8CK4HxZPOAvxYRq1K+5cAnuij7LLL5ya3Mqj2u4LujzfZORcY0DgHW5N63kfUa8h4DJpOdwjod2E/S8JT+ZUn/DAwDjgeeBDqAoZIaI6IVOAM4NH9AScPIAsy0XHIAyyQF8K2IuLarCku6ALgA4K1vfWuBJg4+1R5XKM13cOGFF/ruaLO9SLkGwmcA35B0HrACWAtsiYhlkt4L3AO0A/em9JA0BbhSUg2wjGycIu9/Aj/rdGrqQxGxVtJBwHJJv4iIFZ0rk4LJtZCNaZSpjQPKpEmTmDZtWq/jCv391FXPd2C2dylyemotO/YC6lPaNhGxLiImR8RRwKyU9nJ6nZfGIE4CBKxK6fdGxIcjYixZoFnFjqbQ6dRURKxNry8APyI7dWa7wOMKZrYrigSNB4EjJB0uaV+yL/Ml+QySatPgNmSD1wtT+pB0mgpJo4HRZL0KUm+B1NP4IvDN3PEOAMYBi3Npb5S0X2kdOBl4vK8Ntu08rmBmfdXr6al0Gew04DayS24XRsQTkpqA1ohYAhwHLEhjDSuAi9Lu+wB3pxvEXiG7FHdz2vYFSaeRBa5rIuIOtjsdWJYGz0sOBn6UjjUUuDEifrwrjbaMxxXMrK/87CkzM9vBbt2nYWZmVuKgYWZmhTlomJlZYQ4aZmZWmIOGmZkV5qBhZmaFOWiYmVlhDhpmZlaYg4aZmRXmoGFmZoU5aJiZWWEOGmZmVli5JmEyM7Mq6+joYNGiRaxfv54RI0YwadKksj+x2kHDzGwvFxHMnz+fuXPn7jAb57Rp05gzZw4zZ84kTSux2xw0zMz2cvPnz2f27Nk7pW/atGlb+qxZs8pSlufTMDPbi3V0dFBfX79DD6Ozmpoa2traCp+q2u35NCSNl/S0pNWSLuli+yhJt0taKekuSfW5bZdLejwtZ+bST5D0cEq/XtLQlH6cpN9JejQtXypaDzOzwWbRokU9BgzIehyLFy/uMU9RvQYNSUOAq4EJQANwlqSGTtmuAG6IiNFAE7Ag7XsqcDQwBjgWmCFp/zSf+PXAlIh4D/AccG7ueHdHxJi0NPWhHmZmg8r69esL5Vu3bl1ZyivS0xgLrI6IZyLiNeBmYGKnPA1AaY7vO3PbG4AVEbE5zfe9EhgPDAdei4hVKd9y4BNlqIeZ2aAyYsSIQvlGjhxZlvKKBI1DgDW5920pLe8xYHJaPx3YT9LwlD5e0jBJtcDxwKFABzBUUumc2RkpveT9kh6TdKukd/ehHgBIukBSq6TW9vb2Ak00M9s7TZo0iZqamh7z1NTUMHFieX5jl+vmvhnAOEmPAOOAtcCWiFgGLAXuAW4C7k3pAUwBrpT0APB7YEs61sPAqIg4Evg/wKK+ViYiro2IxohorKur282mmZntuWpra5kzZ06PeebMmVO2+zWKBI217NgLqE9p20TEuoiYHBFHAbNS2svpdV4amzgJELAqpd8bER+OiLHAilz6KxHxh7S+FNgn9VJ6rYeZ2WA0c+ZMmpubd+px1NTU0NzczMyZM8tWVpH7NB4EjpB0ONmX9BTgk/kM6Ut9Q0RsBS4FFqb0IcCbI+JFSaOB0cCytO2giHhBUg3wRWBeSv8z4LcREZLGkgW2F4GXe6tHuVTirso9qVwz27tJYtasWVx44YUsXryYdevWMXLkSCZOnFj+75CI6HUBTiHrCfwKmJXSmoCPp/UzgF+mPC1ATUp/PfBkWu4DxuSO+VXgKeBp4PO59GnAE2TjIfcBH+ipHr0txxxzTBS1devWaG5ujpqamgC2LTU1NdHc3Bxbt24tfKy+qFa5ZmZdAVqju3jQ3YaBsvQlaDQ3N+/wpd15aW5uLnysvqhWuWZmXekpaPiO8KQ/7qosolrlmpl1Z7fvCB8MKn1XZbXLNTPbFQ4aSaXvqqx2uWZmu8JPuU0qfVdltcu1geGqq67i1ltv3Sl948aN9PXUsySGDRu2U/qECROYPn36LtfRBhb3NJJK31VZ7XLNzHaFexpJ6a7Krp5JX1LOuyqrXa4NDNOnTx90vQD3rqrLPY2cSt5VuSeUa2bWV77ktgsdHR39f1flHlSumVleT5fcOmiYmdkOfJ+GmZmVhYOGmZkV5qunzKzPfAXT4OWehpmZFeaBcDMz24EHws3MrCwcNMzMrLBCQUPSeElPS1ot6ZIuto+SdLuklZLuklSf23a5pMfTcmYu/QRJD6f06yUNTemfSsf5uaR7JB2Z2+fZlP6oJJ9zMjOrsF6DRprn+2pgAtAAnCWpoVO2K4AbImI02TSwC9K+pwJHA2OAY4EZkvaX9DrgemBKRLwHeA44Nx3r18C4iPgfwFzg2k5lHR8RY7o732ZmZv2nSE9jLLA6Ip6JiNeAm4HOj1xtAO5I63fmtjcAKyJic0S8CqwExgPDgdciYlXKtxz4BEBE3BMRL6X0+4BtvRYzM6uuIkHjEGBN7n1bSst7DJic1k8H9pM0PKWPlzRMUi1wPHAo0AEMlVTqLZyR0js7H8hfDB7AMkkPSbqguwpLukBSq6TW9vb2Ak00M7MiynVz3wzgG5LOA1YAa4EtEbFM0nuBe4B24N6UHpKmAFdKqgGWAVvyB5R0PFnQ+FAu+UMRsVbSQcBySb+IiBWdKxMR15JOazU2Ng7sa4rNzCqoSE9jLTv2AupT2jYRsS4iJkfEUcCslPZyep2XxiBOAgSsSun3RsSHI2IsWaApnapC0migBZgYES/mylmbXl8AfkR26szMzCqkSNB4EDhC0uGS9gWmAEvyGSTVpsFtgEuBhSl9SDpNVQoEo8l6FaTeAqmn8UXgm+n9W4H/AM7JjXkg6Y2S9iutAycDj+9Ko83MbNf0enoqIjZLmgbcBgwBFkbEE5KagNaIWAIcByyQFGS9hovS7vsAd0sCeAU4OyI2p21fkHQaWeC6JiJKA+lfIhso/9e03+Z0pdTBwI9S2lDgxoj48W613szM+sSPETEzsx309BgRP+XWzKwAP9k348eImJlZYT49ZbabuvsFCoPvV6gNDH7KrZmZlYV7GjZg+JyzWXm4p2FmZmXhnoaZme3APQ0zMysLBw0zMyvMQcPMzApz0DAzs8IcNMzMrDAHDTMzK8xBw8zMCnPQMDOzwhw0zMyssEJBQ9J4SU9LWi3pki62j5J0u6SVku6SVJ/bdrmkx9NyZi79BEkPp/TrJQ1N6ZJ0VSprpaSjc/ucK+mXaTl395puZmZ91WvQkDQEuBqYADQAZ0lq6JTtCuCGiBgNNAEL0r6nAkcDY4BjgRmS9k/ziV8PTImI9wDPAaUgMAE4Ii0XANekY70F+HI6zljgy5IO3MV2m5nZLijS0xgLrI6IZyLiNeBmYGKnPA1AaY7vO3PbG4AVEbE5Il4FVgLjyeYAfy0iVqV8y4FPpPWJZAEoIuI+4M2SRgAfA5ZHxIaIeCntM76P7TUzs91QJGgcAqzJvW9LaXmPAZPT+unAfpKGp/TxkoZJqgWOBw4FOoChkkoPxDojpfdUXpF6ACDpAkmtklrb29sLNNHMzIoo10D4DGCcpEeAccBaYEtELAOWAvcANwH3pvQApgBXSnoA+D2wpUx1ISKujYjGiGisq6sr12HNzAa9IkFjLdt7AQD1KW2biFgXEZMj4ihgVkp7Ob3Oi4gxEXESIGBVSr83Ij4cEWOBFaX0HsrrtR5mZta/hhbI8yBwhKTDyb6kpwCfzGdIp542RMRW4FJgYUofArw5Il6UNBoYDSxL2w6KiBck1QBfBOalwy0Bpkm6mWzQ+3cRsV7SbcD83OD3yaks60I5Z7EDz2RnZpleg0ZEbJY0DbgNGAIsjIgnJDUBrRGxBDgOWCApyHoNF6Xd9wHulgTwCnB2RGxO274g6TSy3s41EVEaSF8KnAKsBjYCn0712CBpLlkQA2iKiA273nQzM+srz9xnZmY78Mx9ZmZWFg4aZmZWmIOGmZkV5qBhZmaFOWiYmVlhRe7TMOuTct4j4vtDzPYs7mmYmVlhvk/DzMx24Ps0zMysLBw0zMysMAcNMzMrzEHDzMwKc9AwM7PCHDTMzKwwBw0zMyvMQcPMzAorFDQkjZf0tKTVki7pYvsoSbdLWinpLkn1uW2XS3o8LWfm0k+U9LCkRyX9VNLbU/qVKe1RSaskvZzbZ0tu25Lda7qZmfVVr8+eSvN8Xw2cBLQBD0paEhFP5rJdAdwQEddLOgFYAJwj6VTgaGAMUAPcJenWiHgFuAaYGBFPSfrfwGzgvIi4OFf23wBH5cr5Y0SM2Z0Gm5nZrivS0xgLrI6IZyLiNeBmYGKnPA1AaY7vO3PbG4AVEbE5Il4FVgLj07YA9k/rBwDruij7LOCmIg0xM7P+VyRoHAKsyb1vS2l5jwGT0/rpwH6Shqf08ZKGSaoFjgcOTfk+CyyV1AacA1yWP6CkUcDhbA9GAK+X1CrpPkmTuquwpAtSvtb29vYCTTQzsyLKNRA+Axgn6RFgHLAW2BIRy4ClwD1kPYZ7gS1pn4uBUyKiHvgO8LVOx5wC3BIRW3Jpo9JDtD4JfF3S27qqTERcGxGNEdFYV1dXnhaamVmhoLGW7b0DgPqUtk1ErIuIyRFxFDArpb2cXudFxJiIOAkQsEpSHXBkRNyfDvE94AOdyp1Cp1NTEbE2vT4D3MWO4x1mZtbPigSNB4EjJB0uaV+yL/MdrlySVCupdKxLgYUpfUg6TYWk0cBoYBnwEnCApHekfU4Cnsod7y+AA8l6JqW0AyXVlMoDPgjkB+PNzKyf9Xr1VERsljQNuA0YAiyMiCckNQGtEbEEOA5YICmAFcBFafd9gLslAbwCnB0RmwEkTQV+KGkrWRD5TK7YKcDNseNkH+8CvpXyvw64rNMVXGZm1s88CZOZme3AkzCZmVlZOGiYmVlhDhpmZlaYg4aZmRXmoGFmZoU5aJiZWWEOGmZmVpiDhpmZFeagYWZmhTlomJlZYQ4aZmZWmIOGmZkV5qBhZmaFOWiYmVlhDhpmZlaYg4aZmRVWKGhIGi/paUmrJV3SxfZRkm6XtFLSXZLqc9sul/R4Ws7MpZ8o6WFJj0r6qaS3p/TzJLWn9EclfTa3z7mSfpmWc3ev6WZm1le9Bg1JQ4CrgQlAA3CWpIZO2a4AboiI0UATsCDteypwNDAGOBaYIWn/tM81wKciYgxwIzA7d7zvRcSYtLSkY70F+HI6zljgy5IO3IU2m5nZLirS0xgLrI6IZyLiNeBmYGKnPA3AHWn9ztz2BmBFRGyOiFeBlcD4tC2AUgA5AFjXSz0+BiyPiA0R8RKwPHcsMzOrgCJB4xBgTe59W0rLewyYnNZPB/aTNDylj5c0TFItcDxwaMr3WWCppDbgHOCy3PE+kU513SKplL9IPQCQdIGkVkmt7e3tBZpoZmZFlGsgfAYwTtIjwDhgLbAlIpYBS4F7gJuAe4EtaZ+LgVMioh74DvC1lP6fwGHpVNdy4Pq+ViYiro2IxohorKur241mmZlZXpGgsZbtvQOA+pS2TUSsi4jJEXEUMCulvZxe56WxiZMAAask1QFHRsT96RDfAz6Q8r8YEZtSegtwTNF6mJlZ/yoSNB4EjpB0uKR9gSnAknwGSbWSSse6FFiY0oek01RIGg2MBpYBLwEHSHpH2uck4KmUb0Tu0B8vpQO3ASdLOjANgJ+c0szMrEKG9pYhIjZLmkb2BT0EWBgRT0hqAlojYglwHLBAUgArgIvS7vsAd0sCeAU4OyI2A0iaCvxQ0layIPKZtM90SR8HNgMbgPNSPTZImksWxACaImLD7jTezMz6RhFR7Tr0q8bGxmhtba12NczM9hqSHoqIxq62+Y5wMzMrrNfTUwPFVVddxa233rpT+saNG+lrb0sSw4YN2yl9woQJTJ8+fZfraGa2p3NPw8zMCvOYhpmZ7cBjGmZmVhYOGmZmVpiDhpmZFeagYWZmhTlomJlZYQ4aZmZWmIOGmZkV5qBhZmaFOWiYmVlhDhpmZlaYg4aZmRXmoGFmZoUVChqSxkt6WtJqSZd0sX2UpNslrZR0l6T63LbLJT2eljNz6SdKeljSo5J+KuntKf3vJD2ZjnW7pFG5fbak/I9KWoKZmVVUr0FD0hDgamAC0ACcJamhU7YrgBsiYjTQBCxI+54KHA2MAY4FZkjaP+1zDfCpiBgD3AjMTumPAI3pWLcA/5Qr548RMSYtH+9za83MbLcU6WmMBVZHxDMR8RpwMzCxU54G4I60fmduewOwIiI2R8SrwEpgfNoWQCmAHACsA4iIOyNiY0q/D9jWazEzs+oqEjQOAdbk3reltLzHgMlp/XRgP0nDU/p4ScMk1QLHA4emfJ8FlkpqA84BLuui7POB/HR7r5fUKuk+SZO6q7CkC1K+1vb29gJNNDOzIso1ED4DGCfpEWAcsBbYEhHLgKXAPcBNwL3AlrTPxcApEVEPfAf4Wv6Aks4GGoGv5pJHpYlBPgl8XdLbuqpMRFwbEY0R0VhXV1emJpqZWZGgsZbtvQPIThetzWeIiHURMTkijgJmpbSX0+u8NAZxEiBglaQ64MiIuD8d4nvAB0rHk/TRdJyPR8SmXDlr0+szwF3AUX1oq5mZ7aYiQeNB4AhJh0vaF5gC7HDlkqRaSaVjXQosTOlD0mkqJI0GRgPLgJeAAyS9I+1zEvBUyncU8C2ygPFCrowDJdWUygM+CDzZ9yabmdmuGtpbhojYLGkacBswBFgYEU9IagJaI2IJcBywQFIAK4CL0u77AHdLAngFODsiNgNImgr8UNJWsiDymbTPV4E3AT9I+z2frpR6F/CtlP91wGUR4aBhZlZBik4LmB0AAAedSURBVIhq16FfNTY2Rmtra7WrYWa215D0UBo/3onvCDczs8IcNMzMrDAHDTMzK8xBw8zMCnPQMDOzwhw0zMysMAcNMzMrzEHDzMwKc9AwM7PCHDTMzKwwBw0zMyvMQcPMzApz0DAzs8IcNMzMrDAHDTMzK8xBw8zMCut15j4ASeOBfyGbua8lIi7rtH0U2RSvdcAGshn62tK2y4FTU9a5EfG9lH4i2Sx9rwP+AJwXEavTlK43AMcALwJnRsSzaZ9LgfOBLcD0iLhtF9tdMVdddRW33nrrTukbN26krxNgSWLYsGE7pU+YMIHp06fvch3NzIrqtachaQhwNTABaADOktTQKdsVwA0RMRpoAhakfU8FjgbGAMcCMyTtn/a5BvhURIwBbgRmp/TzgZci4u3AlcDl6VgNZPOTvxsYD/xrqpuZmVVIkZ7GWGB1RDwDIOlmYCKQn5+7Afi7tH4nsCiXviLNC75Z0kqyL/zvAwGUAsgBwLq0PhH4Slq/BfiGssnCJwI3R8Qm4NeSVqe63Vu4tVUwffp09wLMbMAoMqZxCLAm974tpeU9BkxO66cD+0kantLHSxomqRY4Hjg05fsssFRSG3AOUDrlta28FGx+BwwvWA8AJF0gqVVSa3t7e4EmmplZEeUaCJ8BjJP0CDAOWAtsiYhlwFLgHuAmsl7BlrTPxcApEVEPfAf4WpnqQkRcGxGNEdFYV1dXrsOamQ16RYLGWrb3DgDqU9o2EbEuIiZHxFHArJT2cnqdFxFjIuIkQMAqSXXAkRFxfzrE94APdC5P0lCyU1cvFqmHmZn1ryJB40HgCEmHS9qXbDB6ST6DpFpJpWNdSnYlFZKGpNNUSBoNjAaWAS8BB0h6R9rnJOCptL4EODetnwHcEdllRkuAKZJqJB0OHAE80NcGm5nZrut1IDwiNkuaBtxGdsntwoh4QlIT0BoRS4DjgAWSAlgBXJR23we4OxvH5hWyS3E3A0iaCvxQ0layIPKZtM+3gf+bBro3kAUpUpnfJxuA3wxcFBGlU11mZlYB6uu9AnubxsbGaG1trXY1zMz2GpIeiojGrrb5jnAzMyvMQcPMzAob8KenJLUDz+3i7rVARxmr43L3vLLd5oFfbjXLrmabd8eoiOjyfoUBHzR2h6TW7s7rudyBUbbbPPDLrWbZ1Wxzf/HpKTMzK8xBw8zMCnPQ6Nm1LnfAl+02D/xyq1l2NdvcLzymYWZmhbmnYWZmhTlomJlZYQ4aXZC0UNILkh6vcLmHSrpT0pOSnpD0txUq9/WSHpD0WCr3HytRbq78IZIekfRfFS73WUk/l/SopIo9a0bSmyXdIukXkp6S9P4KlfvO1NbS8oqkz1eo7IvT/9bjkm6S9PoKlfu3qcwn+rutXX1vSHqLpOWSfpleD+zPOlSCg0bX/o1shsFK2wz8fUQ0AO8DLupiat3+sAk4ISKOJJuad7yk91Wg3JK/ZftTjivt+PTo/kpeS/8vwI8j4i+AI6lQ2yPi6dTWMcAxwEbgR/1drqRDgOlAY0S8h+zBp1MqUO57gKlkM3weCZwm6e39WOS/sfP3xiXA7RFxBHB7er9Xc9DoQkSsIHvCbqXLXR8RD6f135N9mXQ5O2GZy42I+EN6u09aKnKFhKR64FSgpRLlVZukA4CPkD3NmYh4rTT3TIWdCPwqInb1aQl9NRR4Q5ojZxjbp3fuT+8C7o+Ijenp2j9h+wyjZdfN98ZE4Pq0fj0wqb/KrxQHjT2UpMOAo4D7e85ZtvKGSHoUeAFYnpsgq799HfgHYGuFyssLYJmkhyRdUKEyDwfage+kU3Itkt5YobLzppDNptnvImItcAXwPLAe+F2a1bO/PQ58WNJwScOAU9hxIrdKODgi1qf13wAHV7j8snPQ2ANJehPwQ+DzEfFKJcqMiC3ptEU9MDZ17fuVpNOAFyLiof4uqxsfioijgQlkpwI/UoEyhwJHA9ekmS5fpcKnLNJkah8HflCh8g4k+8V9ODASeKOks/u73Ih4CricbOK3HwOPsn266YpLk8nt9fc4OGjsYSTtQxYw/j0i/qPS5adTJXdSmTGdDwIfl/QscDNwgqTvVqBcYNsvYCLiBbJz+2MrUGwb0Jbryd1CFkQqaQLwcET8tkLlfRT4dUS0R8SfgP9g+/TO/Soivh0Rx0TER8gme1tViXJzfitpBEB6faHC5Zedg8YeRNkUh98GnoqIr1Ww3DpJb07rbyCbfvcX/V1uRFwaEfURcRjZ6ZI7IqLff4ECSHqjpP1K68DJZKcz+lVE/AZYI+mdKelEstkoK+ksKnRqKnkeeJ+kYel//EQqNPgv6aD0+lay8YwbK1FuTn766nOBxRUuv+x6ne51MJJ0E9kUtrWS2oAvR8S3K1D0B4FzgJ+n8QWAmRGxtJ/LHQFcL2kI2Q+J70dERS9/rYKDgR+lqYiHAjdGxI8rVPbfAP+eThM9A3y6QuWWAuRJwIWVKjMi7pd0C/Aw2RWCj1C5x2v8UNJw4E9kU0T320UHXX1vAJcB35d0PtkUDX/ZX+VXih8jYmZmhfn0lJmZFeagYWZmhTlomJlZYQ4aZmZWmIOGmZkV5qBhZmaFOWiYmVlh/x/9pmriG8xjlAAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "scoresb=scores[0:10]\n", "scoresb.append(cvf) # append the big run we did at the beginning\n", "scoresb=pd.DataFrame(scoresb).transpose() # Convert to data frame for naming\n", "scoresb.columns=[str(x) for x in list(range(1,11))+[102]] # The number of trees that we used\n", "ax=sns.boxplot(data=scoresb) ## coloured boxplots: the test dataset performance\n", "ax=plt.scatter(list(range(0,11)),trainscores+[1-error],c='black',linewidths=4) \n", "## Black points: the training performance\n", "## Black lines: the test performance\n", "plt.title(\"score vs number of trees\", loc=\"left\")" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0.99842,\n", " 0.99802,\n", " 0.99896,\n", " 0.99888,\n", " 0.99914,\n", " 0.99916,\n", " 0.9994,\n", " 0.99932,\n", " 0.99938,\n", " 0.99918]" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scores" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Notes on features importance\n", "\n", "### Activity 9 Feature importance\n", "\n", "Examine the following code and output. What can you conclude about the importance of features as calculated by Random Forests?" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scores for X0, X1, X2: ['0.131', '0.108', '0.761']\n" ] } ], "source": [ "from sklearn.ensemble import RandomForestRegressor\n", "size = 100\n", "np.random.seed(seed=15)\n", "X_seed = np.random.normal(0, 1, size)\n", "X0 = X_seed + np.random.normal(0, 1, size)\n", "X1 = X_seed + np.random.normal(0, 1, size)\n", "X2 = X_seed + np.random.normal(0, 5, size)\n", "X = np.array([X0, X1, X2]).T\n", "Y = X0 + X1 + X2\n", "\n", "rf = RandomForestRegressor(n_estimators=100, max_features=2)\n", "rf.fit(X, Y);\n", "print (\"Scores for X0, X1, X2:\", list(map(lambda x:str(round (x,3)),\n", " rf.feature_importances_)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we plot the largest 10 feature importances. How would you interpret them?" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAABBgAAAG4CAYAAADvxVQgAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAgAElEQVR4nOzdebxddX3v/9c7CRAcQMBUkYCAIhp/jCZRcYbKUBRaCZO0FxxK1aL2569q7KAW7U9trYiUqiiTiAUcy1WugCAVh0rCbEQkYCpB7xURGUUI+dw/1ko4HA/JwXXOWTvZr+fjcR7Z67vW2ud9spKz1/7s75CqQpIkSZIkqYtpfQeQJEmSJEnrPgsMkiRJkiSpMwsMkiRJkiSpMwsMkiRJkiSpMwsMkiRJkiSpMwsMkiRJkiSps3EVGJLsm+T6JEuTLBxj/4uTXJFkRZIFo/Ztk+SCJNcl+WGSbScmuiRJkiRJGhRrLTAkmQ6cCOwHzAEOTzJn1GE/BY4CPjfGU3wG+OeqehYwH/hFl8CSJEmSJGnwzBjHMfOBpVV1E0CSs4ADgR+uOqCqlrX7Vo48sS1EzKiqC9vj7p6Y2JIkSZIkaZCMp8CwFXDziO3lwHPH+fzPAH6d5EvAdsA3gIVV9eAjnfDEJz6xtt1223E+vSRJkiRJmiqXX375L6tq1lj7xlNg6GIG8CJgN5phFGfTDKU4eeRBSY4GjgbYZpttWLx48STHkiRJkiRJj1aS/36kfeOZ5PEWYOsR27PbtvFYDlxVVTdV1QrgK8Duow+qqpOqam5VzZ01a8xCiCRJkiRJGmDjKTAsAnZIsl2SDYHDgHPH+fyLgCckWVU12JMRczdIkiRJkqT1w1oLDG3Pg2OA84HrgHOqakmSY5McAJBkXpLlwMHAJ5Msac99EPhr4KIk1wIBPjU5P4okSZIkSepLqqrvDA8zd+7ccg4GSZIkSdJIDzzwAMuXL+e+++7rO8pQmDlzJrNnz2aDDTZ4WHuSy6tq7ljnTPYkj5IkSZIkdbZ8+XIe//jHs+2225Kk7zjrtaritttuY/ny5Wy33XbjPm88czBIkiRJktSr++67jy222MLiwhRIwhZbbPGoe4tYYJAkSZIkrRMsLkyd3+fv2gKDJEmSJEnjsMcee0zp91u2bBmf+9znpvR7duEcDJIkSZKkdc62C782oc+37IP7r/WY7373uxP6PddkxYoVqwsMr371q6fs+3ZhDwZJkiRJksbhcY97HACXXHIJL3nJSzjwwAPZfvvtWbhwIWeeeSbz589np5124sYbbwTgqKOO4g1veANz587lGc94Bl/96leBZj6J17zmNey0007stttufPOb3wTgtNNO44ADDmDPPfdkr732YuHChVx66aXsuuuuHHfccSxbtowXvehF7L777uy+++6rCx6XXHIJL33pS1mwYAHPfOYzOeKII1i1YuSiRYvYY4892GWXXZg/fz533XUXDz74IG9/+9uZN28eO++8M5/85Ccn5O/HHgySJEmSJD1KV199Nddddx2bb74522+/Pa9//eu57LLLOP744znhhBP46Ec/CjTDHC677DJuvPFGXvayl7F06VJOPPFEknDttdfyox/9iL333psf//jHAFxxxRVcc801bL755lxyySV8+MMfXl2YuPfee7nwwguZOXMmN9xwA4cffjiLFy8G4Morr2TJkiU85SlP4QUveAHf+c53mD9/Poceeihnn3028+bN484772TjjTfm5JNPZtNNN2XRokX89re/5QUveAF77733o1oxYixDW2CY6O40fRlPNx5JkiRJ0sSaN28eW265JQBPe9rT2HvvvQHYaaedVvdIADjkkEOYNm0aO+ywA9tvvz0/+tGP+Pa3v82b3/xmAJ75zGfy1Kc+dXWB4eUvfzmbb775mN/zgQce4JhjjuGqq65i+vTpq88BmD9/PrNnzwZg1113ZdmyZWy66aZsueWWzJs3D4BNNtkEgAsuuIBrrrmGL3zhCwDccccd3HDDDRYYJEmSJEmaahtttNHqx9OmTVu9PW3aNFasWLF63+jVGNa2OsNjH/vYR9x33HHH8aQnPYmrr76alStXMnPmzDHzTJ8+/WEZRqsqTjjhBPbZZ581Znm0nINBkiRJkqRJ8vnPf56VK1dy4403ctNNN7Hjjjvyohe9iDPPPBOAH//4x/z0pz9lxx13/J1zH//4x3PXXXet3r7jjjvYcsstmTZtGmeccQYPPvjgGr/3jjvuyM9//nMWLVoEwF133cWKFSvYZ599+PjHP84DDzywOsM999zT+We1B4MkSZIkSZNkm222Yf78+dx555184hOfYObMmbzpTW/ijW98IzvttBMzZszgtNNOe1gPhFV23nlnpk+fzi677MJRRx3Fm970Jg466CA+85nPsO+++66xtwPAhhtuyNlnn82b3/xmfvOb37DxxhvzjW98g9e//vUsW7aM3Xffnapi1qxZfOUrX+n8s2bVzJKDYu7cubVqkorJ5BwMkiRJkrTuuO6663jWs57Vd4xH5aijjuIVr3gFCxYs6DvK72Wsv/Mkl1fV3LGOd4iEJEmSJEnqzCESkiRJkiRNgtNOO63vCFPKHgySJEmSJKkzCwySJEmSpHXCoM0huD77ff6uLTBIkiRJkgbezJkzue222ywyTIGq4rbbbmPmzJmP6jznYJAkSZIkDbzZs2ezfPlybr311r6jDIWZM2cye/bsR3WOBQZJkiRJ0sDbYIMN2G677fqOoTVwiIQkSZIkSerMAoMkSZIkSerMAoMkSZIkSerMAoMkSZIkSerMAoMkSZIkSerMAoMkSZIkSepsXAWGJPsmuT7J0iQLx9j/4iRXJFmRZMEY+zdJsjzJv05EaEmSJEmSNFjWWmBIMh04EdgPmAMcnmTOqMN+ChwFfO4RnuZ9wLd+/5iSJEmSJGmQjacHw3xgaVXdVFX3A2cBB448oKqWVdU1wMrRJyd5DvAk4IIJyCtJkiRJkgbQeAoMWwE3j9he3ratVZJpwL8Af/3oo0mSJEmSpHXFZE/y+CbgvKpavqaDkhydZHGSxbfeeuskR5IkSZIkSRNtxjiOuQXYesT27LZtPJ4PvCjJm4DHARsmubuqHjZRZFWdBJwEMHfu3Brnc0uSJEmSpAExngLDImCHJNvRFBYOA149nievqiNWPU5yFDB3dHFBkiRJkiSt+9Y6RKKqVgDHAOcD1wHnVNWSJMcmOQAgybwky4GDgU8mWTKZoSVJkiRJ0mAZTw8Gquo84LxRbe8e8XgRzdCJNT3HacBpjzqhJEmSJEkaeJM9yaMkSZIkSRoCFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJn4yowJNk3yfVJliZZOMb+Fye5IsmKJAtGtO+a5HtJliS5JsmhExlekiRJkiQNhrUWGJJMB04E9gPmAIcnmTPqsJ8CRwGfG9V+L/A/qurZwL7AR5M8oWtoSZIkSZI0WGaM45j5wNKqugkgyVnAgcAPVx1QVcvafStHnlhVPx7x+GdJfgHMAn7dObkkSZIkSRoY4xkisRVw84jt5W3bo5JkPrAhcOOjPVeSJEmSJA22KZnkMcmWwBnAa6pq5Rj7j06yOMniW2+9dSoiSZIkSZKkCTSeAsMtwNYjtme3beOSZBPga8DfVtV/jXVMVZ1UVXOrau6sWbPG+9SSJEmSJGlAjKfAsAjYIcl2STYEDgPOHc+Tt8d/GfhMVX3h948pSZIkSZIG2VoLDFW1AjgGOB+4DjinqpYkOTbJAQBJ5iVZDhwMfDLJkvb0Q4AXA0cluar92nVSfhJJkiRJktSb8awiQVWdB5w3qu3dIx4vohk6Mfq8zwKf7ZhRkiRJkiQNuHEVGKTJtu3Cr/UdYUIs++D+fUeQJEmSpF5MySoSkiRJkiRp/WaBQZIkSZIkdWaBQZIkSZIkdWaBQZIkSZIkdWaBQZIkSZIkdWaBQZIkSZIkdeYylZJ+x/qybCi4dKgkSZI0VezBIEmSJEmSOrPAIEmSJEmSOrPAIEmSJEmSOrPAIEmSJEmSOrPAIEmSJEmSOrPAIEmSJEmSOrPAIEmSJEmSOrPAIEmSJEmSOrPAIEmSJEmSOrPAIEmSJEmSOrPAIEmSJEmSOrPAIEmSJEmSOrPAIEmSJEmSOrPAIEmSJEmSOrPAIEmSJEmSOrPAIEmSJEmSOrPAIEmSJEmSOhtXgSHJvkmuT7I0ycIx9r84yRVJViRZMGrfkUluaL+OnKjgkiRJkiRpcKy1wJBkOnAisB8wBzg8yZxRh/0UOAr43KhzNwfeAzwXmA+8J8lm3WNLkiRJkqRBMp4eDPOBpVV1U1XdD5wFHDjygKpaVlXXACtHnbsPcGFV/aqqbgcuBPadgNySJEmSJGmAjKfAsBVw84jt5W3beIzr3CRHJ1mcZPGtt946zqeWJEmSJEmDYiAmeayqk6pqblXNnTVrVt9xJEmSJEnSozSeAsMtwNYjtme3bePR5VxJkiRJkrSOGE+BYRGwQ5LtkmwIHAacO87nPx/YO8lm7eSOe7dtkiRJkiRpPbLWAkNVrQCOoSkMXAecU1VLkhyb5ACAJPOSLAcOBj6ZZEl77q+A99EUKRYBx7ZtkiRJkiRpPTJjPAdV1XnAeaPa3j3i8SKa4Q9jnXsKcEqHjJIkSZIkacANxCSPkiRJkiRp3WaBQZIkSZIkdWaBQZIkSZIkdWaBQZIkSZIkdWaBQZIkSZIkdWaBQZIkSZIkdWaBQZIkSZIkdWaBQZIkSZIkdWaBQZIkSZIkdWaBQZIkSZIkdWaBQZIkSZIkdWaBQZIkSZIkdWaBQZIkSZIkdWaBQZIkSZIkdWaBQZIkSZIkdWaBQZIkSZIkdWaBQZIkSZIkdWaBQZIkSZIkdWaBQZIkSZIkdWaBQZIkSZIkdWaBQZIkSZIkdWaBQZIkSZIkdWaBQZIkSZIkdWaBQZIkSZIkdWaBQZIkSZIkdTZjPAcl2Rc4HpgOfLqqPjhq/0bAZ4DnALcBh1bVsiQbAJ8Gdm+/12eq6gMTmF+ShsK2C7/Wd4QJseyD+/cdQZIkSZNkrT0YkkwHTgT2A+YAhyeZM+qw1wG3V9XTgeOAD7XtBwMbVdVONMWHv0iy7cRElyRJkiRJg2I8QyTmA0ur6qaquh84Czhw1DEHAqe3j78A7JUkQAGPTTID2Bi4H7hzQpJLkiRJkqSBMZ4Cw1bAzSO2l7dtYx5TVSuAO4AtaIoN9wA/B34KfLiqfjX6GyQ5OsniJItvvfXWR/1DSJIkSZKkfk32JI/zgQeBpwDbAf9fku1HH1RVJ1XV3KqaO2vWrEmOJEmSJEmSJtp4Cgy3AFuP2J7dto15TDscYlOayR5fDXy9qh6oql8A3wHmdg0tSZIkSZIGy3gKDIuAHZJsl2RD4DDg3FHHnAsc2T5eAFxcVUUzLGJPgCSPBZ4H/GgigkuSJEmSpMGx1mUqq2pFkmOA82mWqTylqpYkORZYXFXnAicDZyRZCvyKpggBzeoTpyZZAgQ4taqumYwfRJKkqbK+LBsKLh0qSZImzloLDABVdR5w3qi2d494fB/NkpSjz7t7rHZJkiRJkrR+mexJHiVJkiRJ0hCwwCBJkiRJkjob1xAJSZKkQba+zIvhnBiSpHWZPRgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnFhgkSZIkSVJnM/oOIEmSpPXLtgu/1neECbHsg/v3HUGS1in2YJAkSZIkSZ1ZYJAkSZIkSZ1ZYJAkSZIkSZ1ZYJAkSZIkSZ1ZYJAkSZIkSZ25ioQkSZK0nnNlD0lTwR4MkiRJkiSpMwsMkiRJkiSpMwsMkiRJkiSps3EVGJLsm+T6JEuTLBxj/0ZJzm73fz/JtiP27Zzke0mWJLk2ycyJiy9JkiRJkgbBWgsMSaYDJwL7AXOAw5PMGXXY64Dbq+rpwHHAh9pzZwCfBd5QVc8GXgo8MGHpJUmSJEnSQBhPD4b5wNKquqmq7gfOAg4cdcyBwOnt4y8AeyUJsDdwTVVdDVBVt1XVgxMTXZIkSZIkDYrxFBi2Am4esb28bRvzmKpaAdwBbAE8A6gk5ye5Isk7ukeWJEmSJEmDZsYUPP8LgXnAvcBFSS6vqotGHpTkaOBogG222WaSI0mSJEmSpIk2nh4MtwBbj9ie3baNeUw778KmwG00vR2+VVW/rKp7gfOA3Ud/g6o6qarmVtXcWbNmPfqfQpIkSZIk9Wo8BYZFwA5JtkuyIXAYcO6oY84FjmwfLwAurqoCzgd2SvKYtvDwEuCHExNdkiRJkiQNirUOkaiqFUmOoSkWTAdOqaolSY4FFlfVucDJwBlJlgK/oilCUFW3J/kITZGigPOq6muT9LNIkiRJkqSejGsOhqo6j2Z4w8i2d494fB9w8COc+1mapSolSZIkSdJ6ajxDJCRJkiRJktZosleRkCRJkiSNsO3C9WPU+LIP7t93BA0YezBIkiRJkqTOLDBIkiRJkqTOLDBIkiRJkqTOLDBIkiRJkqTOLDBIkiRJkqTOLDBIkiRJkqTOLDBIkiRJkqTOLDBIkiRJkqTOLDBIkiRJkqTOLDBIkiRJkqTOLDBIkiRJkqTOLDBIkiRJkqTOLDBIkiRJkqTOLDBIkiRJkqTOLDBIkiRJkqTOLDBIkiRJkqTOLDBIkiRJkqTOLDBIkiRJkqTOZvQdQJIkSZKkPm278Gt9R5gQyz64f6/f3x4MkiRJkiSpMwsMkiRJkiSpMwsMkiRJkiSpMwsMkiRJkiSps3EVGJLsm+T6JEuTLBxj/0ZJzm73fz/JtqP2b5Pk7iR/PTGxJUmSJEnSIFlrgSHJdOBEYD9gDnB4kjmjDnsdcHtVPR04DvjQqP0fAf5X97iSJEmSJGkQjacHw3xgaVXdVFX3A2cBB4465kDg9PbxF4C9kgQgyR8DPwGWTExkSZIkSZI0aMZTYNgKuHnE9vK2bcxjqmoFcAewRZLHAe8E/qF7VEmSJEmSNKgme5LH9wLHVdXdazooydFJFidZfOutt05yJEmSJEmSNNFmjOOYW4CtR2zPbtvGOmZ5khnApsBtwHOBBUn+CXgCsDLJfVX1ryNPrqqTgJMA5s6dW7/PDyJJkiRJkvozngLDImCHJNvRFBIOA1496phzgSOB7wELgIurqoAXrTogyXuBu0cXFyRJkiRJ0rpvrQWGqlqR5BjgfGA6cEpVLUlyLLC4qs4FTgbOSLIU+BVNEUKSJEmSJA2J8fRgoKrOA84b1fbuEY/vAw5ey3O89/fIJ0mSJEmS1gGTPcmjJEmSJEkaAhYYJEmSJElSZxYYJEmSJElSZxYYJEmSJElSZxYYJEmSJElSZxYYJEmSJElSZxYYJEmSJElSZxYYJEmSJElSZxYYJEmSJElSZxYYJEmSJElSZxYYJEmSJElSZxYYJEmSJElSZxYYJEmSJElSZxYYJEmSJElSZxYYJEmSJElSZxYYJEmSJElSZxYYJEmSJElSZxYYJEmSJElSZxYYJEmSJElSZxYYJEmSJElSZxYYJEmSJElSZxYYJEmSJElSZxYYJEmSJElSZxYYJEmSJElSZxYYJEmSJElSZ+MqMCTZN8n1SZYmWTjG/o2SnN3u/36Sbdv2lye5PMm17Z97Tmx8SZIkSZI0CNZaYEgyHTgR2A+YAxyeZM6ow14H3F5VTweOAz7Utv8SeGVV7QQcCZwxUcElSZIkSdLgGE8PhvnA0qq6qaruB84CDhx1zIHA6e3jLwB7JUlVXVlVP2vblwAbJ9loIoJLkiRJkqTBMZ4Cw1bAzSO2l7dtYx5TVSuAO4AtRh1zEHBFVf3294sqSZIkSZIG1Yyp+CZJnk0zbGLvR9h/NHA0wDbbbDMVkSRJkiRJ0gQaTw+GW4CtR2zPbtvGPCbJDGBT4LZ2ezbwZeB/VNWNY32DqjqpquZW1dxZs2Y9up9AkiRJkiT1bjwFhkXADkm2S7IhcBhw7qhjzqWZxBFgAXBxVVWSJwBfAxZW1XcmKrQkSZIkSRosay0wtHMqHAOcD1wHnFNVS5Icm+SA9rCTgS2SLAXeBqxayvIY4OnAu5Nc1X79wYT/FJIkSZIkqVfjmoOhqs4DzhvV9u4Rj+8DDh7jvPcD7++YUZIkSZIkDbjxDJGQJEmSJElaIwsMkiRJkiSpMwsMkiRJkiSpMwsMkiRJkiSpMwsMkiRJkiSpMwsMkiRJkiSpMwsMkiRJkiSpMwsMkiRJkiSpMwsMkiRJkiSpMwsMkiRJkiSpMwsMkiRJkiSpMwsMkiRJkiSpMwsMkiRJkiSpMwsMkiRJkiSpMwsMkiRJkiSpMwsMkiRJkiSpMwsMkiRJkiSpMwsMkiRJkiSpMwsMkiRJkiSpMwsMkiRJkiSpMwsMkiRJkiSpMwsMkiRJkiSpMwsMkiRJkiSpMwsMkiRJkiSpMwsMkiRJkiSps3EVGJLsm+T6JEuTLBxj/0ZJzm73fz/JtiP2vattvz7JPhMXXZIkSZIkDYq1FhiSTAdOBPYD5gCHJ5kz6rDXAbdX1dOB44APtefOAQ4Dng3sC/xb+3ySJEmSJGk9Mp4eDPOBpVV1U1XdD5wFHDjqmAOB09vHXwD2SpK2/ayq+m1V/QRY2j6fJEmSJElaj4ynwLAVcPOI7eVt25jHVNUK4A5gi3GeK0mSJEmS1nEz+g4AkORo4Oh28+4k1/eZZwI9EfjlZH6DfGgyn3294/UYPF6TweL1GCyTfj3Aa/Io+X9ksHg9Bo/XZLB4PQbP+nJNnvpIO8ZTYLgF2HrE9uy2baxjlieZAWwK3DbOc6mqk4CTxpFlnZJkcVXN7TuHGl6PweM1GSxej8Hi9Rg8XpPB4vUYPF6TweL1GDzDcE3GM0RiEbBDku2SbEgzaeO5o445FziyfbwAuLiqqm0/rF1lYjtgB+CyiYkuSZIkSZIGxVp7MFTViiTHAOcD04FTqmpJkmOBxVV1LnAycEaSpcCvaIoQtMedA/wQWAH8ZVU9OEk/iyRJkiRJ6sm45mCoqvOA80a1vXvE4/uAgx/h3H8E/rFDxnXZejfsYx3n9Rg8XpPB4vUYLF6PweM1GSxej8HjNRksXo/Bs95fkzQjGSRJkiRJkn5/45mDQZIkSZIkaY0sMEiSJEmSpM4sMEiaEu1KMmttkyRJ0u/H+y31zQLDBEpy0XjaNLWSPDXJH7aPN07y+L4zDakvjtH2hSlPoYdJ8rQkG7WPX5rkLUme0HeuYZXkMUn+Psmn2u0dkryi71zDLskLk7ymfTzLm/V+JXlM3xn0kCTPSHJRkh+02zsn+bu+cw0x77cGSJLNx/jaoO9ck8kCwwRIMjPJ5sATk2w24h/PtsBW/aYbbkn+nOaX6ifbptnAV/pLNHySPDPJQcCmSV414usoYGbP8dTciDyY5Ok0MxtvDXyu30hD7VTgt8Dz2+1bgPf3F0dJ3gO8E3hX27QB8Nn+Eg2vJHsk+SHwo3Z7lyT/1nMswado/n88AFBV19AuWa+p4/3WwLoCuBX4MXBD+3hZkiuSPKfXZJNkXMtUaq3+Avgr4CnA5UDa9juBf+0rlAD4S2A+8H2AqrohyR/0G2no7Ai8AngC8MoR7XcBf95LIo20sqpWJPkT4ISqOiHJlX2HGmJPq6pDkxwOUFX3JsnaTtKk+hNgN5qbRKrqZ/aE681xwD7AuQBVdXWSF/cbScBjquqyUb+qVvQVZoh5vzWYLgS+UFXnAyTZGziI5gOFfwOe22O2SWGBYQJU1fHA8UneXFUn9J1HD/Pbqrp/1YtekhmAa7NOoar6D+A/kjy/qr7Xdx79jgfaN7NH8tANyXrddW/A3Z9kY9rfU0meRtOjQf25v6oqyapr8ti+Aw2zqrp51BvZB/vKotV+2f6uWvV/ZAHw834jDR/vtwbW86pqdYGnqi5I8uGq+otVQ1TXNxYYJlD7yd8ewLaM+Lutqs/0Fkr/meRvgI2TvBx4E/A/e840rJa212JbHv7/47W9JRLAa4A3AP9YVT9px5af0XOmYfZe4OvA1knOBF5Ac43Un3OSfBJ4Qjvs7rXAp3vONKxubu+zqh3D/Fbgup4zqektehLwzCS3AD8Bjug30lDzfmuw/DzJO4Gz2u1Dgf+TZDqwsr9YkydVfpg7UZKcATwNuIqHKupVVW/pL9VwSzINeB2wN83QlfOBT5f/8Kdcku8Cl9IMI1r9iVNVjTUZkaZQ+4n5NlV1fd9ZBEm2AJ5H8zvrv6rqlz1HGnptgXr160hVXdhzpKGU5InA8cAf0lyLC4C3VNWveg0mYHXvnmlVdVffWYaZ91uDpf299R7ghW3Td4B/AO6gufda2le2yWKBYQIluQ6Y45vXwdG+2N1XVQ+229OBjarq3n6TDZ8kV1XVrn3n0MMleSXwYWDDqtouya7AsVV1QM/RhlKSi6pqr7W1aeok+VBVvXNtbZp8SV5QVd9ZW5umVlsUXfUGqoBv07yO3NZrsCHl/Zb65ioSE+sHwJP7DqGHuQjYeMT2xsA3esoy7L6a5I/6DqHf8V6aiVB/DVBVVwHb9xloGLka0UB7+Rht+015CgGMNc+Vc1/17yyamfEPAha0j8/uNdFw835rgLRLG/9zkvOSXLzqq+9ck8k5GCbWE4EfJrmMEZNy+Ulgr2ZW1d2rNqrqbtfP7s1bgb9J8luapaxCM4Rok35jDb0HquqOUZOmrZdjAgecqxENmCRvpJm3Z/sk14zY9XiaLq6aIkmeD+wBzErythG7NgGm95NKI2xZVe8bsf3+JIf2lkbebw2WM2kKbq+gmfPqSJoi3DM/6eoAAB2nSURBVHrLAsPEem/fAfQ77kmye1VdAdCuN/ubnjMNpapyWbfBtCTJq4HpSXYA3gJ8t+dMQ8fViAbS54D/BXwAWDii/S7H/E+5DYHH0dy3jnwtuZPmE3P164IkhwHntNsLaOa8Ug+83xo4W1TVyUneWlX/STMB/aK+Q00m52DQei3JPJquez+jqeA+GTi0qi7vNdgQeqS1yqvqW1OdRQ9pe/T8Lc0EdtDcFL6vqlwasSdJ/h9gDjBzVZurEfUvyR/w8Gvy0x7jDKUkT62q/+47hx4uyV3AY3loQsHpwD3tYz85n2Lebw2WJP9VVc9Lcj7wMZr3JF+oqqf1HG3SWGCYQO0v2FV/oRvSrCV/j79Y+9UuZbVju3l9VT3QZ55hlWTk8qAzacb9X15Ve/YUSUCSg6vq82tr09RI8h7gpTQFhvNoxvp/u6r8lLYn7USoH6EZvvIL4KnAdVX17F6DDaEks4B3AM/m4cUeX0eklvdbgyXJK2hW9diaZs6YTYB/qKpzew02iSwwTJI0A5oPBJ5XVQvXdrwmVpI9q+riJK8aa39VfWmqM+nhkmwNfLSqDuo7yzBLckVV7b62Nk2NJNcCuwBXVtUuSZ4EfLaqxppoUFMgydXAnsA3qmq3JC8D/rSqXtdztKGT5AKascx/zYixzK7o0a8kXwROBr5eVc7hM2C839JUcw6GSdIuVfmV9tMoCwxT7yXAxcArx9hXgAWG/i0HntV3iGGVZD/gj4CtknxsxK5NgBX9pBLwm6pamWRFkk1oPjHfuu9QQ+6BqrotybQk06rqm0k+2neoITV0Y5nXER8HXgOckOTzwKlVdX3PmfQQ77d6MOre6ndU1VumKstUs8AwgUZ9Wj4NmAvc11OcoVZV72kfvr6qHlzjwZoSSU7goSFE04BdgSv6SzT0fgYsBg6gWbVglbuA/7eXRAJYnOQJwKdorsvdwPf6jTT0fp3kccC3gDOT/IKHxpdraq0a4vjzJPvT/B7bvMc8AqrqG8A3kmwKHN4+vpnm99hnHZo6tbzfGhivopnjajPg9p6zTCmHSEygJKeO2FwBLAM+VVW/6CeRkvwU+DpNl8qLy3/wvUly5IjNFcCyqnKpt54lOQD4qt1a+9cOrZtdVTe329sCm1TVNWs6T5MryWNpVh+aBhwBbAqcWVW39RpsCA3jWOZ1RZItgD8F/oym8HMm8EJgp6p6aY/Rho73W4MhyQ+BP6RZjeilPLT8NADr82pEFhi0XmtnyH8FcBiwO/BV4Kyq+navwYZUkg2BZ7SbTrg5AJJ8Fng+8EXglKr6Uc+RhlqSa6tqp75zqJFkOs3cCy/rO8uwa6/FW6rquL6z6OGSfJlmMu0zgNOq6ucj9i2uqrm9hRtS3m/1L8lbgDcC2wO3jNxFM5p++16CTQELDBMoyWyaivoL2qZLgbdW1fL+UmmVJJsBxwNHVNX0vvMMmyQvBU6n6dkTmk+gjnTZpP61Y/0PpxlDW8CpwL9X1V29BhtCSU4H/rWqHFc+IJJcBLyqqu7oO8uwS3JZVc3vO4ceLskfVdV5o9o2crnjfni/NViSfLyq3th3jqlkgWECJbkQ+BxNBRearmJHOPt3v5K8BDgU2JdmzPnZVfXFflMNnySXA69eNfFTkmfQvIl9Tr/JBKu7t/4Z8FfAdcDTgY9V1Qm9BhsySX5E83f/3zTj/Fd90rFzr8GGWJL/AHYDLmTE3Avr8wRdgyrJcTRLgJ/Nw6+F48t75GpEg8X7LfXNSR4n1qyqGjkPw2lJ/qq3NCLJMuBK4Bzg7VXlxFz92WDkrNJV9eMkG/QZSKvnYHgNzZvazwDzq+oX7fCiH9L0ytLU2WdNO5NsVlVDNVnUAPgSrjw0KHZt/zx2RFvRLCOqKZbkycBWwMZJduOhMeabAI/pLZi831KvLDBMrNuS/Cnw7+324YCTQPWkHa95SlUdu9aDNRUWJ/k08Nl2+wiaHiXq10HAcaO7TlbVvUle11OmoVVV/72WQy6imU9GU6SqTl/T/iRfdH35qbG2uTCSHLm266UJtQ9wFDAb+BceKjDcCfxNT5nk/ZZ65hCJCZTkqTSf9j2fpqL+XeDNq2YE19RzvObgSLIR8Jc0s0pDM0fJvzlGs3/tp1DzaX5vLaqq/91zJD2CJFdW1W5959BDvCaDw275/Uhy0JqGnlr4mVreb6lvFhgmUDs511+t6r6aZHPgw1X12n6TDS/Haw6Odqm3+6rqwXZ7OrBRVd3bb7Lh1vZSeA9wMc2nTy8Bjq2qU3oNpjH5BmrweE0Gh8WeweT/kanl/Zb65hCJibXzyLGxVfWrdkya+uN4zcFxEc16wHe32xsDFwB79JZIAO8Adquq22D1ZI/fBSwwSFrX+KnZYMraD9EE8n5LvbLAMLGmjZyAq+3B4N9xj1y7fKDMrKpVL3ZU1d3tRILq123AyOUo78K5YwaZN+qDx2syOLwWg8nCz9Tyfku9mtZ3gPXMvwDfS/K+JO+j+RTwn3rONNSSPCnJyUn+V7s9x4nrenNPktVdJJM8B/hNj3mGWpK3JXkbsBT4fpL3JnkP8F/Aj/tNN7ySnLGWtr2mMI6AJK9qxzQ/kndOWZgh13b1XpPvTEkQPVoWfqaW91vqlXMwTLAkc3io+/3FVfXDPvMMu7awcCrwt1W1S5IZwJVVtVPP0YZOknnAWcDPaG42ngwcWlWX9xpsSLXFhEdUVf8wVVn0kNFjlds3VNdW1ZweYw21JKfSvK5/i2Y+n69X1Yp+Uw2nJD8Fvk5zHS4ub2LXCUn+taqO6TvHsPB+S32zwKD1WpJFVTVv5MRPSa6qql3Xdq4mXrsO847t5vVV9cCIfS+vqgv7SaZHkuSEqnpz3znWd0neRbOs28bAqom4AtwPnFRV7+orm1b/7toPOJRmZvYLq+r1/aYaPm0371cAh9Es1/pV4Kyq+navwYZU2wvuEVXVR6Yqix7O+y31yfkBtL67p520rgCSPA+4o99Iw6t9gfvBI+z+EOAL3uB5Qd8BhkFVfSDJh4BPu/LQ4KmqB9oecUVTBPpjwALDFGtnwT8HOCfJZsDxwH8Caxs6ocnx+L4DaGzeb6lPFhi0vnsbcC7wtCTfAWYBC/qNpEfgGE0Ntapa2XZt1QBJsqrnwkuBS4BPA4f0GGmoJXkJzfXYF1iM16I3DqVbZ3m/pUllgUHru6fRdGvdGjgIeC7+ux9UjteS4Iok86pqUd9BtNqf0Xxq/hdV9du+wwyzJMuAK2mux9ur6p5+EwkgyWzgBB7q8XYp8NaqWt5fKq2B91uaVK4iofXd31fVncBmwMuAfwM+3m8kaZ3iJx1T67k0qxHdmOSaJNcmuabvUMOqnWRzy6r6isWFfrXX4pSq+pOq+neLCwPlVJreok9pv/5n2yZpCFlg0PruwfbP/YFPVdXXgA17zKNHtqzvABrT8X0HGDL70PS82hN4Jc2Edq/sNdEQq6oHgZVJNu07y7Brr8Ur+s6hMc2qqlOrakX7dRrNkFQNpmV9B9D6zVUktF5L8lXgFuDlNDNO/wa4rKp26TXYEGo/hT0LOLuqbuw7jxpJLgQOrqpft9ub0czKvk+/yYZXkl2AF7Wbl1bV1X3mGXZJ/gPYjWZStNWfmlfVW3oLNaSSHAdsQLNM5chrcUVvoUSSi2h6LPx723Q48Jqq2qu/VMPL+y31zQKD1mvtklb70qwjf0OSLYGdquqCnqMNnSRPpZmY61BgJc0N4jlV9dNegw25kUu4rqlNUyPJW4E/B77UNv0JzTKVJ/SXarglOXKs9qo6faqzDLsk3xyjuapqzykPo9Xa1/cTgOfTjO//LvAWX9/74f2W+maBQdKUS7ID8PfAEVXl8mI9SnI58CerbjzaG5MvV9Xu/SYbTu0nT89fNb48yWOB71XVzv0mE6zu4bN1VTkvhqSB5/2W+uAcDJKmTJKnJnkHTde9ZwLv6DmS4G+Abyc5I8lngW8B7+o50zALD80dQ/vYiTZ7lOSSJJsk2Ry4AvhUko/0nWsYJXlrey2S5NNJrkiyd9+5hl2S05M8YcT2ZklO6TPTsPN+S31yuT5JUyLJ92nGzp5DM+b/pp4jDb0k04BNaeYneV7b/FdV9cv+Ug29U4HvJ/kyTWHhQODkfiMNvU2r6s4krwc+U1XvcWWP3ry2qo5Psg+wBc0SomcADnvs186r5vEBqKrbkzjMrifeb6lvFhgkTbr2jeyXqupDfWfRQ6pqZZJ3VNU5wFf7ziOoqo8kuQR4Ic1Y5tdU1ZX9php6M9r5ew4B/rbvMENuVW+eP6Ip9ixJYg+f/k1LsllV3Q7Q9vbxPUYPvN/SIHCIhKRJV1UrgYP7zqExfSPJXyfZOsnmq776DqXVb6R889S/Y4HzgaVVtSjJ9sANPWcaVpcnuYCmwHB+ksfTTGKnfv0L8L0k70vyfppJHv+p50xDyfstDQIneZQ0JZJ8EPglv7u82K96CyWS/GTE5uoXhKravoc4Qy/Ju2luDr9IU1z4Y+DzVfX+XoPpESV5V1V9oO8cw6D9dHZX4Kaq+nWSLYCtVk26meTZVbWk15BDKskcYE+a15FvVtUPe440tLzfUt8sMEiaEr6RHUxJDgG+3o4x/3ua+Rje57ry/UhyPbBLVd3Xbm8MXFVVO/abTI8kyRWuujIYvBb9SbIL8GKa1/dLq+rqniMNrVH3W6uU91uaKg6RkDRV3knzxmk7monsrgYW9BtJwN+1xYUX0nz69Gng4z1nGmY/A2aO2N4IuKWnLBofh7EMDq9FD5K8FTgTeCLwB8Bnk7y531TDq6q2G+PL4oKmjAUGSVPFN7KDadWSiPsDn6qqrwEb9phn2N0BLElyWpJTgR8Av07ysSQf6zmbxmZX0MHhtejH64DnVtV7qurdNKsS/XnPmYZWkoPb+UlI8ndJvuSqHppKzvAqaar8zhvZdjIo9euWJJ8EXg58KMlGWHzu05fbr1Uu6SmHxs9PzTXswkOv8bSP/X/Rn7+vqs+3H+j8IfDPwCeA5/YbS8PCAoOkqeIb2cF0CLAv8OF20rQtgbf3nGloVdXpa9qf5ItVddBU5RlmST5UVe9McnBVfX4Nh65pn6bW/X0HGFKnAt9Psqo4+sfAKT3mGXYjP9A5yQ90NNWc5FHSlEjyGJo3stdW1Q3tG9mdquqCnqNJ64wkV1aVXV2nQJJrgZ2By504cDAkCXAEsH1VHZtkG+DJVXVZz9GGXpLdgRe2m5dW1ZV95hlmSb5KM3fPy2kmbv4NcFlV7dJrMA0NCwySJK0jnCV/6iT5Z5px5I8D7h25i2ZG9k16CTbEknwcWAnsWVXPSrIZcEFVzes52lBLckZV/dna2jQ11vaBTpLNqur2XkNqvWb3ZEmSpN/1d1X1BOBrVbXJiK/HW1zozXOr6i+B+wDaN0lOStu/Z4/cSDIdeE5PWYZeVd1bVV+qqhva7Z+P6i16UU/RNCQsMEiStO5w4rSp8732zzt7TaGRHmjfvBZAklk0PRrUgyTvSnIXsHOSO5Pc1W7/AviPnuPpkfk6oknlJI+SJA2IJI8FflNVK9vtacDMqlrVRf+dvYUbPhsmeTWwR5JXjd5ZVV/qIdOw+xjNKit/kOQfgQXA3/UbaXhV1QeADyT5QFW9q+88GjfHx2tSOQeDJEkDIsl/AX9YVXe324+jGWO+R7/Jhk+7xNsRNCutnDtqd1XVa6c+lZI8E9iL5lPYi6rqup4jCUhyAPDidvOSqvpqn3n0yJzLR5PNHgySJA2OmauKCwBVdXc7YZemWFV9G/h2ksVVdXLfebTa/wEupbmH3TjJ7lV1Rc+ZhlqSDwDzgTPbprcm2aOq/qbHWHpkDpHQpLLAIEnS4Lhn5BumJM+hWWJMUyzJnlV1MXC7QyQGQ5L3AUcBN/JQN+8C9uwrkwDYH9h1xNCu04ErAQsMPRjHqh579RBLQ8QCgyRJg+OtwOeT/IzmU6YnA4f2G2lovRi4GHglzZvYjPrTAsPUOwR4WlXd33cQ/Y4nAL9qH2/aZxCteVWPqvrV75whTSALDJIkDYD2JvBFwDOBHdvm66vqgf5SDbW7krwN+AEPFRbACdL69AOaN7K/6DuIHub/B65M8k2a/ycvBhb2G2n4JHkXTa+RjZOsWv0mwP3ASb0F09BxkkdJkgZEksuqan7fOQRJ3tM+3BGYR7PsXmh6NFxWVX/aV7ZhlWQuzXX4AfDbVe1VdUBvoYZcu9LNApp5Mea1zZdV1f/uL9Vwc1UP9c0CgyRJAyLJccAGwNnAPavancSuP0m+BexfVXe1248HvlZVL17zmZpoSZYAnwSuBVauaq+q/+wtlGgnQp3bdw41krwAuKqq7knyp8DuwPFV9d89R9OQsMAgSdKAaLsYj1ZV5SR2PUlyPbBzVf223d4IuKaqdlzzmZpoSRZV1by1H6mplOSDwC/53cKoY/17kOQaYBdgZ+A04NPAIVX1kj5zaXhYYJAkSXoESf6WZnLBL7dNfwycXVUf6C/VcEryEZqhEefy8CES9vDpUZKfMMbcJFW1fQ9xhl6SK6pq9yTvBm6pqpNXtfWdTcPBAoMkSQMiyVuBU4G7gE/RdG1dWFUX9BpsyCXZnWYCToBvVdWVfeYZVvbwGUxJNgbeBLyQptBwKfCJqnKJ3R4k+U/g68BraX5v/QK4uqp26jWYhoYFBkmSBkSSq6tqlyT7AG8A/g44w0+eJA2qJOcAdwJntk2vBjatqkP6SzW8kjyZ5hosqqpLk2wDvLSqPtNzNA0JCwySJA2IJNdU1c5JjgcuqaovJ7myqnbrO5s0CJLsDzwbmLmqraqO7S+RkvywquasrU1TJ8mTePiqHi7tqikzre8AkiRptcuTXAD8EXB+u2LByrWcIw2FJJ8ADgXeTLNk6MHAU3sNJYArkjxv1UaS5wKLe8wz1JIcAlxG8//jEOD7SRb0m0rDxB4MkiQNiHZN+V2Bm6rq10m2ALaqqmva/c+uqv/b3t2FWlqWfQD/X6NjH45iB0Ma0VAd2AeNOSVRk29QGIFMialzYEF01EknhdBARF8whBUURfYWVAZ1EGWWHyVFMPIeNC+NmZWFBRkEFUzhHphS06uDtXZtRd1rmmbfz5r1+8HwrPt+ZuB/tphrXdd9/2JoSBhkQ4fP+nNHktu7+9JN/zGnTFXdm+TCJL+fbz0vya+T/COzMzJ2j8q2iqrq7iSXrXctVNXOJD/o7ovGJmNVnDk6AAAw092PJjmyYX00ydENf+WrmR38CKvo7/Pn8ap6TpK/JLlgYB5m3jQ6AI+x7XEjEUeja50tpMAAAMujRgeAgb5bVecluT6zQlxndtsKA3X3/aMz8Bjfq6rvJ/n6fL0/yW0D87BiFBgAYHmYa2SV/SrJI939zap6SWbdPN8enAkmpbuvq6q3Jtk73/rf7r5pZCZWizMYAGBJVNURV1ayqjacvfDaJB9J8vEkH+juVw2OBsCceRwAWB4PjQ4AAz0yf16e5AvdfWuSswbmgcmpqiur6r6qeqCq1qrqWFWtjc7F6tDBAAATUVWV5NokL+juD1fV85Kc392HB0eD4arqliR/SHJZZuMRf0ty2On48G9V9Zsk+7r73tFZWE0KDAAwEVX1uSSPJnl9d7+4qp6V5I7uvmRwNBiuqp6Z2Y0F93T3fVV1QZKXdfcdg6PBZFTV/3X33s3/JpwaCgwAMBHrZyxU1V3dffF8726/0ALwVKrqyvnH1yU5P7MDUB9cf9/d3xqRi9XjFgkAmI6Hq+qMzG+LqKqdmXU0AMBT2bfh8/Ekb9yw7iQKDGwJHQwAMBFVdW1md5bvSfKVJFcleX93f2NoMABOC1V1oLsPjs7B6UuBAQAmpKpelOQNSSrJDx3UBcB/i+uOOdWMSADAtPwpyZ2ZfUc/o6r2dPeRwZkAOD3U6ACc3hQYAGAiquojSd6R5LeZn8Mwf75+VCYATiva1zmlFBgAYDquSfLC7n5odBAATks6GDilto0OAAD8y8+TnDc6BADLqar2brLn0GBOKYc8AsBEVNUrk9ycWaFh4/3lbx4WCoCl8USHODrYka1kRAIApuMrST6W5J4kjw7OAsCSqKpXJ3lNkp1V9Z4Nr85NcsaYVKwiBQYAmI7j3f3p0SEAWDpnJdmR2f/vztmwv5bkqiGJWElGJABgIqrqk5mNRnwnjx2RcE0lAJuqql3dff/887YkO7p7bXAsVogCAwBMRFX96Am2u7tdUwnApqrqa0neleSRJP+f2YjEp7r7+qHBWBkKDAAAAKeBqvppd7+8qq5NsifJ+5L8pLt3D47GinAGAwBMSFVdnuSlSZ6+vtfdHx6XCIAlsr2qtie5IslnuvvhqhqdiRWybXQAAGCmqm5Isj/Ju5NUkquT7BoaCoBlckOS3yU5O8mhqtqV5IGhiVgpRiQAYCKq6mfdvXvDc0eS27v70tHZAJiux11Nud6y0Jn9oNzd/YmtT8UqMiIBANPxt/nzeFU9J8nRJBcMzAPAcli/mvLCJJckuTmzQsO+JIdHhWL1KDAAwHTcUlXnJbk+yZHMfn364thIAExdd38oSarqUJI93X1svv5gklsHRmPFGJEAgAmqqqcleXp3m50FYCFV9esku7v7wfn6aUl+1t0Xjk3GqnDIIwBMRFVdXVXrba7XJflSVV08MhMAS+XGJIer6oPz7oUfJ/ny0ESsFB0MADARGw53fG2Sj2Y2KvGB7n7V4GgALImq2pNk/XDgQ91918g8rBYFBgCYiKq6q7svrqqDSe7p7q+t743OBgCwGSMSADAdf6iqzyfZn+S2+eys72oAYCnoYACAiaiqZyZ5U2bdC/dV1QVJXtbdd8zfP6u7/zo0JADAk1BgAIAlUVVHunvP6BwAAE9E2yUALI8aHQAA4MkoMADA8tB2CABMlgIDAAAAcNIUGABgeRiRAAAmS4EBACaiqr66yd4btjAOAMAJUWAAgOl46cZFVZ2R5BXr6+7+y5YnAgBYkAIDAAxWVQeq6liS3VW1Nv9zLMmfk9w8OB4AwEKq24HUADAFVXWwuw+MzgEA8J/QwQAA03FLVZ2dJFX1tqr6ZFXtGh0KAGARCgwAMB2fS3K8qi5K8t4kv01y49hIAACLUWAAgOn4R89mF9+S5DPd/dkk5wzOBACwkDNHBwAA/uVYVR1I8vYkl1bVtiTbB2cCAFiIDgYAmI79SR5M8s7u/mOS5ya5fmwkAIDFuEUCACakqp6d5JL58nB3/3lkHgCARelgAICJqKprkhxOcnWSa5L8uKquGpsKAGAxOhgAYCKq6u4kl613LVTVziQ/6O6LxiYDANicDgYAmI5tjxuJOBrf1QDAknCLBABMx/eq6vtJvj5f709y28A8AAALMyIBABNSVW9Nsne+vLO7bxqZBwBgUQoMAAAAwEkz1wkAE1FVV1bVfVX1QFWtVdWxqlobnQsAYBE6GABgIqrqN0n2dfe9o7MAAJwoHQwAMB1/UlwAAJaVDgYAGKyqrpx/fF2S85N8O8mD6++7+1sjcgEAnAgFBgAYrKq+9BSvu7vfuWVhAAD+QwoMALAkqupAdx8cnQMA4Ik4gwEAlsfVowMAADwZBQYAWB41OgAAwJNRYACA5WGuEQCYLAUGAFgeOhgAgMlSYACAiaiqvZvsfWML4wAAnBC3SADARFTVke7es9keAMAUnTk6AACsuqp6dZLXJNlZVe/Z8OrcJGeMSQUAcGIUGABgvLOS7Mjse/mcDftrSa4akggA4AQZkQCAiaiqXd19//zztiQ7unttcCwAgIU45BEApuNgVZ1bVWcn+XmSX1bVdaNDAQAsQoEBAKbjJfOOhSuS3J7k+UnePjYSAMBiFBgAYDq2V9X2zAoM3+nuh0cHAgBYlAIDAEzHDUl+l+TsJIeqaleSB4YmAgBYkEMeAWCwx11NWfNnZ/ZDQHf3J7Y+FQDAiXFNJQCMt3415YVJLklyc2aFhn1JDo8KBQBwInQwAMBEVNWhJJd397H5+pwkt3b3/4xNBgCwOWcwAMB0PDvJQxvWD833AAAmz4gEAEzHjUkOV9VN8/UVSb48Lg4AwOKMSADAhFTVniSXzpeHuvuukXkAABalwAAAAACcNGcwAAAAACdNgQEAAAA4aQoMAAAAwElTYAAAAABOmgIDAAAAcNL+CadTX3fNGg3tAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "feature_importances = pd.DataFrame(clf.feature_importances_,\n", " index = X_train.columns,\n", " columns=['importance']).sort_values('importance',\n", " ascending=False)\n", "\n", "feature_importances.nlargest(10,columns=['importance']).plot(kind='bar',figsize=(18, 5))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 1 }