{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Random Forests (Part 2; python models)\n", "\n", "```\n", "date: \"Block 06\"\n", "author: \"Daniel Lawson\"\n", "email: dan.lawson@bristol.ac.uk\n", "output: html_document\n", "version: 1.0.1\n", "```\n", "\n", "Here we get a random forest classifier running on the kddcup data. We start by importing data from R, after the standard boiler plate stuff." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Activity 1: Read in the data that we saved from R. \n", "\n", "This requires telling python that the first column is the \"index column\" (like row names in R). We use the function pd.read_csv." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "r_train=pd.read_csv('https://raw.githubusercontent.com/dsbristol/dst/master/data/conndataC_train.csv',index_col=0) \n", "r_test=pd.read_csv('https://raw.githubusercontent.com/dsbristol/dst/master/data/conndataC_test.csv',index_col=0) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also need the output of the Random Forest that was run in R (**block06-TreesAndForests_Part1.Rmd**).\n", "\n", "You should really save this locally, but for convenience I've added it to the github repo." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "r_rf_roc=pd.read_csv('https://raw.githubusercontent.com/dsbristol/dst/master/data/conndataC_RFroc.csv',index_col=0) # EDIT" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
durationorig_bytesresp_bytesorig_ip_bytesresp_ip_byteshttp
642030.1739535.7235856.1984796.3526296.4861610
2080550.0295595.8944038.1942296.5998708.2789360
729880.0582696.2324489.0038086.7345929.0532190
2229600.7793256.8480057.5611227.2399337.8038431
711980.0198036.2025369.0058966.7165959.0505240
\n", "
" ], "text/plain": [ " duration orig_bytes resp_bytes orig_ip_bytes resp_ip_bytes http\n", "64203 0.173953 5.723585 6.198479 6.352629 6.486161 0\n", "208055 0.029559 5.894403 8.194229 6.599870 8.278936 0\n", "72988 0.058269 6.232448 9.003808 6.734592 9.053219 0\n", "222960 0.779325 6.848005 7.561122 7.239933 7.803843 1\n", "71198 0.019803 6.202536 9.005896 6.716595 9.050524 0" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r_train.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Convert the data into the format expected by random forest in python." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "r_train_features= np.array(r_train)[:,0:4]\n", "r_train_labels= np.array(r_train)[:,5].ravel() # this becomes a 'horizontal' array\n", "\n", "r_test_features= np.array(r_test)[:,0:4]\n", "r_test_labels= np.array(r_test)[:,5].ravel() " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0., 0., 0., ..., 0., 1., 0.])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r_train_labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Q2 Run a Random Forest Classifier. \n", "\n", "How do you access the prediction probabilities?\n", "\n", "Look up how Python handles Random Forests. How does it differ to R?" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "rdata_rf = RandomForestClassifier(n_estimators=100, max_features=3)\n", "rdata_rf.fit(r_train_features,r_train_labels);" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "?RandomForestClassifier" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "rdata_rf_predictions=rdata_rf.predict_proba(r_test_features)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The below accesses the probability of class 1, but generalises to multi-class datasets." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0. , 1. , 1. , ..., 0.88, 1. , 0.96])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rdata_rf_predictions[:,1] " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Activity 3 make an ROC curve dataset using the function roc_curve.\n", "\n", "We'll extract an ROC curve and take a look at how Python represents it." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import roc_curve\n", "rdata_rf_fpr, rdata_rf_tpr, _ = roc_curve(r_test_labels,rdata_rf_predictions[:,1]) " ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "array([0. , 0.69397993, 0.73578595, 0.76254181, 0.79264214,\n", " 0.80434783, 0.81772575, 0.8277592 , 0.8361204 , 0.84280936,\n", " 0.84949833, 0.85284281, 0.86956522, 0.87792642, 0.88294314,\n", " 0.88461538, 0.89464883, 0.89799331, 0.90301003, 0.90301003,\n", " 0.90635452, 0.91137124, 0.91137124, 0.91304348, 0.91471572,\n", " 0.92140468, 0.92474916, 0.92809365, 0.93143813, 0.93979933,\n", " 0.94314381, 0.94481605, 0.94816054, 0.94983278, 0.95317726,\n", " 0.9548495 , 0.95819398, 0.95986622, 0.95986622, 0.96153846,\n", " 0.96488294, 0.96488294, 0.96822742, 0.96989967, 0.96989967,\n", " 0.97324415, 0.97491639, 0.97491639, 0.97491639, 0.97491639,\n", " 0.97491639, 0.97658863, 0.97658863, 0.97826087, 0.98160535,\n", " 0.98160535, 0.98494983, 0.98494983, 0.98662207, 0.98829431,\n", " 0.98829431, 0.98829431, 0.98829431, 0.98829431, 0.98829431,\n", " 0.98829431, 0.98829431, 0.98996656, 0.99331104, 0.99498328,\n", " 0.99498328, 0.99498328, 0.99498328, 0.99498328, 0.99498328,\n", " 0.99498328, 1. ])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rdata_rf_tpr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we will plot the actual ROC curve, showing the R implementation in red and the python implementation in blue. Which is better? Why?" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "#plt.clf()\n", "plt.semilogx(np.array(r_rf_roc)[:,0], np.array(r_rf_roc)[:,1],'r')\n", "plt.semilogx(rdata_rf_fpr,rdata_rf_tpr,'b')\n", "plt.xlim([0.00001,1])\n", "plt.ylim([0.6,1])\n", "plt.xlabel('FPR')\n", "plt.ylabel('TPR')\n", "plt.title('ROC curve')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The R Random Forest defaults are prodicing a true positive rate about 5% higher at very low false positive rate! \n", "\n", "## Mastery question: why is that? \n", "\n", "Can you demonstrate it?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 2: Natively read and process the input data\n", "\n", "Here we go back to the KDD cup [10%](http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz) data, with its [column names](http://kdd.ics.uci.edu/databases/kddcup99/kddcup.names). If you kept your code in `/code` and data in `/data` then this will work for you, otherwise download again or change the locations below.\n", "\n", "(We first met this data in `block01-EDA.Rmd`)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reading in the header. Its a big clunky to make a list out of the strange format that the data are provided in. I've provided code for this for the future." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "header=pd.read_csv('../data/kddcup.names',sep=\"\\t\", header=None,skiprows=1).iloc[:,0].tolist()\n", "colnames=[str(x).split(':')[0] for x in header]+['normal']" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('../data/kddcup.data_10_percent.gz', sep=\",\", header=None, names=colnames)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The usual data checking" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
durationprotocol_typeserviceflagsrc_bytesdst_byteslandwrong_fragmenturgenthot...dst_host_srv_countdst_host_same_srv_ratedst_host_diff_srv_ratedst_host_same_src_port_ratedst_host_srv_diff_host_ratedst_host_serror_ratedst_host_srv_serror_ratedst_host_rerror_ratedst_host_srv_rerror_ratenormal
00tcphttpSF18154500000...91.00.00.110.00.00.00.00.0normal.
10tcphttpSF2394860000...191.00.00.050.00.00.00.00.0normal.
20tcphttpSF23513370000...291.00.00.030.00.00.00.00.0normal.
30tcphttpSF21913370000...391.00.00.030.00.00.00.00.0normal.
40tcphttpSF21720320000...491.00.00.020.00.00.00.00.0normal.
\n", "

5 rows × 42 columns

\n", "
" ], "text/plain": [ " duration protocol_type service flag src_bytes dst_bytes land \\\n", "0 0 tcp http SF 181 5450 0 \n", "1 0 tcp http SF 239 486 0 \n", "2 0 tcp http SF 235 1337 0 \n", "3 0 tcp http SF 219 1337 0 \n", "4 0 tcp http SF 217 2032 0 \n", "\n", " wrong_fragment urgent hot ... dst_host_srv_count \\\n", "0 0 0 0 ... 9 \n", "1 0 0 0 ... 19 \n", "2 0 0 0 ... 29 \n", "3 0 0 0 ... 39 \n", "4 0 0 0 ... 49 \n", "\n", " dst_host_same_srv_rate dst_host_diff_srv_rate \\\n", "0 1.0 0.0 \n", "1 1.0 0.0 \n", "2 1.0 0.0 \n", "3 1.0 0.0 \n", "4 1.0 0.0 \n", "\n", " dst_host_same_src_port_rate dst_host_srv_diff_host_rate \\\n", "0 0.11 0.0 \n", "1 0.05 0.0 \n", "2 0.03 0.0 \n", "3 0.03 0.0 \n", "4 0.02 0.0 \n", "\n", " dst_host_serror_rate dst_host_srv_serror_rate dst_host_rerror_rate \\\n", "0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 \n", "\n", " dst_host_srv_rerror_rate normal \n", "0 0.0 normal. \n", "1 0.0 normal. \n", "2 0.0 normal. \n", "3 0.0 normal. \n", "4 0.0 normal. \n", "\n", "[5 rows x 42 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Our data is a bit big for interactive exploration. The methods work fine but you have to wait too long. For this session, lets downsample\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(100000, 42)\n", "(100000, 42)\n" ] } ], "source": [ "print(df.shape)\n", "df=df.sample(100000)\n", "print(df.shape)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "duration int64\n", "protocol_type object\n", "service object\n", "flag object\n", "src_bytes int64\n", "dst_bytes int64\n", "land int64\n", "wrong_fragment int64\n", "urgent int64\n", "hot int64\n", "num_failed_logins int64\n", "logged_in int64\n", "num_compromised int64\n", "root_shell int64\n", "su_attempted int64\n", "num_root int64\n", "num_file_creations int64\n", "num_shells int64\n", "num_access_files int64\n", "num_outbound_cmds int64\n", "is_host_login int64\n", "is_guest_login int64\n", "count int64\n", "srv_count int64\n", "serror_rate float64\n", "srv_serror_rate float64\n", "rerror_rate float64\n", "srv_rerror_rate float64\n", "same_srv_rate float64\n", "diff_srv_rate float64\n", "srv_diff_host_rate float64\n", "dst_host_count int64\n", "dst_host_srv_count int64\n", "dst_host_same_srv_rate float64\n", "dst_host_diff_srv_rate float64\n", "dst_host_same_src_port_rate float64\n", "dst_host_srv_diff_host_rate float64\n", "dst_host_serror_rate float64\n", "dst_host_srv_serror_rate float64\n", "dst_host_rerror_rate float64\n", "dst_host_srv_rerror_rate float64\n", "normal object\n", "dtype: object" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes # Usual checking of data " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that many fields that we are interested in are called \"object\". The classifiers don't like this, so we are doing to convert them to factors (analogous to factors in R). These are represented as integer numbers.\n", "\n", "Only the class (\"normal\") will stay in the string (object) format.\n", "\n", "**Note:** many changes have been made to the way Python handles factors, and may come again in hte future!" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "#df['protocol_type_cat'] = df[1].astype('category') # Direct one-hot encoding. But only in later versions of pandas.\n", "df['protocol_type'], protocols= pd.factorize(df['protocol_type'])\n", "df['service'], services = pd.factorize(df['service'])\n", "df['flag'], flags = pd.factorize(df['flag'])\n", "# We have the key to convert back in protocols,services, flag" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['icmp', 'tcp', 'udp'], dtype='object')" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "protocols" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "duration int64\n", "protocol_type int64\n", "service int64\n", "flag int64\n", "src_bytes int64\n", "dst_bytes int64\n", "land int64\n", "wrong_fragment int64\n", "urgent int64\n", "hot int64\n", "num_failed_logins int64\n", "logged_in int64\n", "num_compromised int64\n", "root_shell int64\n", "su_attempted int64\n", "num_root int64\n", "num_file_creations int64\n", "num_shells int64\n", "num_access_files int64\n", "num_outbound_cmds int64\n", "is_host_login int64\n", "is_guest_login int64\n", "count int64\n", "srv_count int64\n", "serror_rate float64\n", "srv_serror_rate float64\n", "rerror_rate float64\n", "srv_rerror_rate float64\n", "same_srv_rate float64\n", "diff_srv_rate float64\n", "srv_diff_host_rate float64\n", "dst_host_count int64\n", "dst_host_srv_count int64\n", "dst_host_same_srv_rate float64\n", "dst_host_diff_srv_rate float64\n", "dst_host_same_src_port_rate float64\n", "dst_host_srv_diff_host_rate float64\n", "dst_host_serror_rate float64\n", "dst_host_srv_serror_rate float64\n", "dst_host_rerror_rate float64\n", "dst_host_srv_rerror_rate float64\n", "normal object\n", "dtype: object" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we extract out features and labels, both in pandas and numpy format" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "features_pd= df.iloc[:,:df.shape[1]-1]\n", "labels_pd= df.iloc[:,df.shape[1]-1:]\n", "labels= labels_pd.values.ravel() # this becomes a 'horizontal' array, i.e. a row vector\n", "\n", "## Not needed here:\n", "## features= np.array(features_pd)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['smurf.', 'smurf.', 'neptune.', ..., 'smurf.', 'smurf.',\n", " 'ipsweep.'], dtype=object)" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#labels_pd\n", "labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Saving the data\n", "\n", "There are many ways to save data. This is a simple though lazy one: through a pickle file\n", "\n", "In python this allows reproduceable test/train data as we used a random seed above. For R use, we'd have to save the X_train/y_train/X_test/y_test objects separately as csvs or similar.\n", "\n", "Pickles can store very many types of object, but those that use memory pointers don't work (we encounter these when working with large scale data)." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "# Save a python object into a picle file\n", "import pickle\n", "pickle.dump( features_pd, open( \"06-features_pd.pickle\", \"wb\" ) )\n", "pickle.dump( labels, open( \"06-labels.pickle\", \"wb\" ) )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Test/train split\n", "\n", "Now separate data in train set and test set\n", "\n", "features= pd.DataFrame(features)\n", "\n", "Create training and testing vars\n", "\n", "Note: if train_size + test_size < 1.0 we are subsampling. This is useful for making test code run faster.\n", "\n", "Use small numbers for slow classifiers, as KNN, Radius, SVC,..." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "X_train, y_train: (50000, 41) (50000,)\n", "X_test, y_test: (50000, 41) (50000,)\n" ] } ], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " features_pd, labels, train_size=0.5, test_size=0.5,random_state=1)\n", "print (\"X_train, y_train:\", X_train.shape, y_train.shape)\n", "print (\"X_test, y_test:\", X_test.shape, y_test.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have everything we need to start to run the classifiers. Notice the wide range of tuning parameters that are available..." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "clf= RandomForestClassifier(n_jobs=-1, random_state=3, n_estimators=100)\n", "#, max_features=0.8, min_samples_leaf=3, n_estimators=500, min_samples_split=3, random_state=10, verbose=1)\n", "\n", "trained_model= clf.fit(X_train, y_train)\n" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Score: 1.0\n" ] } ], "source": [ "print( \"Score: \", trained_model.score(X_train, y_train))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The score is 1! Is this a rounding error, or severe overfitting?" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['smurf.', 'smurf.', 'smurf.', ..., 'neptune.', 'smurf.', 'normal.'],\n", " dtype=object)" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "# Predicting\n", "y_pred = clf.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is how we make a confusion matrix in sklearn. Also how we evaluate the loss function for categorical labels. We use a \"zero/one\" loss: score 0 for the wrong class, 1 for the right one. Its not the best choice of loss for all applications!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q4 Using confusion_matrix, make and display (in text and on image) the confusion matrix." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Confusion matrix:\n", " [[ 233 0 0 0 0 0 0 0 1 0 0 0\n", " 0 0 0 0]\n", " [ 0 4 0 0 0 0 0 0 1 0 0 0\n", " 0 0 0 0]\n", " [ 0 0 3 0 0 0 0 0 2 0 0 0\n", " 0 0 0 0]\n", " [ 0 0 0 0 0 0 0 0 1 0 0 0\n", " 0 0 0 0]\n", " [ 0 0 0 0 107 0 0 0 0 0 0 0\n", " 0 0 0 0]\n", " [ 0 0 0 0 0 2 0 0 0 0 0 0\n", " 0 0 0 0]\n", " [ 0 0 0 0 0 0 10812 0 0 0 0 0\n", " 0 0 0 0]\n", " [ 0 0 0 0 0 0 0 21 0 0 0 0\n", " 0 0 0 0]\n", " [ 0 0 0 0 1 0 0 0 9923 0 0 0\n", " 0 0 3 0]\n", " [ 0 0 0 0 0 0 0 0 0 25 0 0\n", " 0 0 0 0]\n", " [ 0 0 0 0 0 0 1 0 0 0 120 0\n", " 0 0 0 0]\n", " [ 0 0 0 0 0 0 0 0 2 0 0 151\n", " 0 0 0 0]\n", " [ 0 0 0 0 0 0 0 0 0 0 0 0\n", " 28385 0 0 0]\n", " [ 0 0 0 0 0 0 0 0 0 0 0 0\n", " 0 98 0 0]\n", " [ 0 0 0 0 0 0 0 0 11 0 0 0\n", " 0 0 90 0]\n", " [ 0 0 0 0 0 0 0 0 1 0 0 0\n", " 0 0 0 2]]\n" ] } ], "source": [ "from sklearn.metrics import confusion_matrix\n", "results = confusion_matrix(y_test, y_pred) # EDIT to make a confusion matrix from TEST (rows) and PRED ( columns)\n", "print (\"Confusion matrix:\\n\", results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since this isn't trivial to read, we'll also make a heatmap using seaborn. Make it with log10 of the results." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "sresults=[x/(1+x.sum()) for x in results]" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import seaborn as sns\n", "\n", "ax = sns.heatmap((sresults), linewidth=0.5) ## EDIT to get it to plot the log10 of the results\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q5 The score was 1 above! Is the error really zero? Check using the zero_one_loss function." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Error: 0.00048000000000003595\n" ] } ], "source": [ "from sklearn.metrics import zero_one_loss\n", "error = zero_one_loss(y_test, y_pred) ## EDIT to get it to output the error.\n", "print (\"Error: \", error)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we'll evaluate two different models, the decision tree and logistic regression." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.0010200000000000209\n" ] } ], "source": [ "from sklearn.tree import DecisionTreeClassifier\n", "cld = DecisionTreeClassifier(criterion='gini', splitter='best', \n", " max_depth=None, min_samples_split=2, min_samples_leaf=1, \n", " min_weight_fraction_leaf=0.0, max_features=None, \n", " random_state=None, max_leaf_nodes=None, class_weight=None)\n", "# other parameters: min_impurity_decrease=0.0, \n", "\n", "trained_model_d= cld.fit(X_train, y_train)\n", "y_pred_d = cld.predict(X_test)\n", "error_d = zero_one_loss(y_test,y_pred_d) \n", "print(error_d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now Logistic Regression. This is quite a bit slower, " ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.019240000000000035\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:764: ConvergenceWarning: lbfgs failed to converge (status=1):\n", "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n", "\n", "Increase the number of iterations (max_iter) or scale the data as shown in:\n", " https://scikit-learn.org/stable/modules/preprocessing.html\n", "Please also refer to the documentation for alternative solver options:\n", " https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n", " extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)\n" ] } ], "source": [ "from sklearn.linear_model import LogisticRegression\n", "cll = LogisticRegression()\n", "\n", "trained_model_l= cll.fit(X_train, y_train)\n", "y_pred_l = cll.predict(X_test)\n", "error_l = zero_one_loss(y_test,y_pred_l) \n", "print(error_l)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Comparison:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Error Random Forest 0.00048000000000003595 Error Decision Tree 0.0010200000000000209 Error Logistic Regression 0.019240000000000035\n" ] } ], "source": [ "print(\"Error Random Forest\",error,\"Error Decision Tree\",error_d,\"Error Logistic Regression\",error_l)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Activity 6: Use the function cross_val_score to get a prediction error only using the training dataset. \n", "\n", "Why is this useful?" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import cross_val_score\n", "?cross_val_score" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.7/site-packages/sklearn/model_selection/_split.py:672: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=10.\n", " % (min_groups, self.n_splits)), UserWarning)\n" ] } ], "source": [ "cvf= cross_val_score(clf,X_train,y_train,cv=10) " ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.9992, 0.9996, 0.9994, 0.9998, 0.9996, 0.9994, 0.9992, 0.9994,\n", " 0.999 , 0.9998])" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cvf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Activity 7: relative performance\n", "\n", "Are the cross validation errors higher or lower tha error in the test dataset? Check by writing an iterator over cvf. Why?" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0.00031999999999998696,\n", " -8.000000000008001e-05,\n", " 0.00012000000000000899,\n", " -0.000280000000000058,\n", " -8.000000000008001e-05,\n", " 0.00012000000000000899,\n", " 0.00031999999999998696,\n", " 0.00012000000000000899,\n", " 0.0005199999999999649,\n", " -0.000280000000000058]" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Compare the error in the test dataset to that learned using CV, vs that learned using the trained data\n", "[(1-x)-error for x in cvf]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 3: Performance as a function of the number of trees\n", "\n", "Here we fit a random forest with a different number of trees, and gather the scores (in both the training and the test dataset)\n", "\n", "## Activity 8: How does the number of trees affect performance?\n", "\n", "Train a RandomForestClassifier with 1-10 trees. Store the scores on left out and training data." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "scores=[]\n", "trainscores=[]\n", "for ntrees in range(1,11):\n", " tmpcf=RandomForestClassifier(n_estimators=ntrees)\n", " tmp_trained_model= tmpcf.fit(X_train, y_train)\n", " tmp_y_pred = tmp_trained_model.predict(X_test)\n", " tmp_y_pred_train = tmp_trained_model.predict(X_train)\n", " tmp_test_error = zero_one_loss(y_test,tmp_y_pred)\n", " tmp_train_error = zero_one_loss(y_train,tmp_y_pred_train)\n", " scores.append(1-tmp_test_error)\n", " trainscores.append(1-tmp_train_error)\n" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0.99842,\n", " 0.99802,\n", " 0.99896,\n", " 0.99888,\n", " 0.99914,\n", " 0.99916,\n", " 0.9994,\n", " 0.99932,\n", " 0.99938,\n", " 0.99918]" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scores" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q8 prediction error as a function of the number of trees\n", "\n", "Here we plot the prediction error as a function of the number of trees. What conclusions do you draw from these results?" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.0, 1.0, 'score vs number of trees')" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "scoresb=scores[0:10]\n", "scoresb.append(cvf) # append the big run we did at the beginning\n", "scoresb=pd.DataFrame(scoresb).transpose() # Convert to data frame for naming\n", "scoresb.columns=[str(x) for x in list(range(1,11))+[102]] # The number of trees that we used\n", "ax=sns.boxplot(data=scoresb) ## coloured boxplots: the test dataset performance\n", "ax=plt.scatter(list(range(0,11)),trainscores+[1-error],c='black',linewidths=4) \n", "## Black points: the training performance\n", "## Black lines: the test performance\n", "plt.title(\"score vs number of trees\", loc=\"left\")" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0.99842,\n", " 0.99802,\n", " 0.99896,\n", " 0.99888,\n", " 0.99914,\n", " 0.99916,\n", " 0.9994,\n", " 0.99932,\n", " 0.99938,\n", " 0.99918]" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scores" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Notes on features importance\n", "\n", "### Activity 9 Feature importance\n", "\n", "Examine the following code and output. What can you conclude about the importance of features as calculated by Random Forests?" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scores for X0, X1, X2: ['0.131', '0.108', '0.761']\n" ] } ], "source": [ "from sklearn.ensemble import RandomForestRegressor\n", "size = 100\n", "np.random.seed(seed=15)\n", "X_seed = np.random.normal(0, 1, size)\n", "X0 = X_seed + np.random.normal(0, 1, size)\n", "X1 = X_seed + np.random.normal(0, 1, size)\n", "X2 = X_seed + np.random.normal(0, 5, size)\n", "X = np.array([X0, X1, X2]).T\n", "Y = X0 + X1 + X2\n", "\n", "rf = RandomForestRegressor(n_estimators=100, max_features=2)\n", "rf.fit(X, Y);\n", "print (\"Scores for X0, X1, X2:\", list(map(lambda x:str(round (x,3)),\n", " rf.feature_importances_)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we plot the largest 10 feature importances. How would you interpret them?" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "feature_importances = pd.DataFrame(clf.feature_importances_,\n", " index = X_train.columns,\n", " columns=['importance']).sort_values('importance',\n", " ascending=False)\n", "\n", "feature_importances.nlargest(10,columns=['importance']).plot(kind='bar',figsize=(18, 5))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 1 }