You should have already run the introductory 01.3-WSL-1-EDA.Rmd worksheet and become familiar with working with an RStudio RMarkdown document.
Its good practice to get the requirements right at the top. The following solution checks for the requirements, and then installs them if they are not present.
if(!require("gplots")) install.packages("gplots")
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
if(!require("readr")) install.packages("readr")
## Loading required package: readr
library("gplots")
library("readr")
Here we’re going to go a little further and explore a new dataset, the KDD99 dataset. Read about the competition task specification.
We use the 10% and column names files, which you can download directly. We use the opportunity of exploring external data to make some suggestions for your group assessment project structure.
Q: What information would be useful here? What would you like to see in an exemplar web resource?
Cyber security data is often very weird. These data were generated in a competition setting where teams were hacking one another. The connection activity was then recorded on the internet connection, classified by what generated that traffic, and turned into “features”, i.e. a data frame. Because the computers are “doing something” all the time, there is “normal” traffic in here, but there is a very large number of “cyber attack” related traffic, which is unrealistic of real data. Also the ability to classify everything your computer is doing to obtain “true labels” is unusual.
The details are very involved but all we really care about is that there are some labels, and our task as data scientists is to see whether the classification task of identifying attacks is feasible.
DATA ORGANISATION: It is very helpful to keep the project structure that I use, so your code “just runs”. Therefore work in a directory (“workshops”) and keep your data in a directory called “data” in the PARENT DIRECTORY, e.g. both inside “dst”, so that your file structure looks like this:
. This maps better to how we will structure the Assessments, which are fussier again because of their group project nature.
OBTAINING DATA: It is essential that it is clear how to obtain the data used in an analysis. It is OK to have manual steps if they are clearly described, but automation is best.
To automate this, DIFFERENTLY AND BETTER THAN THE RECORDING, we will
use internal R functions which work regardless of file system. We would
like a cross-platform (Windows/Linux/Mac) solution, and this is provided
by the file.path
function in R, rather than specifying
folder locations completely. For example, Windows would use
..\\data\\file
, whereas Mac/Linux would be
../data/file
. Run ?file.path
to learn
more.
This could look like this:
if(!dir.exists(file.path("..","data"))) dir.create(file.path("..","data"))
if(!file.exists(file.path("..","data","kddcup.data_10_percent.zip"))) download.file("http://kdd.org/cupfiles/KDDCupData/1999/kddcup.data_10_percent.zip", destfile=file.path("..","data","kddcup.data_10_percent.zip"))
if(!file.exists(file.path("..","data","kddcup.names"))) download.file("http://kdd.org/cupfiles/KDDCupData/1999/kddcup.names",destfile=file.path("..","data","kddcup.names"))
Aside: An even better implementation would define, or find, a
function like safedircreate
, along the lines of:
safedircreate<-function(...)
if(!dir.exists(file.path(...))) dir.create(file.path(...))
This could be used as part of a safedownloadfile
function, etc so that we only ever specify the location once. The above
function is used as safedircreate("..","data")
in this
context.
Q: When should you implement this? Can you do it so that it is reusable for your assessments?
Code checking question: is this guaranteed to always work? What would make this completely robust? Does R provide that tool already?
about how to generate data.
The data are read in as follows:
kddata<-as.data.frame(read_csv(file.path("..","data","kddcup.data_10_percent.zip"),col_names=FALSE)) ## Ignore the warnings - there is a bug with the header
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 494021 Columns: 42
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): X2, X3, X4, X42
## dbl (38): X1, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X1...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
kddnames=read.table(file.path("..","data","kddcup.names"),sep=":",skip=1,as.is=T)
colnames(kddata)=c(kddnames[,1],"normal") # here we fix the bug with the header
goodcat=names(which(table(kddata[,"normal"])>1))
kddata=kddata[kddata[,"normal"]%in%goodcat,]
Q: How important are these problems? What should we do about them? If we wanted to stop this warning, or all warnings, how would we do it? And when should we?
Lets take a look:
head(kddata)
## duration protocol_type service flag src_bytes dst_bytes land wrong_fragment
## 1 0 tcp http SF 181 5450 0 0
## 2 0 tcp http SF 239 486 0 0
## 3 0 tcp http SF 235 1337 0 0
## 4 0 tcp http SF 219 1337 0 0
## 5 0 tcp http SF 217 2032 0 0
## 6 0 tcp http SF 217 2032 0 0
## urgent hot num_failed_logins logged_in num_compromised root_shell
## 1 0 0 0 1 0 0
## 2 0 0 0 1 0 0
## 3 0 0 0 1 0 0
## 4 0 0 0 1 0 0
## 5 0 0 0 1 0 0
## 6 0 0 0 1 0 0
## su_attempted num_root num_file_creations num_shells num_access_files
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## num_outbound_cmds is_host_login is_guest_login count srv_count serror_rate
## 1 0 0 0 8 8 0
## 2 0 0 0 8 8 0
## 3 0 0 0 8 8 0
## 4 0 0 0 6 6 0
## 5 0 0 0 6 6 0
## 6 0 0 0 6 6 0
## srv_serror_rate rerror_rate srv_rerror_rate same_srv_rate diff_srv_rate
## 1 0 0 0 1 0
## 2 0 0 0 1 0
## 3 0 0 0 1 0
## 4 0 0 0 1 0
## 5 0 0 0 1 0
## 6 0 0 0 1 0
## srv_diff_host_rate dst_host_count dst_host_srv_count dst_host_same_srv_rate
## 1 0 9 9 1
## 2 0 19 19 1
## 3 0 29 29 1
## 4 0 39 39 1
## 5 0 49 49 1
## 6 0 59 59 1
## dst_host_diff_srv_rate dst_host_same_src_port_rate
## 1 0 0.11
## 2 0 0.05
## 3 0 0.03
## 4 0 0.03
## 5 0 0.02
## 6 0 0.02
## dst_host_srv_diff_host_rate dst_host_serror_rate dst_host_srv_serror_rate
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## dst_host_rerror_rate dst_host_srv_rerror_rate normal
## 1 0 0 normal.
## 2 0 0 normal.
## 3 0 0 normal.
## 4 0 0 normal.
## 5 0 0 normal.
## 6 0 0 normal.
And get a summary of the data:
summary(kddata)
## duration protocol_type service flag
## Min. : 0.00 Length:494020 Length:494020 Length:494020
## 1st Qu.: 0.00 Class :character Class :character Class :character
## Median : 0.00 Mode :character Mode :character Mode :character
## Mean : 47.98
## 3rd Qu.: 0.00
## Max. :58329.00
## src_bytes dst_bytes land wrong_fragment
## Min. : 0 Min. : 0 Min. :0.00e+00 Min. :0.000000
## 1st Qu.: 45 1st Qu.: 0 1st Qu.:0.00e+00 1st Qu.:0.000000
## Median : 520 Median : 0 Median :0.00e+00 Median :0.000000
## Mean : 3026 Mean : 869 Mean :4.45e-05 Mean :0.006433
## 3rd Qu.: 1032 3rd Qu.: 0 3rd Qu.:0.00e+00 3rd Qu.:0.000000
## Max. :693375640 Max. :5155468 Max. :1.00e+00 Max. :3.000000
## urgent hot num_failed_logins logged_in
## Min. :0.00e+00 Min. : 0.00000 Min. :0.000000 Min. :0.0000
## 1st Qu.:0.00e+00 1st Qu.: 0.00000 1st Qu.:0.000000 1st Qu.:0.0000
## Median :0.00e+00 Median : 0.00000 Median :0.000000 Median :0.0000
## Mean :1.42e-05 Mean : 0.03452 Mean :0.000152 Mean :0.1482
## 3rd Qu.:0.00e+00 3rd Qu.: 0.00000 3rd Qu.:0.000000 3rd Qu.:0.0000
## Max. :3.00e+00 Max. :30.00000 Max. :5.000000 Max. :1.0000
## num_compromised root_shell su_attempted num_root
## Min. : 0.0000 Min. :0.0000000 Min. :0.00e+00 Min. : 0.0000
## 1st Qu.: 0.0000 1st Qu.:0.0000000 1st Qu.:0.00e+00 1st Qu.: 0.0000
## Median : 0.0000 Median :0.0000000 Median :0.00e+00 Median : 0.0000
## Mean : 0.0102 Mean :0.0001113 Mean :3.64e-05 Mean : 0.0114
## 3rd Qu.: 0.0000 3rd Qu.:0.0000000 3rd Qu.:0.00e+00 3rd Qu.: 0.0000
## Max. :884.0000 Max. :1.0000000 Max. :2.00e+00 Max. :993.0000
## num_file_creations num_shells num_access_files num_outbound_cmds
## Min. : 0.000000 Min. :0.0000000 Min. :0.000000 Min. :0
## 1st Qu.: 0.000000 1st Qu.:0.0000000 1st Qu.:0.000000 1st Qu.:0
## Median : 0.000000 Median :0.0000000 Median :0.000000 Median :0
## Mean : 0.001083 Mean :0.0001093 Mean :0.001008 Mean :0
## 3rd Qu.: 0.000000 3rd Qu.:0.0000000 3rd Qu.:0.000000 3rd Qu.:0
## Max. :28.000000 Max. :2.0000000 Max. :8.000000 Max. :0
## is_host_login is_guest_login count srv_count
## Min. :0 Min. :0.000000 Min. : 0.0 Min. : 0.0
## 1st Qu.:0 1st Qu.:0.000000 1st Qu.:117.0 1st Qu.: 10.0
## Median :0 Median :0.000000 Median :510.0 Median :510.0
## Mean :0 Mean :0.001387 Mean :332.3 Mean :292.9
## 3rd Qu.:0 3rd Qu.:0.000000 3rd Qu.:511.0 3rd Qu.:511.0
## Max. :0 Max. :1.000000 Max. :511.0 Max. :511.0
## serror_rate srv_serror_rate rerror_rate srv_rerror_rate
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.00000 Median :0.00000
## Mean :0.1767 Mean :0.1766 Mean :0.05743 Mean :0.05772
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.00000
## same_srv_rate diff_srv_rate srv_diff_host_rate dst_host_count
## Min. :0.0000 Min. :0.00000 Min. :0.000 Min. : 0.0
## 1st Qu.:1.0000 1st Qu.:0.00000 1st Qu.:0.000 1st Qu.:255.0
## Median :1.0000 Median :0.00000 Median :0.000 Median :255.0
## Mean :0.7915 Mean :0.02098 Mean :0.029 Mean :232.5
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.000 3rd Qu.:255.0
## Max. :1.0000 Max. :1.00000 Max. :1.000 Max. :255.0
## dst_host_srv_count dst_host_same_srv_rate dst_host_diff_srv_rate
## Min. : 0.0 Min. :0.0000 Min. :0.00000
## 1st Qu.: 46.0 1st Qu.:0.4100 1st Qu.:0.00000
## Median :255.0 Median :1.0000 Median :0.00000
## Mean :188.7 Mean :0.7538 Mean :0.03091
## 3rd Qu.:255.0 3rd Qu.:1.0000 3rd Qu.:0.04000
## Max. :255.0 Max. :1.0000 Max. :1.00000
## dst_host_same_src_port_rate dst_host_srv_diff_host_rate dst_host_serror_rate
## Min. :0.0000 Min. :0.000000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000000 1st Qu.:0.0000
## Median :1.0000 Median :0.000000 Median :0.0000
## Mean :0.6019 Mean :0.006684 Mean :0.1768
## 3rd Qu.:1.0000 3rd Qu.:0.000000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.000000 Max. :1.0000
## dst_host_srv_serror_rate dst_host_rerror_rate dst_host_srv_rerror_rate
## Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000 Median :0.00000
## Mean :0.1764 Mean :0.05812 Mean :0.05741
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000 Max. :1.00000
## normal
## Length:494020
## Class :character
## Mode :character
##
##
##
Q: What can you “understand” about the data from this? What is worrying about it?
We need to pay attention to variables that:
The \(Y\) value is a multiple-category label called “normal” in the data. This is the attack type, with value “normal” being “not an attack” and everything else being an attack.
Q: How did I know this? How would you find it out from the data source?
par(las=2) # Ask R to plot perpendicular to the axes
barplot((sort(table(kddata[,"normal"]))),log="y") # Log axis
Now we’ll examine the labels separately. This is a way to make a list for each class:
labs=unique(as.character(kddata[,"normal"]))
names(labs)=labs
kddlist=lapply(labs,function(x){
kddata[kddata[,"normal"]==x,1:41]
})
The Tidyverse data objects have nice ways to examine these sorts of conditional statements.
Q: What does lapply
do here? What sort of object is
kddlist? Is this what we want?
kddmean=t(sapply(kddlist,function(x)colMeans(x[,c(1,5:41)])))
library("gplots")
heatmap.2(log(kddmean+1),margins =c(9,15),trace="none",cexCol = 0.5,main="Heatmap 1")
mycols=c("dst_bytes","src_bytes","duration","dst_host_svd_count","dst_host_count","srv_count","count")
Q: What sort of objects are kddmean
and
kddlist
? Why? Q: Why has trace="none"
been
passed into heatmap.2? How do we find out what it does? Q: What is the
“+1” doing in the log?
We can obtain the top 7 features by their variance across the classes in this plot using:
head(sort(apply(log(kddmean+1),2,var),decreasing=TRUE),7)
## dst_bytes src_bytes duration count
## 23.008924 11.836058 6.249263 3.177224
## dst_host_count dst_host_srv_count srv_count
## 2.749490 2.453298 2.177616
Q: What does high variance for a feature mean? How might it be important for downstream work?
Now we want to standardize the features and repeat the heatmap.
kddfreq=apply(kddmean,2,function(x)x/(sum(x)+1)) ## Divide each column ()
kddfreq[!is.finite(kddfreq)]=0
kddfreq[is.nan(kddfreq)]=0
heatmap.2(kddfreq,margins =c(9,15),trace="none",cexCol = 0.5,main="Heatmap 2")
Q: What is this standardization? What is the “+1” doing in the frequency calculation? (We cover this later in the course)
The corresponding variance in the features for this heatmap is:
head(sort(apply(kddfreq,2,var),decreasing=TRUE),7)
## dst_bytes src_bytes srv_count num_root wrong_fragment
## 0.03716260 0.02113590 0.02079709 0.01753651 0.01692405
## num_compromised duration
## 0.01373042 0.01309468
about what these results mean.
We will now make a table of the interaction between the class label and the categorical variables.
mycategorical=colnames(kddata)[2:4]
classlist=lapply(mycategorical,function(mycat){
table(kddata[,c(mycat,"normal")])
})
for(i in 1:3) heatmap.2(log(classlist[[i]]+1),margins =c(9,15),trace="none",main=mycategorical[i])
Q: Some things to reflect on:
kddsd=t(sapply(kddlist,function(x){
apply(x[,c(1,5:41)],2,sd)
}))
heatmap.2(log(kddsd/(kddmean+0.01)+1),margins =c(9,15),trace="none",main="Heatmap 3")
Q: Some reflection: * Again, what interpretations can you make about the data? * What would happen if we do not scale the s.d. to the individual mean entry? * What if we use a linear instead of a log scale? * Is high variability good, or bad, for inference?
about what these results mean.
Q: Some final thoughts: * The dataset contains attacks that are not listed here. * Consider what the above Exploratory Data analysis might mean for the hopes of detecting different properties of attack. * How you might go about making a model that will perform well out-of-sample, comparing “normal” to other classes of “attack”? * You might choose to explore this further in Assessment 0.