You should have already run the introductory 01.3-WSL-1-EDA.Rmd worksheet and become familiar with working with an RStudio RMarkdown document.

1.0 DATA

Here we’re going to go a little further and explore a new dataset, the KDD99 dataset. Read about the competition task specification.

We use the 10% and column names files, which you can download directly. We use the opportunity of exploring external data to make some suggestions for your group assessment project structure.

Q: What information would be useful here? What would you like to see in an exemplar web resource?

Notes on the data

Cyber security data is often very weird. These data were generated in a competition setting where teams were hacking one another. The connection activity was then recorded on the internet connection, classified by what generated that traffic, and turned into “features”, i.e. a data frame. Because the computers are “doing something” all the time, there is “normal” traffic in here, but there is a very large number of “cyber attack” related traffic, which is unrealistic of real data. Also the ability to classify everything your computer is doing to obtain “true labels” is unusual.

The details are very involved but all we really care about is that there are some labels, and our task as data scientists is to see whether the classification task of identifying attacks is feasible.

1.1 Data Import

DATA ORGANISATION: It is very helpful to keep the project structure that I use, so your code “just runs”. Therefore work in a directory (“workshops”) and keep your data in a directory called “data” in the PARENT DIRECTORY, e.g. both inside “dst”, so that your file structure looks like this:

dst
- dst/data
- dst/workshops

. This maps better to how we will structure the Assessments, which are fussier again because of their group project nature.

OBTAINING DATA: It is essential that it is clear how to obtain the data used in an analysis. It is OK to have manual steps if they are clearly described, but automation is best.

To automate this, DIFFERENTLY AND BETTER THAN THE RECORDING, we will use internal R functions which work regardless of file system. We would like a cross-platform (Windows/Linux/Mac) solution, and this is provided by the file.path function in R, rather than specifying folder locations completely. For example, Windows would use ..\\data\\file, whereas Mac/Linux would be ../data/file. Run ?file.path to learn more.

This could look like this:

if(!dir.exists(file.path("..","data"))) dir.create(file.path("..","data"))
if(!file.exists(file.path("..","data","kddcup.data_10_percent.zip"))) download.file("http://kdd.org/cupfiles/KDDCupData/1999/kddcup.data_10_percent.zip", destfile=file.path("..","data","kddcup.data_10_percent.zip"))
if(!file.exists(file.path("..","data","kddcup.names"))) download.file("http://kdd.org/cupfiles/KDDCupData/1999/kddcup.names",destfile=file.path("..","data","kddcup.names"))

Aside: An even better implementation would define, or find, a function like safedircreate, along the lines of:

safedircreate<-function(...)
  if(!dir.exists(file.path(...))) dir.create(file.path(...))

This could be used as part of a safedownloadfile function, etc so that we only ever specify the location once. The above function is used as safedircreate("..","data") in this context.

Q: When should you implement this? Can you do it so that it is reusable for your assessments?

Code checking question: is this guaranteed to always work? What would make this completely robust? Does R provide that tool already?

See Question Q1 in Block 1 Portfolio

about how to generate data.

1.2 Exploratory Data Analysis (1D)

The data are read in as follows:

kddata<-as.data.frame(read_csv(file.path("..","data","kddcup.data_10_percent.zip"),col_names=FALSE)) ## Ignore the warnings - there is a bug with the header

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

## Rows: 494021 Columns: 42
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): X2, X3, X4, X42
## dbl (38): X1, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X1...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

kddnames=read.table(file.path("..","data","kddcup.names"),sep=":",skip=1,as.is=T)
colnames(kddata)=c(kddnames[,1],"normal") # here we fix the bug with the header
goodcat=names(which(table(kddata[,"normal"])>1))
kddata=kddata[kddata[,"normal"]%in%goodcat,]

Q: How important are these problems? What should we do about them? If we wanted to stop this warning, or all warnings, how would we do it? And when should we?

Lets take a look:

head(kddata)

##   duration protocol_type service flag src_bytes dst_bytes land wrong_fragment
## 1        0           tcp    http   SF       181      5450    0              0
## 2        0           tcp    http   SF       239       486    0              0
## 3        0           tcp    http   SF       235      1337    0              0
## 4        0           tcp    http   SF       219      1337    0              0
## 5        0           tcp    http   SF       217      2032    0              0
## 6        0           tcp    http   SF       217      2032    0              0
##   urgent hot num_failed_logins logged_in num_compromised root_shell
## 1      0   0                 0         1               0          0
## 2      0   0                 0         1               0          0
## 3      0   0                 0         1               0          0
## 4      0   0                 0         1               0          0
## 5      0   0                 0         1               0          0
## 6      0   0                 0         1               0          0
##   su_attempted num_root num_file_creations num_shells num_access_files
## 1            0        0                  0          0                0
## 2            0        0                  0          0                0
## 3            0        0                  0          0                0
## 4            0        0                  0          0                0
## 5            0        0                  0          0                0
## 6            0        0                  0          0                0
##   num_outbound_cmds is_host_login is_guest_login count srv_count serror_rate
## 1                 0             0              0     8         8           0
## 2                 0             0              0     8         8           0
## 3                 0             0              0     8         8           0
## 4                 0             0              0     6         6           0
## 5                 0             0              0     6         6           0
## 6                 0             0              0     6         6           0
##   srv_serror_rate rerror_rate srv_rerror_rate same_srv_rate diff_srv_rate
## 1               0           0               0             1             0
## 2               0           0               0             1             0
## 3               0           0               0             1             0
## 4               0           0               0             1             0
## 5               0           0               0             1             0
## 6               0           0               0             1             0
##   srv_diff_host_rate dst_host_count dst_host_srv_count dst_host_same_srv_rate
## 1                  0              9                  9                      1
## 2                  0             19                 19                      1
## 3                  0             29                 29                      1
## 4                  0             39                 39                      1
## 5                  0             49                 49                      1
## 6                  0             59                 59                      1
##   dst_host_diff_srv_rate dst_host_same_src_port_rate
## 1                      0                        0.11
## 2                      0                        0.05
## 3                      0                        0.03
## 4                      0                        0.03
## 5                      0                        0.02
## 6                      0                        0.02
##   dst_host_srv_diff_host_rate dst_host_serror_rate dst_host_srv_serror_rate
## 1                           0                    0                        0
## 2                           0                    0                        0
## 3                           0                    0                        0
## 4                           0                    0                        0
## 5                           0                    0                        0
## 6                           0                    0                        0
##   dst_host_rerror_rate dst_host_srv_rerror_rate  normal
## 1                    0                        0 normal.
## 2                    0                        0 normal.
## 3                    0                        0 normal.
## 4                    0                        0 normal.
## 5                    0                        0 normal.
## 6                    0                        0 normal.

And get a summary of the data:

summary(kddata)

##     duration        protocol_type        service              flag          
##  Min.   :    0.00   Length:494020      Length:494020      Length:494020     
##  1st Qu.:    0.00   Class :character   Class :character   Class :character  
##  Median :    0.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :   47.98                                                           
##  3rd Qu.:    0.00                                                           
##  Max.   :58329.00                                                           
##    src_bytes           dst_bytes            land          wrong_fragment    
##  Min.   :        0   Min.   :      0   Min.   :0.00e+00   Min.   :0.000000  
##  1st Qu.:       45   1st Qu.:      0   1st Qu.:0.00e+00   1st Qu.:0.000000  
##  Median :      520   Median :      0   Median :0.00e+00   Median :0.000000  
##  Mean   :     3026   Mean   :    869   Mean   :4.45e-05   Mean   :0.006433  
##  3rd Qu.:     1032   3rd Qu.:      0   3rd Qu.:0.00e+00   3rd Qu.:0.000000  
##  Max.   :693375640   Max.   :5155468   Max.   :1.00e+00   Max.   :3.000000  
##      urgent              hot           num_failed_logins    logged_in     
##  Min.   :0.00e+00   Min.   : 0.00000   Min.   :0.000000   Min.   :0.0000  
##  1st Qu.:0.00e+00   1st Qu.: 0.00000   1st Qu.:0.000000   1st Qu.:0.0000  
##  Median :0.00e+00   Median : 0.00000   Median :0.000000   Median :0.0000  
##  Mean   :1.42e-05   Mean   : 0.03452   Mean   :0.000152   Mean   :0.1482  
##  3rd Qu.:0.00e+00   3rd Qu.: 0.00000   3rd Qu.:0.000000   3rd Qu.:0.0000  
##  Max.   :3.00e+00   Max.   :30.00000   Max.   :5.000000   Max.   :1.0000  
##  num_compromised      root_shell         su_attempted         num_root       
##  Min.   :  0.0000   Min.   :0.0000000   Min.   :0.00e+00   Min.   :  0.0000  
##  1st Qu.:  0.0000   1st Qu.:0.0000000   1st Qu.:0.00e+00   1st Qu.:  0.0000  
##  Median :  0.0000   Median :0.0000000   Median :0.00e+00   Median :  0.0000  
##  Mean   :  0.0102   Mean   :0.0001113   Mean   :3.64e-05   Mean   :  0.0114  
##  3rd Qu.:  0.0000   3rd Qu.:0.0000000   3rd Qu.:0.00e+00   3rd Qu.:  0.0000  
##  Max.   :884.0000   Max.   :1.0000000   Max.   :2.00e+00   Max.   :993.0000  
##  num_file_creations    num_shells        num_access_files   num_outbound_cmds
##  Min.   : 0.000000   Min.   :0.0000000   Min.   :0.000000   Min.   :0        
##  1st Qu.: 0.000000   1st Qu.:0.0000000   1st Qu.:0.000000   1st Qu.:0        
##  Median : 0.000000   Median :0.0000000   Median :0.000000   Median :0        
##  Mean   : 0.001083   Mean   :0.0001093   Mean   :0.001008   Mean   :0        
##  3rd Qu.: 0.000000   3rd Qu.:0.0000000   3rd Qu.:0.000000   3rd Qu.:0        
##  Max.   :28.000000   Max.   :2.0000000   Max.   :8.000000   Max.   :0        
##  is_host_login is_guest_login         count         srv_count    
##  Min.   :0     Min.   :0.000000   Min.   :  0.0   Min.   :  0.0  
##  1st Qu.:0     1st Qu.:0.000000   1st Qu.:117.0   1st Qu.: 10.0  
##  Median :0     Median :0.000000   Median :510.0   Median :510.0  
##  Mean   :0     Mean   :0.001387   Mean   :332.3   Mean   :292.9  
##  3rd Qu.:0     3rd Qu.:0.000000   3rd Qu.:511.0   3rd Qu.:511.0  
##  Max.   :0     Max.   :1.000000   Max.   :511.0   Max.   :511.0  
##   serror_rate     srv_serror_rate   rerror_rate      srv_rerror_rate  
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.00000  
##  Mean   :0.1767   Mean   :0.1766   Mean   :0.05743   Mean   :0.05772  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000  
##  same_srv_rate    diff_srv_rate     srv_diff_host_rate dst_host_count 
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.000      Min.   :  0.0  
##  1st Qu.:1.0000   1st Qu.:0.00000   1st Qu.:0.000      1st Qu.:255.0  
##  Median :1.0000   Median :0.00000   Median :0.000      Median :255.0  
##  Mean   :0.7915   Mean   :0.02098   Mean   :0.029      Mean   :232.5  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.000      3rd Qu.:255.0  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.000      Max.   :255.0  
##  dst_host_srv_count dst_host_same_srv_rate dst_host_diff_srv_rate
##  Min.   :  0.0      Min.   :0.0000         Min.   :0.00000       
##  1st Qu.: 46.0      1st Qu.:0.4100         1st Qu.:0.00000       
##  Median :255.0      Median :1.0000         Median :0.00000       
##  Mean   :188.7      Mean   :0.7538         Mean   :0.03091       
##  3rd Qu.:255.0      3rd Qu.:1.0000         3rd Qu.:0.04000       
##  Max.   :255.0      Max.   :1.0000         Max.   :1.00000       
##  dst_host_same_src_port_rate dst_host_srv_diff_host_rate dst_host_serror_rate
##  Min.   :0.0000              Min.   :0.000000            Min.   :0.0000      
##  1st Qu.:0.0000              1st Qu.:0.000000            1st Qu.:0.0000      
##  Median :1.0000              Median :0.000000            Median :0.0000      
##  Mean   :0.6019              Mean   :0.006684            Mean   :0.1768      
##  3rd Qu.:1.0000              3rd Qu.:0.000000            3rd Qu.:0.0000      
##  Max.   :1.0000              Max.   :1.000000            Max.   :1.0000      
##  dst_host_srv_serror_rate dst_host_rerror_rate dst_host_srv_rerror_rate
##  Min.   :0.0000           Min.   :0.00000      Min.   :0.00000         
##  1st Qu.:0.0000           1st Qu.:0.00000      1st Qu.:0.00000         
##  Median :0.0000           Median :0.00000      Median :0.00000         
##  Mean   :0.1764           Mean   :0.05812      Mean   :0.05741         
##  3rd Qu.:0.0000           3rd Qu.:0.00000      3rd Qu.:0.00000         
##  Max.   :1.0000           Max.   :1.00000      Max.   :1.00000         
##     normal         
##  Length:494020     
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Q: What can you “understand” about the data from this? What is worrying about it?

We need to pay attention to variables that:

are not numeric
Have surprising or large ranges, where the choice of linear scale may be inappropriate
Don’t vary much or at all

Activity 1: EDA for the “normal” class label

The \(Y\) value is a multiple-category label called “normal” in the data. This is the attack type, with value “normal” being “not an attack” and everything else being an attack.

Q: How did I know this? How would you find it out from the data source?

We will make a barplot of the “normal” column, using an appropriate scaling.
Use the inline help (?par).
Consider the scaling of the “y” axis.

par(las=2) # Ask R to plot perpendicular to the axes 
barplot((sort(table(kddata[,"normal"]))),log="y") # Log axis

Now we’ll examine the labels separately. This is a way to make a list for each class:

labs=unique(as.character(kddata[,"normal"]))
names(labs)=labs
kddlist=lapply(labs,function(x){
  kddata[kddata[,"normal"]==x,1:41]
})

The Tidyverse data objects have nice ways to examine these sorts of conditional statements.

Q: What does lapply do here? What sort of object is kddlist? Is this what we want?

1.3 Exploratory Data Analysis (1D)

Activity 2: Heatmap of the mean for each variable within each category

kddmean=t(sapply(kddlist,function(x)colMeans(x[,c(1,5:41)])))
library("gplots")
heatmap.2(log(kddmean+1),margins =c(9,15),trace="none",cexCol = 0.5,main="Heatmap 1")

mycols=c("dst_bytes","src_bytes","duration","dst_host_svd_count","dst_host_count","srv_count","count")

Q: What sort of objects are kddmean and kddlist? Why? Q: Why has trace="none" been passed into heatmap.2? How do we find out what it does? Q: What is the “+1” doing in the log?

We can obtain the top 7 features by their variance across the classes in this plot using:

head(sort(apply(log(kddmean+1),2,var),decreasing=TRUE),7)

##          dst_bytes          src_bytes           duration              count 
##          23.008924          11.836058           6.249263           3.177224 
##     dst_host_count dst_host_srv_count          srv_count 
##           2.749490           2.453298           2.177616

Q: What does high variance for a feature mean? How might it be important for downstream work?

Now we want to standardize the features and repeat the heatmap.

kddfreq=apply(kddmean,2,function(x)x/(sum(x)+1)) ## Divide each column ()
kddfreq[!is.finite(kddfreq)]=0
kddfreq[is.nan(kddfreq)]=0
heatmap.2(kddfreq,margins =c(9,15),trace="none",cexCol = 0.5,main="Heatmap 2")

Q: What is this standardization? What is the “+1” doing in the frequency calculation? (We cover this later in the course)

The corresponding variance in the features for this heatmap is:

head(sort(apply(kddfreq,2,var),decreasing=TRUE),7)

##       dst_bytes       src_bytes       srv_count        num_root  wrong_fragment 
##      0.03716260      0.02113590      0.02079709      0.01753651      0.01692405 
## num_compromised        duration 
##      0.01373042      0.01309468

See Question Q2 in Block 1 Portfolio

about what these results mean.

Activity 3: Comparison for the categorical variables

We will now make a table of the interaction between the class label and the categorical variables.

mycategorical=colnames(kddata)[2:4]
classlist=lapply(mycategorical,function(mycat){
  table(kddata[,c(mycat,"normal")])
})
for(i in 1:3) heatmap.2(log(classlist[[i]]+1),margins =c(9,15),trace="none",main=mycategorical[i])

Q: Some things to reflect on:

What about scale here? Explore different choices of scaling function.
Describe some key features that you see in the data.
Are any of the classes informative?
In what sense are they informative?

Activity 4: Variability

We want to learn about the variability in the labels with the function “sd”, given the mean.
For that we must represent variability in an appropriate format.
It is again convenient to construct the matrices that we want to plot.

kddsd=t(sapply(kddlist,function(x){
  apply(x[,c(1,5:41)],2,sd)
}))
heatmap.2(log(kddsd/(kddmean+0.01)+1),margins =c(9,15),trace="none",main="Heatmap 3")

Q: Some reflection: * Again, what interpretations can you make about the data? * What would happen if we do not scale the s.d. to the individual mean entry? * What if we use a linear instead of a log scale? * Is high variability good, or bad, for inference?

See Question Q3 in Block 1 Portfolio