This workshop explores spectral clustering on cyber data using DBSCAN.

Remember to look at the code examples from the lectures for analogous data for other methods.

0. Requirements

if(!require("purrr")) install.packages("purrr") # allows the nice %>% notation; part of the "tidyverse"
## Loading required package: purrr
if(!require("dbscan")) install.packages("dbscan") # density based clustering
## Loading required package: dbscan
## 
## Attaching package: 'dbscan'
## The following object is masked from 'package:stats':
## 
##     as.dendrogram
if(!require("readr")) install.packages("readr") # for read_csv which is improved over read.csv
## Loading required package: readr
library("purrr")
library("readr")

1. Data

We will continue with the KDD11 dataset from the first workshop. Recall that we downloaded it with:

if(!dir.exists(file.path("..","data"))) dir.create(file.path("..","data"))
if(!file.exists(file.path("..","data","kddcup.data_10_percent.zip"))) download.file("http://kdd.org/cupfiles/KDDCupData/1999/kddcup.data_10_percent.zip", destfile=file.path("..","data","kddcup.data_10_percent.zip"))
if(!file.exists(file.path("..","data","kddcup.names"))) download.file("http://kdd.org/cupfiles/KDDCupData/1999/kddcup.names",destfile=file.path("..","data","kddcup.names"))

We’ll read it in as before.

kddata<-as.data.frame(read_csv(file.path("..","data","kddcup.data_10_percent.zip"),col_names=FALSE)) ## Ignore the warnings - there is a bug with the header
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 494021 Columns: 42
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): X2, X3, X4, X42
## dbl (38): X1, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X1...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
kddnames=read.table(file.path("..","data","kddcup.names"),sep=":",skip=1,as.is=T)
colnames(kddata)=c(kddnames[,1],"normal") # here we fix the bug with the header
goodcat=names(which(table(kddata[,"normal"])>1))
kddata=kddata[kddata[,"normal"]%in%goodcat,]

1.1 Data transformations

Now we make the data suitable for spectral embedding. There are several motivations here:

  1. You can take an SVD of anything. However, any reasonable interpretation of SVD assumes additive noise. When the data are inherently multiplicative, we need to take a logarithm for this assumption hold. Here the data are heavy tailed, an indication of some sort of multiplicative noise.
  2. Much of these data are categorical. Whilst we could use categorical data for SVD, it would need to be one-hot encoded and scaled appropriately. You should do that for assignments but its overkill here.
  3. We did EDA on the data and found several columns to contain hard-to-interpret structure. These do too, but we’ll be able to do the work of exploring them.

We will consider a few transformations here. You should rerun the analysis with different choices to get an understanding of how they effect the inference.

testdata=kddata[,c("src_bytes","dst_bytes",
                   "count","srv_count",
                     "dst_host_count","dst_host_srv_count")]
#transformation=function(x)x # Identity
#transformation=function(x)(x-mean(x))/sd(x) # standardize and center
#transformation=function(x)x/sd(x) # standardize only
transformation=function(x){xx=log(1+x);(xx)/sd(xx)} # log transform and then standardize
# transformation=function(x)log(1+x) # log transform
testdatatrans=apply(testdata,2,transformation)

1.2 Spectral Embedding

Make a suitable “spectral embedding” (SVD or PCA) from the transformed test data, and plot the Eigenvalues to choose the number of PCs.

testdatatrans.svd <- svd(testdatatrans)

Plotting the eigenvalues (a “scree plot”):

plot(testdatatrans.svd$d,xlab="Eigenvalue index",ylab="Eigenvalue",log="")

See Q1 in Block 3 Portfolio

about exploring the different transformations that are possible and how they impact inference.

With this transformation choice it looks like we want the first 3 PCs, which we will record in a variable:

npcs=3

Extensions: Try rerunning the following analysis using more or fewer PCs. How does the choice of representation effect our analysis - what if we scale differently? What if we normalise features?

See Q2 in Block 3 Portfolio

about the correct number of PCs to retain.

2.1 Visualising Eigenvectors

We make a plot of the data, annotated by the “normal” category that it has been assigned. Because there are very many examples of some classes of data, we retain only up to 5000 of each class.

First we will obtain class labels the indices of different classes of data, and order them by the number of each class:

normaltab=table(kddata[,"normal"])
normallist=lapply(names(normaltab),function(x)which(kddata[,"normal"]==x))
normalsamples=lapply(normallist,function(x){
  if(length(x)<5000) return(x)
  return(sample(x,size = 5000))
})
names(normalsamples)=names(normaltab)
normalsizes=sapply(normalsamples,length)
normalsamples=normalsamples[order(normalsizes,decreasing=TRUE)]
normalsizes=normalsizes[order(normalsizes,decreasing=TRUE)]
normalsizes
##         neptune.          normal.           smurf.            back. 
##             5000             5000             5000             2203 
##           satan.         ipsweep.       portsweep.     warezclient. 
##             1589             1247             1040             1020 
##        teardrop.             pod.            nmap.    guess_passwd. 
##              979              264              231               53 
## buffer_overflow.            land.     warezmaster.            imap. 
##               30               21               20               12 
##         rootkit.      loadmodule.       ftp_write.        multihop. 
##               10                9                8                7 
##             phf.            perl.             spy. 
##                4                3                2

From this we can see that we have downsampled neptune, normal, and smurf classes. We have very few cases of several attacks.

Plot the PCA:

i=2;j=3
plot(testdatatrans.svd$u[,i],
     testdatatrans.svd$u[,j],type="n",xlab=paste0("PC",i),ylab=paste0("PC",j),
     col="#33333311",pch=19,cex=0.3) ## Plot where the points are omitted
legend("topleft",legend=names(normalsamples),text.col=1:length(normalsamples),col=1:length(normalsamples),pch=c(19,1,2)[ceiling((1:30)/10)],cex=0.5)
for(index in 1:length(normalsamples)){ ## Question: why would we plot the points in decreasing size order?
  points(testdatatrans.svd$u[normalsamples[[index]],c(i,j)],
         col=index+1,pch=c(19,1,2)[ceiling((1:30)/10)],cex=0.3)
}

This looks reassuring that there is structure to find for classification. The “line” structures are potentially interesting, and contrast to the “blob” structures in the larger clusters.

See Q3 in Block 3 Portfolio

about the interpretation of these results.

3.1 Clustering with DBscan

Here we perform parameter tuning for clustering using “dbscan”.

First, we use the function “kNNdist” which obtains the k nearest neighbours. We will explore this to test what sort of thresholds for the parameter “eps” might be useful.

library("dbscan")
test=kNNdist(testdatatrans.svd$u[,1:npcs], k = 5,all=TRUE)
testmin=apply(test,1,min)
range(testmin)
## [1] 0.0000000000 0.0007231371

Now we’ll plot all of the minimum values. This gives the distance to the closest other point, for every point.

plot(sort(testmin[testmin>1e-8]),log="y",type="l")
threshholds= c(0.01,0.001,0.0001,0.00001,0.000001)
abline(h=c(0.01,0.001,0.0001,0.00001,0.000001))
abline(h=0.0001,col="red") # we chose this number.

## abline(h=0.001) # would give bigger clusters
## abline(h=0.00001) # would give smaller clusters

From this analysis, we can see that nearly all points are within 1e-04 of another point.

Extension: Experiment with other thresholds and see how the clustering behaves.

See Q4 in Block 3 Portfolio

about running DBSCAN.

3.1.1 Downsampling

Unfortunately, dbscan runs out of memory when applied to the full dataset. Here we have

dim(testdatatrans.svd$u)[1]
## [1] 494020

so a small amount of downsampling should solve the problem. Make a list of indices that can be used to downsample the main dataset, keeping all examples of small classes and up to 60000 of the larger classes. You should get 188752 entries. Then use this clustering to run dbscan.

set.seed(1)
normalsamples2=lapply(normallist,function(x){
  if(length(x)<60000) return(x)
  return(sample(x,size = 60000))
})
myindices=unlist(normalsamples2) %>% sort
length(myindices)
## [1] 188752

3.1.2 DBScan clustering

Now do the clustering:

dbscanres=dbscan(testdatatrans.svd$u[myindices,1:3],0.0001)

We can examine the results in a simple way using “table”, to ask: How many classes are recovered? How do they associate with the labels?

table(dbscanres$cluster)
## 
##      0      1      2      3      4      5      6      7      8      9     10 
##   1665 125547     25      8      7      8     25      8     16  59991     11 
##     11     12     13     14     15     16     17     18     19     20     21 
##      7      8     23      7      9      5     92     15      6      6     20 
##     22     23     24     25     26     27     28     29     30     31     32 
##     23    810     52      6      6      6      5      6      6     38      8 
##     33     34     35     36     37     38     39     40     41     42     43 
##      7     13      5     17      5      4      6      8      9      5      5 
##     44     45     46     47     48     49     50     51     52     53     54 
##      8      7     12     11      7      8      5     11     11     27      5 
##     55     56     57     58     59     60     61     62     63     64     65 
##      9      3     15     15      5      5      5      4      5      6      5 
##     66 
##      4

Now we will display this with a heatmap:

##pdf("test_kddspectralembedding_dbscan.pdf",height=10,width=10)
    plot(testdatatrans.svd$u[myindices,1],
         testdatatrans.svd$u[myindices,2],
         xlab="PC1",ylab="PC2",main=paste("K=41 + noise"),
         col=c("#66666666",rainbow(41))[dbscanres$cluster+1],pch=19,cex=0.5)

    plot(testdatatrans.svd$u[myindices,1],
         testdatatrans.svd$u[myindices,3],xlab="PC1",ylab="PC3",main=paste("K=41 + noise"),
         col=c("#66666666",rainbow(41))[dbscanres$cluster+1],pch=19,cex=0.5)

    plot(testdatatrans.svd$u[myindices,2],
         testdatatrans.svd$u[myindices,3],xlab="PC2",ylab="PC3",,main=paste("K=41 + noise"),
         col=c("#66666666",rainbow(41))[dbscanres$cluster+1],pch=19,cex=0.5)

##dev.off()

Examining these plots, it is clear that there are some important distinctions between the inferred clusters. We can make a table of the labels and clustering:

mytable=data.frame(cluster=dbscanres$cluster,label=kddata[myindices,"normal"])
mytab=table(mytable)
myfreq=mytab/rowSums(mytab)
heatmap(t(myfreq),scale="none",main="Frequency of label given cluster",cexRow=0.7)

myfreq2=t(t(mytab)/colSums(mytab))
heatmap(t(myfreq2),scale="none",main="Frequency of cluster given label",cexRow = 0.7)

3.1.3 Interpretation as a classifier

How well does the clustering describes each label?

bestmatch=cbind(bestfreq=apply(myfreq,1,max),count=apply(mytab,1,sum))
print(bestmatch)
##     bestfreq  count
## 0  0.7885886   1665
## 1  0.4771679 125547
## 2  0.4800000     25
## 3  1.0000000      8
## 4  1.0000000      7
## 5  1.0000000      8
## 6  1.0000000     25
## 7  1.0000000      8
## 8  1.0000000     16
## 9  0.9995166  59991
## 10 1.0000000     11
## 11 1.0000000      7
## 12 1.0000000      8
## 13 0.6086957     23
## 14 1.0000000      7
## 15 0.8888889      9
## 16 0.6000000      5
## 17 0.9782609     92
## 18 1.0000000     15
## 19 0.8333333      6
## 20 1.0000000      6
## 21 1.0000000     20
## 22 0.6956522     23
## 23 0.8370370    810
## 24 1.0000000     52
## 25 1.0000000      6
## 26 1.0000000      6
## 27 1.0000000      6
## 28 1.0000000      5
## 29 1.0000000      6
## 30 1.0000000      6
## 31 0.9473684     38
## 32 0.8750000      8
## 33 0.5714286      7
## 34 1.0000000     13
## 35 1.0000000      5
## 36 0.7647059     17
## 37 1.0000000      5
## 38 1.0000000      4
## 39 0.8333333      6
## 40 1.0000000      8
## 41 1.0000000      9
## 42 1.0000000      5
## 43 1.0000000      5
## 44 1.0000000      8
## 45 1.0000000      7
## 46 1.0000000     12
## 47 1.0000000     11
## 48 1.0000000      7
## 49 1.0000000      8
## 50 0.6000000      5
## 51 1.0000000     11
## 52 0.9090909     11
## 53 1.0000000     27
## 54 0.6000000      5
## 55 1.0000000      9
## 56 1.0000000      3
## 57 1.0000000     15
## 58 1.0000000     15
## 59 0.8000000      5
## 60 1.0000000      5
## 61 1.0000000      5
## 62 1.0000000      4
## 63 0.6000000      5
## 64 1.0000000      6
## 65 1.0000000      5
## 66 1.0000000      4
accuracy=sum(bestmatch[,1]*bestmatch[,2])/sum(bestmatch[,2])
print(paste("Accuracy in training data:",accuracy))
## [1] "Accuracy in training data: 0.649238153767907"

Overall:

  • The mean accuracy is low, but the accuracy for every cluster except 0 and 3 is high.
    • Cluster 3 is very large.
    • There are very many clusters that are close to perfectly associated with a label.
    • Even the “unclustered” data (0) is only very small.
  • Therefore, with any other assignment, we can be more confident.
    • “Cluster” is a good feature.
  • If we were given new data, we might find which cluster it most resembled but this might be concerning as the data might not look like any of the currently observed data.
  • Also, a clustering offers little in the way of uncertainty assessment, and probably overfits to the small strands of data that we have available.
  • for example, all those tiny clusters are unlikely to be “real”.

See Q5 in Block 3 Portfolio

about interpreting the results DBSCAN.

Extensions:

  • Comment on what we can and cannot do with this representation of the data.
  • Would we expect clustering to match labels? How would we do better?

Mastery:

This worksheet applied dbscan and investigated its’ clustering. What about the other clustering methods discussed? Which is best? Which even run? What is the right metric of performance? Can we use labels to choose the correct density threshold?

How would you go about making a real classifier from a clustering model, that can be deployed on a training dataset?