This workshop explores spectral clustering on cyber data using DBSCAN. Because of the peculiarities of this data as visualised in Lecture 3.1 using Spectral decomposition and clustered in Lecture 3.2, it generates strange structure that can only be handled with carefully chosen methods.
Whilst not all data are as extreme as this example, the methods discussed here are robust and apply in many other contexts.
Remember to look at the code examples from the lectures for analogous data for other methods.
if(!require("purrr")) install.packages("purrr") # allows the nice %>% notation; part of the "tidyverse"## Loading required package: purrrif(!require("dbscan")) install.packages("dbscan") # density based clustering## Loading required package: dbscan## 
## Attaching package: 'dbscan'## The following object is masked from 'package:stats':
## 
##     as.dendrogramif(!require("readr")) install.packages("readr") # for read_csv which is improved over read.csv## Loading required package: readrlibrary("purrr")
library("readr")We will continue with the KDD11 dataset from the first workshop. Recall that we downloaded it with:
if(!dir.exists(file.path("..","data"))) dir.create(file.path("..","data"))
if(!file.exists(file.path("..","data","kddcup.data_10_percent.zip"))) download.file("http://kdd.org/cupfiles/KDDCupData/1999/kddcup.data_10_percent.zip", destfile=file.path("..","data","kddcup.data_10_percent.zip"))
if(!file.exists(file.path("..","data","kddcup.names"))) download.file("http://kdd.org/cupfiles/KDDCupData/1999/kddcup.names",destfile=file.path("..","data","kddcup.names"))We’ll read it in as before.
kddata<-as.data.frame(read_csv(file.path("..","data","kddcup.data_10_percent.zip"),col_names=FALSE)) ## Ignore the warnings - there is a bug with the header## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)## Rows: 494021 Columns: 42
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): X2, X3, X4, X42
## dbl (38): X1, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X1...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.kddnames=read.table(file.path("..","data","kddcup.names"),sep=":",skip=1,as.is=T)
colnames(kddata)=c(kddnames[,1],"normal") # here we fix the bug with the header
goodcat=names(which(table(kddata[,"normal"])>1))
kddata=kddata[kddata[,"normal"]%in%goodcat,]Now we make the data suitable for spectral embedding. There are several motivations here:
We will consider a few transformations here. You should rerun the analysis with different choices to get an understanding of how they effect the inference.
testdata=kddata[,c("src_bytes","dst_bytes",
                   "count","srv_count",
                     "dst_host_count","dst_host_srv_count")]
#transformation=function(x)x # Identity
#transformation=function(x)(x-mean(x))/sd(x) # standardize and center
#transformation=function(x)x/sd(x) # standardize only
transformation=function(x){xx=log(1+x);(xx)/sd(xx)} # log transform and then standardize
# transformation=function(x)log(1+x) # log transform
testdatatrans=apply(testdata,2,transformation)Make a suitable “spectral embedding” (SVD or PCA) from the transformed test data, and plot the Eigenvalues to choose the number of PCs.
testdatatrans.svd <- svd(testdatatrans)Plotting the eigenvalues (a “scree plot”):
plot(testdatatrans.svd$d,xlab="Eigenvalue index",ylab="Eigenvalue",log="")about exploring the different transformations that are possible and how they impact inference.
With this transformation choice it looks like we want the first 3 PCs, which we will record in a variable:
npcs=3Extensions: Try rerunning the following analysis using more or fewer PCs. How does the choice of representation effect our analysis - what if we scale differently? What if we normalise features?
about the correct number of PCs to retain.
We make a plot of the data, annotated by the “normal” category that it has been assigned. Because there are very many examples of some classes of data, we retain only up to 5000 of each class.
First we will obtain class labels the indices of different classes of data, and order them by the number of each class:
normaltab=table(kddata[,"normal"])
normallist=lapply(names(normaltab),function(x)which(kddata[,"normal"]==x))
normalsamples=lapply(normallist,function(x){
  if(length(x)<5000) return(x)
  return(sample(x,size = 5000))
})
names(normalsamples)=names(normaltab)
normalsizes=sapply(normalsamples,length)
normalsamples=normalsamples[order(normalsizes,decreasing=TRUE)]
normalsizes=normalsizes[order(normalsizes,decreasing=TRUE)]
normalsizes##         neptune.          normal.           smurf.            back. 
##             5000             5000             5000             2203 
##           satan.         ipsweep.       portsweep.     warezclient. 
##             1589             1247             1040             1020 
##        teardrop.             pod.            nmap.    guess_passwd. 
##              979              264              231               53 
## buffer_overflow.            land.     warezmaster.            imap. 
##               30               21               20               12 
##         rootkit.      loadmodule.       ftp_write.        multihop. 
##               10                9                8                7 
##             phf.            perl.             spy. 
##                4                3                2From this we can see that we have downsampled neptune,
normal, and smurf classes. We have very few
cases of several attacks.
Plot the PCA:
i=2;j=3
plot(testdatatrans.svd$u[,i],
     testdatatrans.svd$u[,j],type="n",xlab=paste0("PC",i),ylab=paste0("PC",j),
     col="#33333311",pch=19,cex=0.3) ## Plot where the points are omitted
legend("topleft",legend=names(normalsamples),text.col=1:length(normalsamples),col=1:length(normalsamples),pch=c(19,1,2)[ceiling((1:30)/10)],cex=0.5)
for(index in 1:length(normalsamples)){ ## Question: why would we plot the points in decreasing size order?
  points(testdatatrans.svd$u[normalsamples[[index]],c(i,j)],
         col=index+1,pch=c(19,1,2)[ceiling((1:30)/10)],cex=0.3)
}This looks reassuring that there is structure to find for classification. The “line” structures are potentially interesting, and contrast to the “blob” structures in the larger clusters.
about the interpretation of these results.
Here we perform parameter tuning for clustering using “dbscan”.
First, we use the function “kNNdist” which obtains the k nearest neighbours. We will explore this to test what sort of thresholds for the parameter “eps” might be useful.
library("dbscan")
test=kNNdist(testdatatrans.svd$u[,1:npcs], k = 5,all=TRUE)
testmin=apply(test,1,min)
range(testmin)## [1] 0.0000000000 0.0007231371Now we’ll plot all of the minimum values. This gives the distance to the closest other point, for every point.
plot(sort(testmin[testmin>1e-8]),log="y",type="l")
threshholds= c(0.01,0.001,0.0001,0.00001,0.000001)
abline(h=c(0.01,0.001,0.0001,0.00001,0.000001))
abline(h=0.0001,col="red") # we chose this number.## abline(h=0.001) # would give bigger clusters
## abline(h=0.00001) # would give smaller clustersFrom this analysis, we can see that nearly all points are within 1e-04 of another point.
Extension: Experiment with other thresholds and see how the clustering behaves.
about running DBSCAN.
Unfortunately, dbscan runs out of memory when applied to the full dataset. Here we have
dim(testdatatrans.svd$u)[1]## [1] 494020so a small amount of downsampling should solve the problem. Make a list of indices that can be used to downsample the main dataset, keeping all examples of small classes and up to 60000 of the larger classes. You should get 188752 entries. Then use this clustering to run dbscan.
set.seed(1)
normalsamples2=lapply(normallist,function(x){
  if(length(x)<60000) return(x)
  return(sample(x,size = 60000))
})
myindices=unlist(normalsamples2) %>% sort
length(myindices)## [1] 188752Now do the clustering:
dbscanres=dbscan(testdatatrans.svd$u[myindices,1:3],0.0001)We can examine the results in a simple way using “table”, to ask: How many classes are recovered? How do they associate with the labels?
table(dbscanres$cluster)## 
##      0      1      2      3      4      5      6      7      8      9     10 
##   1665 125547     25      8      7      8     25      8     16  59991     11 
##     11     12     13     14     15     16     17     18     19     20     21 
##      7      8     23      7      9      5     92     15      6      6     20 
##     22     23     24     25     26     27     28     29     30     31     32 
##     23    810     52      6      6      6      5      6      6     38      8 
##     33     34     35     36     37     38     39     40     41     42     43 
##      7     13      5     17      5      4      6      8      9      5      5 
##     44     45     46     47     48     49     50     51     52     53     54 
##      8      7     12     11      7      8      5     11     11     27      5 
##     55     56     57     58     59     60     61     62     63     64     65 
##      9      3     15     15      5      5      5      4      5      6      5 
##     66 
##      4Now we will display this with a heatmap:
##pdf("test_kddspectralembedding_dbscan.pdf",height=10,width=10)
    plot(testdatatrans.svd$u[myindices,1],
         testdatatrans.svd$u[myindices,2],
         xlab="PC1",ylab="PC2",main=paste("K=41 + noise"),
         col=c("#66666666",rainbow(41))[dbscanres$cluster+1],pch=19,cex=0.5)    plot(testdatatrans.svd$u[myindices,1],
         testdatatrans.svd$u[myindices,3],xlab="PC1",ylab="PC3",main=paste("K=41 + noise"),
         col=c("#66666666",rainbow(41))[dbscanres$cluster+1],pch=19,cex=0.5)    plot(testdatatrans.svd$u[myindices,2],
         testdatatrans.svd$u[myindices,3],xlab="PC2",ylab="PC3",,main=paste("K=41 + noise"),
         col=c("#66666666",rainbow(41))[dbscanres$cluster+1],pch=19,cex=0.5)##dev.off()Examining these plots, it is clear that there are some important distinctions between the inferred clusters. We can make a table of the labels and clustering:
mytable=data.frame(cluster=dbscanres$cluster,label=kddata[myindices,"normal"])
mytab=table(mytable)
myfreq=mytab/rowSums(mytab)
heatmap(t(myfreq),scale="none",main="Frequency of label given cluster",cexRow=0.7)myfreq2=t(t(mytab)/colSums(mytab))
heatmap(t(myfreq2),scale="none",main="Frequency of cluster given label",cexRow = 0.7)How well does the clustering describes each label?
bestmatch=cbind(bestfreq=apply(myfreq,1,max),count=apply(mytab,1,sum))
print(bestmatch)##     bestfreq  count
## 0  0.7885886   1665
## 1  0.4771679 125547
## 2  0.4800000     25
## 3  1.0000000      8
## 4  1.0000000      7
## 5  1.0000000      8
## 6  1.0000000     25
## 7  1.0000000      8
## 8  1.0000000     16
## 9  0.9995166  59991
## 10 1.0000000     11
## 11 1.0000000      7
## 12 1.0000000      8
## 13 0.6086957     23
## 14 1.0000000      7
## 15 0.8888889      9
## 16 0.6000000      5
## 17 0.9782609     92
## 18 1.0000000     15
## 19 0.8333333      6
## 20 1.0000000      6
## 21 1.0000000     20
## 22 0.6956522     23
## 23 0.8370370    810
## 24 1.0000000     52
## 25 1.0000000      6
## 26 1.0000000      6
## 27 1.0000000      6
## 28 1.0000000      5
## 29 1.0000000      6
## 30 1.0000000      6
## 31 0.9473684     38
## 32 0.8750000      8
## 33 0.5714286      7
## 34 1.0000000     13
## 35 1.0000000      5
## 36 0.7647059     17
## 37 1.0000000      5
## 38 1.0000000      4
## 39 0.8333333      6
## 40 1.0000000      8
## 41 1.0000000      9
## 42 1.0000000      5
## 43 1.0000000      5
## 44 1.0000000      8
## 45 1.0000000      7
## 46 1.0000000     12
## 47 1.0000000     11
## 48 1.0000000      7
## 49 1.0000000      8
## 50 0.6000000      5
## 51 1.0000000     11
## 52 0.9090909     11
## 53 1.0000000     27
## 54 0.6000000      5
## 55 1.0000000      9
## 56 1.0000000      3
## 57 1.0000000     15
## 58 1.0000000     15
## 59 0.8000000      5
## 60 1.0000000      5
## 61 1.0000000      5
## 62 1.0000000      4
## 63 0.6000000      5
## 64 1.0000000      6
## 65 1.0000000      5
## 66 1.0000000      4accuracy=sum(bestmatch[,1]*bestmatch[,2])/sum(bestmatch[,2])
print(paste("Accuracy in training data:",accuracy))## [1] "Accuracy in training data: 0.649238153767907"Overall:
about interpreting the results DBSCAN.
Extensions:
This worksheet applied dbscan and investigated its’ clustering. What about the other clustering methods discussed? Which is best? Which even run? What is the right metric of performance? Can we use labels to choose the correct density threshold?
How would you go about making a real classifier from a clustering model, that can be deployed on a training dataset?