This workshop explores spectral clustering on cyber data using DBSCAN.
Remember to look at the code examples from the lectures for analogous data for other methods.
if(!require("purrr")) install.packages("purrr") # allows the nice %>% notation; part of the "tidyverse"
## Loading required package: purrr
if(!require("dbscan")) install.packages("dbscan") # density based clustering
## Loading required package: dbscan
##
## Attaching package: 'dbscan'
## The following object is masked from 'package:stats':
##
## as.dendrogram
if(!require("readr")) install.packages("readr") # for read_csv which is improved over read.csv
## Loading required package: readr
library("purrr")
library("readr")
We will continue with the KDD11 dataset from the first workshop. Recall that we downloaded it with:
if(!dir.exists(file.path("..","data"))) dir.create(file.path("..","data"))
if(!file.exists(file.path("..","data","kddcup.data_10_percent.zip"))) download.file("http://kdd.org/cupfiles/KDDCupData/1999/kddcup.data_10_percent.zip", destfile=file.path("..","data","kddcup.data_10_percent.zip"))
if(!file.exists(file.path("..","data","kddcup.names"))) download.file("http://kdd.org/cupfiles/KDDCupData/1999/kddcup.names",destfile=file.path("..","data","kddcup.names"))
We’ll read it in as before.
kddata<-as.data.frame(read_csv(file.path("..","data","kddcup.data_10_percent.zip"),col_names=FALSE)) ## Ignore the warnings - there is a bug with the header
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 494021 Columns: 42
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): X2, X3, X4, X42
## dbl (38): X1, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X1...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
kddnames=read.table(file.path("..","data","kddcup.names"),sep=":",skip=1,as.is=T)
colnames(kddata)=c(kddnames[,1],"normal") # here we fix the bug with the header
goodcat=names(which(table(kddata[,"normal"])>1))
kddata=kddata[kddata[,"normal"]%in%goodcat,]
Now we make the data suitable for spectral embedding. There are several motivations here:
We will consider a few transformations here. You should rerun the analysis with different choices to get an understanding of how they effect the inference.
testdata=kddata[,c("src_bytes","dst_bytes",
"count","srv_count",
"dst_host_count","dst_host_srv_count")]
#transformation=function(x)x # Identity
#transformation=function(x)(x-mean(x))/sd(x) # standardize and center
#transformation=function(x)x/sd(x) # standardize only
transformation=function(x){xx=log(1+x);(xx)/sd(xx)} # log transform and then standardize
# transformation=function(x)log(1+x) # log transform
testdatatrans=apply(testdata,2,transformation)
Make a suitable “spectral embedding” (SVD or PCA) from the transformed test data, and plot the Eigenvalues to choose the number of PCs.
testdatatrans.svd <- svd(testdatatrans)
Plotting the eigenvalues (a “scree plot”):
plot(testdatatrans.svd$d,xlab="Eigenvalue index",ylab="Eigenvalue",log="")
about exploring the different transformations that are possible and how they impact inference.
With this transformation choice it looks like we want the first 3 PCs, which we will record in a variable:
npcs=3
Extensions: Try rerunning the following analysis using more or fewer PCs. How does the choice of representation effect our analysis - what if we scale differently? What if we normalise features?
about the correct number of PCs to retain.
We make a plot of the data, annotated by the “normal” category that it has been assigned. Because there are very many examples of some classes of data, we retain only up to 5000 of each class.
First we will obtain class labels the indices of different classes of data, and order them by the number of each class:
normaltab=table(kddata[,"normal"])
normallist=lapply(names(normaltab),function(x)which(kddata[,"normal"]==x))
normalsamples=lapply(normallist,function(x){
if(length(x)<5000) return(x)
return(sample(x,size = 5000))
})
names(normalsamples)=names(normaltab)
normalsizes=sapply(normalsamples,length)
normalsamples=normalsamples[order(normalsizes,decreasing=TRUE)]
normalsizes=normalsizes[order(normalsizes,decreasing=TRUE)]
normalsizes
## neptune. normal. smurf. back.
## 5000 5000 5000 2203
## satan. ipsweep. portsweep. warezclient.
## 1589 1247 1040 1020
## teardrop. pod. nmap. guess_passwd.
## 979 264 231 53
## buffer_overflow. land. warezmaster. imap.
## 30 21 20 12
## rootkit. loadmodule. ftp_write. multihop.
## 10 9 8 7
## phf. perl. spy.
## 4 3 2
From this we can see that we have downsampled neptune
,
normal
, and smurf
classes. We have very few
cases of several attacks.
Plot the PCA:
i=2;j=3
plot(testdatatrans.svd$u[,i],
testdatatrans.svd$u[,j],type="n",xlab=paste0("PC",i),ylab=paste0("PC",j),
col="#33333311",pch=19,cex=0.3) ## Plot where the points are omitted
legend("topleft",legend=names(normalsamples),text.col=1:length(normalsamples),col=1:length(normalsamples),pch=c(19,1,2)[ceiling((1:30)/10)],cex=0.5)
for(index in 1:length(normalsamples)){ ## Question: why would we plot the points in decreasing size order?
points(testdatatrans.svd$u[normalsamples[[index]],c(i,j)],
col=index+1,pch=c(19,1,2)[ceiling((1:30)/10)],cex=0.3)
}
This looks reassuring that there is structure to find for classification. The “line” structures are potentially interesting, and contrast to the “blob” structures in the larger clusters.
about the interpretation of these results.
Here we perform parameter tuning for clustering using “dbscan”.
First, we use the function “kNNdist” which obtains the k nearest neighbours. We will explore this to test what sort of thresholds for the parameter “eps” might be useful.
library("dbscan")
test=kNNdist(testdatatrans.svd$u[,1:npcs], k = 5,all=TRUE)
testmin=apply(test,1,min)
range(testmin)
## [1] 0.0000000000 0.0007231371
Now we’ll plot all of the minimum values. This gives the distance to the closest other point, for every point.
plot(sort(testmin[testmin>1e-8]),log="y",type="l")
threshholds= c(0.01,0.001,0.0001,0.00001,0.000001)
abline(h=c(0.01,0.001,0.0001,0.00001,0.000001))
abline(h=0.0001,col="red") # we chose this number.
## abline(h=0.001) # would give bigger clusters
## abline(h=0.00001) # would give smaller clusters
From this analysis, we can see that nearly all points are within 1e-04 of another point.
Extension: Experiment with other thresholds and see how the clustering behaves.
about running DBSCAN.
Unfortunately, dbscan runs out of memory when applied to the full dataset. Here we have
dim(testdatatrans.svd$u)[1]
## [1] 494020
so a small amount of downsampling should solve the problem. Make a list of indices that can be used to downsample the main dataset, keeping all examples of small classes and up to 60000 of the larger classes. You should get 188752 entries. Then use this clustering to run dbscan.
set.seed(1)
normalsamples2=lapply(normallist,function(x){
if(length(x)<60000) return(x)
return(sample(x,size = 60000))
})
myindices=unlist(normalsamples2) %>% sort
length(myindices)
## [1] 188752
Now do the clustering:
dbscanres=dbscan(testdatatrans.svd$u[myindices,1:3],0.0001)
We can examine the results in a simple way using “table”, to ask: How many classes are recovered? How do they associate with the labels?
table(dbscanres$cluster)
##
## 0 1 2 3 4 5 6 7 8 9 10
## 1665 125547 25 8 7 8 25 8 16 59991 11
## 11 12 13 14 15 16 17 18 19 20 21
## 7 8 23 7 9 5 92 15 6 6 20
## 22 23 24 25 26 27 28 29 30 31 32
## 23 810 52 6 6 6 5 6 6 38 8
## 33 34 35 36 37 38 39 40 41 42 43
## 7 13 5 17 5 4 6 8 9 5 5
## 44 45 46 47 48 49 50 51 52 53 54
## 8 7 12 11 7 8 5 11 11 27 5
## 55 56 57 58 59 60 61 62 63 64 65
## 9 3 15 15 5 5 5 4 5 6 5
## 66
## 4
Now we will display this with a heatmap:
##pdf("test_kddspectralembedding_dbscan.pdf",height=10,width=10)
plot(testdatatrans.svd$u[myindices,1],
testdatatrans.svd$u[myindices,2],
xlab="PC1",ylab="PC2",main=paste("K=41 + noise"),
col=c("#66666666",rainbow(41))[dbscanres$cluster+1],pch=19,cex=0.5)
plot(testdatatrans.svd$u[myindices,1],
testdatatrans.svd$u[myindices,3],xlab="PC1",ylab="PC3",main=paste("K=41 + noise"),
col=c("#66666666",rainbow(41))[dbscanres$cluster+1],pch=19,cex=0.5)
plot(testdatatrans.svd$u[myindices,2],
testdatatrans.svd$u[myindices,3],xlab="PC2",ylab="PC3",,main=paste("K=41 + noise"),
col=c("#66666666",rainbow(41))[dbscanres$cluster+1],pch=19,cex=0.5)
##dev.off()
Examining these plots, it is clear that there are some important distinctions between the inferred clusters. We can make a table of the labels and clustering:
mytable=data.frame(cluster=dbscanres$cluster,label=kddata[myindices,"normal"])
mytab=table(mytable)
myfreq=mytab/rowSums(mytab)
heatmap(t(myfreq),scale="none",main="Frequency of label given cluster",cexRow=0.7)
myfreq2=t(t(mytab)/colSums(mytab))
heatmap(t(myfreq2),scale="none",main="Frequency of cluster given label",cexRow = 0.7)
How well does the clustering describes each label?
bestmatch=cbind(bestfreq=apply(myfreq,1,max),count=apply(mytab,1,sum))
print(bestmatch)
## bestfreq count
## 0 0.7885886 1665
## 1 0.4771679 125547
## 2 0.4800000 25
## 3 1.0000000 8
## 4 1.0000000 7
## 5 1.0000000 8
## 6 1.0000000 25
## 7 1.0000000 8
## 8 1.0000000 16
## 9 0.9995166 59991
## 10 1.0000000 11
## 11 1.0000000 7
## 12 1.0000000 8
## 13 0.6086957 23
## 14 1.0000000 7
## 15 0.8888889 9
## 16 0.6000000 5
## 17 0.9782609 92
## 18 1.0000000 15
## 19 0.8333333 6
## 20 1.0000000 6
## 21 1.0000000 20
## 22 0.6956522 23
## 23 0.8370370 810
## 24 1.0000000 52
## 25 1.0000000 6
## 26 1.0000000 6
## 27 1.0000000 6
## 28 1.0000000 5
## 29 1.0000000 6
## 30 1.0000000 6
## 31 0.9473684 38
## 32 0.8750000 8
## 33 0.5714286 7
## 34 1.0000000 13
## 35 1.0000000 5
## 36 0.7647059 17
## 37 1.0000000 5
## 38 1.0000000 4
## 39 0.8333333 6
## 40 1.0000000 8
## 41 1.0000000 9
## 42 1.0000000 5
## 43 1.0000000 5
## 44 1.0000000 8
## 45 1.0000000 7
## 46 1.0000000 12
## 47 1.0000000 11
## 48 1.0000000 7
## 49 1.0000000 8
## 50 0.6000000 5
## 51 1.0000000 11
## 52 0.9090909 11
## 53 1.0000000 27
## 54 0.6000000 5
## 55 1.0000000 9
## 56 1.0000000 3
## 57 1.0000000 15
## 58 1.0000000 15
## 59 0.8000000 5
## 60 1.0000000 5
## 61 1.0000000 5
## 62 1.0000000 4
## 63 0.6000000 5
## 64 1.0000000 6
## 65 1.0000000 5
## 66 1.0000000 4
accuracy=sum(bestmatch[,1]*bestmatch[,2])/sum(bestmatch[,2])
print(paste("Accuracy in training data:",accuracy))
## [1] "Accuracy in training data: 0.649238153767907"
Overall:
about interpreting the results DBSCAN.
Extensions:
This worksheet applied dbscan and investigated its’ clustering. What about the other clustering methods discussed? Which is best? Which even run? What is the right metric of performance? Can we use labels to choose the correct density threshold?
How would you go about making a real classifier from a clustering model, that can be deployed on a training dataset?