# Workshop on Topic Modelling (Part 2, Main Content)

```
date: "Block 07"
author: "Daniel Lawson"
email: dan.lawson@bristol.ac.uk
output: html_document
version: 1.0.1
```

## Introduction

Topic models can be applied to cyber security data. See for examples:

* For a general discussion of threat detection using NLP, see e.g. https://www.endgame.com/blog/technical-blog/nlp-security-malicious-language-processing
* For a concrete example, see https://github.com/python-security/pyt

However, finding data appropriate to them is more difficult. Additionally, traditional natural language processing also has application in cyber security. Examples:

* Profiling Underground Economy Sellers
* Understanding Hacker Source Code 

were given in "[Topic Modeling and Latent Dirichlet Allocation: An Overview](https://ai.arizona.edu/sites/ai/files/MIS611D/lda.pptx)" (Weifeng Li, Sagar Samtani and Hsinchun Chen Acknowledgements: David Blei, Princeton University, The Stanford Natural Language Processing Group) 

Further [Bobby Filar describes NLP For Security: Malicious Language Processing](https://www.endgame.com/blog/technical-blog/nlp-security-malicious-language-processing) which explains the following areas:

* [Domain Generation Algorithm classification](http://conferences.sigcomm.org/imc/2010/papers/p48.pdf) – Using NLP to identify malicious domains (e.g., blbwpvcyztrepfue.ru) from benign domains (e.g., cnn.com)
* [Source Code Vulnerability Analysis](https://www.usenix.org/legacy/events/woot11/tech/slides/yamaguchi.pdf) – Determining function patterns associated with known vulnerabilities, then using NLP to identify other potentially vulnerable code segments.
* [Phishing Identification](http://nlp.uned.es/~lurdes/araujo/eswa13_malicious_tweets.pdf) – A bag-of-words model determines the probability an email message contains a phishing attempt or not.
* [Malware Family Analysis](https://www.endgame.com/blog/examining-malware-python) –Topic modeling techniques assign samples of malware to families.

However, none of these contain data. So we will go over a traditional text-based NLP in this workshop.

Additional references:
A [Gensim](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/) overview, and a description of 
[Coherence](https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/).

In [None]:
import pickle
import pandas as pd

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)

import nltk

## Data

First, load the data. This idea comes from [Susan Li on Towards Data Science](https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24) and the data is direct from [Kaggle million headlines](https://www.kaggle.com/therohk/million-headlines/data).

A reminder: We downloaded this in Part 1, from the [DSBristol github](https://github.com/dsbristol/dst/tree/master/data).

In [None]:
data = pd.read_csv('../data/abcnews-date-text.csv.gz', compression='gzip',error_bad_lines=False);
data_text = data[['headline_text']]
data_text['index'] = data_text.index
documents = data_text[0:100000]
print(len(documents))


Summaries of the data:

In [None]:
print(len(documents))
print(documents[:5])

The code below defines THREE choices of lemmatizer:
* "normalise_text" uses manual wordnet lemmatisation. It tries to use word position to figure out whether something is and adjective, verb, noun or adverb. 
* "preprocess" doesn't bother with that, it uses a standard lemmatizer.
* "prepare_text_for_lda" also lemmatizes, but it also handles stop words.

In [None]:
## Needed for stop words (only)
en_stop = set(nltk.corpus.stopwords.words('english'))

In [None]:
import nltk
from nltk.corpus import wordnet

lmtzr = nltk.WordNetLemmatizer().lemmatize

## We lookup whether a word is and adjective, verb, noun or adverb here.
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

    
## This version uses word type. Needs the bigger nltp download ("popular")
def normalize_text(text):
    ## Runs on documents (vector of words)
    word_pos = nltk.pos_tag(nltk.word_tokenize(text))
    lemm_words = [lmtzr(sw[0], get_wordnet_pos(sw[1])) for sw in word_pos]

    return [x.lower() for x in lemm_words]

## This version doesn't require the "popular" download
def preprocess(text):
    ## Runs on documents (vector of words)
    lemmatizer = nltk.WordNetLemmatizer()
    return([lemmatizer.lemmatize(i) for i in text.split()])

################
## wordnet version
from nltk.corpus import wordnet as wn
def get_lemma(word):
    ## morphy does a lemma lookup and word standardization
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma

## lemmatize
from nltk.stem.wordnet import WordNetLemmatizer
def get_lemma2(word):
    return WordNetLemmatizer().lemmatize(word)

## This version is for comparison
def prepare_text_for_lda(text):
    ## Runs on documents (vector of words)
    tokens = nltk.word_tokenize(text)
    tokens = [token for token in tokens if len(token) > 4]
    tokens = [token for token in tokens if token not in en_stop]
    tokens = [get_lemma(token) for token in tokens]
    return tokens

Applying these to example text:

In [None]:
documents[documents['index'] == 16]

In [None]:
from gensim import parsing
doc_sample = documents[documents['index'] == 16].values[0][0]

print('original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(normalize_text(doc_sample))
print('\n\n simpler tokenized and lemmatized document: ')
print(preprocess(doc_sample))
print('\n\n method removing stop words: ')
print(prepare_text_for_lda(doc_sample))

Apply this to the dataset as a whole. (warning: takes a little time)

In [None]:
processed_docs = documents['headline_text'].map(preprocess) # preprocess is faster than normalise_text.
processed_docs[:10]

Now we'll make a dictionary and report some of the items in it.

In [None]:
dictionary = gensim.corpora.Dictionary(processed_docs)

count = 0
for k,v  in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

It is important to get rid of extremes. This is one way to do it.

In [None]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

## Creating a corpus
Now we will map the documents into the bag of words model. As you can see, a corpus is simply a list of documents, each of which is a list of words.

However, creating them is slow enough so that you might want to download them preprocessed from [dst-block7-lda.zip](https://github.com/dsbristol/dst/blob/master/data/dst-block7-lda.zip?raw=true). (Downloaded in part1).

In [None]:
try:
    print("Reading corpus from pickle")
    bow_corpus=pickle.load(open('../data/bow_corpus.pkl', 'rb'))
except FileNotFoundError:
    print("Creating corpus and saving to pickle")
    bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
    pickle.dump(bow_corpus, open('../data/bow_corpus.pkl', 'wb'))
    pickle.dump(dictionary, open('../data/dictionary.pkl', 'wb'))

bow_corpus[16]

In [None]:
bow_doc_16 = bow_corpus[1000]

for i in range(len(bow_doc_16)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_16[i][0], 
                                               dictionary[bow_doc_16[i][0]], 
                                                bow_doc_16[i][1]))

The following is a version with stop words removed:

In [None]:
processed_docs2 = documents['headline_text'].map(prepare_text_for_lda) 
processed_docs2[:10]

In [None]:
dictionary2 = gensim.corpora.Dictionary(processed_docs2)

count2 = 0
for k, v in dictionary2.iteritems():
    print(k, v)
    count2 += 1
    if count2 > 10:
        break

In [None]:
dictionary2.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

In [None]:
from gensim import corpora, models

tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]

from pprint import pprint

for doc in corpus_tfidf:
    pprint(doc)
    break

## Making the corpus

The code below **remakes** the corpus, simply by looping over the words in the document that are in the reduced dictionary, and mapping them to their sparse vector notation.

However, this is quite slow, so we instead check whether it was pre-created and saved to file. If so, we read it from file instead.

In [None]:
try:
    print("Reading corpus from pickle...")
    bow_corpus2=pickle.load(open('../data/bow_corpus2.pkl', 'rb'))
except FileNotFoundError:
    print("Reading corpus failed.")
    print("Creating corpus and saving to pickle")
    bow_corpus2 = [dictionary2.doc2bow(doc) for doc in processed_docs2]
    pickle.dump(bow_corpus2, open('../data/bow_corpus2.pkl', 'wb'))
    pickle.dump(dictionary2, open('../data/dictionary2.pkl', 'wb'))

bow_corpus2[16]

## Making an LDA model

This is the key component of an LDA model: defining the model with a specified corpus and dictionary.

Note that we also have to specify how many topics we will generate as well as the number of passes through the data. Because the inference algorithm is sensitive to word order, we can get different answers when rerunning.

In [None]:
try:
    lda_model=pickle.load(open('../data/lda_model.pkl', 'rb'))
    print("Reading lda_model from pickle")
except FileNotFoundError:
    print("Creating lda_model and saving to pickle")
    lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)
    pickle.dump(lda_model,open('../data/lda_model.pkl','wb'))

In [None]:
try:
    lda_model2=pickle.load(open('../data/lda_model2.pkl', 'rb'))
    print("Reading lda_model2 from pickle")
except FileNotFoundError:
    print("Creating lda_model2 and saving to pickle")
    lda_model2 = gensim.models.LdaMulticore(bow_corpus2, num_topics=10, id2word=dictionary2, passes=2, workers=2)
    pickle.dump(lda_model2,open('../data/lda_model2.pkl','wb'))

Now we'll explore the model a little.

First we compare a document to its topic representation:

In [None]:
documents['headline_text'][89]

In [None]:
lda_model[bow_corpus[89]]

Modify the text a little to see how the topics change:

In [None]:
text='woman fined after aboriginal tent embassy raid'
pptext=preprocess(text)
lda_model[dictionary.doc2bow(pptext)]

In [None]:
text='badger fined after aboriginal tent embassy raid'
pptext=preprocess(text)
lda_model[dictionary.doc2bow(pptext)]


In [None]:
text='man fined after badger tent embassy raid'
pptext=preprocess(text)
lda_model[dictionary.doc2bow(pptext)]


In [None]:
lda_model.show_topics(20,7)

## tf-idf model

Now we rerun with tf-idf.

We keep the exact same dictionaries, but use the tf-idf weights.

In [None]:
try:
    lda_model_tfidf=pickle.load(open('../data/lda_model_tfidf.pkl', 'rb'))
    print("Reading lda_model_tfidf from pickle")
except FileNotFoundError:
    print("Creating lda_model_tfidf and saving to pickle")
    lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=2, workers=4)
    pickle.dump(lda_model,open('../data/lda_model_tfidf.pkl','wb'))

In [None]:
lda_model_tfidf.show_topics(10,7)

In [None]:
tfidf2 = models.TfidfModel(bow_corpus2)
corpus_tfidf2 = tfidf[bow_corpus2]

In [None]:
try:
    lda_model_tfidf2=pickle.load(open('../data/lda_model_tfidf2.pkl', 'rb'))
    print("Reading lda_model_tfidf2 from pickle")
except FileNotFoundError:
    print("Creating lda_model_tfidf2 and saving to pickle")
    lda_model_tfidf2 = gensim.models.LdaMulticore(corpus_tfidf2, num_topics=10, id2word=dictionary2, passes=2, workers=4)
    pickle.dump(lda_model2,open('../data/lda_model_tfidf2.pkl','wb'))

In [None]:
## Testing on out-of-sample data
unseen_document = 'How a Pentagon deal became an identity crisis for Google'
#unseen_document='american coppers found eating donuts once again'
bow_vector = dictionary.doc2bow(preprocess(unseen_document))
bow_vector2 = dictionary2.doc2bow(prepare_text_for_lda(unseen_document))
print(lda_model2[bow_vector2])
print(lda_model_tfidf2[bow_vector2])

## Visualisation

The pyLDAvis package has a nice interactive visualisation designed for gensim.

We have to prepare the data, which is again a bit slow so I provide the pkl versions of these objects.

In [None]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

In [None]:
try:
    lda_display=pickle.load(open('../data/lda_display.pkl', 'rb'))
    print("Reading lda_display from pickle")
except FileNotFoundError:
    print("Creating lda_display and saving to pickle")
    lda_display = pyLDAvis.gensim.prepare(lda_model, bow_corpus, 
                                          dictionary, mds='mmds')
    pickle.dump(lda_display,open('../data/lda_display.pkl','wb'))

In [None]:
try:
    lda_display2=pickle.load(open('../data/lda_display2.pkl', 'rb'))
    print("Reading lda_display2 from pickle")
except FileNotFoundError:
    print("Creating lda_display2 and saving to pickle")
    lda_display2 = pyLDAvis.gensim.prepare(lda_model2, bow_corpus2, 
                                          dictionary2, mds='mmds')
    pickle.dump(lda_display2,open('../data/lda_display2.pkl','wb'))


In [None]:
try:
    lda_display_tfidf2=pickle.load(open('../data/lda_display_tfidf2.pkl', 'rb'))
    print("Reading lda_display_tfidf2 from pickle")
except FileNotFoundError:
    print("Creating lda_display_tfidf2 and saving to pickle")
    lda_display_tfidf2 = pyLDAvis.gensim.prepare(lda_model_tfidf2, 
                                                 corpus_tfidf2, dictionary2, sort_topics=False)
    pickle.dump(lda_display_tfidf2,open('../data/lda_display_tfidf2.pkl','wb'))


Now we will use visualisation, found at many places including:

https://towardsdatascience.com/topic-modelling-in-python-with-nltk-and-gensim-4ef03213cd21


In [None]:
# NB under some circumstances you need show, under others you need display. It appears to be a known bug.
pyLDAvis.display(lda_display, template_type='notebook')

In [None]:
pyLDAvis.display(lda_display2, template_type='notebook') # NB under some circumstances you need show, under others you need display. It appears to be a known bug.

Now visualise the tfidf2 model, which "should" be our best model.

What can we learn about the topics here that is different to the above?

In [None]:
pyLDAvis.display(lda_display_tfidf2, template_type='notebook') # NB under some circumstances you need show, under others you need display. It appears to be a known bug.

## Return to topics


In [None]:
for index, score in sorted(lda_model2[bow_vector2], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model2.print_topic(index, 10)))

In [None]:
for index, score in sorted(lda_model2[bow_vector2], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model2.print_topic(index, 5)))

In [None]:
for index, score in sorted(lda_model_tfidf2[bow_vector2], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model_tfidf2.print_topic(index, 5)))

## Perplexity and Coherence

Question: What does a "good" model prediction look like?

The below examines the scores that we have.

In [None]:
## Sadly very slow, so we only look at the first few documents.
log_perplexities = {"lda_model": lda_model.log_perplexity(bow_corpus[0:1000]), 
     "lda_model2" : lda_model2.log_perplexity(bow_corpus2[0:1000]),
     "lda_model_tfidf" : lda_model_tfidf.log_perplexity(corpus_tfidf[0:1000]),
     "lda_model_tfidf2" : lda_model_tfidf2.log_perplexity(corpus_tfidf2[0:1000])
    };
# a measure of how good the model is. lower the better.
log_perplexities

Question: Why is the tf-idf model so much better in this measure? Does the performance measure capture your intuition about what a good topic model is?

Now we compute the intrinsic coherence to check the quality of the fit.

In [None]:
from gensim.models.coherencemodel import CoherenceModel
def getCoherence(m,c,d):
    coherence_model_lda = CoherenceModel(model=m,corpus=c, dictionary=d, coherence='u_mass')
    coherence_lda = coherence_model_lda.get_coherence()
    return(coherence_lda)

In [None]:
### Compute Coherence Score
coherences={
    "lda_model": getCoherence(lda_model,bow_corpus[0:1000],dictionary),
    "lda_model2": getCoherence(lda_model2,bow_corpus2[0:1000],dictionary2),
    "lda_model_tfidf": getCoherence(lda_model_tfidf,corpus_tfidf[0:1000],dictionary),
    "lda_model_tfidf2": getCoherence(lda_model_tfidf2,corpus_tfidf2[0:1000],dictionary2)
}
# a different measure of how good the model is. Higher is better.
coherences

Question: Why is the version of the data in which we removed stop words performing worse? 

## Return to out-of-sample performance

In [None]:
oos_coherences={
    "lda_model": getCoherence(lda_model,[bow_vector],dictionary),
    "lda_model2": getCoherence(lda_model2,[bow_vector2],dictionary2),
    "lda_model_tfidf": getCoherence(lda_model_tfidf,[bow_vector],dictionary),
    "lda_model_tfidf2": getCoherence(lda_model_tfidf2,[bow_vector2],dictionary2)
};
oos_coherences

Question: what would be the most appropriate way to test an LDA model? Would it make a difference to possess labels? How would you use them?

## Challenge: what is the "best" value of K to use? 
How would you evaluate it? How would you handle the model runtime?

## Conclusions

What conclusions would you draw from this procedure?

## Appendix

NLTK includes synonyms, dictionary definitions, antonyms, and more; all available for automated processing.

Some tasters:

In [None]:
from nltk.corpus import wordnet
syn = wordnet.synsets("pain")
print(syn[0].definition())
print(syn[0].examples())

In [None]:
syn

In [None]:
synonyms = []
for syn in wordnet.synsets('Computer'):
    for lemma in syn.lemmas():
        synonyms.append(lemma.name())
print(synonyms)