# 8.3.2 Workshop on Topic Modelling (Part 2, Main Content)

```
date: "Block 08"
author: "Daniel Lawson"
email: dan.lawson@bristol.ac.uk
output: html_document
version: 2.0.0
```

# 0. Introduction

We have setup the data and libraries in the Block 08.3.1 section.


Additional references:
A [Gensim](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/) overview, and a description of 
[Coherence](https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/).

In [2]:
import pickle
import pandas as pd

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)

import nltk
import os

# 1.0 Pre-processing data

### 1.1.1 Goal of pre-processing

* Identify or remove special words (emoticons, hashtags),
* Remove common words ("stop words"),
* Lemmatise or stem (standardize endings),
* Where multiple meanings exist, use context to deduce correct one (noun/verb/adjective?).
* We cover these details in the workshop.

## 1.1.2 Data from unusual sources

* Use a converter to 'plain text':
* [textract](https://textract.readthedocs.io/en/stable/):

```python
### **textract** for converting from a wide
### range of sources including MS and pdf
import textract
text = textract.process("path/to/file.extension")
```

* [pdfminer](https://pypi.org/project/pdfminer/):
```python
### dedicated tool: should be better performance
import pdfminer
convert_pdf_to_txt('file name')
```

## 1.2 Overview on regexp

* You need to know the basics of **regular expressions** to cut the text down to the core text.
	* Regular expressions are a very general syntax for specifying search patterns.
	* Remove the **punctuation** marks: ',.;:?!'
	* Remove the **stop-words**, like "I", "and", and "the"
	* Remove too **common words**
	* **Standardize** spacing: double spaces, tabs, newlines
	* What do you want to do with special words and characters? e.g. Twitter "rt", "@user", "#hashtag!"

* Correct cleaning is **context specific**.
  * Legal documents are different to tweets, html, blog posts, etc!
* It is unlikely that the same subject discussed in two different fora will look the same to a topic model!


## 1.2.1 Regexp 

* Essential for pre-cleaning your data.
* See the [Python Documentation](https://docs.python.org/3/library/re.html).
* Regular expressions can contain both special and ordinary characters.
* Most ordinary characters, like 'A', 'a', or '0', are the simplest regular expressions; they simply match themselves.
* Some characters, like '|' or '(', are special.
* Special characters either stand for classes of ordinary characters, or affect how the regular expressions around them are interpreted.
* Repetition qualifiers (*, +, ?, {m,n}, etc) define how many characters are wanted.

### 1.2.2 Regexp in python

* Basic usage:
```python
match = re.search(pattern, string)
if match:
    process(match)
```

* Many more complex possibilities exist!
* Search/Replace/Group/Split etc.
* Basic usage is massively helpful.
* Lookup more complex problems.

### 1.2.3 Regexp special characters

* `\`: Escape special character.
* `.` (dot): match any character
  * `r"me."`: matches the string `men` or `met` but not `me` at the end of a word.
* `^` (caret): start of string
  * `r"^me"`: matches `me` at the start only (`meaning`)
* `$` (dollar): end of string/final character before newline
  * `r"me$"`: matches `me` at the end only (`biome`)
* `*` (star): 0 or more matches of preceding RE
  * `r"file.*\.txt"`: matches all strings of the form "file", anything, and ".txt"
* `+` (plus): 1 or more matches of preceding RE
  * `r"file.+\.txt"`: matches "file", any one character, and ".txt"
* `[]`: Set of characters.
  * `r"file[0-9]+\.txt"`: matches forms like "file5.txt"
* `{m}`: match m copies of receding RE
  * `r"file[0-9]{3}\.txt"`: matches forms like "file005.txt"

### 1.2.4 Example of cleaning (Text) data with regexp

```python
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
	return text
```


# 2. Data

First, load the data. This idea comes from [Susan Li on Towards Data Science](https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24) and the data is direct from [Kaggle million headlines](https://www.kaggle.com/therohk/million-headlines/data).

A reminder: We downloaded this in Part 1, from the [DSBristol github](https://github.com/dsbristol/dst/tree/master/data).

In [3]:
data = pd.read_csv(os.path.join('..', 'data', 'abcnews-date-text.csv.gz'), compression='gzip',error_bad_lines=False);
data_text = data[['headline_text']]
data_text['index'] = data_text.index
documents = data_text[0:100000]
print(len(documents))




  data = pd.read_csv(os.path.join('..', 'data', 'abcnews-date-text.csv.gz'), compression='gzip',error_bad_lines=False);


100000


## 2.1 Data Pre-processing

Summaries of the data:

In [4]:
print(len(documents))
print(documents[:5])

100000
                                       headline_text  index
0  aba decides against community broadcasting lic...      0
1     act fire witnesses must be aware of defamation      1
2     a g calls for infrastructure protection summit      2
3           air nz staff in aust strike for pay rise      3
4      air nz strike to affect australian travellers      4


The code below defines THREE choices of lemmatizer:
* "normalise_text" uses manual wordnet lemmatisation. It tries to use word position to figure out whether something is and adjective, verb, noun or adverb. 
* "preprocess" doesn't bother with that, it uses a standard lemmatizer.
* "prepare_text_for_lda" also lemmatizes, but it also handles stop words.

In [5]:
## Needed for stop words (only)
en_stop = set(nltk.corpus.stopwords.words('english'))
print(en_stop)

{'above', 'can', 'had', 'or', 'such', 's', 'his', 'so', 'was', 'doing', 'mustn', 'now', "couldn't", 'few', 'theirs', "shouldn't", "don't", 'hers', 'themselves', 'aren', 'yourselves', 'what', 'there', 'below', 'against', 'wouldn', 'he', 'i', 'during', 'weren', 'over', 'needn', 'than', 'these', 'up', 'but', 'they', 'out', 'under', 'should', 'further', 'been', 'no', 'about', 'ours', 'hadn', 'this', 'our', 'all', 'of', 'ourselves', 'have', 'those', 'while', 'how', "wouldn't", 'has', 'own', 'did', 'd', 'other', 'very', 'her', 'were', 'off', 'the', 'isn', 'ain', 'again', 'm', "won't", "you'd", 'any', 'being', 'once', 'by', 'does', 'are', 'a', 'o', 've', 'hasn', "hadn't", 'will', 'that', 'why', 'not', 'most', 'them', "isn't", 'll', "shan't", "it's", 'which', 'as', 'more', 'we', 'with', 'from', "didn't", "hasn't", 'haven', 'herself', 'its', 'down', 'into', 'it', 'for', 'am', 'in', 'both', 'yours', "aren't", 'yourself', 't', 'through', "mustn't", "needn't", 'if', "you'll", 'to', 'just', 'their'

In [6]:
import nltk
from nltk.corpus import wordnet

lmtzr = nltk.WordNetLemmatizer().lemmatize

## We lookup whether a word is and adjective, verb, noun or adverb here.
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

    
## This version uses word type. Needs the bigger nltp download ("popular")
def normalize_text(text):
    ## Runs on documents (vector of words)
    word_pos = nltk.pos_tag(nltk.word_tokenize(text))
    lemm_words = [lmtzr(sw[0], get_wordnet_pos(sw[1])) for sw in word_pos]

    return [x.lower() for x in lemm_words]

##Â This version doesn't require the "popular" download
def preprocess(text):
    ## Runs on documents (vector of words)
    lemmatizer = nltk.WordNetLemmatizer()
    return([lemmatizer.lemmatize(i) for i in text.split()])

################
## wordnet version
from nltk.corpus import wordnet as wn
def get_lemma(word):
    ## morphy does a lemma lookup and word standardization
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma

## lemmatize
from nltk.stem.wordnet import WordNetLemmatizer
def get_lemma2(word):
    return WordNetLemmatizer().lemmatize(word)

## This version is for comparison
def prepare_text_for_lda(text):
    ## Runs on documents (vector of words)
    tokens = nltk.word_tokenize(text)
    tokens = [token for token in tokens if len(token) > 4]
    tokens = [token for token in tokens if token not in en_stop]
    tokens = [get_lemma(token) for token in tokens]
    return tokens

Applying these to example text:

In [7]:
documents[documents['index'] == 16]

Unnamed: 0,headline_text,index
16,brigadier dismisses reports troops harassed in,16


In [8]:
from gensim import parsing
doc_sample = documents[documents['index'] == 16].values[0][0]

print('original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(normalize_text(doc_sample))
print('\n\n simpler tokenized and lemmatized document: ')
print(preprocess(doc_sample))
print('\n\n method removing stop words: ')
print(prepare_text_for_lda(doc_sample))

original document: 
['brigadier', 'dismisses', 'reports', 'troops', 'harassed', 'in']


 tokenized and lemmatized document: 
['brigadier', 'dismisses', 'report', 'troop', 'harass', 'in']


 simpler tokenized and lemmatized document: 
['brigadier', 'dismisses', 'report', 'troop', 'harassed', 'in']


 method removing stop words: 
['brigadier', 'dismiss', 'report', 'troops', 'harass']


Apply this to the dataset as a whole. (warning: takes a little time)

In [9]:
processed_docs = documents['headline_text'].map(preprocess) # preprocess is faster than normalise_text.
processed_docs[:10]

0    [aba, decides, against, community, broadcastin...
1    [act, fire, witness, must, be, aware, of, defa...
2    [a, g, call, for, infrastructure, protection, ...
3    [air, nz, staff, in, aust, strike, for, pay, r...
4    [air, nz, strike, to, affect, australian, trav...
5               [ambitious, olsson, win, triple, jump]
6    [antic, delighted, with, record, breaking, barca]
7    [aussie, qualifier, stosur, waste, four, memph...
8    [aust, address, un, security, council, over, i...
9    [australia, is, locked, into, war, timetable, ...
Name: headline_text, dtype: object

Now we'll make a dictionary and report some of the items in it.

In [10]:
dictionary = gensim.corpora.Dictionary(processed_docs)

count = 0
for k,v  in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 aba
1 against
2 broadcasting
3 community
4 decides
5 licence
6 act
7 aware
8 be
9 defamation
10 fire


It is important to get rid of extremes. This is one way to do it.

In [11]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

# 3. The corpus

Now we will map the documents into the bag of words model. As you can see, a corpus is simply a list of documents, each of which is a list of words.

However, creating them is slow enough so that you might want to download them preprocessed from [dst-block7-lda.zip](https://github.com/dsbristol/dst/blob/master/data/dst-block7-lda.zip?raw=true). (Downloaded in part1).

## 3.1 Constructing the corpus

In [12]:
try:
    print("Reading corpus from pickle")
    bow_corpus=pickle.load(open(os.path.join('..', 'data','bow_corpus.pkl'), 'rb'))
except FileNotFoundError:
    print("Creating corpus and saving to pickle")
    bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
    pickle.dump(bow_corpus, open(os.path.join('..', 'data', 'bow_corpus.pkl'), 'wb'))
    pickle.dump(dictionary, open(os.path.join('..', 'data', 'dictionary.pkl'), 'wb'))

bow_corpus[16]

Reading corpus from pickle


[(21, 1), (78, 1), (79, 1), (80, 1)]

In [15]:
bow_doc_16 = bow_corpus[245]

for i in range(len(bow_doc_16)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_16[i][0], 
                                               dictionary[bow_doc_16[i][0]], 
                                                bow_doc_16[i][1]))

Word 13 ("call") appears 1 time.
Word 798 ("distance") appears 1 time.
Word 799 ("it") appears 1 time.
Word 800 ("quits") appears 1 time.
Word 801 ("swimmer") appears 1 time.


The following is a version with stop words removed:

In [16]:
processed_docs2 = documents['headline_text'].map(prepare_text_for_lda) 
processed_docs2[:10]

0           [decide, community, broadcasting, licence]
1                         [witness, aware, defamation]
2           [call, infrastructure, protection, summit]
3                                      [staff, strike]
4              [strike, affect, australian, traveller]
5                          [ambitious, olsson, triple]
6            [antic, delight, record, breaking, barca]
7    [aussie, qualifier, stosur, waste, memphis, ma...
8                         [address, security, council]
9                         [australia, lock, timetable]
Name: headline_text, dtype: object

In [17]:
dictionary2 = gensim.corpora.Dictionary(processed_docs2)

count2 = 0
for k, v in dictionary2.iteritems():
    print(k, v)
    count2 += 1
    if count2 > 10:
        break

0 broadcasting
1 community
2 decide
3 licence
4 aware
5 defamation
6 witness
7 call
8 infrastructure
9 protection
10 summit


In [18]:
dictionary2.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

In [19]:
from gensim import corpora, models

tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]

from pprint import pprint

for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.6356876402708115),
 (1, 0.35528667888075016),
 (2, 0.42043632764557687),
 (3, 0.5412078105614471)]


## 3.2 Making the corpus

The code below **remakes** the corpus, simply by looping over the words in the document that are in the reduced dictionary, and mapping them to their sparse vector notation.

However, this is quite slow, so we instead check whether it was pre-created and saved to file. If so, we read it from file instead.

In [21]:
try:
    print("Reading corpus from pickle...")
    bow_corpus2=pickle.load(open(os.path.join('..', 'data', 'bow_corpus2.pkl'), 'rb'))
except FileNotFoundError:
    print("Reading corpus failed.")
    print("Creating corpus and saving to pickle")
    bow_corpus2 = [dictionary2.doc2bow(doc) for doc in processed_docs2]
    pickle.dump(bow_corpus2, open(os.path.join('..', 'data', 'bow_corpus2.pkl'), 'wb'))
    pickle.dump(dictionary2, open(os.path.join('..', 'data', 'dictionary2.pkl'), 'wb'))

bow_corpus2[16]

Reading corpus from pickle...


[(48, 1), (49, 1), (50, 1)]

# 4. Making an LDA model

## 4.1 Running gensim/LDA

This is the key component of an LDA model: defining the model with a specified corpus and dictionary.

Note that we also have to specify how many topics we will generate as well as the number of passes through the data. Because the inference algorithm is sensitive to word order, we can get different answers when rerunning.

In order to avoid expensive rerunning, we will save the results to disk. 

In [22]:
try:
    lda_model=pickle.load(open(os.path.join('..', 'data', 'lda_model.pkl'), 'rb'))
    print("Reading lda_model from pickle")
except FileNotFoundError:
    print("Creating lda_model and saving to pickle")
    lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)
    pickle.dump(lda_model,open(os.path.join('..', 'data', 'lda_model.pkl'),'wb'))

Reading lda_model from pickle


In [23]:
try:
    lda_model2=pickle.load(open(os.path.join('..', 'data', 'lda_model2.pkl'), 'rb'))
    print("Reading lda_model2 from pickle")
except FileNotFoundError:
    print("Creating lda_model2 and saving to pickle")
    lda_model2 = gensim.models.LdaMulticore(bow_corpus2, num_topics=10, id2word=dictionary2, passes=2, workers=2)
    pickle.dump(lda_model2,open(os.path.join('..', 'data', 'lda_model2.pkl'),'wb'))

Reading lda_model2 from pickle


Now we'll explore the model a little.

First we compare a document to its topic representation:

In [24]:
documents['headline_text'][89]

'man fined after aboriginal tent embassy raid'

In [25]:
lda_model[bow_corpus[89]]

[(0, 0.26754367),
 (1, 0.012505235),
 (2, 0.6324094),
 (3, 0.012504951),
 (4, 0.012504956),
 (5, 0.012504924),
 (6, 0.01250615),
 (7, 0.012505493),
 (8, 0.0125053255),
 (9, 0.01250989)]

Modify the text a little to see how the topics change:

In [26]:
text='woman fined after aboriginal tent embassy raid'
pptext=preprocess(text)
lda_model[dictionary.doc2bow(pptext)]

[(0, 0.2737118),
 (1, 0.012505458),
 (2, 0.62624),
 (3, 0.012504904),
 (4, 0.012504909),
 (5, 0.012504883),
 (6, 0.012507135),
 (7, 0.012505512),
 (8, 0.01250532),
 (9, 0.012510036)]

In [28]:
text='cat fined after aboriginal tent embassy raid'
pptext=preprocess(text)
lda_model[dictionary.doc2bow(pptext)]


[(0, 0.2710001),
 (1, 0.012505789),
 (2, 0.4050364),
 (3, 0.012506005),
 (4, 0.012505366),
 (5, 0.012505312),
 (6, 0.0125067),
 (7, 0.012505885),
 (8, 0.012505906),
 (9, 0.23642255)]

In [52]:
text='man fined after cat tent embassy raid'
pptext=preprocess(text)
lda_model[dictionary.doc2bow(pptext)]


[(0, 0.13960162),
 (1, 0.012510224),
 (2, 0.54823977),
 (3, 0.012510606),
 (4, 0.01250992),
 (5, 0.012509877),
 (6, 0.012510563),
 (7, 0.012509901),
 (8, 0.01251019),
 (9, 0.22458732)]

In [29]:
lda_model.show_topics(20,7)

[(0,
  '0.034*"over" + 0.030*"for" + 0.018*"plan" + 0.016*"to" + 0.012*"govt" + 0.012*"concern" + 0.011*"group"'),
 (1,
  '0.059*"in" + 0.021*"of" + 0.020*"win" + 0.019*"for" + 0.018*"the" + 0.016*"out" + 0.015*"to"'),
 (2,
  '0.060*"in" + 0.033*"police" + 0.031*"man" + 0.028*"over" + 0.022*"to" + 0.021*"court" + 0.017*"after"'),
 (3,
  '0.040*"u" + 0.038*"to" + 0.020*"water" + 0.020*"in" + 0.015*"iraqi" + 0.013*"chief" + 0.011*"of"'),
 (4,
  '0.044*"for" + 0.031*"to" + 0.023*"up" + 0.020*"iraq" + 0.016*"of" + 0.016*"say" + 0.016*"pm"'),
 (5,
  '0.050*"to" + 0.018*"health" + 0.013*"budget" + 0.010*"study" + 0.009*"minister" + 0.009*"in" + 0.009*"on"'),
 (6,
  '0.030*"in" + 0.027*"of" + 0.022*"for" + 0.021*"police" + 0.015*"to" + 0.013*"crash" + 0.013*"missing"'),
 (7,
  '0.092*"to" + 0.042*"on" + 0.023*"govt" + 0.016*"urged" + 0.014*"council" + 0.010*"of" + 0.010*"plan"'),
 (8,
  '0.081*"to" + 0.023*"for" + 0.018*"off" + 0.015*"a" + 0.014*"on" + 0.014*"be" + 0.012*"return"'),
 (9,
  '0

## 4.2 tf-idf model

Now we rerun with the tf-idf data pre-processing.

We keep the exact same dictionaries, but use the tf-idf weights.

In [30]:
try:
    lda_model_tfidf=pickle.load(open(os.path.join('..', 'data', 'lda_model_tfidf.pkl'), 'rb'))
    print("Reading lda_model_tfidf from pickle")
except FileNotFoundError:
    print("Creating lda_model_tfidf and saving to pickle")
    lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=2, workers=4)
    pickle.dump(lda_model,open(os.path.join('..', 'data', 'lda_model_tfidf.pkl'),'wb'))

Reading lda_model_tfidf from pickle


In [31]:
lda_model_tfidf.show_topics(10,7)

[(0,
  '0.034*"over" + 0.030*"for" + 0.018*"plan" + 0.016*"to" + 0.012*"govt" + 0.012*"concern" + 0.011*"group"'),
 (1,
  '0.059*"in" + 0.021*"of" + 0.020*"win" + 0.019*"for" + 0.018*"the" + 0.016*"out" + 0.015*"to"'),
 (2,
  '0.060*"in" + 0.033*"police" + 0.031*"man" + 0.028*"over" + 0.022*"to" + 0.021*"court" + 0.017*"after"'),
 (3,
  '0.040*"u" + 0.038*"to" + 0.020*"water" + 0.020*"in" + 0.015*"iraqi" + 0.013*"chief" + 0.011*"of"'),
 (4,
  '0.044*"for" + 0.031*"to" + 0.023*"up" + 0.020*"iraq" + 0.016*"of" + 0.016*"say" + 0.016*"pm"'),
 (5,
  '0.050*"to" + 0.018*"health" + 0.013*"budget" + 0.010*"study" + 0.009*"minister" + 0.009*"in" + 0.009*"on"'),
 (6,
  '0.030*"in" + 0.027*"of" + 0.022*"for" + 0.021*"police" + 0.015*"to" + 0.013*"crash" + 0.013*"missing"'),
 (7,
  '0.092*"to" + 0.042*"on" + 0.023*"govt" + 0.016*"urged" + 0.014*"council" + 0.010*"of" + 0.010*"plan"'),
 (8,
  '0.081*"to" + 0.023*"for" + 0.018*"off" + 0.015*"a" + 0.014*"on" + 0.014*"be" + 0.012*"return"'),
 (9,
  '0

In [32]:
tfidf2 = models.TfidfModel(bow_corpus2)
corpus_tfidf2 = tfidf[bow_corpus2]

In [33]:
try:
    lda_model_tfidf2=pickle.load(open(os.path.join('..', 'data', 'lda_model_tfidf2.pkl'), 'rb'))
    print("Reading lda_model_tfidf2 from pickle")
except FileNotFoundError:
    print("Creating lda_model_tfidf2 and saving to pickle")
    lda_model_tfidf2 = gensim.models.LdaMulticore(corpus_tfidf2, num_topics=10, id2word=dictionary2, passes=2, workers=4)
    pickle.dump(lda_model2,open(os.path.join('..', 'data', 'lda_model_tfidf2.pkl'),'wb'))

Reading lda_model_tfidf2 from pickle


In [32]:
## Testing on out-of-sample data
unseen_document = 'How a Pentagon deal became an identity crisis for Google'
#unseen_document='american coppers found eating donuts once again'
bow_vector = dictionary.doc2bow(preprocess(unseen_document))
bow_vector2 = dictionary2.doc2bow(prepare_text_for_lda(unseen_document))
print(lda_model2[bow_vector2])
print(lda_model_tfidf2[bow_vector2])

[(0, 0.025005715), (1, 0.2744563), (2, 0.27502477), (3, 0.025005715), (4, 0.025005717), (5, 0.27547815), (6, 0.025005724), (7, 0.025006486), (8, 0.025005715), (9, 0.025005717)]
[(0, 0.025005713), (1, 0.27445617), (2, 0.27502474), (3, 0.025005713), (4, 0.025005715), (5, 0.27547827), (6, 0.025005722), (7, 0.025006488), (8, 0.025005713), (9, 0.025005715)]


In [34]:
lda_model_tfidf2.show_topics(10,7)

[(0,
  '0.047*"council" + 0.025*"miss" + 0.020*"appeal" + 0.018*"continue" + 0.017*"search" + 0.017*"study" + 0.015*"business"'),
 (1,
  '0.024*"south" + 0.018*"first" + 0.014*"threat" + 0.014*"company" + 0.014*"fight" + 0.013*"title" + 0.012*"clash"'),
 (2,
  '0.020*"abuse" + 0.020*"release" + 0.019*"chief" + 0.018*"former" + 0.018*"school" + 0.015*"delay" + 0.014*"return"'),
 (3,
  '0.022*"final" + 0.022*"centre" + 0.020*"indigenous" + 0.015*"welcome" + 0.015*"award" + 0.014*"israeli" + 0.013*"highlight"'),
 (4,
  '0.030*"kill" + 0.029*"attack" + 0.019*"arrest" + 0.018*"defend" + 0.017*"three" + 0.015*"troops" + 0.014*"blast"'),
 (5,
  '0.103*"police" + 0.032*"probe" + 0.027*"crash" + 0.017*"begin" + 0.014*"meeting" + 0.013*"launch" + 0.012*"investigate"'),
 (6,
  '0.047*"court" + 0.027*"claim" + 0.026*"charge" + 0.019*"reject" + 0.018*"murder" + 0.017*"budget" + 0.016*"health"'),
 (7,
  '0.025*"boost" + 0.019*"change" + 0.018*"call" + 0.017*"urge" + 0.016*"group" + 0.015*"public" + 

## 4.3 Visualisation

The pyLDAvis package has a nice interactive visualisation designed for gensim.

We have to prepare the data, which is again a bit slow so I provide the pkl versions of these objects.

In [35]:
import pyLDAvis
pyLDAvis.enable_notebook()

  from imp import reload


In [36]:
try:
    lda_display=pickle.load(open(os.path.join('..', 'data', 'lda_display.pkl'), 'rb'))
    print("Reading lda_display from pickle")
except FileNotFoundError:
    print("Creating lda_display and saving to pickle")
    lda_display = pyLDAvis.gensim.prepare(lda_model, bow_corpus, 
                                          dictionary, mds='mmds')
    pickle.dump(lda_display,open(os.path.join('..', 'data', 'lda_display.pkl'),'wb'))

Reading lda_display from pickle


In [37]:
try:
    lda_display2=pickle.load(open(os.path.join('..', 'data', 'lda_display2.pkl'), 'rb'))
    print("Reading lda_display2 from pickle")
except FileNotFoundError:
    print("Creating lda_display2 and saving to pickle")
    lda_display2 = pyLDAvis.gensim.prepare(lda_model2, bow_corpus2, 
                                          dictionary2, mds='mmds')
    pickle.dump(lda_display2,open(os.path.join('..', 'data', 'lda_display2.pkl'),'wb'))


Reading lda_display2 from pickle


In [38]:
try:
    lda_display_tfidf2=pickle.load(open(os.path.join('..', 'data', 'lda_display_tfidf2.pkl'), 'rb'))
    print("Reading lda_display_tfidf2 from pickle")
except FileNotFoundError:
    print("Creating lda_display_tfidf2 and saving to pickle")
    lda_display_tfidf2 = pyLDAvis.gensim.prepare(lda_model_tfidf2, 
                                                 corpus_tfidf2, dictionary2, sort_topics=False)
    pickle.dump(lda_display_tfidf2,open(os.path.join('..', 'data', 'lda_display_tfidf2.pkl'),'wb'))


Reading lda_display_tfidf2 from pickle


Now we will use visualisation, explained at many places including:

https://towardsdatascience.com/topic-modelling-in-python-with-nltk-and-gensim-4ef03213cd21


In [39]:
# NB under some circumstances you need show, under others you need display. It appears to be a known bug.
pyLDAvis.display(lda_display, template_type='notebook')

In [39]:
pyLDAvis.display(lda_display2, template_type='notebook') # NB under some circumstances you need show, under others you need display. It appears to be a known bug.

Now visualise the tfidf2 model, which "should" be our best model.

In [40]:
pyLDAvis.display(lda_display_tfidf2, template_type='notebook') # NB under some circumstances you need show, under others you need display. It appears to be a known bug.

##Â 4.4 Return to topics


In [41]:
for index, score in sorted(lda_model2[bow_vector2], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model2.print_topic(index, 10)))

NameError: name 'bow_vector2' is not defined

In [42]:
for index, score in sorted(lda_model2[bow_vector2], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model2.print_topic(index, 5)))

Score: 0.2754786014556885	 Topic: 0.103*"police" + 0.032*"probe" + 0.027*"crash" + 0.017*"begin" + 0.014*"meeting"
Score: 0.27502474188804626	 Topic: 0.020*"abuse" + 0.020*"release" + 0.019*"chief" + 0.018*"former" + 0.018*"school"
Score: 0.27445584535598755	 Topic: 0.024*"south" + 0.018*"first" + 0.014*"threat" + 0.014*"company" + 0.014*"fight"
Score: 0.025006506592035294	 Topic: 0.025*"boost" + 0.019*"change" + 0.018*"call" + 0.017*"urge" + 0.016*"group"
Score: 0.025005724281072617	 Topic: 0.047*"court" + 0.027*"claim" + 0.026*"charge" + 0.019*"reject" + 0.018*"murder"
Score: 0.02500571683049202	 Topic: 0.030*"kill" + 0.029*"attack" + 0.019*"arrest" + 0.018*"defend" + 0.017*"three"
Score: 0.02500571683049202	 Topic: 0.024*"death" + 0.019*"water" + 0.015*"question" + 0.015*"accident" + 0.015*"concern"
Score: 0.02500571496784687	 Topic: 0.047*"council" + 0.025*"miss" + 0.020*"appeal" + 0.018*"continue" + 0.017*"search"
Score: 0.02500571496784687	 Topic: 0.022*"final" + 0.022*"centre" +

In [43]:
for index, score in sorted(lda_model_tfidf2[bow_vector2], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model_tfidf2.print_topic(index, 5)))

Score: 0.27547821402549744	 Topic: 0.103*"police" + 0.032*"probe" + 0.027*"crash" + 0.017*"begin" + 0.014*"meeting"
Score: 0.27502474188804626	 Topic: 0.020*"abuse" + 0.020*"release" + 0.019*"chief" + 0.018*"former" + 0.018*"school"
Score: 0.2744562327861786	 Topic: 0.024*"south" + 0.018*"first" + 0.014*"threat" + 0.014*"company" + 0.014*"fight"
Score: 0.025006486102938652	 Topic: 0.025*"boost" + 0.019*"change" + 0.018*"call" + 0.017*"urge" + 0.016*"group"
Score: 0.025005722418427467	 Topic: 0.047*"court" + 0.027*"claim" + 0.026*"charge" + 0.019*"reject" + 0.018*"murder"
Score: 0.02500571496784687	 Topic: 0.030*"kill" + 0.029*"attack" + 0.019*"arrest" + 0.018*"defend" + 0.017*"three"
Score: 0.02500571496784687	 Topic: 0.024*"death" + 0.019*"water" + 0.015*"question" + 0.015*"accident" + 0.015*"concern"
Score: 0.02500571310520172	 Topic: 0.047*"council" + 0.025*"miss" + 0.020*"appeal" + 0.018*"continue" + 0.017*"search"
Score: 0.02500571310520172	 Topic: 0.022*"final" + 0.022*"centre" +

##Â 4.5 Perplexity and Coherence

Question: What does a "good" model prediction look like?

The below examines the scores that we have.

In [44]:
## Sadly very slow, so we only look at the first few documents.
log_perplexities = {"lda_model": lda_model.log_perplexity(bow_corpus[0:1000]), 
     "lda_model2" : lda_model2.log_perplexity(bow_corpus2[0:1000]),
     "lda_model_tfidf" : lda_model_tfidf.log_perplexity(corpus_tfidf[0:1000]),
     "lda_model_tfidf2" : lda_model_tfidf2.log_perplexity(corpus_tfidf2[0:1000])
    };
# a measure of how good the model is. lower the better.
log_perplexities

{'lda_model': -14.897495582694685,
 'lda_model2': -16.94209941970645,
 'lda_model_tfidf': -27.977380333079466,
 'lda_model_tfidf2': -25.68284363960556}

Question: Why is the tf-idf model so much better in this measure? Does the performance measure capture your intuition about what a good topic model is?

Now we compute the intrinsic coherence to check the quality of the fit.

In [45]:
from gensim.models.coherencemodel import CoherenceModel
def getCoherence(m,c,d):
    coherence_model_lda = CoherenceModel(model=m,corpus=c, dictionary=d, coherence='u_mass')
    coherence_lda = coherence_model_lda.get_coherence()
    return(coherence_lda)

In [46]:
### Compute Coherence Score
coherences={
    "lda_model": getCoherence(lda_model,bow_corpus[0:1000],dictionary),
    "lda_model2": getCoherence(lda_model2,bow_corpus2[0:1000],dictionary2),
    "lda_model_tfidf": getCoherence(lda_model_tfidf,corpus_tfidf[0:1000],dictionary),
    "lda_model_tfidf2": getCoherence(lda_model_tfidf2,corpus_tfidf2[0:1000],dictionary2)
}
# a different measure of how good the model is. Higher is better.
coherences

{'lda_model': -16.141460380438886,
 'lda_model2': -20.439275147176623,
 'lda_model_tfidf': -16.141460380438886,
 'lda_model_tfidf2': -20.439275147176623}

Question: Why is the version of the data in which we removed stop words performing worse? 

## 4.6 Return to out-of-sample performance

In [54]:
oos_coherences={
    "lda_model": getCoherence(lda_model,[bow_vector],dictionary),
    "lda_model2": getCoherence(lda_model2,[bow_vector2],dictionary2),
    "lda_model_tfidf": getCoherence(lda_model_tfidf,[bow_vector],dictionary),
    "lda_model_tfidf2": getCoherence(lda_model_tfidf2,[bow_vector2],dictionary2)
};
oos_coherences

{'lda_model': -2.414078686970596,
 'lda_model2': -0.08725585615556383,
 'lda_model_tfidf': -2.414078686970596,
 'lda_model_tfidf2': -0.08725585615556383}

Question: what would be the most appropriate way to test an LDA model? Would it make a difference to possess labels? How would you use them?

## Challenge: what is the "best" value of K to use? 
How would you evaluate it? How would you handle the model runtime?

## Conclusions

What conclusions would you draw from this procedure?

## Appendix

NLTK includes synonyms, dictionary definitions, antonyms, and more; all available for automated processing.

Some tasters:

In [48]:
from nltk.corpus import wordnet
syn = wordnet.synsets("pain")
print(syn[0].definition())
print(syn[0].examples())

a symptom of some physical hurt or disorder
['the patient developed severe pain and distension']


In [49]:
syn

[Synset('pain.n.01'),
 Synset('pain.n.02'),
 Synset('pain.n.03'),
 Synset('pain.n.04'),
 Synset('annoyance.n.04'),
 Synset('trouble.v.05'),
 Synset('pain.v.02')]

In [50]:
synonyms = []
for syn in wordnet.synsets('Computer'):
    for lemma in syn.lemmas():
        synonyms.append(lemma.name())
print(synonyms)

['computer', 'computing_machine', 'computing_device', 'data_processor', 'electronic_computer', 'information_processing_system', 'calculator', 'reckoner', 'figurer', 'estimator', 'computer']


## Some references

### Data science topic modelling
* [Preparing Data for Topic Modelling](https://publish.illinois.edu/commonsknowledge/2017/11/16/preparing-your-data-for-topic-modeling/)
* [NLP for legal documents](https://towardsdatascience.com/nlp-for-topic-modeling-summarization-of-legal-documents-8c89393b1534)
* [Machine-Learning-In-Law github repo](https://github.com/chibueze07/Machine-Learning-In-Law/tree/master)

### Judging topic models
* Chang, Jonathan, Jordan Boyd-Graber, Sean Gerrish, Chong Wang and David M. Blei. 2009. [Reading Tea Leaves: How Humans Interpret Topic Models](http://umiacs.umd.edu/~jbg/docs/nips2009-rtl.pdf). NIPS.
* Stevens, Kegelmeyer, Andrzejewsk and Buttler [Exploring Topic Coherence over many models and many topics](https://www.aclweb.org/anthology/D/D12/D12-1087.pdf)
