# Workshop on Topic Modelling (Part 1, Prereqs)

```
date: "Block 07"
author: "Daniel Lawson"
email: dan.lawson@bristol.ac.uk
output: html_document
version: 1.0.1
```

## NLP Environment

Here we set up the NLP environment in python.


## The libraries that are needed

The main library for this is called "gensim". However, there are several libraries that implement better natural language processing technologies.  The main facility is something called "stemming" or "lemmatizing". 

https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/

* **Stemming**: Cutting off the start and end of words to make (hopefully) consistent word stems
* **Lemmatization**: Looking up the correct word-roots in dictionaries to find the morphological root of words

Stemming may work across languages and dialects, and is computationally cheaper. However, it may merge words that are functionally different; for example, *operation*, *operational*, *operand* and *opera* might all become *oper*!

Lemmatization works only as well as its dictionary and requires large databases to function. It would be preferred for many language tasks.

An important additional step is to remove **stop words**. These are common words that link other words but have little intrinsic meaning by themselves, such as "the", "it", "on", "and", etc.

## Installation of software

Here we install the required packages, nltk and gensim. Note that this is not completing the setup for these as they contain sub-modules that need configuring.

In [None]:
!pip3 install nltk
!pip3 install gensim
!pip3 install pyLDAvis

Now we'll check our gensim version.

You can ignore this; but, you need gensim v3, not v0, for some of the functions to work. 

The command-line update is "conda install -c anaconda gensim"

Now we are ready to test that we can load all software we need:

In [None]:
import pickle
import pandas as pd
import requests
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)

import nltk

In [None]:
print (gensim.__version__)
## If you need to reinstall and reload:
##from importlib import reload
##reload(gensim)

Finally, we need to download the nltk modules that are needed. 

In [None]:
### Do this once! Then leave commented next time you run the script.
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('popular') # Get the "popular" package with the advanced WordNet dictionary in it. There are ways to avoid this is you want.

## Data

First, load the data. This idea comes from [Susan Li on Towards Data Science](https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24) and the data is direct from [Kaggle million headlines](https://www.kaggle.com/therohk/million-headlines/data). Because the documents are short, this is a useful dataset for teaching.

We will download it from the [Data Science GitHub Data Directory](https://github.com/dsbristol/dst/tree/master/data) using the [Direct Download link](https://github.com/dsbristol/dst/blob/master/data/abcnews-date-text.csv.gz?raw=true)
using the python below.

We will also get the [dst-block7-lda.zip](https://github.com/dsbristol/dst/blob/master/data/dst-block7-lda.zip?raw=true) zip file which contains some intermediate results that we don't want to have to run in real time. They take about 10-15 mins to regenerate in total and you will have the code to generate them.

In [None]:
url = 'https://github.com/dsbristol/dst/blob/master/data/abcnews-date-text.csv.gz?raw=true'
r = requests.get(url, allow_redirects=True)
open('../data/abcnews-date-text.csv.gz', 'wb').write(r.content)

In [None]:
data = pd.read_csv('../data/abcnews-date-text.csv.gz', compression='gzip',error_bad_lines=False);
data_text = data[['headline_text']]
data_text['index'] = data_text.index
documents = data_text[0:100000]

Summaries of the data:

In [None]:
print(len(documents))
print(documents[:5])

In [None]:
url = 'https://github.com/dsbristol/dst/blob/master/data/dst-block7-lda.zip?raw=true'
r = requests.get(url, allow_redirects=True)
open('../data/dst-block7-lda.zip', 'wb').write(r.content)

Check that the file is not corrupted: it should have a hash of 793d48054fa8ec271ed6c683295f1122

This is important because the zip files contains pkl files which are vulnerable to malicious use.

In [None]:
import hashlib
print(hashlib.md5(r.content).hexdigest()) 

Now we'll unzip these files so you don't have to regenerate them:

In [None]:
import zipfile
with zipfile.ZipFile('../data/dst-block7-lda.zip', 'r') as zip_ref:
    zip_ref.extractall('../data/')

That is it, we're ready to do NLP in anger.