Jump to Block: (About) 01 02 03 04 05 06 07 08 09 10 11 12 (Assessments)
08 Topic Models and Bayes
In this block we cover:
- Simple topic models
- The Bag Of Words
- term frequency, inverse document frequency (tf-idf) representation
- N-grams
- Navigating Bayesian Methodology
- Bayes Theorem
- Bayesian Motivations for Smoothing and Regularisation
- False Positive Rates
- How MCMC, SMC, ABC and Variational Inference fit together
- Latent Dirichlet Allocation
- Perplexity and coherence
- Text cleaning
- Practical use of LDA
- Regexp (regular expressions)
- Pipelines for data processing
- Example application
Lectures:
Workshop:
The workshop is split into two sections. The first of these installs gensim and uses NLTK (Natural Language Toolkit to install some useful tools. It also gets the data. The second is the serious workshop containing a full text modelling example.
- Python Notebook: 8.3.1 Topic Models (Software and downloading data)
- Python Notebook: 8.3.2 Topic Models
Assessments:
- Portfolio 08 of the full Portfolio.
- Block08 on Noteable via Blackboard:
References:
Bag of Words
- Python Bag of Words: p259 Python Machine Learning (Raschka & Mirjalili, 2nd ed 2017).
- Topic Modeling and Latent Dirichlet Allocation: An Overview (Weifeng Li, Sagar Samtani and Hsinchun Chen)
- Stephen Robinson, Microsoft Research Understanding Inverse Document Frequency: On theoretical arguments for IDF
Bayesian Methodology
- Bayesian Programming languages: Software which allows you to specify the model without requiring you to specify the method:
- There is a super useful list of conjugate priors and interpretations on the Conjugate Prior Wikipedia page!
- Monte Carlo:
- Gamerman and Hedibert. Markov chain Monte Carlo: stochastic simulation for Bayesian inference.
- Doucet, Godsill, and Andrieu. “On sequential Monte Carlo sampling methods for Bayesian filtering” Statistics and computing 10.3 (2000): 197-208.
- Andrieu, Doucet, and Holenstein Particle Markov chain Monte Carlo methods
- ABC:
- Beaumont, Zhang, and Balding. “Approximate Bayesian computation in population genetics.” Genetics 162.4 (2002): 2025-2035.
- Murray, Ghahramani, and MacKay. “MCMC for doubly-intractable distributions” Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence (UAI) (2006).
- Variational Inference:
* Blei, Kucukelbir and McAuliffe. “ Variational Inference: A Review for Statisticians”, JASA (2017): 859-877.
- Blei and Jordan. “Variational inference for Dirichlet process mixtures”, Bayesian analysis 1.1 (2006): 121-143.
- A Beginner’s Guide to Variational Methods, by Eric Jang.
Latent Dirichlet Allocation
- B. Barde and A. Bainwad. “An overview of topic modeling methods and tools” (2017) ICICCS 745-750.
- Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent dirichlet allocation”, Journal of machine Learning research 3.Jan (2003): 993-1022.
- Neural Networks approaches to document models:
Data science topic modelling
Judging topic models
- Chang, Jonathan, Jordan Boyd-Graber, Sean Gerrish, Chong Wang and David M. Blei. 2009. Reading Tea Leaves: How Humans Interpret Topic Models. NIPS.
- Stevens, Kegelmeyer, Andrzejewsk and Buttler Exploring Topic Coherence over many models and many topics
Data sources
- Kaggle dataset for fake news
- Intelligence and Security Informatics Data Sets
- Vizsec security data collection
- Threatminer cyber data with NLP
- Phishing data corpus