Jump to Block: (About) 01 02 03 04 05 06 07 08 09 10 11 12
07 Topic Models and Bayes
In this block we cover:
- Simple topic models
- The Bag Of Words
- term frequency, inverse document frequency (tf-idf) representation
- N-grams
- Navigating Bayesian Methodology
- Bayes Theorem
- Bayesian Motivations for Smoothing and Regularisation
- False Positive Rates
- How MCMC, SMC, ABC and Variational Inference fit together
- Latent Dirichlet Allocation
- Perplexity and coherence
- Text cleaning
- Practical use of LDA
- Regexp (regular expressions)
- Pipelines for data processing
- Example application
Lectures:
- Topic Models, Bayes, Regularization, Latent Dirichlet Allocation:
- Applying Topic Models:
Worksheets:
Workshop:
The workshop is split into two sections. The first of these installs gensim and uses NLTK (Natural Language Toolkit to install some useful tools. It also gets the data. The second is the serious workshop containing a full text modelling example.
- 7.3.1 Workshop on Topic Modelling (Part 1, Prereqs) (9:57)
- 7.3.2 Workshop on Topic Modelling (Part 2, Main content) (34:09)
Assessments:
- Assessment 3 will be set in this week; see Assessments. This is a summatieve assessment (i.e. does contribute to your grade) and will be due in Week 16.
References
Bag of Words
- Python Bag of Words: p259 Python Machine Learning (Raschka & Mirjalili, 2nd ed 2017).
- Topic Modeling and Latent Dirichlet Allocation: An Overview (Weifeng Li, Sagar Samtani and Hsinchun Chen)
- Stephen Robinson, Microsoft Research Understanding Inverse Document Frequency: On theoretical arguments for IDF
Bayesian Methodology
- Bayesian Programming languages: Software which allows you to specify the model without requiring you to specify the method:
- There is a super useful list of conjugate priors and interpretations on the Conjugate Prior Wikipedia page!
- Monte Carlo:
- Gamerman and Hedibert. Markov chain Monte Carlo: stochastic simulation for Bayesian inference.
- Doucet, Godsill, and Andrieu. “On sequential Monte Carlo sampling methods for Bayesian filtering” Statistics and computing 10.3 (2000): 197-208.
- Andrieu, Doucet, and Holenstein Particle Markov chain Monte Carlo methods
- ABC:
- Beaumont, Zhang, and Balding. “Approximate Bayesian computation in population genetics.” Genetics 162.4 (2002): 2025-2035.
- Murray, Ghahramani, and MacKay. “MCMC for doubly-intractable distributions” Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence (UAI) (2006).
- Variational Inference:
* Blei, Kucukelbir and McAuliffe. “ Variational Inference: A Review for Statisticians”, JASA (2017): 859-877.
- Blei and Jordan. “Variational inference for Dirichlet process mixtures”, Bayesian analysis 1.1 (2006): 121-143.
- A Beginner’s Guide to Variational Methods, by Eric Jang.
Latent Dirichlet Allocation
- B. Barde and A. Bainwad. “An overview of topic modeling methods and tools” (2017) ICICCS 745-750.
- Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent dirichlet allocation”, Journal of machine Learning research 3.Jan (2003): 993-1022.
- Neural Networks approaches to document models:
Data science topic modelling
Judging topic models
- Chang, Jonathan, Jordan Boyd-Graber, Sean Gerrish, Chong Wang and David M. Blei. 2009. Reading Tea Leaves: How Humans Interpret Topic Models. NIPS.
- Stevens, Kegelmeyer, Andrzejewsk and Buttler Exploring Topic Coherence over many models and many topics