Jump to Block: (About) 01 02 03 04 05 06 07 08 09 10 11 12 (Assessments)
12 Parallel Infrastructure and Spark
This Block is unassessed except where it overlaps with other blocks. You may find the Parallel Data lecture helpful for Block 10 on parallel algorithms.
In this block we cover:
- Big Data
- Streaming
- Hadoop Distributed file system (HDFS)
- Hadoop MapReduce
- Spark overview
- Resilient Distributed Datasets (RDDs)
- Spark
- Accessing Spark through pyspark
Lectures
Worksheets:
Workshop:
- The workshop this week involves considerable setup.
- You are advised to do this first - it is discussed in the first video.
- All Workshop Content is accessed via the GitHub repository:
- Workshop Videos:
References
- General parallel algorithms:
- Map Reduce
- Apache Hadoop
- Gentle introduction to MapReduce
- A Q&A
- Lecture@poznan
- Basic MapReduce Algorithms Design
- Tutorialspoint Mapreduce
- Hadoop for Streaming applications
References:
- Laws governing data science:
- Privacy:
- The Algorithmic Foundations of Differential Privacy by Dwork and Roth (2014).
- ONS policy on disclosure control.
- Sweeney 1997. Weaving technology and policy together to maintain confi-dentiality. Journal of Law, Medicines Ethics, 25:98–110.
- Narayanan and Shmatikov 2008. Robust de-anonymization of largesparse datasets (how to break anonymity of the netflix prize dataset). IEEE Sec. and Priv.
- Statistical Disclosure Attacks by George Danezis.
- Interpretability and fairness:
- Book: “Interpretable Machine Learning: A Guide for Making Black Box Models Explainable.” Christoph Molnar 2019.
- Algorithmic Bias Tutorial by Francesco Bonchi with Slides from KDD 2016
- Hardt, Price and Srebo Equality of Opportunity in Supervised Learning 2016 explored in https://blog.acolyer.org/2018/05/07/equality-of-opportunity-in-supervised-learning/.
- Spark:
Previous: Block 11.