11 Parallel Infrastructure and Spark
In this block we cover:
- Big Data
- Hadoop Distributed file system (HDFS)
- Hadoop MapReduce
- Spark overview
- Resilient Distributed Datasets (RDDs)
- Accessing Spark through pyspark
- Parallel data with MapReduce and Spark
- The workshop this week involves considerable setup, and the use of BlueCrystal Phase 4 (or setting up your own environment…).
- You are advised to do this first - it is discussed in the first video.
- There are two ways to achieve the learning. The first is by getting Spark working on your local machine so that you can access
- All Workshop Content is accessed via the GitHub repository:
- Video Lectures:
- General parallel algorithms:
- Map Reduce