Welcome to Data Science Toolbox!
This week we will prepare you for your Data Science Journey. It is essential that you prepare before contact time. That means:
- Watch and reflect on the Lectures;
- Look at the worksheets and think about the questions; as a minimum, make notes on how you might go about answering them;
- Most importantly, look at the Workshop content and do the pre-preparation for it.
The first two blocks demand the most work to allow you to hit the ground running. There is less content in future blocks, with a corresponding increased amount of time for group assessments.
In Block 01, we cover:
- What is Data Science Toolbox?
- Use of Group Assessments.
- What is Data Science?
- An overview of Exploratory Data Analysis (EDA).
- Exploratory Data Analysis with R.
- Setting up a basic Data Science Environment with Rstudio
- NB: We cover Python starting Block 06.
- Using Git (via GitHub Desktop) for collaborative projects.
- How to work with Cyber security data
- 1.0 - Introduction to the course (22.16)
- 1.1 - Introduction to Data Science (26.59)
- 1.2 - Exploratory Data Analysis (26.45)
- Everyone needs to have followed the Block 01 preparation given in Appendix 1.
- Specifically, you must have installed Rstudio and Github Desktop, and seen the appropriate training content.
- You cannot properly use the interaction time unless you have done this preparation in advance!
In the workshop, we will be discussing how to collaborate and work together remotely. We will then discuss Exploratory Data Analysis in practice.
- 1.3.1 - Workshop Lecture for RStudio (29.05)
- 1.3.2 - Workshop Lecture for Exploratory Data Analysis (18.13)
- 1.3.3 Workshop Lecture on Assessments, split into the following parts:
Before the workshop, you will have attempted to understand the Workshop content. This workshop will discuss difficulties encountered during this content.
- The Example Assessment should be carefully examined.
- Assessment 0 will be set in this week; see Assessments. This is a formative assessment (i.e. does not contribute to your grade) and will be due in Week 3.
The main references are:
We use the following Cyber Security Data Sources:
- Bro log data from Secrepo
- This can be loaded into R in a nice form with a script (raw) that can be run directly from R using
- The KDD99 dataset, which was created for a competition with a task specification. We normally use the 10% and column names files, which you can download directly.