Wednesday 12 February 2020
Take things to the next level and learn tidyverse packages (dplyr & friends) to master managing, cleaning, and linking datasets in R.
- Introduction to the tidyverse for data processing
- An intro to dplyr and using the pipe (%>%) operator to make your data processing flow clearer
- filter (choosing rows) and select (choosing columns)
- mutate: Calculating or altering variables in dplyr syntax
- merging datasets: binding rows/columns and merging datasets (joins)
- reshaping data from wide to long format (and back again)
- summarise: getting summaries of variables by sub-groups
- stringr and forcats: a brief introduction to functions for handling strings and factors
Style of course
Computer lab - hands-on course combining taught overview, hints on programming practices, and practical exercises. Course attendance is limited to 18 participants.
Who should attend?
Do you do a lot of data manipulation/processing in your work? Are you interested in more efficient and “readable” code for processing your data? Do you use R (a bit or a lot)?
This course is aimed at people with a reasonable amount of experience using R who are interested in learning tools to streamline data management tasks. The overall structure will cover “inputs” (getting data imported and cleaned up for analysis) and “outputs” (summarizing and reshaping data). We will also cover how to merge datasets using dplyr syntax.
If you’re not familiar with the concept of the tidyverse packages, you can have a look here (https://www.tidyverse.org/) for more details about the tidyverse philosophy and individual packages. Please note that this course will not be covering the ggplot2 graphing package.
In this course we will cover the main R packages for data manipulation that sit within the tidyverse: we will look at the dplyr and tidyr packages for the overall structure of working with data, including summarising large datasets. We will also have a brief introduction to some related packages that help with specific types of data like text strings and factors (the stringr and forcats packages).
During the course we will be working with a couple of different health-related datasets (a set of linked administrative health records and a health survey dataset) that will give a good idea about the application of these tools to real-world health datasets. The practicals will be completed using Windows PCs running R Studio, but the material also applies to people running R on a Mac or Linux system.
Please note that this course is not intended as a follow-on for people who have only just learned R (e.g. at the level covered in the “Introduction to R” course) and is instead aimed at slightly more experienced R users. If you would like more information about whether this course would suit you, please contact James directly (firstname.lastname@example.org).
|Time||Content (note that structure is being finalized but the following broad topics will all be covered)||Presenter(s)|
|8:30am||Registration and Coffee|
|9:00am||Introduction and housekeeping
Part 1: Getting started with the tidyverse
Importing data (readr, haven) and an introduction to tibbles
Basic verbs for choosing rows and columns (filter and select)
|11:00am||Part 2: New functions for data processing
Using dplyr and an introduction to the pipe %>%
mutate: making changes to variables and calculating new variables
Mass mutation: making similar changes to several variables
|1:15pm||Part 3: Joining data tables and reshaping data
Binding rows and columns
Joining tables based on common identifiers
Reshaping data with tidyr (converting between long and wide data formats)
|3:30pm||Part 4: Summarising data in tables
The summarise function: getting column summaries for grouped data
Super-summaries: summarising lots of variables in one go
(Brief notes) Using stringr and forcats to work with strings and factors
|4:50pm||Summary of course and evaluation|
- Dr James Stanley is a Kairuruku Matua / Kaitātari Koiora (Research Associate Professor and biostatistician) at the University of Otago, Wellington. His research work involves working with large administrative health datasets and complex survey sample data. James has been using R since 2009, and since 2015 has been using dplyr and related tidyverse packages for pretty much all the data wrangling he does in R.
- Maddie White works as a Kairuruku Tuarua (Assistant Research Fellow) with He Kāinga Oranga at the University of Otago, Wellington. Although R wasn't the first statistical programming language she studied, it quickly became her favourite. She now uses R in her day-to-day work understanding large, linked datasets in the Integrated Data Infrastructure (IDI). Maddie enjoys improving her R skills for data management, graphing and other exciting things, and especially likes encouraging others to do the same through helping teach the PHSS R courses.
Course cost and registration
$300 early bird, $400 after 19 December 2019.
A 50% discount is available to full-time students, those unwaged and University of Otago staff.