
About R and RStudio

R is a flexible, convenient and powerful tool for managing and analysing data, and well as for creating publication materials such as visualisations, reports and dashboards. RStudio is an IDE (Integrated Development Environment) for using R. R is free to install, as is the open source edition of RStudio (there are professional and hosted versions of RStudio which are not free). In my opinion, R and RStudio provide at least as much of the functionality you would see in other quite expensive statistical computing platforms.

About this guide

This guide will be just enough to get you going. It does not show the full range of features for every function. It uses tidyverse and isn’t very concerned with technical stuff going on under the hood.

Getting help

The help function in R provides great documentation and examples for all functions. Use it by running ? followed by the name of the function. For example, to find out more about the function select just run ?select.

There are very good cheatsheets for different packages on the RStudio website. One cheatsheet that you won’t find on that page but which I think is really useful for beginners is the one on data wrangling (although some of the functions on that are out of date).

Finally there are a couple of good online forums where users post their coding problems and other users try to provide solutions. One of the most active of these is Stack Overflow. Try searching or posting your problem there if you are getting stuck. Users are more likely to be able to help if you provide a reproducible example, so try to provide enough code so that they can recreate the problem themselves.

A word on web restrictions

R needs to access the web in order to download packages from CRAN. You may also want to import data from the web, e.g. through the csodata package for importing data from the CSO website. If you are having trouble with these steps then it may be because your organisation has not allowed R to access the web, so you might need to contact your IT department. If you are in the same organisation as me and you are experiencing difficulties, let me know, there are some settings that I can pass on that will help.

Packages and Tidyverse

About packages

The capabilities of R are extended through a wide array of packages. As a beginner with R, it is likely that all of the extra packages that you use will be downloaded from an online repository of packages called CRAN, which stands for the Comprehensive R Archive Network. Packages are downloaded from CRAN within the R environment, you rarely have to visit the CRAN website at all. There is a stringent process to getting your package included on CRAN which includes a series of checks and testing, so there is a very low likelihood of you ever downloading malicious code from CRAN.

Tidyverse and Base R

Tidyverse is a very widely used collection of packages which makes it easier to write and read code in R. It is so commonly used that it is almost ubiquitous, so I wouldn’t hesitate to recommend that you begin learning R through tidyverse. If you are not using tidyverse and you are only using the handful of default packages in R then you would say that you are using ‘base R’, where ‘base’ is the main default package (I think there are seven default packages). This is not an ‘either-or’ decision, when you load tidyverse it is perfectly fine to write code using the ‘base’ set of functions. It’s good to understand base R, but I think it’s ok to pick this up as you go along rather than starting out with exclusive use of base R.

Installing and loading

To install a package you use the function install.packages, so to install the tidyverse packages you run install.packages("tidyverse"). Then to use the package you have to load it into your environment using the function library, so to use the tidyverse functions and features you run library(tidyverse) (note you need inverted commas for install.packages but not for library). You only have to install a package once, or again if you want to update the package, but you have to load the package every time you start a new session in R. Note that the default packages that come with R do not have to be installed or loaded using install.packages and library.

Core tidyverse and other tidyverse

Within tidyverse there are eight core packages, and then a couple of extra tidyverse packages. All of these are installed when you run install.packages("tidyverse"), but only the core packages are loaded when you run library(tidyverse) (plus a couple of back-end ones that you don’t really see). This is not something that you have to worry too much about, it just means that you have to run an extra library line when you want to use one of those extra tidyverse packages that are beyond the core ones. Extra tidyverse packages that I would often use include scales for formatting axes, lubridate for handling dates and times, and readxl for reading and writing Excel files. Some packages I like which aren’t from tidyverse (and require separate installation) include sf for maps, tictoc for timing how long it takes code to run, beepr for making ‘ding’ noises after code has finished running, and knitr for making html documents like this one.

Click for a summary on types of packages

A diagram which summarises the different types of package is shown below. The five main types (A, B, C, D and E) are described below. As a beginner you will probably only encounter types A and B.

Diagram summarising types of packages

  1. Base R: These packages are already installed with R and always loaded by default, so you don’t need to use install.packages or library. Examples include base which has lots of fundamental functions and stats which has functions for statistical analysis.
  2. Core Tidyverse: These are the main tidyverse packages. They are all installed using install.packages("tidyverse") and loaded using library(tidyverse). Examples include dplyr for data manipulation and stringr for manipulating strings.
  3. Other Tidyverse: These are part of tidyverse but not among the core packages. They are also all installed using install.packages("tidyverse") but they are not loaded with library(tidyverse), you have to use a separate library line to load these. Examples include lubridate for managing dates and times and haven for reading more unusual file types. You might use these kinds of packages as an intermediate user but not so much as a beginner.
  4. Other packages on CRAN: There are over 18,000 packages on CRAN made by developers and organisations all over the world. These can be downloaded using install.packages and loaded using library. Examples include csodata for downloading CSO data, and caret for machine learning. As a beginner you will not encounter many of these.
  5. Other packages not on CRAN: Many people make packages which for one reason or another are not on CRAN. For example a company may make a package for internal use only, or an individual might store several functions that they made themselves as a package on their own device (or maybe on GitHub) so that they can be easily added to their projects. These would be installed using a special configuration of install.packages or some other function, but once installed they can be loaded using library.

The RStudio environment

A screenshot of RStudio is shown below. There are four main panels. This guide will not go through every single feature in RStudio, but will hopefully be enough to get you started.

  • The top left panel is where you write and run code and where you view your datasets. You can open up a new scripting window by clicking File > New File > R Script (or ‘Ctrl Shift n’). You can save these as ‘.R’ files. How you actually go about running code that you have written here is covered further on under ‘How do I run code?’ To view datasets in this window you can click them in the ‘Environment’ panel (top right) or run View(dataset_name).
  • The bottom left panel is the console, which does two things. The first is that it outputs useful information after you have run some code, such as warnings or any print statements. The second thing is that you can also write code here. Generally I would write code in the Console if I didn’t care about not having access to that code again, so I might run quick spot checks in the console, and it is also where I would run code for installing packages. There are tabs for ‘Terminal’, ‘R Markdown’ and ‘Jobs’ which you don’t need to worry about for now.
  • I would call the top right panel the environment panel, even though ‘Environment’ is only one of the six tabs in that panel, because I would rarely have occasion to use the other five tabs. The environment panel shows you all the datasets and other variables that you have created. There is a blue arrow next to the datasets which allows you to browse their structure. Note that there are some datasets which exist and which you can use, but they don’t appear in the environment panel. These are usually associated with training or as demo datasets to play with a package. One of these is mtcars, which contains information on 32 cars. In the code in the screenshot, I have used mtcars to make a new duplicate dataset called my_data which does appear in the environment panel.
  • The bottom right panel shows plots and help, and also allows you to browse files and installed packages. I don’t think it is too difficult to find your way around this panel. If you create a plot or run a help command then the appropriate tab will become active.

A screenshot of the RStudio environment

Data objects in R


Vectors are one-dimensional objects containing the same type of data, so all entries are either numerical, character (strings) or logical (TRUE or FALSE). These are created using the function c(), with the elements separated by commas. A vector of three names would look like: c("Andy", "Betty" , "Carol"). A vector of three numbers would look like: c(71, 8.5 , 0.13). You can make a vector of consecutive integers by putting a colon (:) between the low and high integer, so 5:8 is equivalent to c(5, 6, 7, 8). Finally, you can make a logical vector like so: c(TRUE, FALSE, FALSE).


Most of the data objects you’ll encounter will be dataframes, which can be thought of as tables with rows and columns. Each column has a particular type which can be numerical, character (strings) or logical (TRUE or FALSE). The columns of dataframes have names. The rows of dataframes can have names too, but this is a stupid feature, it makes more sense to just use another column for whatever information is stored as the row name.

The columns of dataframes are vectors. So what? It is useful to know that the columns of dataframes are themselves vectors. Why is that useful to know? Two reasons. One is that you can extract a column from a dataframe and manipulate it as a vector. The second reason is that programming in R is generally geared towards working with vectors as the basic unit. It is easier to program in R with “long” datasets, where you have data for different groups stacked vertically, so you might have a column for X, a column for Y, and a column for group. If you were working in Excel you would probably arrange things differently, maybe with a “wide” dataset, having a column for X, then a column for Y_1 (representing group 1), a column for Y_2 (representing group 2), and so on. You will find that row-wise operations in R are not as straightforward as column-wise operations. Writing for loops over a dataframe is pretty slow and discouraged. You might notice other effects like this as you become more proficient.

In tidyverse, there is an improved version of a dataframe called a tibble. The differences between a tibble and a regular dataframe are quite subtle from the beginner’s perspective and they can more-or-less be handled in the same way, so I wouldn’t worry about it for now. Just be aware that if you see a reference to a ‘tibble’ then it’s a special type of a dataframe.

Other structures

There are other objects in R including matrices, arrays and lists. You can skip this section if you like as it is unlikely you will need to deal with them as a beginner.

Click for info on other structures

A matrix is a 2D table with only one type of data (usually all numeric). An array is more general than a matrix, it can have any number of dimensions. I don’t think I have ever had to use matrices or arrays.

A list is a 1D structure where the elements don’t have to be of the same type, i.e. the first element could be a number and the second element could be a string. But moreover, the elements of lists can be vectors, dataframes, or even more lists. Dealing with lists would be for an intermediate or advanced course, but even as a beginner you might find yourself working with a package which uses lists. Often in these cases, the lists have a specific structure, or ‘class’, which is recognised by the package. Don’t panic, there will usually be a helpful vignette for you to follow.

Fundamentals: Running Code and Creating Objects

This section covers just a couple of fundamental tools from Base R. . I prefer to do all my data filtering, manipulation and aggregation using tidyverse functions, so those kinds of processes are not covered here.

How do I run code?

After you write some code in the scripting window you will want to run it. My preferred way to run code is to place the cursor somewhere in the line of code and press ‘Ctrl Enter’. As well as running the line of code, the cursor will then jump to the next line of code, and this is useful as it helps you to run several concurrent lines of code by holding ‘Ctrl’ and tapping ‘Enter’.

You can also write code directly in the console. After you write your code there, simply hit ‘Enter’ and it will run.

Click to see other ways of running code

There are other ways of running code from your scripting window. Pressing the ‘Run’ button at the top of the scripting window has the same effect as hitting ‘Ctrl Enter’. Instead of placing the cursor in the line of code you can select the whole line of code. You can select several lines of code and then hit ‘Ctrl Enter’ or ‘Run’, and they will all run in the order they appear.

Finally, you can run all of the code in the script by pressing ‘Ctrl Alt r’. Note that this also saves the script.

The assignment operator <-

To create objects, we use the assignment operator: <- You can quickly write this operator by hitting ‘Alt -’. Let’s make an object x which has a value of 5.

x <- 5

After you run this line of code you will see the code appear in the Console, and x appear in the Environment panel along with its value, 5.

To output the value of x, we can write x in a line on its own and run that line. The value 5 will appear in the Console. We can also run x + 2 to return 7.

## [1] 5

x + 2
## [1] 7

Let’s make a vector containing a couple of numeric values using the function c, and output it as well.

numbers_vec <- c(5,6,7,8,9)

## [1] 5 6 7 8 9
Click to see example of string and logical vectors

We can make a vector of strings by putting each string in inverted commas. We can use single or double inverted commas.

names_vec <- c("Mary", "Louise", "Tom")

## [1] "Mary"   "Louise" "Tom"

And we can make a logical vector like so. Note that I am writing T and F which are abbreviations of TRUE and FALSE and work in exactly the same way.

logical_vec <- c(T, F, F, T)


I find that I very rarely need to make dataframes from scratch. As a beginner you usually work with built-in sample dataframes, and as a regular user you usually work with dataframes created from data from files or from the web.

Although the built-in datasets don’t appear in the Environment panel, they are all there waiting to be used. We will make our own copy of a dataset from a study on oesophageal cancer which is called esoph. Our copy of this dataset will be called cancer. Note that cancer will appear in the Environment panel.

cancer <- esoph
agegp alcgp tobgp ncases ncontrols
25-34 0-39g/day 0-9g/day 0 40
25-34 0-39g/day 10-19 0 10
25-34 0-39g/day 20-29 0 6
25-34 0-39g/day 30+ 0 5
25-34 40-79 0-9g/day 0 27
25-34 40-79 10-19 0 7
25-34 40-79 20-29 0 4
25-34 40-79 30+ 0 7
25-34 80-119 0-9g/day 0 2
25-34 80-119 10-19 0 1
25-34 80-119 30+ 0 2
25-34 120+ 0-9g/day 0 1
25-34 120+ 10-19 1 0
25-34 120+ 20-29 0 1
25-34 120+ 30+ 0 2
35-44 0-39g/day 0-9g/day 0 60
35-44 0-39g/day 10-19 1 13
35-44 0-39g/day 20-29 0 7
35-44 0-39g/day 30+ 0 8
35-44 40-79 0-9g/day 0 35
35-44 40-79 10-19 3 20
35-44 40-79 20-29 1 13
35-44 40-79 30+ 0 8
35-44 80-119 0-9g/day 0 11
35-44 80-119 10-19 0 6
35-44 80-119 20-29 0 2
35-44 80-119 30+ 0 1
35-44 120+ 0-9g/day 2 1
35-44 120+ 10-19 0 3
35-44 120+ 20-29 2 2
45-54 0-39g/day 0-9g/day 1 45
45-54 0-39g/day 10-19 0 18
45-54 0-39g/day 20-29 0 10
45-54 0-39g/day 30+ 0 4
45-54 40-79 0-9g/day 6 32
45-54 40-79 10-19 4 17
45-54 40-79 20-29 5 10
45-54 40-79 30+ 5 2
45-54 80-119 0-9g/day 3 13
45-54 80-119 10-19 6 8
45-54 80-119 20-29 1 4
45-54 80-119 30+ 2 2
45-54 120+ 0-9g/day 4 0
45-54 120+ 10-19 3 1
45-54 120+ 20-29 2 1
45-54 120+ 30+ 4 0
55-64 0-39g/day 0-9g/day 2 47
55-64 0-39g/day 10-19 3 19
55-64 0-39g/day 20-29 3 9
55-64 0-39g/day 30+ 4 2
55-64 40-79 0-9g/day 9 31
55-64 40-79 10-19 6 15
55-64 40-79 20-29 4 13
55-64 40-79 30+ 3 3
55-64 80-119 0-9g/day 9 9
55-64 80-119 10-19 8 7
55-64 80-119 20-29 3 3
55-64 80-119 30+ 4 0
55-64 120+ 0-9g/day 5 5
55-64 120+ 10-19 6 1
55-64 120+ 20-29 2 1
55-64 120+ 30+ 5 1
65-74 0-39g/day 0-9g/day 5 43
65-74 0-39g/day 10-19 4 10
65-74 0-39g/day 20-29 2 5
65-74 0-39g/day 30+ 0 2
65-74 40-79 0-9g/day 17 17
65-74 40-79 10-19 3 7
65-74 40-79 20-29 5 4
65-74 80-119 0-9g/day 6 7
65-74 80-119 10-19 4 8
65-74 80-119 20-29 2 1
65-74 80-119 30+ 1 0
65-74 120+ 0-9g/day 3 1
65-74 120+ 10-19 1 1
65-74 120+ 20-29 1 0
65-74 120+ 30+ 1 0
75+ 0-39g/day 0-9g/day 1 17
75+ 0-39g/day 10-19 2 4
75+ 0-39g/day 30+ 1 2
75+ 40-79 0-9g/day 2 3
75+ 40-79 10-19 1 2
75+ 40-79 20-29 0 3
75+ 40-79 30+ 1 0
75+ 80-119 0-9g/day 1 0
75+ 80-119 10-19 1 0
75+ 120+ 0-9g/day 2 0
75+ 120+ 10-19 1 0
Click to see how to make a dataframe from scratch

Like I say, this is something you rarely need to do. Dataframes are created using a bunch of named vectors separated by commas, within the function data.frame(). Here I am making a character variable called name, a numerical variable called age, and a logical variable called married.

some_dataframe <- data.frame(name = c("Mary", "Louise", "Tom"),
                             age = c(30, 40, 50),
                             married = c(FALSE, TRUE, TRUE))

Accessing elements with [] and $

We can use the square brackets [] to access specific elements within vectors and dataframes. Let’s create the vector numbers_vec as before and access the second element using numbers_vec[2].

numbers_vec <- c(5,6,7,8,9)
## [1] 6

Similarly we can access the ith row and the jth column of a dataframe using [i,j]. Let’s go to the fifth row of the second column of the cancer dataframe using cancer[5,2].

cancer <- esoph

## [1] 40-79
## Levels: 0-39g/day < 40-79 < 80-119 < 120+

You’ll see in the output above that as well as giving the contents of that cell (which is 40-79), it shows: Levels: 0-39g/day < 40-79 < 80-119 < 120+. This is because this particular variable (alcgp) is stored as a factor. A full discussion of factors would be better placed in an intermediate guide to R, but in this context a factor is a variable where each element can have one of several possible values or levels. The allowed levels of the variable alcgp are 0-39g/day, 40-79, 80-119 and 120+. The levels of a factor can be given an order, in which case we would say that the variable is an ‘ordinal variable’. Factors have many uses. In particular they can be used for sorting string variables in an order that is not alphabetical, for example Low, Medium, High.

We can access a whole row or a whole column of a dataframe by using square brackets with either i or j left blank. For example, we can access the whole fifth row of cancer using cancer[5,]:

##   agegp alcgp    tobgp ncases ncontrols
## 5 25-34 40-79 0-9g/day      0        27

To access a whole column we can leave i blank and include the value for j (like cancer[,2]). Alternatively, and perhaps more practically, we can refer to the column by name. Here we will output the column agegp using cancer$agegp, and this will print all 88 elements (along with the list of levels, which are ordered).

##  [1] 25-34 25-34 25-34 25-34 25-34 25-34 25-34 25-34 25-34 25-34 25-34 25-34
## [13] 25-34 25-34 25-34 35-44 35-44 35-44 35-44 35-44 35-44 35-44 35-44 35-44
## [25] 35-44 35-44 35-44 35-44 35-44 35-44 45-54 45-54 45-54 45-54 45-54 45-54
## [37] 45-54 45-54 45-54 45-54 45-54 45-54 45-54 45-54 45-54 45-54 55-64 55-64
## [49] 55-64 55-64 55-64 55-64 55-64 55-64 55-64 55-64 55-64 55-64 55-64 55-64
## [61] 55-64 55-64 65-74 65-74 65-74 65-74 65-74 65-74 65-74 65-74 65-74 65-74
## [73] 65-74 65-74 65-74 65-74 65-74 75+   75+   75+   75+   75+   75+   75+  
## [85] 75+   75+   75+   75+  
## Levels: 25-34 < 35-44 < 45-54 < 55-64 < 65-74 < 75+

As mentioned in one of the collapsible sections above, the columns of dataframes are themselves vectors. So we can access the fifth element of the column agegp using square brackets, e.g.:

## [1] 25-34
## Levels: 25-34 < 35-44 < 45-54 < 55-64 < 65-74 < 75+

Viewing dataframes

The easiest way to view a dataframe is by clicking on it in the Environment panel. The dataframe will then appear in its own window in the top-left panel, and you can scroll and search within that window. Note that View(dataset name) will appear in the console. Running that line of code would have the same effect.

Beside the name of each dataframe in the environment panel is a blue arrow which allows you to expand the dataset to look at the variable names and the first couple of values. This view also tells you the type of each value (logical (logi), numeric (num) or character (chr)). It also tells you if a variable is stored as a factor, as is the case for three of the variables in the cancer dataset.

A very similar view to that produced by expanding the values in the environment panel can be achieved using the glimpse function (e.g. run glimpse(cancer)). This is output to console.

You can output the first or last couple of lines of a dataframe to console using the head and tail functions (e.g. run head(cancer)). By default it prints the first (or last) 6 rows, but you can change that using the extra argument n (e.g. run head(cancer , n=10)).

You can print an entire dataframe to console by simply running the name of the dataframe on its own.

The pipe and five key functions

We’ll go through five of the most important functions in R. These are all from tidyverse, so you will need to load the tidyverse package before progressing.


Most of the examples below start with the dataset cancer which was created as a copy of esoph as shown earlier. In each case the dataset cancer is pushed into a function and the result is shown below. In practice what you might do is create a new dataset to capture the output, by writing the name of your dataset and the assignment operator <- at the top, so instead of…

cancer %>% 

…you might have…

my_new_dataset <- cancer %>% 

The pipe: %>%

The pipe is a tool which allows you to pass an object such as a dataframe through a series of operations. It makes code much easier to read and write. The way it works is that it takes whatever is on the left hand side and pushes it through the function on the right hand side. The shortcut for writing a pipe is ‘Ctrl Shift m’. Suppose we have a dataset called my_dataset and we want to apply a function called some_function, the following two lines of code would be equivalent:

my_dataset %>% some_function()

# the following line is equivalent:

It is clearer why the pipe makes code easier to read and write when you consider a series of functions which may each have additional arguments. Suppose we want to subsequently apply another function called another_function, and furthermore that the two functions require argument_A and argument_B. Compare the following two equivalent lines of code.

my_dataset %>% 
  some_function(argument_A = 10) %>% 
  another_function(argument_B = 20) 

# the following line is equivalent, but much harder to read!
another_function( some_function( my_dataset , argument_A = 10), argument_B = 20 )

The pipe always passes the left-hand side as the first argument to the function on the right hand side. What does this mean? It means that functions that work with the pipe have to be set up so that their first argument is the object that is being manipulated (usually a dataframe). All of the tidyverse functions are set up in this way, so it’s not something that you need to worry too much about. If you were creating your own function for manipulating a dataframe in some way (something for an intermediate course) and wanted it to be compatible with the pipe, you would set it up so that the first argument to your function was the incoming dataframe.


The select function allows you to select specific columns from a dataframe. Like other tidyverse functions, the first argument is the dataframe, so it can be used with the pipe. Then you can pass the names of variables that you want to keep. Here we’ll take the cancer dataframe and keep just the variables agegp and alcgp.

cancer %>% 
  select(agegp , alcgp)
agegp alcgp
25-34 0-39g/day
25-34 0-39g/day
25-34 0-39g/day
25-34 0-39g/day
25-34 40-79
25-34 40-79
25-34 40-79
25-34 40-79
25-34 80-119
25-34 80-119
25-34 80-119
25-34 120+
25-34 120+
25-34 120+
25-34 120+
35-44 0-39g/day
35-44 0-39g/day
35-44 0-39g/day
35-44 0-39g/day
35-44 40-79
35-44 40-79
35-44 40-79
35-44 40-79
35-44 80-119
35-44 80-119
35-44 80-119
35-44 80-119
35-44 120+
35-44 120+
35-44 120+
45-54 0-39g/day
45-54 0-39g/day
45-54 0-39g/day
45-54 0-39g/day
45-54 40-79
45-54 40-79
45-54 40-79
45-54 40-79
45-54 80-119
45-54 80-119
45-54 80-119
45-54 80-119
45-54 120+
45-54 120+
45-54 120+
45-54 120+
55-64 0-39g/day
55-64 0-39g/day
55-64 0-39g/day
55-64 0-39g/day
55-64 40-79
55-64 40-79
55-64 40-79
55-64 40-79
55-64 80-119
55-64 80-119
55-64 80-119
55-64 80-119
55-64 120+
55-64 120+
55-64 120+
55-64 120+
65-74 0-39g/day
65-74 0-39g/day
65-74 0-39g/day
65-74 0-39g/day
65-74 40-79
65-74 40-79
65-74 40-79
65-74 80-119
65-74 80-119
65-74 80-119
65-74 80-119
65-74 120+
65-74 120+
65-74 120+
65-74 120+
75+ 0-39g/day
75+ 0-39g/day
75+ 0-39g/day
75+ 40-79
75+ 40-79
75+ 40-79
75+ 40-79
75+ 80-119
75+ 80-119
75+ 120+
75+ 120+

We can also drop variables by putting a minus sign in front of them. Here we’ll drop the variables alcgp and tobgp.

cancer %>%
  select(-alcgp , -tobgp)
agegp ncases ncontrols
25-34 0 40
25-34 0 10
25-34 0 6
25-34 0 5
25-34 0 27
25-34 0 7
25-34 0 4
25-34 0 7
25-34 0 2
25-34 0 1
25-34 0 2
25-34 0 1
25-34 1 0
25-34 0 1
25-34 0 2
35-44 0 60
35-44 1 13
35-44 0 7
35-44 0 8
35-44 0 35
35-44 3 20
35-44 1 13
35-44 0 8
35-44 0 11
35-44 0 6
35-44 0 2
35-44 0 1
35-44 2 1
35-44 0 3
35-44 2 2
45-54 1 45
45-54 0 18
45-54 0 10
45-54 0 4
45-54 6 32
45-54 4 17
45-54 5 10
45-54 5 2
45-54 3 13
45-54 6 8
45-54 1 4
45-54 2 2
45-54 4 0
45-54 3 1
45-54 2 1
45-54 4 0
55-64 2 47
55-64 3 19
55-64 3 9
55-64 4 2
55-64 9 31
55-64 6 15
55-64 4 13
55-64 3 3
55-64 9 9
55-64 8 7
55-64 3 3
55-64 4 0
55-64 5 5
55-64 6 1
55-64 2 1
55-64 5 1
65-74 5 43
65-74 4 10
65-74 2 5
65-74 0 2
65-74 17 17
65-74 3 7
65-74 5 4
65-74 6 7
65-74 4 8
65-74 2 1
65-74 1 0
65-74 3 1
65-74 1 1
65-74 1 0
65-74 1 0
75+ 1 17
75+ 2 4
75+ 1 2
75+ 2 3
75+ 1 2
75+ 0 3
75+ 1 0
75+ 1 0
75+ 1 0
75+ 2 0
75+ 1 0

There are a couple of really useful ‘selection helper’ functions that help you to keep or drop variables which contain certain string patterns. For example, we can use select with the selection helper function contains with the argument "gp" to select only variables whose names contain the string pattern “gp” (so agegp, alcgp and tobgp). The functions starts_with and ends_width are similar but the variable name has to start or end with the string pattern.

Here we will put a minus in front of contains to drop variables containing the string pattern “gp”.

cancer %>%
ncases ncontrols
0 40
0 10
0 6
0 5
0 27
0 7
0 4
0 7
0 2
0 1
0 2
0 1
1 0
0 1
0 2
0 60
1 13
0 7
0 8
0 35
3 20
1 13
0 8
0 11
0 6
0 2
0 1
2 1
0 3
2 2
1 45
0 18
0 10
0 4
6 32
4 17
5 10
5 2
3 13
6 8
1 4
2 2
4 0
3 1
2 1
4 0
2 47
3 19
3 9
4 2
9 31
6 15
4 13
3 3
9 9
8 7
3 3
4 0
5 5
6 1
2 1
5 1
5 43
4 10
2 5
0 2
17 17
3 7
5 4
6 7
4 8
2 1
1 0
3 1
1 1
1 0
1 0
1 17
2 4
1 2
2 3
1 2
0 3
1 0
1 0
1 0
2 0
1 0

You can select all the columns of a particular type (numerical, character, etc.) using the selection helper function where. It takes as its argument the name of another function which performs a logical check on a variable, examples include is.numeric, is.character, is.logical, is.factor. Let’s select the numerical variables.

cancer %>%
ncases ncontrols
0 40
0 10
0 6
0 5
0 27
0 7
0 4
0 7
0 2
0 1
0 2
0 1
1 0
0 1
0 2
0 60
1 13
0 7
0 8
0 35
3 20
1 13
0 8
0 11
0 6
0 2
0 1
2 1
0 3
2 2
1 45
0 18
0 10
0 4
6 32
4 17
5 10
5 2
3 13
6 8
1 4
2 2
4 0
3 1
2 1
4 0
2 47
3 19
3 9
4 2
9 31
6 15
4 13
3 3
9 9
8 7
3 3
4 0
5 5
6 1
2 1
5 1
5 43
4 10
2 5
0 2
17 17
3 7
5 4
6 7
4 8
2 1
1 0
3 1
1 1
1 0
1 0
1 17
2 4
1 2
2 3
1 2
0 3
1 0
1 0
1 0
2 0
1 0

Note that you can select the opposite (non-numeric columns, for example), by putting a minus or the ‘NOT’ logical operator ! in front of the where function.


The filter function allows you to select specific rows from a dataframe. Again the first argument is the dataframe so that it can be used by the pipe. Then you provide some logical expression based on the variables in the dataframe, and cases where that expression is true are retained in the output. Let’s take the cancer dataframe again and filter out the rows where tobgp is equal to "30+". Note the use of the double equals == which is the binary comparison operator for ‘equals’.

cancer %>%
  filter(togbp == "30+")
agegp alcgp tobgp ncases ncontrols
25-34 0-39g/day 30+ 0 5
25-34 40-79 30+ 0 7
25-34 80-119 30+ 0 2
25-34 120+ 30+ 0 2
35-44 0-39g/day 30+ 0 8
35-44 40-79 30+ 0 8
35-44 80-119 30+ 0 1
45-54 0-39g/day 30+ 0 4
45-54 40-79 30+ 5 2
45-54 80-119 30+ 2 2
45-54 120+ 30+ 4 0
55-64 0-39g/day 30+ 4 2
55-64 40-79 30+ 3 3
55-64 80-119 30+ 4 0
55-64 120+ 30+ 5 1
65-74 0-39g/day 30+ 0 2
65-74 80-119 30+ 1 0
65-74 120+ 30+ 1 0
75+ 0-39g/day 30+ 1 2
75+ 40-79 30+ 1 0

You can specify several conditions within a single filter function. If you have two conditions and want both of them to apply, you can separate them using a comma or the boolean ‘AND’ symbol which is the ampersand &.

If you have two conditions and want either of them to apply you separate them using the boolean ‘OR’ symbol which is the pipe | (unfortunately this character has the same name as the %>% tool in R, but it would be read as ‘OR’).

Let’s filter the cancer dataset keeping cases where ncases is not equal to zero or ncontrols is greater than 20.

cancer %>%
  filter(ncases != 0 | ncontrols > 20)
agegp alcgp tobgp ncases ncontrols
25-34 0-39g/day 0-9g/day 0 40
25-34 40-79 0-9g/day 0 27
25-34 120+ 10-19 1 0
35-44 0-39g/day 0-9g/day 0 60
35-44 0-39g/day 10-19 1 13
35-44 40-79 0-9g/day 0 35
35-44 40-79 10-19 3 20
35-44 40-79 20-29 1 13
35-44 120+ 0-9g/day 2 1
35-44 120+ 20-29 2 2
45-54 0-39g/day 0-9g/day 1 45
45-54 40-79 0-9g/day 6 32
45-54 40-79 10-19 4 17
45-54 40-79 20-29 5 10
45-54 40-79 30+ 5 2
45-54 80-119 0-9g/day 3 13
45-54 80-119 10-19 6 8
45-54 80-119 20-29 1 4
45-54 80-119 30+ 2 2
45-54 120+ 0-9g/day 4 0
45-54 120+ 10-19 3 1
45-54 120+ 20-29 2 1
45-54 120+ 30+ 4 0
55-64 0-39g/day 0-9g/day 2 47
55-64 0-39g/day 10-19 3 19
55-64 0-39g/day 20-29 3 9
55-64 0-39g/day 30+ 4 2
55-64 40-79 0-9g/day 9 31
55-64 40-79 10-19 6 15
55-64 40-79 20-29 4 13
55-64 40-79 30+ 3 3
55-64 80-119 0-9g/day 9 9
55-64 80-119 10-19 8 7
55-64 80-119 20-29 3 3
55-64 80-119 30+ 4 0
55-64 120+ 0-9g/day 5 5
55-64 120+ 10-19 6 1
55-64 120+ 20-29 2 1
55-64 120+ 30+ 5 1
65-74 0-39g/day 0-9g/day 5 43
65-74 0-39g/day 10-19 4 10
65-74 0-39g/day 20-29 2 5
65-74 40-79 0-9g/day 17 17
65-74 40-79 10-19 3 7
65-74 40-79 20-29 5 4
65-74 80-119 0-9g/day 6 7
65-74 80-119 10-19 4 8
65-74 80-119 20-29 2 1
65-74 80-119 30+ 1 0
65-74 120+ 0-9g/day 3 1
65-74 120+ 10-19 1 1
65-74 120+ 20-29 1 0
65-74 120+ 30+ 1 0
75+ 0-39g/day 0-9g/day 1 17
75+ 0-39g/day 10-19 2 4
75+ 0-39g/day 30+ 1 2
75+ 40-79 0-9g/day 2 3
75+ 40-79 10-19 1 2
75+ 40-79 30+ 1 0
75+ 80-119 0-9g/day 1 0
75+ 80-119 10-19 1 0
75+ 120+ 0-9g/day 2 0
75+ 120+ 10-19 1 0

What if we want to filter cases where a variable matches one of a selection of different values? Using multiple OR statements would become messy after about 3 options. A better way is to use the value matching tool %in%. The format is x %in% c(a, b, c, ... ) where c(a, b, c, ... ) is a vector of options which are the same type as the variable x. Let’s filter the cancer dataset keeping only the rows where ncases in equal to 3, 4, 5 or 6.

cancer %>%
  filter(ncases %in% c(3,4,5,6))
agegp alcgp tobgp ncases ncontrols
35-44 40-79 10-19 3 20
45-54 40-79 0-9g/day 6 32
45-54 40-79 10-19 4 17
45-54 40-79 20-29 5 10
45-54 40-79 30+ 5 2
45-54 80-119 0-9g/day 3 13
45-54 80-119 10-19 6 8
45-54 120+ 0-9g/day 4 0
45-54 120+ 10-19 3 1
45-54 120+ 30+ 4 0
55-64 0-39g/day 10-19 3 19
55-64 0-39g/day 20-29 3 9
55-64 0-39g/day 30+ 4 2
55-64 40-79 10-19 6 15
55-64 40-79 20-29 4 13
55-64 40-79 30+ 3 3
55-64 80-119 20-29 3 3
55-64 80-119 30+ 4 0
55-64 120+ 0-9g/day 5 5
55-64 120+ 10-19 6 1
55-64 120+ 30+ 5 1
65-74 0-39g/day 0-9g/day 5 43
65-74 0-39g/day 10-19 4 10
65-74 40-79 10-19 3 7
65-74 40-79 20-29 5 4
65-74 80-119 0-9g/day 6 7
65-74 80-119 10-19 4 8
65-74 120+ 0-9g/day 3 1

Note that since these four options are consecutive integers, we could have written the vector simply as 3:6, so an equivalent piece of code would be:

cancer %>%
  filter(ncases %in% 3:6)


Simple variable creation

The mutate function is used to edit variables or to create new ones. Let’s take the cancer dataset and create a new variable called new_var which is just equal to ncases plus 5.

cancer %>%
  mutate(new_var = ncases + 5)
agegp alcgp tobgp ncases ncontrols new_var
25-34 0-39g/day 0-9g/day 0 40 5
25-34 0-39g/day 10-19 0 10 5
25-34 0-39g/day 20-29 0 6 5
25-34 0-39g/day 30+ 0 5 5
25-34 40-79 0-9g/day 0 27 5
25-34 40-79 10-19 0 7 5
25-34 40-79 20-29 0 4 5
25-34 40-79 30+ 0 7 5
25-34 80-119 0-9g/day 0 2 5
25-34 80-119 10-19 0 1 5
25-34 80-119 30+ 0 2 5
25-34 120+ 0-9g/day 0 1 5
25-34 120+ 10-19 1 0 6
25-34 120+ 20-29 0 1 5
25-34 120+ 30+ 0 2 5
35-44 0-39g/day 0-9g/day 0 60 5
35-44 0-39g/day 10-19 1 13 6
35-44 0-39g/day 20-29 0 7 5
35-44 0-39g/day 30+ 0 8 5
35-44 40-79 0-9g/day 0 35 5
35-44 40-79 10-19 3 20 8
35-44 40-79 20-29 1 13 6
35-44 40-79 30+ 0 8 5
35-44 80-119 0-9g/day 0 11 5
35-44 80-119 10-19 0 6 5
35-44 80-119 20-29 0 2 5
35-44 80-119 30+ 0 1 5
35-44 120+ 0-9g/day 2 1 7
35-44 120+ 10-19 0 3 5
35-44 120+ 20-29 2 2 7
45-54 0-39g/day 0-9g/day 1 45 6
45-54 0-39g/day 10-19 0 18 5
45-54 0-39g/day 20-29 0 10 5
45-54 0-39g/day 30+ 0 4 5
45-54 40-79 0-9g/day 6 32 11
45-54 40-79 10-19 4 17 9
45-54 40-79 20-29 5 10 10
45-54 40-79 30+ 5 2 10
45-54 80-119 0-9g/day 3 13 8
45-54 80-119 10-19 6 8 11
45-54 80-119 20-29 1 4 6
45-54 80-119 30+ 2 2 7
45-54 120+ 0-9g/day 4 0 9
45-54 120+ 10-19 3 1 8
45-54 120+ 20-29 2 1 7
45-54 120+ 30+ 4 0 9
55-64 0-39g/day 0-9g/day 2 47 7
55-64 0-39g/day 10-19 3 19 8
55-64 0-39g/day 20-29 3 9 8
55-64 0-39g/day 30+ 4 2 9
55-64 40-79 0-9g/day 9 31 14
55-64 40-79 10-19 6 15 11
55-64 40-79 20-29 4 13 9
55-64 40-79 30+ 3 3 8
55-64 80-119 0-9g/day 9 9 14
55-64 80-119 10-19 8 7 13
55-64 80-119 20-29 3 3 8
55-64 80-119 30+ 4 0 9
55-64 120+ 0-9g/day 5 5 10
55-64 120+ 10-19 6 1 11
55-64 120+ 20-29 2 1 7
55-64 120+ 30+ 5 1 10
65-74 0-39g/day 0-9g/day 5 43 10
65-74 0-39g/day 10-19 4 10 9
65-74 0-39g/day 20-29 2 5 7
65-74 0-39g/day 30+ 0 2 5
65-74 40-79 0-9g/day 17 17 22
65-74 40-79 10-19 3 7 8
65-74 40-79 20-29 5 4 10
65-74 80-119 0-9g/day 6 7 11
65-74 80-119 10-19 4 8 9
65-74 80-119 20-29 2 1 7
65-74 80-119 30+ 1 0 6
65-74 120+ 0-9g/day 3 1 8
65-74 120+ 10-19 1 1 6
65-74 120+ 20-29 1 0 6
65-74 120+ 30+ 1 0 6
75+ 0-39g/day 0-9g/day 1 17 6
75+ 0-39g/day 10-19 2 4 7
75+ 0-39g/day 30+ 1 2 6
75+ 40-79 0-9g/day 2 3 7
75+ 40-79 10-19 1 2 6
75+ 40-79 20-29 0 3 5
75+ 40-79 30+ 1 0 6
75+ 80-119 0-9g/day 1 0 6
75+ 80-119 10-19 1 0 6
75+ 120+ 0-9g/day 2 0 7
75+ 120+ 10-19 1 0 6

Conditional variables

We can create a variable whose value is conditional on another variable using if_else. This function takes a logical expression as its first argument, and then its second and third arguments provide the result if the expression is true or false respectively. Let’s make a variable called lots_of_controls which is "Y" if ncontrols is greater than 15, and "N" otherwise. We’ll begin by selecting just the column ncontrols to make the output easier to read.

cancer %>%
  select(ncontrols) %>% 
  mutate(lots_of_controls = if_else(ncontrols > 15 , "Y" , "N"))
ncontrols lots_of_controls
40 Y
10 N
6 N
5 N
27 Y
7 N
4 N
7 N
2 N
1 N
2 N
1 N
0 N
1 N
2 N
60 Y
13 N
7 N
8 N
35 Y
20 Y
13 N
8 N
11 N
6 N
2 N
1 N
1 N
3 N
2 N
45 Y
18 Y
10 N
4 N
32 Y
17 Y
10 N
2 N
13 N
8 N
4 N
2 N
0 N
1 N
1 N
0 N
47 Y
19 Y
9 N
2 N
31 Y
15 N
13 N
3 N
9 N
7 N
3 N
0 N
5 N
1 N
1 N
1 N
43 Y
10 N
5 N
2 N
17 Y
7 N
4 N
7 N
8 N
1 N
0 N
1 N
1 N
0 N
0 N
17 Y
4 N
2 N
3 N
2 N
3 N
0 N
0 N
0 N
0 N
0 N
Let’s say I wanted to do some conditional editing on one of those factor variables…

Suppose you wanted to change the variable tobgp so that instead of having 30+ it would read 30 or more. If tobgp was a regular string variable, you could use the following. Note that the ‘false’ option (the third argument to if_else) is simply tobgp, meaning that tobgp is left as-is if the logical expression is false. This is a common structure, at least for me.

cancer %>%
  mutate(tobgp = if_else(tobgp == "30+" , "30 or more" , tobgp))

However, if you try to run that piece of code you’ll get an error, and this is because tobgp is a factor with defined levels, and "30 or more" is not one of those levels.

There are two options here. The first is to change the variable tobgp into a regular character variable using mutate with as.character, and then do the switcheroo.

cancer %>%
  mutate(tobgp = as.character(tobgp)) %>% 
  mutate(tobgp = if_else(tobgp == "30+" , "30 or more" , tobgp))
agegp alcgp tobgp ncases ncontrols
25-34 0-39g/day 0-9g/day 0 40
25-34 0-39g/day 10-19 0 10
25-34 0-39g/day 20-29 0 6
25-34 0-39g/day 30 or more 0 5
25-34 40-79 0-9g/day 0 27
25-34 40-79 10-19 0 7
25-34 40-79 20-29 0 4
25-34 40-79 30 or more 0 7
25-34 80-119 0-9g/day 0 2
25-34 80-119 10-19 0 1
25-34 80-119 30 or more 0 2
25-34 120+ 0-9g/day 0 1
25-34 120+ 10-19 1 0
25-34 120+ 20-29 0 1
25-34 120+ 30 or more 0 2
35-44 0-39g/day 0-9g/day 0 60
35-44 0-39g/day 10-19 1 13
35-44 0-39g/day 20-29 0 7
35-44 0-39g/day 30 or more 0 8
35-44 40-79 0-9g/day 0 35
35-44 40-79 10-19 3 20
35-44 40-79 20-29 1 13
35-44 40-79 30 or more 0 8
35-44 80-119 0-9g/day 0 11
35-44 80-119 10-19 0 6
35-44 80-119 20-29 0 2
35-44 80-119 30 or more 0 1
35-44 120+ 0-9g/day 2 1
35-44 120+ 10-19 0 3
35-44 120+ 20-29 2 2
45-54 0-39g/day 0-9g/day 1 45
45-54 0-39g/day 10-19 0 18
45-54 0-39g/day 20-29 0 10
45-54 0-39g/day 30 or more 0 4
45-54 40-79 0-9g/day 6 32
45-54 40-79 10-19 4 17
45-54 40-79 20-29 5 10
45-54 40-79 30 or more 5 2
45-54 80-119 0-9g/day 3 13
45-54 80-119 10-19 6 8
45-54 80-119 20-29 1 4
45-54 80-119 30 or more 2 2
45-54 120+ 0-9g/day 4 0
45-54 120+ 10-19 3 1
45-54 120+ 20-29 2 1
45-54 120+ 30 or more 4 0
55-64 0-39g/day 0-9g/day 2 47
55-64 0-39g/day 10-19 3 19
55-64 0-39g/day 20-29 3 9
55-64 0-39g/day 30 or more 4 2
55-64 40-79 0-9g/day 9 31
55-64 40-79 10-19 6 15
55-64 40-79 20-29 4 13
55-64 40-79 30 or more 3 3
55-64 80-119 0-9g/day 9 9
55-64 80-119 10-19 8 7
55-64 80-119 20-29 3 3
55-64 80-119 30 or more 4 0
55-64 120+ 0-9g/day 5 5
55-64 120+ 10-19 6 1
55-64 120+ 20-29 2 1
55-64 120+ 30 or more 5 1
65-74 0-39g/day 0-9g/day 5 43
65-74 0-39g/day 10-19 4 10
65-74 0-39g/day 20-29 2 5
65-74 0-39g/day 30 or more 0 2
65-74 40-79 0-9g/day 17 17
65-74 40-79 10-19 3 7
65-74 40-79 20-29 5 4
65-74 80-119 0-9g/day 6 7
65-74 80-119 10-19 4 8
65-74 80-119 20-29 2 1
65-74 80-119 30 or more 1 0
65-74 120+ 0-9g/day 3 1
65-74 120+ 10-19 1 1
65-74 120+ 20-29 1 0
65-74 120+ 30 or more 1 0
75+ 0-39g/day 0-9g/day 1 17
75+ 0-39g/day 10-19 2 4
75+ 0-39g/day 30 or more 1 2
75+ 40-79 0-9g/day 2 3
75+ 40-79 10-19 1 2
75+ 40-79 20-29 0 3
75+ 40-79 30 or more 1 0
75+ 80-119 0-9g/day 1 0
75+ 80-119 10-19 1 0
75+ 120+ 0-9g/day 2 0
75+ 120+ 10-19 1 0

That would work, but you would lose the ordered factor levels of tobgp which is useful for sorting. You could re-create the factor again (not hard but beyond the scope of a crash course). A better way might be to recode the level "30+" in the original factor using the function fct_recode. Here’s how you would do that:

cancer %>%
  mutate(tobgp = fct_recode(tobgp , "30 or more" = "30+"))
agegp alcgp tobgp ncases ncontrols
25-34 0-39g/day 0-9g/day 0 40
25-34 0-39g/day 10-19 0 10
25-34 0-39g/day 20-29 0 6
25-34 0-39g/day 30 or more 0 5
25-34 40-79 0-9g/day 0 27
25-34 40-79 10-19 0 7
25-34 40-79 20-29 0 4
25-34 40-79 30 or more 0 7
25-34 80-119 0-9g/day 0 2
25-34 80-119 10-19 0 1
25-34 80-119 30 or more 0 2
25-34 120+ 0-9g/day 0 1
25-34 120+ 10-19 1 0
25-34 120+ 20-29 0 1
25-34 120+ 30 or more 0 2
35-44 0-39g/day 0-9g/day 0 60
35-44 0-39g/day 10-19 1 13
35-44 0-39g/day 20-29 0 7
35-44 0-39g/day 30 or more 0 8
35-44 40-79 0-9g/day 0 35
35-44 40-79 10-19 3 20
35-44 40-79 20-29 1 13
35-44 40-79 30 or more 0 8
35-44 80-119 0-9g/day 0 11
35-44 80-119 10-19 0 6
35-44 80-119 20-29 0 2
35-44 80-119 30 or more 0 1
35-44 120+ 0-9g/day 2 1
35-44 120+ 10-19 0 3
35-44 120+ 20-29 2 2
45-54 0-39g/day 0-9g/day 1 45
45-54 0-39g/day 10-19 0 18
45-54 0-39g/day 20-29 0 10
45-54 0-39g/day 30 or more 0 4
45-54 40-79 0-9g/day 6 32
45-54 40-79 10-19 4 17
45-54 40-79 20-29 5 10
45-54 40-79 30 or more 5 2
45-54 80-119 0-9g/day 3 13
45-54 80-119 10-19 6 8
45-54 80-119 20-29 1 4
45-54 80-119 30 or more 2 2
45-54 120+ 0-9g/day 4 0
45-54 120+ 10-19 3 1
45-54 120+ 20-29 2 1
45-54 120+ 30 or more 4 0
55-64 0-39g/day 0-9g/day 2 47
55-64 0-39g/day 10-19 3 19
55-64 0-39g/day 20-29 3 9
55-64 0-39g/day 30 or more 4 2
55-64 40-79 0-9g/day 9 31
55-64 40-79 10-19 6 15
55-64 40-79 20-29 4 13
55-64 40-79 30 or more 3 3
55-64 80-119 0-9g/day 9 9
55-64 80-119 10-19 8 7
55-64 80-119 20-29 3 3
55-64 80-119 30 or more 4 0
55-64 120+ 0-9g/day 5 5
55-64 120+ 10-19 6 1
55-64 120+ 20-29 2 1
55-64 120+ 30 or more 5 1
65-74 0-39g/day 0-9g/day 5 43
65-74 0-39g/day 10-19 4 10
65-74 0-39g/day 20-29 2 5
65-74 0-39g/day 30 or more 0 2
65-74 40-79 0-9g/day 17 17
65-74 40-79 10-19 3 7
65-74 40-79 20-29 5 4
65-74 80-119 0-9g/day 6 7
65-74 80-119 10-19 4 8
65-74 80-119 20-29 2 1
65-74 80-119 30 or more 1 0
65-74 120+ 0-9g/day 3 1
65-74 120+ 10-19 1 1
65-74 120+ 20-29 1 0
65-74 120+ 30 or more 1 0
75+ 0-39g/day 0-9g/day 1 17
75+ 0-39g/day 10-19 2 4
75+ 0-39g/day 30 or more 1 2
75+ 40-79 0-9g/day 2 3
75+ 40-79 10-19 1 2
75+ 40-79 20-29 0 3
75+ 40-79 30 or more 1 0
75+ 80-119 0-9g/day 1 0
75+ 80-119 10-19 1 0
75+ 120+ 0-9g/day 2 0
75+ 120+ 10-19 1 0

The function case_when provides even greater (potentially unlimited) options for conditionally defining a value. The format for this function is a logical expression (condition) followed by a tilde (~) followed by the value to be assigned in the event of that expression being true. This is repeated for further conditions, with each option separated by a comma, and usually written on separate lines for ease of reading. Below is an example where amount_of_controls has a value of "very few" if ncontrols is less than 10, "a couple" if ncontrols is between 10 and 19, and "lots" for ncontrols equal to 20 or more.

cancer %>%
  select(ncontrols) %>% 
  mutate(amount_of_controls = case_when(
    ncontrols < 10 ~ "very few",
    ncontrols < 20 ~ "a couple",
    ncontrols >= 20 ~ "lots"
ncontrols amount_of_controls
40 lots
10 a couple
6 very few
5 very few
27 lots
7 very few
4 very few
7 very few
2 very few
1 very few
2 very few
1 very few
0 very few
1 very few
2 very few
60 lots
13 a couple
7 very few
8 very few
35 lots
20 lots
13 a couple
8 very few
11 a couple
6 very few
2 very few
1 very few
1 very few
3 very few
2 very few
45 lots
18 a couple
10 a couple
4 very few
32 lots
17 a couple
10 a couple
2 very few
13 a couple
8 very few
4 very few
2 very few
0 very few
1 very few
1 very few
0 very few
47 lots
19 a couple
9 very few
2 very few
31 lots
15 a couple
13 a couple
3 very few
9 very few
7 very few
3 very few
0 very few
5 very few
1 very few
1 very few
1 very few
43 lots
10 a couple
5 very few
2 very few
17 a couple
7 very few
4 very few
7 very few
8 very few
1 very few
0 very few
1 very few
1 very few
0 very few
0 very few
17 a couple
4 very few
2 very few
3 very few
2 very few
3 very few
0 very few
0 very few
0 very few
0 very few
0 very few

Note that the value returned by case_when is determined by the first true expression. This means that for the second condition we don’t have to specify that ncontrols is greater or equal to 10, we write ncontrols < 20 rather than ncontrols >=10 & ncontrols < 20. This is because the possibility that is ncontrols is less than 10 is already covered off by the first expression. If ncontrols was equal to 5, for example, it would never make it past the first condition into the second condition.

For the third expression (ncontrols >= 20) it shouldn’t have been necessary to specify any logical expression at all, since all cases remaining after the first two expressions would be greater than 20 by default and should be categorised as "lots". We can therefore replace ncontrols >= 20 with the word TRUE and get the same result (output not shown but identical to the above).

cancer %>%
  select(ncontrols) %>% 
  mutate(amount_of_controls = case_when(
    ncontrols < 10 ~ "very few",
    ncontrols < 20 ~ "a couple",
    TRUE ~ "lots"

Summary, size and row number functions

Before we move on from mutate I want to mention a couple of other useful functions because these will be useful later. We can create a variable equal to the minimum, maximum or sum of another variable using the functions min, max and sum. Let’s calculate the sum of ncases as a separate variable.

cancer %>%
  mutate(total_cases = sum(ncases))
agegp alcgp tobgp ncases ncontrols total_cases
25-34 0-39g/day 0-9g/day 0 40 200
25-34 0-39g/day 10-19 0 10 200
25-34 0-39g/day 20-29 0 6 200
25-34 0-39g/day 30+ 0 5 200
25-34 40-79 0-9g/day 0 27 200
25-34 40-79 10-19 0 7 200
25-34 40-79 20-29 0 4 200
25-34 40-79 30+ 0 7 200
25-34 80-119 0-9g/day 0 2 200
25-34 80-119 10-19 0 1 200
25-34 80-119 30+ 0 2 200
25-34 120+ 0-9g/day 0 1 200
25-34 120+ 10-19 1 0 200
25-34 120+ 20-29 0 1 200
25-34 120+ 30+ 0 2 200
35-44 0-39g/day 0-9g/day 0 60 200
35-44 0-39g/day 10-19 1 13 200
35-44 0-39g/day 20-29 0 7 200
35-44 0-39g/day 30+ 0 8 200
35-44 40-79 0-9g/day 0 35 200
35-44 40-79 10-19 3 20 200
35-44 40-79 20-29 1 13 200
35-44 40-79 30+ 0 8 200
35-44 80-119 0-9g/day 0 11 200
35-44 80-119 10-19 0 6 200
35-44 80-119 20-29 0 2 200
35-44 80-119 30+ 0 1 200
35-44 120+ 0-9g/day 2 1 200
35-44 120+ 10-19 0 3 200
35-44 120+ 20-29 2 2 200
45-54 0-39g/day 0-9g/day 1 45 200
45-54 0-39g/day 10-19 0 18 200
45-54 0-39g/day 20-29 0 10 200
45-54 0-39g/day 30+ 0 4 200
45-54 40-79 0-9g/day 6 32 200
45-54 40-79 10-19 4 17 200
45-54 40-79 20-29 5 10 200
45-54 40-79 30+ 5 2 200
45-54 80-119 0-9g/day 3 13 200
45-54 80-119 10-19 6 8 200
45-54 80-119 20-29 1 4 200
45-54 80-119 30+ 2 2 200
45-54 120+ 0-9g/day 4 0 200
45-54 120+ 10-19 3 1 200
45-54 120+ 20-29 2 1 200
45-54 120+ 30+ 4 0 200
55-64 0-39g/day 0-9g/day 2 47 200
55-64 0-39g/day 10-19 3 19 200
55-64 0-39g/day 20-29 3 9 200
55-64 0-39g/day 30+ 4 2 200
55-64 40-79 0-9g/day 9 31 200
55-64 40-79 10-19 6 15 200
55-64 40-79 20-29 4 13 200
55-64 40-79 30+ 3 3 200
55-64 80-119 0-9g/day 9 9 200
55-64 80-119 10-19 8 7 200
55-64 80-119 20-29 3 3 200
55-64 80-119 30+ 4 0 200
55-64 120+ 0-9g/day 5 5 200
55-64 120+ 10-19 6 1 200
55-64 120+ 20-29 2 1 200
55-64 120+ 30+ 5 1 200
65-74 0-39g/day 0-9g/day 5 43 200
65-74 0-39g/day 10-19 4 10 200
65-74 0-39g/day 20-29 2 5 200
65-74 0-39g/day 30+ 0 2 200
65-74 40-79 0-9g/day 17 17 200
65-74 40-79 10-19 3 7 200
65-74 40-79 20-29 5 4 200
65-74 80-119 0-9g/day 6 7 200
65-74 80-119 10-19 4 8 200
65-74 80-119 20-29 2 1 200
65-74 80-119 30+ 1 0 200
65-74 120+ 0-9g/day 3 1 200
65-74 120+ 10-19 1 1 200
65-74 120+ 20-29 1 0 200
65-74 120+ 30+ 1 0 200
75+ 0-39g/day 0-9g/day 1 17 200
75+ 0-39g/day 10-19 2 4 200
75+ 0-39g/day 30+ 1 2 200
75+ 40-79 0-9g/day 2 3 200
75+ 40-79 10-19 1 2 200
75+ 40-79 20-29 0 3 200
75+ 40-79 30+ 1 0 200
75+ 80-119 0-9g/day 1 0 200
75+ 80-119 10-19 1 0 200
75+ 120+ 0-9g/day 2 0 200
75+ 120+ 10-19 1 0 200

We can also create a variable equal to the row number using row_number. We’ll create my_id using this function. Finally we can get the number of entries in the group using n(). Let’s call this variable number_of_rows. We’ll create both of these variables at the same time. Note that you can create multiple variables in a single mutate function (separated by commas). You can even create a new variable in a mutate function and create another variable which depends on the first one within the same mutate function.

We will see later that row_number and n are pretty versatile functions.

cancer %>%
  mutate(my_id = row_number() , number_of_rows = n())
agegp alcgp tobgp ncases ncontrols my_id number_of_rows
25-34 0-39g/day 0-9g/day 0 40 1 88
25-34 0-39g/day 10-19 0 10 2 88
25-34 0-39g/day 20-29 0 6 3 88
25-34 0-39g/day 30+ 0 5 4 88
25-34 40-79 0-9g/day 0 27 5 88
25-34 40-79 10-19 0 7 6 88
25-34 40-79 20-29 0 4 7 88
25-34 40-79 30+ 0 7 8 88
25-34 80-119 0-9g/day 0 2 9 88
25-34 80-119 10-19 0 1 10 88
25-34 80-119 30+ 0 2 11 88
25-34 120+ 0-9g/day 0 1 12 88
25-34 120+ 10-19 1 0 13 88
25-34 120+ 20-29 0 1 14 88
25-34 120+ 30+ 0 2 15 88
35-44 0-39g/day 0-9g/day 0 60 16 88
35-44 0-39g/day 10-19 1 13 17 88
35-44 0-39g/day 20-29 0 7 18 88
35-44 0-39g/day 30+ 0 8 19 88
35-44 40-79 0-9g/day 0 35 20 88
35-44 40-79 10-19 3 20 21 88
35-44 40-79 20-29 1 13 22 88
35-44 40-79 30+ 0 8 23 88
35-44 80-119 0-9g/day 0 11 24 88
35-44 80-119 10-19 0 6 25 88
35-44 80-119 20-29 0 2 26 88
35-44 80-119 30+ 0 1 27 88
35-44 120+ 0-9g/day 2 1 28 88
35-44 120+ 10-19 0 3 29 88
35-44 120+ 20-29 2 2 30 88
45-54 0-39g/day 0-9g/day 1 45 31 88
45-54 0-39g/day 10-19 0 18 32 88
45-54 0-39g/day 20-29 0 10 33 88
45-54 0-39g/day 30+ 0 4 34 88
45-54 40-79 0-9g/day 6 32 35 88
45-54 40-79 10-19 4 17 36 88
45-54 40-79 20-29 5 10 37 88
45-54 40-79 30+ 5 2 38 88
45-54 80-119 0-9g/day 3 13 39 88
45-54 80-119 10-19 6 8 40 88
45-54 80-119 20-29 1 4 41 88
45-54 80-119 30+ 2 2 42 88
45-54 120+ 0-9g/day 4 0 43 88
45-54 120+ 10-19 3 1 44 88
45-54 120+ 20-29 2 1 45 88
45-54 120+ 30+ 4 0 46 88
55-64 0-39g/day 0-9g/day 2 47 47 88
55-64 0-39g/day 10-19 3 19 48 88
55-64 0-39g/day 20-29 3 9 49 88
55-64 0-39g/day 30+ 4 2 50 88
55-64 40-79 0-9g/day 9 31 51 88
55-64 40-79 10-19 6 15 52 88
55-64 40-79 20-29 4 13 53 88
55-64 40-79 30+ 3 3 54 88
55-64 80-119 0-9g/day 9 9 55 88
55-64 80-119 10-19 8 7 56 88
55-64 80-119 20-29 3 3 57 88
55-64 80-119 30+ 4 0 58 88
55-64 120+ 0-9g/day 5 5 59 88
55-64 120+ 10-19 6 1 60 88
55-64 120+ 20-29 2 1 61 88
55-64 120+ 30+ 5 1 62 88
65-74 0-39g/day 0-9g/day 5 43 63 88
65-74 0-39g/day 10-19 4 10 64 88
65-74 0-39g/day 20-29 2 5 65 88
65-74 0-39g/day 30+ 0 2 66 88
65-74 40-79 0-9g/day 17 17 67 88
65-74 40-79 10-19 3 7 68 88
65-74 40-79 20-29 5 4 69 88
65-74 80-119 0-9g/day 6 7 70 88
65-74 80-119 10-19 4 8 71 88
65-74 80-119 20-29 2 1 72 88
65-74 80-119 30+ 1 0 73 88
65-74 120+ 0-9g/day 3 1 74 88
65-74 120+ 10-19 1 1 75 88
65-74 120+ 20-29 1 0 76 88
65-74 120+ 30+ 1 0 77 88
75+ 0-39g/day 0-9g/day 1 17 78 88
75+ 0-39g/day 10-19 2 4 79 88
75+ 0-39g/day 30+ 1 2 80 88
75+ 40-79 0-9g/day 2 3 81 88
75+ 40-79 10-19 1 2 82 88
75+ 40-79 20-29 0 3 83 88
75+ 40-79 30+ 1 0 84 88
75+ 80-119 0-9g/day 1 0 85 88
75+ 80-119 10-19 1 0 86 88
75+ 120+ 0-9g/day 2 0 87 88
75+ 120+ 10-19 1 0 88 88


The function summarise is used to produce aggregated data from a dataframe. In practice it is almost always used in combination with summary functions such as max, sum, etc.

Let’s calculate the sum of ncases and ncontrols from the cancer dataset. We will also calculate the number of rows in the dataset using the function n().

cancer %>%
  summarise(total_cases = sum(ncases) , 
            total_controls = sum(ncontrols),
            number_of_rows = n())
total_cases total_controls number_of_rows
200 775 88

Note that whereas mutate added the new calculations as additional rows, summarise has done away with the original data.


The real power of mutate and summarise becomes clear when you start to use them in combination with group_by. This function allows you to and calculate counts and summary functions over groups within the data. With mutate the new grouped aggregated data is added to the dataset, and with summarise only the aggregated data for each group remains.

group_by with mutate

Let’s calculate the total number of controls for each of the four bands of tobacco intake (tobgp) and add that as a new column called total_controls.

cancer %>%
  group_by(tobgp) %>% 
  mutate(total_controls = sum(ncases))
agegp alcgp tobgp ncases ncontrols total_controls
25-34 0-39g/day 0-9g/day 0 40 78
25-34 0-39g/day 10-19 0 10 58
25-34 0-39g/day 20-29 0 6 33
25-34 0-39g/day 30+ 0 5 31
25-34 40-79 0-9g/day 0 27 78
25-34 40-79 10-19 0 7 58
25-34 40-79 20-29 0 4 33
25-34 40-79 30+ 0 7 31
25-34 80-119 0-9g/day 0 2 78
25-34 80-119 10-19 0 1 58
25-34 80-119 30+ 0 2 31
25-34 120+ 0-9g/day 0 1 78
25-34 120+ 10-19 1 0 58
25-34 120+ 20-29 0 1 33
25-34 120+ 30+ 0 2 31
35-44 0-39g/day 0-9g/day 0 60 78
35-44 0-39g/day 10-19 1 13 58
35-44 0-39g/day 20-29 0 7 33
35-44 0-39g/day 30+ 0 8 31
35-44 40-79 0-9g/day 0 35 78
35-44 40-79 10-19 3 20 58
35-44 40-79 20-29 1 13 33
35-44 40-79 30+ 0 8 31
35-44 80-119 0-9g/day 0 11 78
35-44 80-119 10-19 0 6 58
35-44 80-119 20-29 0 2 33
35-44 80-119 30+ 0 1 31
35-44 120+ 0-9g/day 2 1 78
35-44 120+ 10-19 0 3 58
35-44 120+ 20-29 2 2 33
45-54 0-39g/day 0-9g/day 1 45 78
45-54 0-39g/day 10-19 0 18 58
45-54 0-39g/day 20-29 0 10 33
45-54 0-39g/day 30+ 0 4 31
45-54 40-79 0-9g/day 6 32 78
45-54 40-79 10-19 4 17 58
45-54 40-79 20-29 5 10 33
45-54 40-79 30+ 5 2 31
45-54 80-119 0-9g/day 3 13 78
45-54 80-119 10-19 6 8 58
45-54 80-119 20-29 1 4 33
45-54 80-119 30+ 2 2 31
45-54 120+ 0-9g/day 4 0 78
45-54 120+ 10-19 3 1 58
45-54 120+ 20-29 2 1 33
45-54 120+ 30+ 4 0 31
55-64 0-39g/day 0-9g/day 2 47 78
55-64 0-39g/day 10-19 3 19 58
55-64 0-39g/day 20-29 3 9 33
55-64 0-39g/day 30+ 4 2 31
55-64 40-79 0-9g/day 9 31 78
55-64 40-79 10-19 6 15 58
55-64 40-79 20-29 4 13 33
55-64 40-79 30+ 3 3 31
55-64 80-119 0-9g/day 9 9 78
55-64 80-119 10-19 8 7 58
55-64 80-119 20-29 3 3 33
55-64 80-119 30+ 4 0 31
55-64 120+ 0-9g/day 5 5 78
55-64 120+ 10-19 6 1 58
55-64 120+ 20-29 2 1 33
55-64 120+ 30+ 5 1 31
65-74 0-39g/day 0-9g/day 5 43 78
65-74 0-39g/day 10-19 4 10 58
65-74 0-39g/day 20-29 2 5 33
65-74 0-39g/day 30+ 0 2 31
65-74 40-79 0-9g/day 17 17 78
65-74 40-79 10-19 3 7 58
65-74 40-79 20-29 5 4 33
65-74 80-119 0-9g/day 6 7 78
65-74 80-119 10-19 4 8 58
65-74 80-119 20-29 2 1 33
65-74 80-119 30+ 1 0 31
65-74 120+ 0-9g/day 3 1 78
65-74 120+ 10-19 1 1 58
65-74 120+ 20-29 1 0 33
65-74 120+ 30+ 1 0 31
75+ 0-39g/day 0-9g/day 1 17 78
75+ 0-39g/day 10-19 2 4 58
75+ 0-39g/day 30+ 1 2 31
75+ 40-79 0-9g/day 2 3 78
75+ 40-79 10-19 1 2 58
75+ 40-79 20-29 0 3 33
75+ 40-79 30+ 1 0 31
75+ 80-119 0-9g/day 1 0 78
75+ 80-119 10-19 1 0 58
75+ 120+ 0-9g/day 2 0 78
75+ 120+ 10-19 1 0 58

The row number function can be used with group_by to produce a counter or unique number for each group. Here we’ll create a counter called age_group_counter for each value of agegp.

cancer %>%
  group_by(agegp) %>% 
  mutate(age_group_counter = row_number())
agegp alcgp tobgp ncases ncontrols age_group_counter
25-34 0-39g/day 0-9g/day 0 40 1
25-34 0-39g/day 10-19 0 10 2
25-34 0-39g/day 20-29 0 6 3
25-34 0-39g/day 30+ 0 5 4
25-34 40-79 0-9g/day 0 27 5
25-34 40-79 10-19 0 7 6
25-34 40-79 20-29 0 4 7
25-34 40-79 30+ 0 7 8
25-34 80-119 0-9g/day 0 2 9
25-34 80-119 10-19 0 1 10
25-34 80-119 30+ 0 2 11
25-34 120+ 0-9g/day 0 1 12
25-34 120+ 10-19 1 0 13
25-34 120+ 20-29 0 1 14
25-34 120+ 30+ 0 2 15
35-44 0-39g/day 0-9g/day 0 60 1
35-44 0-39g/day 10-19 1 13 2
35-44 0-39g/day 20-29 0 7 3
35-44 0-39g/day 30+ 0 8 4
35-44 40-79 0-9g/day 0 35 5
35-44 40-79 10-19 3 20 6
35-44 40-79 20-29 1 13 7
35-44 40-79 30+ 0 8 8
35-44 80-119 0-9g/day 0 11 9
35-44 80-119 10-19 0 6 10
35-44 80-119 20-29 0 2 11
35-44 80-119 30+ 0 1 12
35-44 120+ 0-9g/day 2 1 13
35-44 120+ 10-19 0 3 14
35-44 120+ 20-29 2 2 15
45-54 0-39g/day 0-9g/day 1 45 1
45-54 0-39g/day 10-19 0 18 2
45-54 0-39g/day 20-29 0 10 3
45-54 0-39g/day 30+ 0 4 4
45-54 40-79 0-9g/day 6 32 5
45-54 40-79 10-19 4 17 6
45-54 40-79 20-29 5 10 7
45-54 40-79 30+ 5 2 8
45-54 80-119 0-9g/day 3 13 9
45-54 80-119 10-19 6 8 10
45-54 80-119 20-29 1 4 11
45-54 80-119 30+ 2 2 12
45-54 120+ 0-9g/day 4 0 13
45-54 120+ 10-19 3 1 14
45-54 120+ 20-29 2 1 15
45-54 120+ 30+ 4 0 16
55-64 0-39g/day 0-9g/day 2 47 1
55-64 0-39g/day 10-19 3 19 2
55-64 0-39g/day 20-29 3 9 3
55-64 0-39g/day 30+ 4 2 4
55-64 40-79 0-9g/day 9 31 5
55-64 40-79 10-19 6 15 6
55-64 40-79 20-29 4 13 7
55-64 40-79 30+ 3 3 8
55-64 80-119 0-9g/day 9 9 9
55-64 80-119 10-19 8 7 10
55-64 80-119 20-29 3 3 11
55-64 80-119 30+ 4 0 12
55-64 120+ 0-9g/day 5 5 13
55-64 120+ 10-19 6 1 14
55-64 120+ 20-29 2 1 15
55-64 120+ 30+ 5 1 16
65-74 0-39g/day 0-9g/day 5 43 1
65-74 0-39g/day 10-19 4 10 2
65-74 0-39g/day 20-29 2 5 3
65-74 0-39g/day 30+ 0 2 4
65-74 40-79 0-9g/day 17 17 5
65-74 40-79 10-19 3 7 6
65-74 40-79 20-29 5 4 7
65-74 80-119 0-9g/day 6 7 8
65-74 80-119 10-19 4 8 9
65-74 80-119 20-29 2 1 10
65-74 80-119 30+ 1 0 11
65-74 120+ 0-9g/day 3 1 12
65-74 120+ 10-19 1 1 13
65-74 120+ 20-29 1 0 14
65-74 120+ 30+ 1 0 15
75+ 0-39g/day 0-9g/day 1 17 1
75+ 0-39g/day 10-19 2 4 2
75+ 0-39g/day 30+ 1 2 3
75+ 40-79 0-9g/day 2 3 4
75+ 40-79 10-19 1 2 5
75+ 40-79 20-29 0 3 6
75+ 40-79 30+ 1 0 7
75+ 80-119 0-9g/day 1 0 8
75+ 80-119 10-19 1 0 9
75+ 120+ 0-9g/day 2 0 10
75+ 120+ 10-19 1 0 11

group_by with summarise

Let’s get the total number of cases by age group (agegp) and tobacco intake group (tobgp).

Note that we group by two variables using a single group_by statement. If we use a second group_by it will overwrite the previous grouping, the function is not ‘additive’ in that way.

cancer %>%
  group_by(agegp, tobgp) %>% 
  summarise(total_cases = sum(ncases))
agegp tobgp total_cases
25-34 0-9g/day 0
25-34 10-19 1
25-34 20-29 0
25-34 30+ 0
35-44 0-9g/day 2
35-44 10-19 4
35-44 20-29 3
35-44 30+ 0
45-54 0-9g/day 14
45-54 10-19 13
45-54 20-29 8
45-54 30+ 11
55-64 0-9g/day 25
55-64 10-19 23
55-64 20-29 12
55-64 30+ 16
65-74 0-9g/day 31
65-74 10-19 12
65-74 20-29 10
65-74 30+ 2
75+ 0-9g/day 6
75+ 10-19 5
75+ 20-29 0
75+ 30+ 2


After using a group_by you may wish to apply the function ungroup(). This could avoid unexpected outcomes later, for example if you were using a summary function on a dataset that you had grouped for the purposes of some mutate process. Here’s how you would use it (results not shown):

cancer %>%
  group_by(tobgp) %>% 
  mutate(total_controls = sum(ncases)) %>% 

Other useful tools

Renaming variables

You will probably need to rename variables at some stage. We use the rename function to do that. This is another tidyverse function, so its first argument is the dataframe to be edited and therefore it can be used with the pipe. You can rename more than one variable in a single rename function. Let’s take the cancer dataset and rename the variable agegp as age_group, and alcgp as alcohol_group. Note that the order of the variable names in this function is new_name = old_name.

cancer %>%
  rename(age_group = agegp , alcohol_group = alcgp)
age_group alcohol_group tobgp ncases ncontrols
25-34 0-39g/day 0-9g/day 0 40
25-34 0-39g/day 10-19 0 10
25-34 0-39g/day 20-29 0 6
25-34 0-39g/day 30+ 0 5
25-34 40-79 0-9g/day 0 27
25-34 40-79 10-19 0 7
25-34 40-79 20-29 0 4
25-34 40-79 30+ 0 7
25-34 80-119 0-9g/day 0 2
25-34 80-119 10-19 0 1
25-34 80-119 30+ 0 2
25-34 120+ 0-9g/day 0 1
25-34 120+ 10-19 1 0
25-34 120+ 20-29 0 1
25-34 120+ 30+ 0 2
35-44 0-39g/day 0-9g/day 0 60
35-44 0-39g/day 10-19 1 13
35-44 0-39g/day 20-29 0 7
35-44 0-39g/day 30+ 0 8
35-44 40-79 0-9g/day 0 35
35-44 40-79 10-19 3 20
35-44 40-79 20-29 1 13
35-44 40-79 30+ 0 8
35-44 80-119 0-9g/day 0 11
35-44 80-119 10-19 0 6
35-44 80-119 20-29 0 2
35-44 80-119 30+ 0 1
35-44 120+ 0-9g/day 2 1
35-44 120+ 10-19 0 3
35-44 120+ 20-29 2 2
45-54 0-39g/day 0-9g/day 1 45
45-54 0-39g/day 10-19 0 18
45-54 0-39g/day 20-29 0 10
45-54 0-39g/day 30+ 0 4
45-54 40-79 0-9g/day 6 32
45-54 40-79 10-19 4 17
45-54 40-79 20-29 5 10
45-54 40-79 30+ 5 2
45-54 80-119 0-9g/day 3 13
45-54 80-119 10-19 6 8
45-54 80-119 20-29 1 4
45-54 80-119 30+ 2 2
45-54 120+ 0-9g/day 4 0
45-54 120+ 10-19 3 1
45-54 120+ 20-29 2 1
45-54 120+ 30+ 4 0
55-64 0-39g/day 0-9g/day 2 47
55-64 0-39g/day 10-19 3 19
55-64 0-39g/day 20-29 3 9
55-64 0-39g/day 30+ 4 2
55-64 40-79 0-9g/day 9 31
55-64 40-79 10-19 6 15
55-64 40-79 20-29 4 13
55-64 40-79 30+ 3 3
55-64 80-119 0-9g/day 9 9
55-64 80-119 10-19 8 7
55-64 80-119 20-29 3 3
55-64 80-119 30+ 4 0
55-64 120+ 0-9g/day 5 5
55-64 120+ 10-19 6 1
55-64 120+ 20-29 2 1
55-64 120+ 30+ 5 1
65-74 0-39g/day 0-9g/day 5 43
65-74 0-39g/day 10-19 4 10
65-74 0-39g/day 20-29 2 5
65-74 0-39g/day 30+ 0 2
65-74 40-79 0-9g/day 17 17
65-74 40-79 10-19 3 7
65-74 40-79 20-29 5 4
65-74 80-119 0-9g/day 6 7
65-74 80-119 10-19 4 8
65-74 80-119 20-29 2 1
65-74 80-119 30+ 1 0
65-74 120+ 0-9g/day 3 1
65-74 120+ 10-19 1 1
65-74 120+ 20-29 1 0
65-74 120+ 30+ 1 0
75+ 0-39g/day 0-9g/day 1 17
75+ 0-39g/day 10-19 2 4
75+ 0-39g/day 30+ 1 2
75+ 40-79 0-9g/day 2 3
75+ 40-79 10-19 1 2
75+ 40-79 20-29 0 3
75+ 40-79 30+ 1 0
75+ 80-119 0-9g/day 1 0
75+ 80-119 10-19 1 0
75+ 120+ 0-9g/day 2 0
75+ 120+ 10-19 1 0

Sorting datasets

You can sort datasets using the function arrange. This is another tidyverse function so has the dataframe as the first argument and is pipe-friendly. You can sort by as many variables as you like. Let’s sort the cancer dataset by alcgp and then by ncases. Note that since alcgp is an ordered factor (ordinal variable), it is sorted according to the order of the levels rather than alphabetically.

cancer %>%
  arrange(alcgp, ncases)
agegp alcgp tobgp ncases ncontrols
25-34 0-39g/day 0-9g/day 0 40
25-34 0-39g/day 10-19 0 10
25-34 0-39g/day 20-29 0 6
25-34 0-39g/day 30+ 0 5
35-44 0-39g/day 0-9g/day 0 60
35-44 0-39g/day 20-29 0 7
35-44 0-39g/day 30+ 0 8
45-54 0-39g/day 10-19 0 18
45-54 0-39g/day 20-29 0 10
45-54 0-39g/day 30+ 0 4
65-74 0-39g/day 30+ 0 2
35-44 0-39g/day 10-19 1 13
45-54 0-39g/day 0-9g/day 1 45
75+ 0-39g/day 0-9g/day 1 17
75+ 0-39g/day 30+ 1 2
55-64 0-39g/day 0-9g/day 2 47
65-74 0-39g/day 20-29 2 5
75+ 0-39g/day 10-19 2 4
55-64 0-39g/day 10-19 3 19
55-64 0-39g/day 20-29 3 9
55-64 0-39g/day 30+ 4 2
65-74 0-39g/day 10-19 4 10
65-74 0-39g/day 0-9g/day 5 43
25-34 40-79 0-9g/day 0 27
25-34 40-79 10-19 0 7
25-34 40-79 20-29 0 4
25-34 40-79 30+ 0 7
35-44 40-79 0-9g/day 0 35
35-44 40-79 30+ 0 8
75+ 40-79 20-29 0 3
35-44 40-79 20-29 1 13
75+ 40-79 10-19 1 2
75+ 40-79 30+ 1 0
75+ 40-79 0-9g/day 2 3
35-44 40-79 10-19 3 20
55-64 40-79 30+ 3 3
65-74 40-79 10-19 3 7
45-54 40-79 10-19 4 17
55-64 40-79 20-29 4 13
45-54 40-79 20-29 5 10
45-54 40-79 30+ 5 2
65-74 40-79 20-29 5 4
45-54 40-79 0-9g/day 6 32
55-64 40-79 10-19 6 15
55-64 40-79 0-9g/day 9 31
65-74 40-79 0-9g/day 17 17
25-34 80-119 0-9g/day 0 2
25-34 80-119 10-19 0 1
25-34 80-119 30+ 0 2
35-44 80-119 0-9g/day 0 11
35-44 80-119 10-19 0 6
35-44 80-119 20-29 0 2
35-44 80-119 30+ 0 1
45-54 80-119 20-29 1 4
65-74 80-119 30+ 1 0
75+ 80-119 0-9g/day 1 0
75+ 80-119 10-19 1 0
45-54 80-119 30+ 2 2
65-74 80-119 20-29 2 1
45-54 80-119 0-9g/day 3 13
55-64 80-119 20-29 3 3
55-64 80-119 30+ 4 0
65-74 80-119 10-19 4 8
45-54 80-119 10-19 6 8
65-74 80-119 0-9g/day 6 7
55-64 80-119 10-19 8 7
55-64 80-119 0-9g/day 9 9
25-34 120+ 0-9g/day 0 1
25-34 120+ 20-29 0 1
25-34 120+ 30+ 0 2
35-44 120+ 10-19 0 3
25-34 120+ 10-19 1 0
65-74 120+ 10-19 1 1
65-74 120+ 20-29 1 0
65-74 120+ 30+ 1 0
75+ 120+ 10-19 1 0
35-44 120+ 0-9g/day 2 1
35-44 120+ 20-29 2 2
45-54 120+ 20-29 2 1
55-64 120+ 20-29 2 1
75+ 120+ 0-9g/day 2 0
45-54 120+ 10-19 3 1
65-74 120+ 0-9g/day 3 1
45-54 120+ 0-9g/day 4 0
45-54 120+ 30+ 4 0
55-64 120+ 0-9g/day 5 5
55-64 120+ 30+ 5 1
55-64 120+ 10-19 6 1

Combining datasets


You will probably need to join datasets together by some common id or other variable. We don’t have anything to join onto the cancer dataset yet, so let’s make a dataframe called ‘data_set_to_join’. It has one variable in common with the cancer dataset which is tobgp, however it has one extra value of tobgp which is "All amounts". Then it has another variable called tobacco_code.

data_set_to_join <- data.frame(
  tobgp = c("0-9g/day", "10-19", "20-29", "30+", "All amounts"),
  tobacco_code = c("A", "B", "C", "D", "X")
tobgp tobacco_code
0-9g/day A
10-19 B
20-29 C
30+ D
All amounts X

The join functions from tidyverse are inner_join, left_join, right_join and full_join. You can probably guess what each of these does (if you’re not sure check the help, e.g. ?left_join). Let’s do a left join of the cancer dataset onto data_set_to_join. The first two arguments to the join function are the two datasets to be joined, and then there is a by argument which is the joining variable wrapped in quotes. If there is more than one joining variable then they are passed to by as a vector (e.g. by = c("var_1", "var_2")). You can actually leave out the by argument and R will join by whatever common variables are present.

We’ll output the resulting dataframe into a new object called ‘combined_dataset’. The value of "All amounts" for tobgp does not appear in combined_dataset since this is a left join, but it would appear if you did a full_join or a right_join instead.

combined_dataset <- left_join(cancer, data_set_to_join , by = "tobgp")
agegp alcgp tobgp ncases ncontrols tobacco_code
25-34 0-39g/day 0-9g/day 0 40 A
25-34 0-39g/day 10-19 0 10 B
25-34 0-39g/day 20-29 0 6 C
25-34 0-39g/day 30+ 0 5 D
25-34 40-79 0-9g/day 0 27 A
25-34 40-79 10-19 0 7 B
25-34 40-79 20-29 0 4 C
25-34 40-79 30+ 0 7 D
25-34 80-119 0-9g/day 0 2 A
25-34 80-119 10-19 0 1 B
25-34 80-119 30+ 0 2 D
25-34 120+ 0-9g/day 0 1 A
25-34 120+ 10-19 1 0 B
25-34 120+ 20-29 0 1 C
25-34 120+ 30+ 0 2 D
35-44 0-39g/day 0-9g/day 0 60 A
35-44 0-39g/day 10-19 1 13 B
35-44 0-39g/day 20-29 0 7 C
35-44 0-39g/day 30+ 0 8 D
35-44 40-79 0-9g/day 0 35 A
35-44 40-79 10-19 3 20 B
35-44 40-79 20-29 1 13 C
35-44 40-79 30+ 0 8 D
35-44 80-119 0-9g/day 0 11 A
35-44 80-119 10-19 0 6 B
35-44 80-119 20-29 0 2 C
35-44 80-119 30+ 0 1 D
35-44 120+ 0-9g/day 2 1 A
35-44 120+ 10-19 0 3 B
35-44 120+ 20-29 2 2 C
45-54 0-39g/day 0-9g/day 1 45 A
45-54 0-39g/day 10-19 0 18 B
45-54 0-39g/day 20-29 0 10 C
45-54 0-39g/day 30+ 0 4 D
45-54 40-79 0-9g/day 6 32 A
45-54 40-79 10-19 4 17 B
45-54 40-79 20-29 5 10 C
45-54 40-79 30+ 5 2 D
45-54 80-119 0-9g/day 3 13 A
45-54 80-119 10-19 6 8 B
45-54 80-119 20-29 1 4 C
45-54 80-119 30+ 2 2 D
45-54 120+ 0-9g/day 4 0 A
45-54 120+ 10-19 3 1 B
45-54 120+ 20-29 2 1 C
45-54 120+ 30+ 4 0 D
55-64 0-39g/day 0-9g/day 2 47 A
55-64 0-39g/day 10-19 3 19 B
55-64 0-39g/day 20-29 3 9 C
55-64 0-39g/day 30+ 4 2 D
55-64 40-79 0-9g/day 9 31 A
55-64 40-79 10-19 6 15 B
55-64 40-79 20-29 4 13 C
55-64 40-79 30+ 3 3 D
55-64 80-119 0-9g/day 9 9 A
55-64 80-119 10-19 8 7 B
55-64 80-119 20-29 3 3 C
55-64 80-119 30+ 4 0 D
55-64 120+ 0-9g/day 5 5 A
55-64 120+ 10-19 6 1 B
55-64 120+ 20-29 2 1 C
55-64 120+ 30+ 5 1 D
65-74 0-39g/day 0-9g/day 5 43 A
65-74 0-39g/day 10-19 4 10 B
65-74 0-39g/day 20-29 2 5 C
65-74 0-39g/day 30+ 0 2 D
65-74 40-79 0-9g/day 17 17 A
65-74 40-79 10-19 3 7 B
65-74 40-79 20-29 5 4 C
65-74 80-119 0-9g/day 6 7 A
65-74 80-119 10-19 4 8 B
65-74 80-119 20-29 2 1 C
65-74 80-119 30+ 1 0 D
65-74 120+ 0-9g/day 3 1 A
65-74 120+ 10-19 1 1 B
65-74 120+ 20-29 1 0 C
65-74 120+ 30+ 1 0 D
75+ 0-39g/day 0-9g/day 1 17 A
75+ 0-39g/day 10-19 2 4 B
75+ 0-39g/day 30+ 1 2 D
75+ 40-79 0-9g/day 2 3 A
75+ 40-79 10-19 1 2 B
75+ 40-79 20-29 0 3 C
75+ 40-79 30+ 1 0 D
75+ 80-119 0-9g/day 1 0 A
75+ 80-119 10-19 1 0 B
75+ 120+ 0-9g/day 2 0 A
75+ 120+ 10-19 1 0 B


We can also combine datasets by binding them vertically using bind_rows and horizontally using bind_cols. These functions take the datasets to be bound together as arguments. We don’t have any datasets to bind ready to hand, so let’s produce ‘cancer_young’ containing the rows from ‘cancer’ where the age is “25-34” and ‘cancer_old’ where the age is "75+".

cancer_young <- cancer %>% 
  filter(agegp == "25-34")

cancer_old <- cancer %>% 
  filter(agegp == "75+")

Now we can bind these together vertically using bind_rows (sorry that this example is a bit artificial).

combined_dataset <- bind_rows(cancer_young, cancer_old)
agegp alcgp tobgp ncases ncontrols
25-34 0-39g/day 0-9g/day 0 40
25-34 0-39g/day 10-19 0 10
25-34 0-39g/day 20-29 0 6
25-34 0-39g/day 30+ 0 5
25-34 40-79 0-9g/day 0 27
25-34 40-79 10-19 0 7
25-34 40-79 20-29 0 4
25-34 40-79 30+ 0 7
25-34 80-119 0-9g/day 0 2
25-34 80-119 10-19 0 1
25-34 80-119 30+ 0 2
25-34 120+ 0-9g/day 0 1
25-34 120+ 10-19 1 0
25-34 120+ 20-29 0 1
25-34 120+ 30+ 0 2
75+ 0-39g/day 0-9g/day 1 17
75+ 0-39g/day 10-19 2 4
75+ 0-39g/day 30+ 1 2
75+ 40-79 0-9g/day 2 3
75+ 40-79 10-19 1 2
75+ 40-79 20-29 0 3
75+ 40-79 30+ 1 0
75+ 80-119 0-9g/day 1 0
75+ 80-119 10-19 1 0
75+ 120+ 0-9g/day 2 0
75+ 120+ 10-19 1 0

If there was a variable in one dataset but not in the other, then it would appear in the resulting dataset with values of NA for the rows coming from the dataset where it did not exist.

The function bind_cols works in a similar fashion. One or more of arguments can be a vector. It produces an error if the dataframes/vectors have varying numbers of rows/elements.

File input and output

As well as data stored in R’s own file format (.rds) and fairly common data file types like .csv and .txt, you can import a very wide range of data files from other programs through different packages. The package readxl provides functions for reading in Excel files, and the package haven has functions read_sas for SAS files, read_stata for Stata files, and read_sav for SPSS files. There are ‘write’ versions of these functions that allow you to output these types of files from R. Both readxl and haven are installed as part of the tidyverse collection but they are not loaded with library(tidyverse), so you would have to load them seperately to use these functions (e.g. library(readxl)).

Here we’ll just look at CSVs and the native data file format for R (.rds).


We can read in a CSV using read_csv. The only argument it really needs is the filename, so you could use it like:

my_dataframe <- read_csv("my_folder/my_csv_file.csv")

By default it assumes that there is a header row in your CSV which becomes the column names for the dataframe. If that is not the case then use the additional argument col_names = FALSE, and it will come up with some default column names on its own.

To write a dataframe to CSV you use write_csv, which takes a dataframe as the first argument and the filename as the second argument, so you could use it like:

write_csv(my_dataframe, "my_folder/my_csv_file.csv")

One thing with read_csv and read_excel: When these functions are used to read a file, they guess the type of a column by looking at the first number of rows. The number of rows that it examines before reading the file is determined by the argument guess_max, which by default is set to 1,000. I had a case where I had a very large Excel file and there were some columns with more than 1,000 missing cells before it got to an actual non-missing value. The function read_excel then mistook the type of the column and it wasn’t read correctly. So I had to set guess_max = Inf (Inf meaning infinity) so that it checked all the rows to correctly determine column type.


To save a single dataframe to file, use saveRDS. It works the same way as write_csv, e.g.

saveRDS(my_dataframe, "my_folder/my_rds_file.rds")

And to read such a file use readRDS, like so:

a_new_dataframe <- readRDS("my_folder/my_rds_file.rds")
save and load functions You may see functions save and load. I don’t use them so much. The key differences are that you can save and load as many objects as you like with single save and load functions, and the load function brings back all of the objects with the same names that they were saved with, you don’t use load with the assignment operator to make a new object.


Pivoting dataframes can be tricky but I need to do it a lot with data imported from the CSO’s open data portal so I’ll give a guide here.

To keep things simple we will only consider pivots involving one column for names and one for values when pivoting wider, and one resultant column for names and one resultant column for values when pivoting longer.


Let’s take our cancer dataset again, forget about ncases (we’ll drop that) and pivot the table so that we have four new columns for each of the tobgp values, and the values within those columns will be ncontrols for the appropriate combination of tobgp, agegp and alcgp. For this we use pivot_wider. The two important arguments to this are names_from which is equal to the name of the variable that will provide names for our new columns, and values_from which is the values that will sit in those new columns. We’ll pop the output into a new dataframe called wide_cancer_dataset

agegp alcgp 0-9g/day 10-19 20-29 30+
25-34 0-39g/day 40 10 6 5
25-34 40-79 27 7 4 7
25-34 80-119 2 1 NA 2
25-34 120+ 1 0 1 2
35-44 0-39g/day 60 13 7 8
35-44 40-79 35 20 13 8
35-44 80-119 11 6 2 1
35-44 120+ 1 3 2 NA
45-54 0-39g/day 45 18 10 4
45-54 40-79 32 17 10 2
45-54 80-119 13 8 4 2
45-54 120+ 0 1 1 0
55-64 0-39g/day 47 19 9 2
55-64 40-79 31 15 13 3
55-64 80-119 9 7 3 0
55-64 120+ 5 1 1 1
65-74 0-39g/day 43 10 5 2
65-74 40-79 17 7 4 NA
65-74 80-119 7 8 1 0
65-74 120+ 1 1 0 0
75+ 0-39g/day 17 4 NA 2
75+ 40-79 3 2 3 0
75+ 80-119 0 0 NA NA
75+ 120+ 0 0 NA NA
wide_cancer_dataset <- cancer %>% 
  select(-ncases) %>% 
  pivot_wider(names_from = tobgp, values_from = ncontrols)

You might notice some NA values among the new dataset. This is because wide_cancer_dataset include some combinations that weren’t in the original cancer dataset, e.g. agegp = "25-34", alcgp = "80-119" and tobgp = "20-29" didn’t appear in the cancer dataset. If you wanted to set those NA values to zero at the time of pivoting, you could use the extra argument values_fill = 0.

You can also add a prefix to your new columns using the additional argument names_prefix, which takes a string, e.g. names_prefix = "tobgp_".


Let’s take wide_cancer_dataset and convert it back into a longer format. We’ll use the function pivot_longer. It has three key arguments: - cols: this specifies the columns to be pivoted into rows - names_to: the name for the new column which contains the names of the columns that you are pivoting - values_to: the name for the new column which contains the values from the columns that you are pivoting.

wide_cancer_dataset %>% 
  pivot_longer(cols = c("0-9g/day", "10-19",  "20-29","30+"), 
               names_to = "tobgp", 
               values_to = "ncontrols")
agegp alcgp tobgp ncontrols
25-34 0-39g/day 0-9g/day 40
25-34 0-39g/day 10-19 10
25-34 0-39g/day 20-29 6
25-34 0-39g/day 30+ 5
25-34 40-79 0-9g/day 27
25-34 40-79 10-19 7
25-34 40-79 20-29 4
25-34 40-79 30+ 7
25-34 80-119 0-9g/day 2
25-34 80-119 10-19 1
25-34 80-119 20-29 NA
25-34 80-119 30+ 2
25-34 120+ 0-9g/day 1
25-34 120+ 10-19 0
25-34 120+ 20-29 1
25-34 120+ 30+ 2
35-44 0-39g/day 0-9g/day 60
35-44 0-39g/day 10-19 13
35-44 0-39g/day 20-29 7
35-44 0-39g/day 30+ 8
35-44 40-79 0-9g/day 35
35-44 40-79 10-19 20
35-44 40-79 20-29 13
35-44 40-79 30+ 8
35-44 80-119 0-9g/day 11
35-44 80-119 10-19 6
35-44 80-119 20-29 2
35-44 80-119 30+ 1
35-44 120+ 0-9g/day 1
35-44 120+ 10-19 3
35-44 120+ 20-29 2
35-44 120+ 30+ NA
45-54 0-39g/day 0-9g/day 45
45-54 0-39g/day 10-19 18
45-54 0-39g/day 20-29 10
45-54 0-39g/day 30+ 4
45-54 40-79 0-9g/day 32
45-54 40-79 10-19 17
45-54 40-79 20-29 10
45-54 40-79 30+ 2
45-54 80-119 0-9g/day 13
45-54 80-119 10-19 8
45-54 80-119 20-29 4
45-54 80-119 30+ 2
45-54 120+ 0-9g/day 0
45-54 120+ 10-19 1
45-54 120+ 20-29 1
45-54 120+ 30+ 0
55-64 0-39g/day 0-9g/day 47
55-64 0-39g/day 10-19 19
55-64 0-39g/day 20-29 9
55-64 0-39g/day 30+ 2
55-64 40-79 0-9g/day 31
55-64 40-79 10-19 15
55-64 40-79 20-29 13
55-64 40-79 30+ 3
55-64 80-119 0-9g/day 9
55-64 80-119 10-19 7
55-64 80-119 20-29 3
55-64 80-119 30+ 0
55-64 120+ 0-9g/day 5
55-64 120+ 10-19 1
55-64 120+ 20-29 1
55-64 120+ 30+ 1
65-74 0-39g/day 0-9g/day 43
65-74 0-39g/day 10-19 10
65-74 0-39g/day 20-29 5
65-74 0-39g/day 30+ 2
65-74 40-79 0-9g/day 17
65-74 40-79 10-19 7
65-74 40-79 20-29 4
65-74 40-79 30+ NA
65-74 80-119 0-9g/day 7
65-74 80-119 10-19 8
65-74 80-119 20-29 1
65-74 80-119 30+ 0
65-74 120+ 0-9g/day 1
65-74 120+ 10-19 1
65-74 120+ 20-29 0
65-74 120+ 30+ 0
75+ 0-39g/day 0-9g/day 17
75+ 0-39g/day 10-19 4
75+ 0-39g/day 20-29 NA
75+ 0-39g/day 30+ 2
75+ 40-79 0-9g/day 3
75+ 40-79 10-19 2
75+ 40-79 20-29 3
75+ 40-79 30+ 0
75+ 80-119 0-9g/day 0
75+ 80-119 10-19 0
75+ 80-119 20-29 NA
75+ 80-119 30+ NA
75+ 120+ 0-9g/day 0
75+ 120+ 10-19 0
75+ 120+ 20-29 NA
75+ 120+ 30+ NA

That’s all fine, but specifying the four column names is a bit fiddly. Instead we can specify the columns which will not be pivoted, which are agegp and alcgp. To do that we pop in these column names with the argument cols = but put a minus in front of the vector containing these names. The output isn’t shown here but the result is the same as the above.

wide_cancer_dataset %>% 
  pivot_longer(cols = -c("alcgp", "agegp"), 
               names_to = "tobgp", 
               values_to = "ncontrols")

You should know that the selection helper functions (like contains, starts_with and ends_with) can be used when specifying the list of columns which are to be pivoted. Suppose in making wide_cancer_dataset that we had used names_prefix = "tobgp_" so that each of our tobacco columns started with the string "tobgp_". Then we could have pivoted longer by specifying our column selection as cols = starts_with("tobgp_").

We could also have used the where selection helper together with is.numeric to pivot only the numerical type columns. The code for this is shown below but the results are not since they are the same as before.

wide_cancer_dataset %>% 
  pivot_longer(cols = where(is.numeric), 
               names_to = "tobgp", 
               values_to = "ncontrols")

One last thing is that the NA values from wide_cancer_dataset have appeared in the output. If you didn’t want that to be the case you could use the additional argument values_drop_na, setting it equal to TRUE.


Introduction to ggplot

Plotting is done through the ggplot2 package, which is a core tidyverse package and is loaded with library(tidyverse).

There are other handy functions for plotting like plot and qplot, but I think it’s best to start with ggplot2 because the format is so general.

The way it works is something like the pseudocode shown below. It starts out with the function ggplot which just takes the dataframe as its argument. Then there is a geometry type function which specifies the kind of plot that will be made. There is geom_point for points (scatterplot), geom_line for lines, geom_col which I use for both column and bar charts, and lots more asides. I’ve just used geom_XXX as a placeholder here, there is no function called geom_XXX.

Within the geometry function is the aesthetic mapping function aes. This is where you specify what variables in the data are associated with x and y as well as other variables you might have like size, shape, colour, fill etc.

You can have more than one geometry function, e.g. geom_line and geom_points for a plot with both points and lines. You only ever have one ggplot function per plot though.

The other_functions alluded to below refers to functions for altering the appearance of the plot. Don’t worry about that for now.

You’ll notice all the separate parts are connected to by a plus sign. I like to insert a new line after each + for ease of reading. The indentation is automatic.

ggplot(data = the_dataframe) +
  geom_XXX(mapping = aes(x = var_1 , y = var_2 , <other_variables>) , <options_for_this_geom> ) + 
  other_functions ...

Just so you know, the dataset can be defined within the geometry function (you can put data = within geom_XXX) and the aesthetic mapping function can be defined within ggplot (you can put mapping = aes(...) within ggplot).

Here we’ll use the population dataset that comes loaded with tidyverse, but doesn’t appear in the Environment panel. You can view it by running View(population). We’ll make plots using subsets of data from this, making a dataset called dataset_to_plot each time.

Croatia 2006 4378601
Croatia 2007 4369337
Croatia 2008 4360126
Croatia 2009 4349930
Croatia 2010 4338027
Croatia 2011 4323822
Croatia 2012 4307422
Croatia 2013 4289714
Cuba 1995 10932013
Cuba 1996 10980758
Cuba 1997 11024249
Cuba 1998 11063917
Cuba 1999 11101580
Cuba 2000 11138416
Cuba 2001 11175465
Cuba 2002 11212125
Cuba 2003 11245926
Cuba 2004 11273363
Cuba 2005 11292078
Cuba 2006 11301100
Cuba 2007 11301674
Cuba 2008 11296355
Cuba 2009 11288826
Cuba 2010 11281768
Cuba 2011 11276053
Cuba 2012 11270957
Cuba 2013 11265629
Curaçao 2010 147560
Curaçao 2011 151523
Curaçao 2012 155293
Curaçao 2013 158760
Cyprus 1995 855389
Cyprus 1996 873246
Cyprus 1997 890733
Cyprus 1998 908040
Cyprus 1999 925491
Cyprus 2000 943287
Cyprus 2001 961481
Cyprus 2002 979877
Cyprus 2003 998142
Cyprus 2004 1015820
Cyprus 2005 1032586
Cyprus 2006 1048314
Cyprus 2007 1063095
Cyprus 2008 1077089
Cyprus 2009 1090553
Cyprus 2010 1103685
Cyprus 2011 1116513
Cyprus 2012 1128994
Cyprus 2013 1141166
Czech Republic 1995 10339223
Czech Republic 1996 10327823
Czech Republic 1997 10311228
Czech Republic 1998 10291153
Czech Republic 1999 10270096
Czech Republic 2000 10250398
Czech Republic 2001 10231435
Czech Republic 2002 10214070
Czech Republic 2003 10203846
Czech Republic 2004 10207862
Czech Republic 2005 10230683
Czech Republic 2006 10274906
Czech Republic 2007 10337782
Czech Republic 2008 10411836
Czech Republic 2009 10486434
Czech Republic 2010 10553701
Czech Republic 2011 10611037
Czech Republic 2012 10660051
Some simple plots

Ok, let’s make a dataset just containing country == "Ireland" and make a scatterplot with year on the x and population on the y.

dataset_for_plot <- population %>% 
  filter(country == "Ireland")

ggplot(data = dataset_for_plot) + 
  geom_point(mapping = aes(x = year, y = population))

Nice. We can easily make a line plot by swapping geom_point for geom_line in the above.

Let’s make a line plot with the countries France, Germany and Spain each having a different line colour, plotting their populations over time.

dataset_for_plot <- population %>% 
  filter(country %in% c("France", "Germany", "Spain"))
ggplot(data = dataset_for_plot) + 
  geom_line(mapping = aes(x = year, y = population, colour = country))

Great. Let’s make a column plot using geom_col. We’ll filter a couple of countries and take the data just from 2010.

dataset_for_plot <- population %>% 
  filter(country %in% c("France", "Germany", "Spain", "Italy", "Poland"), year == 2010)
ggplot(data = dataset_for_plot) + 
  geom_col(mapping = aes(x = country, y = population))

We can turn this into a bar plot easily by adding coord_flip():

ggplot(data = dataset_for_plot) + 
  geom_col(mapping = aes(x = country, y = population)) +

Let’s return to our line plot for France, Germany and Spain. If we wanted lines and points it would make more sense to put the aes mapping function within the ggplot rather than in both the geom_line and geom_point:

dataset_for_plot <- population %>% 
  filter(country %in% c("France", "Germany", "Spain"))
ggplot(data = dataset_for_plot , mapping = aes(x = year, y = population, colour = country)) + 
  geom_line() +

Adjusting labels

Let’s do the same plot, but change some of the labels. We’ll change the y-axis label to “Persons”, leave both the x-axis title and legend title blank, and set the heading to “Population of selected countries over time”. All of that can be done through the labs function.

ggplot(data = dataset_for_plot , mapping = aes(x = year, y = population, colour = country)) + 
  geom_line() +
  geom_point() +
  labs(x = "",
       y = "Persons", 
       colour = "", 
       title = "Population of selected countries over time")

Adjusting axes

Let’s do the same plot (we’ll forget about the labels) but make two adjustments to the y-axis. We will use the function scale_y_continuous which has lots of tools within it for tailoring a continuous y-axis.

The function scale_y_continuous is one of a large family of functions with the format scale_DIM_FORMAT where DIM is the dimension (x, y, fill, colour, etc.) and FORMAT is the type of axis. See the second page of the ggplot2 cheatsheet for more information.

We’ll change the limits of the y-axis so that it starts at zero. This is done using the limits argument to scale_y_continuous, and limits takes a vector of length two, the first entry being the lower limit and the second entry being the upper limit. I’ll set the lower limit to zero, but I’ll leave the upper limit as NA, meaning that I’ll leave it up to the data to determine how high the y-axis should go.

I’ll change the labels on the y so that they show the whole number with comma separators rather than the scientific format which has appeared. That is done with the labels argument to scale_y_continuous. I’m setting that argument as being equal to comma, which is actually a function from the package scales. This package provide a bunch of tools that make it easier to tailor axes and legends. The scales package is installed with tidyverse but needs to be loaded in using library(scales). If the y-axis needed to be formatted as a percentage, then you’d replace comma with percent (percent is another function from scales).


ggplot(data = dataset_for_plot , mapping = aes(x = year, y = population, colour = country)) + 
  geom_line() +
  geom_point() + 
  scale_y_continuous(limits = c(0,NA) , label = comma )

There are tons of other ways to edit the appearance of plots through ggplot, but these would be beyond a crash course. I’ll wrap up with one tool for altering the appearance which is to add a theme. There are 8 themes including the default theme_grey(). There is theme_light(), theme_dark(), theme_bw(). My favourite is theme_minimal(). Use the help (?theme_grey) to find out more.


ggplot(data = dataset_for_plot , mapping = aes(x = year, y = population, colour = country)) + 
  geom_line() +
  geom_point() + 
  scale_y_continuous(limits = c(0,NA) , label = comma ) + 