I had a situation where I needed to calculate the sum of a value for
the seven previous rows (including the current row). I thought about
adding value
to lag(value)
and
lag(value, 2)
and lag(value, 3)
etc. all the
way to lag(value ,6)
, but I realised that this would be
messy. I also didn’t want to write a for loop.
Then I found out about the slider
package, which includes the function slide
and related
functions with specific output types. These functions allow you to
specify a ‘window’ of rows before and/or after the current row, and to
perform some operation on values within those rows. The window position
is defined relative to each specific row so that the window ‘slides’
down the dataframe. Here I’m only interested in applying slider
functions to columns in dataframes, though they can be used on other
structures. The primary slide
function returns a list which
I probably wouldn’t commonly use, but I have made use of
slide_dbl
which returns a (double) numeric value. There
might also be cases where I would use slide_chr
which
returns string variables.
The first argument to your slide
function (we’ll use
slide_dbl
here) is the column that you’re operating on, and
the second argument is the function you want to apply. Then you define
the window around each row using the arguments .before
and
.after
. There are other arguments and options which were
beyond my needs.
Let’s take a simple example using the starwars dataset that comes
with the dplyr package. Let’s say we want to calculate a running sum of
the three most recent values of height
(inclusive of
current row). The first argument is height
, and the second
argument will be the function to apply which is sum
. Then
to define the window of rows to include I use .before=2
to
set the start of the window to two rows before the current row. The
default value for .after
is zero so I don’t need to adjust
that. I’m also setting .complete
to TRUE
. This
means that the window has to be complete to return a value, and I will
get NA for the first two rows. If I had not made this adjustment (the
default for .complete
is FALSE
), the value for
the first and second rows would have included only one and two input
values respectively.
library(tidyverse)
library(slider)
starwars_heights <- starwars %>%
mutate(height_sum_3 = slide_dbl(height, sum, .before=2, .complete=T)) %>%
select(name , height, height_sum_3)
head(starwars_heights)
## # A tibble: 6 × 3
## name height height_sum_3
## <chr> <int> <dbl>
## 1 Luke Skywalker 172 NA
## 2 C-3PO 167 NA
## 3 R2-D2 96 435
## 4 Darth Vader 202 465
## 5 Leia Organa 150 448
## 6 Owen Lars 178 530
There you go. I think it’s pretty neat. You can include a
group_by
so that the window begins again when you get to
the next group. If I had added group_by(sex)
before the
mutate
line above, then the window would include only the
two most recent values for the same value of sex
(it might
look clearer if it was also sorted by sex
).
It’s hard for me to think of an application for
slide_chr
, but let’s say you wanted to concatenate the two
previous values of a string. Let’s do that with name
from
the starwars dataset. Note I’m setting .after
to minus one;
positive values of .after
indicate subsequent
rows, whereas I want to stop the window on the previous
row.
starwars_names <- starwars %>%
mutate(recent_names = slide_chr(name,
~paste(., collapse=" "),
.before = 2,
.after = -1,
.complete = T)) %>%
select(name, recent_names)
head(starwars_names)
## # A tibble: 6 × 2
## name recent_names
## <chr> <chr>
## 1 Luke Skywalker <NA>
## 2 C-3PO <NA>
## 3 R2-D2 Luke Skywalker C-3PO
## 4 Darth Vader C-3PO R2-D2
## 5 Leia Organa R2-D2 Darth Vader
## 6 Owen Lars Darth Vader Leia Organa