Running and using shell in R

The shell function allows you to run commands under a shell, as if you were using the Windows console. One thing I found very useful is that you can turn the output into an object in R for analysis or manipulation. I used this approach when I had a folder with 1,000 files and I wanted to quickly get the names and sizes of each one. The standard route was to use the list.files function and pass that vector of names to the function file.size:

library(tidyverse)

name <- list.files("folder_with_lots_of_files")
size <- file.size(name)

file_data <- bind_cols(name, size )

head(file_data)

## # A tibble: 6 × 2
##   ...1            ...2
##   <chr>          <dbl>
## 1 120000226.pdf 604283
## 2 120000231.pdf 533831
## 3 120000235.pdf 502477
## 4 120000241.pdf 875071
## 5 120001240.pdf 305258
## 6 120001630.pdf  46381

However, I found the file.size part of the above process to be pretty slow, for my 1,000 pdfs it took 36 seconds.

Instead, we can run the command dir folder_with_lots_of_files to the shell using the shellfunction. This command produces a list of files in the folder, together with their size and date and time when they were last modified. The argument intern = TRUE converts the output into an R object. I go a step further here and convert it into a dataframe using data.frame().

Note that to explore subfolders in the Windows console you would usually use a backslash \ but in R this is an escape character. So to put in a backslash in your folder location you can either use a double backslash \\ or use a forward slash and then the additional argument translate = TRUE, which changes forward slashes to backslashes in the command argument to shell.

file_info <- shell("dir folder_with_lots_of_files" , intern = TRUE) %>% 
  data.frame()

# Here I'll print the first 10 lines and the last 5 lines. 
# I've changed the first 4 lines to read HEADER MATERIAL
bind_rows(head(file_info, n=10) , tail(file_info, n=5))

##                                                      .
## 1                                      HEADER MATERIAL
## 2                                      HEADER MATERIAL
## 3                                      HEADER MATERIAL
## 4                                      HEADER MATERIAL
## 5                                                     
## 6                23/02/2022  11:09    <DIR>          .
## 7               23/02/2022  11:09    <DIR>          ..
## 8    04/01/2022  12:15           604,283 120000226.pdf
## 9    04/01/2022  12:15           533,831 120000231.pdf
## 10   04/01/2022  12:15           502,477 120000235.pdf
## 11   04/01/2022  13:37         1,391,634 120849342.pdf
## 12   04/01/2022  13:37         1,318,437 120852400.pdf
## 13   04/01/2022  13:37           587,916 120853353.pdf
## 14                    211 File(s)    234,314,197 bytes
## 15                2 Dir(s)  178,821,488,640 bytes free

From here, the goal is to process each of these strings so that you can parse the date, time, size and filename. Provided that you don’t have spaces in your filenames, the following should work.

file_info <- file_info %>% 
  rename(string = 1) %>% 
  mutate(string = string %>% str_replace_all( "\\s+" , " ") %>% trimws()) %>% 
  mutate(date = word(string , 1) , 
         time = word(string , 2) , 
         size = word(string , 3) ,
         filename = word(string , 4)) %>% 
  select(-string)

bind_rows(head(file_info, n=10) , tail(file_info, n=5))

##          date     time            size      filename
## 1      HEADER MATERIAL            <NA>          <NA>
## 2      HEADER MATERIAL            <NA>          <NA>
## 3      HEADER MATERIAL            <NA>          <NA>
## 4      HEADER MATERIAL            <NA>          <NA>
## 5                 <NA>            <NA>          <NA>
## 6  23/02/2022    11:09           <DIR>             .
## 7  23/02/2022    11:09           <DIR>            ..
## 8  04/01/2022    12:15         604,283 120000226.pdf
## 9  04/01/2022    12:15         533,831 120000231.pdf
## 10 04/01/2022    12:15         502,477 120000235.pdf
## 11 04/01/2022    13:37       1,391,634 120849342.pdf
## 12 04/01/2022    13:37       1,318,437 120852400.pdf
## 13 04/01/2022    13:37         587,916 120853353.pdf
## 14        211  File(s)     234,314,197         bytes
## 15          2   Dir(s) 178,821,488,640         bytes

There are a number of ways that you could top and tail this dataset, getting rid of the header and footer rows and leaving you only with rows for each filename. Since my folder contains only PDFs, it could simply be done by filtering rows where the filename contains the pattern “.pdf”. Another way would be to drop the first seven lines and the last two lines provided you were sure that the command always had this number of header and footer lines (e.g. %>% filter(row_number() >7 , row_number()<= nrow(.)-2 )). Here I’ll use a more agnostic approach, and keep any row where the first character of size is a number (gets rid of header) and where filename is not equal to ‘bytes’ (gets rid of footer).

From there, you could use the functions in the lubridate package to parse the date and time columns.

file_info <- file_info %>% 
  filter(str_sub(size,1,1) %in% 0:9 , filename != "bytes")

bind_rows(head(file_info, n=10) , tail(file_info, n=5))

##          date  time      size      filename
## 1  04/01/2022 12:15   604,283 120000226.pdf
## 2  04/01/2022 12:15   533,831 120000231.pdf
## 3  04/01/2022 12:15   502,477 120000235.pdf
## 4  04/01/2022 12:15   875,071 120000241.pdf
## 5  04/01/2022 12:15   305,258 120001240.pdf
## 6  04/01/2022 12:15    46,381 120001630.pdf
## 7  04/01/2022 12:15   116,524 120002198.pdf
## 8  04/01/2022 12:15   124,794 120002387.pdf
## 9  04/01/2022 12:15   797,116 120002788.pdf
## 10 04/01/2022 12:15   259,259 120003133.pdf
## 11 04/01/2022 13:37   148,138 120833843.pdf
## 12 04/01/2022 13:37   709,884 120837616.pdf
## 13 04/01/2022 13:37 1,391,634 120849342.pdf
## 14 04/01/2022 13:37 1,318,437 120852400.pdf
## 15 04/01/2022 13:37   587,916 120853353.pdf

Running and using shell in R

Brendan O’Dowd

2022-05-23