The shell
function allows you to run commands under a
shell, as if you were using the Windows console. One thing I found very
useful is that you can turn the output into an object in R for analysis
or manipulation. I used this approach when I had a folder with 1,000
files and I wanted to quickly get the names and sizes of each one. The
standard route was to use the list.files
function and pass
that vector of names to the function file.size
:
library(tidyverse)
name <- list.files("folder_with_lots_of_files")
size <- file.size(name)
file_data <- bind_cols(name, size )
head(file_data)
## # A tibble: 6 × 2
## ...1 ...2
## <chr> <dbl>
## 1 120000226.pdf 604283
## 2 120000231.pdf 533831
## 3 120000235.pdf 502477
## 4 120000241.pdf 875071
## 5 120001240.pdf 305258
## 6 120001630.pdf 46381
However, I found the file.size
part of the above process
to be pretty slow, for my 1,000 pdfs it took 36 seconds.
Instead, we can run the command
dir folder_with_lots_of_files
to the shell using the
shell
function. This command produces a list of files in the
folder, together with their size and date and time when they were last
modified. The argument intern = TRUE
converts the output
into an R object. I go a step further here and convert it into a
dataframe using data.frame()
.
Note that to explore subfolders in the Windows console you would
usually use a backslash \
but in R this is an escape
character. So to put in a backslash in your folder location you can
either use a double backslash \\
or use a forward slash and
then the additional argument translate = TRUE
, which
changes forward slashes to backslashes in the command argument to
shell
.
file_info <- shell("dir folder_with_lots_of_files" , intern = TRUE) %>%
data.frame()
# Here I'll print the first 10 lines and the last 5 lines.
# I've changed the first 4 lines to read HEADER MATERIAL
bind_rows(head(file_info, n=10) , tail(file_info, n=5))
## .
## 1 HEADER MATERIAL
## 2 HEADER MATERIAL
## 3 HEADER MATERIAL
## 4 HEADER MATERIAL
## 5
## 6 23/02/2022 11:09 <DIR> .
## 7 23/02/2022 11:09 <DIR> ..
## 8 04/01/2022 12:15 604,283 120000226.pdf
## 9 04/01/2022 12:15 533,831 120000231.pdf
## 10 04/01/2022 12:15 502,477 120000235.pdf
## 11 04/01/2022 13:37 1,391,634 120849342.pdf
## 12 04/01/2022 13:37 1,318,437 120852400.pdf
## 13 04/01/2022 13:37 587,916 120853353.pdf
## 14 211 File(s) 234,314,197 bytes
## 15 2 Dir(s) 178,821,488,640 bytes free
From here, the goal is to process each of these strings so that you can parse the date, time, size and filename. Provided that you don’t have spaces in your filenames, the following should work.
file_info <- file_info %>%
rename(string = 1) %>%
mutate(string = string %>% str_replace_all( "\\s+" , " ") %>% trimws()) %>%
mutate(date = word(string , 1) ,
time = word(string , 2) ,
size = word(string , 3) ,
filename = word(string , 4)) %>%
select(-string)
bind_rows(head(file_info, n=10) , tail(file_info, n=5))
## date time size filename
## 1 HEADER MATERIAL <NA> <NA>
## 2 HEADER MATERIAL <NA> <NA>
## 3 HEADER MATERIAL <NA> <NA>
## 4 HEADER MATERIAL <NA> <NA>
## 5 <NA> <NA> <NA>
## 6 23/02/2022 11:09 <DIR> .
## 7 23/02/2022 11:09 <DIR> ..
## 8 04/01/2022 12:15 604,283 120000226.pdf
## 9 04/01/2022 12:15 533,831 120000231.pdf
## 10 04/01/2022 12:15 502,477 120000235.pdf
## 11 04/01/2022 13:37 1,391,634 120849342.pdf
## 12 04/01/2022 13:37 1,318,437 120852400.pdf
## 13 04/01/2022 13:37 587,916 120853353.pdf
## 14 211 File(s) 234,314,197 bytes
## 15 2 Dir(s) 178,821,488,640 bytes
There are a number of ways that you could top and tail this dataset,
getting rid of the header and footer rows and leaving you only with rows
for each filename. Since my folder contains only PDFs, it could simply
be done by filtering rows where the filename contains the pattern
“.pdf”. Another way would be to drop the first seven lines and the last
two lines provided you were sure that the command always had this number
of header and footer lines
(e.g. %>% filter(row_number() >7 , row_number()<= nrow(.)-2 )
).
Here I’ll use a more agnostic approach, and keep any row where the first
character of size
is a number (gets rid of header) and
where filename
is not equal to ‘bytes’ (gets rid of
footer).
From there, you could use the functions in the lubridate package to parse the date and time columns.
file_info <- file_info %>%
filter(str_sub(size,1,1) %in% 0:9 , filename != "bytes")
bind_rows(head(file_info, n=10) , tail(file_info, n=5))
## date time size filename
## 1 04/01/2022 12:15 604,283 120000226.pdf
## 2 04/01/2022 12:15 533,831 120000231.pdf
## 3 04/01/2022 12:15 502,477 120000235.pdf
## 4 04/01/2022 12:15 875,071 120000241.pdf
## 5 04/01/2022 12:15 305,258 120001240.pdf
## 6 04/01/2022 12:15 46,381 120001630.pdf
## 7 04/01/2022 12:15 116,524 120002198.pdf
## 8 04/01/2022 12:15 124,794 120002387.pdf
## 9 04/01/2022 12:15 797,116 120002788.pdf
## 10 04/01/2022 12:15 259,259 120003133.pdf
## 11 04/01/2022 13:37 148,138 120833843.pdf
## 12 04/01/2022 13:37 709,884 120837616.pdf
## 13 04/01/2022 13:37 1,391,634 120849342.pdf
## 14 04/01/2022 13:37 1,318,437 120852400.pdf
## 15 04/01/2022 13:37 587,916 120853353.pdf