Summary
- Spouting the virtues of replicable, reproducible, and distributable research is commonplace.
- However, there is a shortage of current, descriptive, and detailed guides for enacting such worfklows.
- In this series of vignettes, we walk provide detailed guides for several key components to replicable, reproducible, and distributable workflows.
This vignette is an excerpt from the DANTE Project’s beta release of Open, Reproducible, and Distributable Research with R Packages. To view the entire current release, please visit the bookdown site. If you would like to contribute to this bookdown project, please visit the project GitLab repository.
Creating and Documenting Functions
It’s common for researchers learning a programming language, especially those without a computer science background, to develop all of their data processing, modeling, and visualizations in a lose collection of scripts. Best case scenario these are labeled by the workflow process (get-data.R
,merge-data.R
, modeling.R
, plots.R
) and contain a smattering of comments providing context and notes to yourself. While this is how many of us carry out research workflows the when first learning, it does not lend itself to open science. Moreover, it limits the scope of what that code may be used for, and can be a real nightmare to revisit months or years later.
A better approach is to functionalize as much of your code as possible. Turning raw scripts into proper functions increases the re-usability of your code, expands the scope of your code’s applications to other datasets and workflows, and allows for proper documentation that increases efficiency when you return to your research at a later date or integrate new members into your team.
Script to Function
In the context of R packages for reproducible research I find it helpful to work out the overall workflow in a set of loose commented scripts inside the myresearch/raw-scripts/
directory. Once I’m fairly certain my plan for processing and modeling works, I begin to functionalize as much of the code as possible and place each completed function in the myresearch/R/
directory. An example of a raw script for data processing is in duplicator
’s raw-scripts
folder. Here is an excerpt from this script:
#----
# Pull in the administrative boundaries
<-list() gadm.study.countries
for(i in 1:length(study.countries$iso3)){
<-as.character(study.countries[i,"iso3"]) iso
<-sf::st_cast(sf::st_as_sf(raster::getData('GADM', country
country=iso,
level=0,
path = "raw-data/")),
'MULTIPOLYGON')
<-country gadm.study.countries[[i]]
names(gadm.study.countries)[[i]]<-paste0(country$GID_0)
}
<-sf::st_as_sf(data.table::rbindlist(gadm.study.countries)) gadm.study.countries
This code creates a shapefile of administrative boundaries used later to carry out zonal statistics on rasters layers of interest. It creates and empty list and populates the list with individual country shapefiles downloaded with the raster::getData()
function. The countries are identified by a vector ISO3C country codes (study.countries$iso3
). They are downloaded to the specified directory, and renamed with the respective country code. Finally, the list of individual country boundaries are concatenated with sf::st_as_sf()
into a single shapefile for all the countries of interest.
This is a fine piece of research grade code that will carry out the desired task, however, creating a single shapefile given a list of country codes is a useful task that is often repeat. Instead of copying this script dozens of times over the next 5-10 years, it could be converted into a proper and called from this package whenever needed. By doing this the code is centralized into a single location, which reduces the potential for typos or other errors, and provides documentation for additional users.
There are numerous detailed resources on developing functions in R; at their core, functions are comprised of 1) a name, 2) arguments, and 3) the body. Here is the functionalized version of the above code.
<- function(iso3, level = 0, gadm.dir="gadm-data") { combineGADM
dir.create(file.path(gadm.dir), showWarnings = FALSE)
<- list() gadm.countries
cat("Acquiring: ")
for (i in 1:length(iso3)) {
<- as.character(iso3[i]) iso
cat(paste(iso), " ")
<- sf::st_cast(sf::st_as_sf( country
::getData( raster
'GADM',
country = iso,
level = level,
path=paste0(getwd(),"/",gadm.dir)
)
),
'MULTIPOLYGON')
<- country gadm.countries[[i]]
names(gadm.countries)[[i]] <- paste0(country$GID_0)
}
cat("... Concatenating country boundaries. ")
<- gadm.countries
do.call(rbind, gadm.countries)
cat("Complete")
return(gadm.countries)
}
The name is combineGADM
, the arguments, iso3
, level
, gadm.dir
, are listed inside of function( )
, and the body is all the code comprising the function listed between the { }
. To use this function you provide it a character vector of countries you want to download, provide the administrative level of interest, and specify the directory name where the files will be downloaded. To create a shapefile object named fra.usa.shp
of admin 1 units (states/provinces) of France and The United States in a directory called "my-data"
the user would enter:
<-combineGADM(c("FRA", "USA"), level = 1, gadm.dir = "my-data") fra.usa.shp
The values entered for the arguments will replace the argument names found in the function code body when the function is executed. In essence, the code body becomes:
dir.create(file.path("my-data"), showWarnings = FALSE)
<- list() gadm.countries
cat("Acquiring: ")
for (i in 1:length(c("FRA", "USA"))) {
<- as.character(c("FRA", "USA")[i]) iso
cat(paste(iso), " ")
<- sf::st_cast(sf::st_as_sf( country
::getData( raster
'GADM',
country = iso,
level = 2,
path=paste0(getwd(),"/","my-data")
)
),
'MULTIPOLYGON')
<- country gadm.countries[[i]]
names(gadm.countries)[[i]] <- paste0(country$GID_0)
}
cat("... Concatenating country boundaries. ")
<- gadm.countries
do.call(rbind, gadm.countries)
cat("Complete")
return(gadm.countries)
}
Documentation With roxygen2
After ensuring your function is behaving as intended, it’s time to create the documentation that will form the basis of the help file and reference manual. R automatically generates function documentation using roxygen2
with a “header” written in roxygen2
syntax. RStudio will automatically create the framework of the roxygen2
header.With your function completed, place the cursor on the first line of the function and using the RStudio menu select Code > Insert Roxygen Skeleton
or Ctrl + Alt + Shift + R
.
#' Title
#'
#' @param iso3
#' @param level
#' @param gadm.dir
#'
#' @return
#' @export
#'
#' @examples
<- function(iso3, level = 0, gadm.dir="gadm-data") { combineGADM
dir.create(file.path(gadm.dir), showWarnings = FALSE)
<- list() gadm.countries
cat("Acquiring: ")
for (i in 1:length(iso3)) {
<- as.character(iso3[i]) iso
cat(paste(iso), " ")
<- sf::st_cast(sf::st_as_sf( country
::getData( raster
'GADM',
country = iso,
level = level,
path=paste0(getwd(),"/",gadm.dir)
)
),
'MULTIPOLYGON')
<- country gadm.countries[[i]]
names(gadm.countries)[[i]] <- paste0(country$GID_0)
}
cat("... Concatenating country boundaries. ")
<- gadm.countries
do.call(rbind, gadm.countries)
cat("Complete")
return(gadm.countries)
}
The roxygen block starts with a title; it is customary to skip a line and then provide a more detailed description. There are dozens of roxygen @tags
you can read about in their documentation. Continue to add as many as are appropriate for the function you created. The most common are:
@param
Each argument is given a parameter, which should be followed by a description for how the user inputs values for the parameter.@return
should be followed by a brief description of what type of object this function returns.@export
should be left as is. It ensures the function is available to be called by the user withmyresearch::combineGADM()
.@examples
is where you list examples of using the code. For personal projects I often lazily remove this line and provide no examples.
An example of this function with the completed roxygen documentation:
#' Create a Singular Object of GADM Boundaries
#'
#' This is a convenience wrapper for \link[raster]{getData} that creates a
#' singular object of class \code{'sf'} given a character vector of ISO3 country
#' codes.
#'
#' @param iso3 A character vector of ISO3 country codes passed to
#' \link[raster]{getData} that will retrieve their respective vector GADM
#' boundaries.
#' @param level Resolution of GADM administrative boundary. Can be country level
#' (0), state level (1), or county level (3).
#' @param gadm.dir Character string of new directory to create in existing
#' working directory to store downloaded boundaries.
#' @return Returns an object of class \code{sf}.
#' @export
<- function(iso3, level = 0, gadm.dir="gadm-data") { combineGADM
dir.create(file.path(gadm.dir), showWarnings = FALSE)
<- list() gadm.countries
cat("Acquiring: ")
for (i in 1:length(iso3)) {
<- as.character(iso3[i]) iso
cat(paste(iso), " ")
<- sf::st_cast(sf::st_as_sf( country
::getData( raster
'GADM',
country = iso,
level = level,
path=paste0(getwd(),"/",gadm.dir)
)
),
'MULTIPOLYGON')
<- country gadm.countries[[i]]
names(gadm.countries)[[i]] <- paste0(country$GID_0)
}
cat("... Concatenating country boundaries. ")
<- gadm.countries
do.call(rbind, gadm.countries)
cat("Complete")
return(gadm.countries)
}
Before rebuilding the package, we must add all packages used for the new function to the Imports:
list in the DESCRIPTION
file. This ensures that users will be able to execute this function when they install the myresearch
package. This function uses the raster
and sf
packages. Here is the updated DESCRIPTION
file.
Now that the function is complete and the DESCRIPTION
file is updated, save the function as its own script named combineGADM.R
and place it in the myresearch/R/
directory. The function is complete, but not available for use; you must rebuild the package before it can be called. Click the Build
tab and then Install and Restart
. The function will be available once it’s complete with a call to myresearch::combineGADM()
.
Add new comment