Joshua BrinksISciences, LLC 
Photo by Jeff Smith on Unsplash

Summary

  • Spouting the virtues of replicable, reproducible, and distributable research is commonplace.
  • However, there is a shortage of current, descriptive, and detailed guides for enacting such worfklows.
  • In this series of vignettes, we walk provide detailed guides for several key components to replicable, reproducible, and distributable workflows.

This vignette is an excerpt from the DANTE Project’s beta release of Open, Reproducible, and Distributable Research with R Packages. To view the entire current release, please visit the bookdown site. If you would like to contribute to this bookdown project, please visit the project GitLab repository.

Default Package Files

RStudio leaves you with a handful of files and directories after creating a new package. We’ll review the new files and create some additional commonly used directories.

  • .Rbuildignore The build ignore file is where you list files you do not want to be bundled up with your package, but are inside the package root directory because they are used for package development. These may include images, notes, files that are used for pre-processing of larger embedded datasets, or any other file that is non-essential to the final package. By default the RStudio project files are listed in .Rbuildignore. Build ignore uses regular expression syntax, but if you’re not comfortable with regular expression you can use usethis::use_build_ignore().
  • DESCRIPTION The package description file contains basic information about your package. By default it’s fairly simple. The only mandatory fields are Package, Version, License, Description, Title, Author, and Maintainer, however, it would be very rare to not have an Imports and Suggests field.
  • The man/ directory contains automatically generated manual and reference materials by roxygen2. You do not have to edit this directory. It will populate every time you Install and Restart your package as long as you followed the steps above to Configure Build Tools....
  • The NAMESPACE file is also automatically generated by roxygen2 when you Install and Restart. It’s not something to cloud your mind with as a beginner; more information is available here.
  • The R/ directory contains all of your functions. By default it will contain the hello.R file for the hello() function. This directory should only have function files, and a data.R file that we will discuss later. Current best practices are for each function to be in a single file named after the function, but you may also place multiple functions in a single file.
  • The final default file is the RStudio project file (myresearch.Rproj). You can execute this file from anywhere to open up an RStudio session for your package project.

The DESCRIPTION

The DESCRIPTION merits additional discussion as one of the primary package files you edit directly. We can address important fields in more detail:

Default Fields

  • Title: is slightly more explanatory title to your project beyond the package name.
  • Version: is not terribly important in this context. I usually leave it at the default. You can read more about R package versioning here.
  • Authors: is self explanatory and may be written in plain text, however, it’s strongly suggested that you replace this with the Authors@R: field. This sets the authors and roles in a more programmatic way and establishes emails and roles (author "aut", creator "cre", contributor "ctb", copyright holder "cph").
Authors@R: person("Joshua", "Brinks", email = "jbrinks@isciences.com",
role = c("aut", "cre"))
  • Maintainer: is the package maintainer. Typically the same as the author. Written in plain text followed by the email address: Joshua Brinks <jbrinks@isciences.com>.
  • Description: is a comprehensive description of your package functions. I usually include a few sentences for context and functionality.
  • License: is the operating license of your package determines the legality of how and whom may use your package. Being this is an article on open science we strongly recommend using a Free or Open Source Software Licence (FOSS) when possible, however, there are several contexts where this simply doesn’t work. There is lots of discussion regarding comparative software licenses on the internet. I suggest you acquire a greater understanding. When possible I implement a GPL3 open source license with usethis.
usethis::use_gpl3_license()
  • Encoding: determines your package encoding. Usually a good idea to leave this UTF-8.
  • LazyData: determines how the data you embed in your package is loaded when your package is loaded. It’s best to leave this set to true. This ensures that data embedded in your package is only loaded into memory when you call on the dataset. Otherwise any large datasets will use up memory as soon as your package is loaded.
  • RoxygenNote: specifies the version of roxygen2 being used to manage your package documentation. It will be updated automatically.

Additional Fields

These are other common fields.

  • URL: Any appropriate package or personal website. I usually list the Git pkgdown website here.
  • Imports: is a list of packages that your package depends on to carry out its core functions found in the R/ directory. If you have a function in the R/ directory that uses data.table::merge(), dplyr::filter(), and ggplot2::geom_point(), these packages must be listed in the Imports:. This ensures that when your package is installed additional dependencies are also installed. Syntax for the Imports: and Suggests: is:
Imports: data.table,
dplyr,
ggplot2
  • Suggests: is similar to Imports: but for packages that are used in your vignettes, but not listed as part of your core Imports:. These are typically packages used for your vignettes (rmarkdown, leaflet), but you may also have a package you use for a rare function in the R/ directory that you don’t want to automatically load as a courtesy for your users.
  • Remotes: is used to specify packages your package depends on that are not released on CRAN but are available on GitHub or GitLab. The syntax is gitsite::repository.
Remotes: gitlab::dante-sttr/commonCodes,
gitlab::dante-sttr/untools

The simplest way to add a package dependency is with usethis, although I typically edit the DESCRIPTION file directly.

usethis::use_package("ggplot2")

Here is an example of a completed DESCRIPTION from the duplicator package.

RStudio Build Tools and Roxygen Options interfaces.

RStudio Build Tools and Roxygen Options interfaces.

Importing data.table and tidyverse Packages

When importing either the data.table or tidyverse packages you must accommodate their special operators and naming conventions (data.table doesn’t need quoted variables in functions) that are not part of base R programming. For tidyverse this refers to the %>% (pipe) operator that comes from the magrittr package. data.table implements several additional operators including c(.N, .I, ':='). If these operators are not addressed your package will kickback warnings and errors when executing build checks. usethis has functions to assist setting these up.

usethis::use_data_table()

usethis::use_pipe()

These functions will adjust your imports section. Additionally, they will both create non function files in your R/ directory (utils-data-table.R and utils-pipe.R). The utils-data-table.R file needs an addendum to handle the special operators. The base file created is:

# data.table is generally careful to minimize the scope for namespace
# conflicts (i.e., functions with the same name as in other packages);
# a more conservative approach using @importFrom should be careful to
# import any needed data.table special symbols as well, e.g., if you
# run DT[ , .N, by='grp'] in your package, you'll need to add
# @importFrom data.table .N to prevent the NOTE from R CMD check.
# See ?data.table::`special-symbols` for the list of such symbols
# data.table defines; see the 'Importing data.table' vignette for more
# advice (vignette('datatable-importing', 'data.table')).
#
#' @import data.table
NULL

As stated you must add the additional line for their operators. I would add the most common.

# data.table is generally careful to minimize the scope for namespace
# conflicts (i.e., functions with the same name as in other packages);
# a more conservative approach using @importFrom should be careful to
# import any needed data.table special symbols as well, e.g., if you
# run DT[ , .N, by='grp'] in your package, you'll need to add
# @importFrom data.table .N to prevent the NOTE from R CMD check.
# See ?data.table::`special-symbols` for the list of such symbols
# data.table defines; see the 'Importing data.table' vignette for more
# advice (vignette('datatable-importing', 'data.table')).
#
#' @import data.table
#' @importFrom(data.table, .N, .I, ':=')
NULL

Additional Directories

There are additional directories that are both common constructs in the R community and helpful for research specific workflows. These include raw-data/, raw-scripts/, and inst/. Click on New Folder in the RStudio Files window to create these directories.

  • The /raw-data/ folder is where you place scripts used to import, pre-process, and embed datasets into your package. This will be explained in greater detail later.
  • The /raw-scripts/ directory is where you keep standard scripts with notes as you work out your workflow and code you will eventually wrap up and document in a function. This directory is less common and the naming is not widely accepted, however, it’s good practice to keep rough drafts of the code prior to wrapping it up into a function.
  • The /inst/ folder contains additional files vital to your package that are not scripts, vignettes, or can not be directly embedded as .RData files. These files will be installed along with the package when someone else installs your package. Therefore, some consideration should be given to including massive amounts of data or otherwise potentially harmful or sensitive scripts and data. These may be complex copyright or licensing agreements that can not be captured by the DESCRIPTION, external and unprocessed data, the package citation, and code from other languages. When your package is installed, everything in the /inst/ folder will be moved up to the root level. This is somewhat confusing at first. For example, while working directly on your package you may have:
myresearch/inst/COPYRIGHT.TXT
myresearch/inst/extdata/france.shp

When your package is installed locally or on another computer these files are accessible at:

myresearch/COPYRIGHT.TXT
myresearch/extdata/france.shp

We will discuss how to programmatically access /inst/ data in the embedded data section.

At this time you may also create the data/ and vignettes/ folders, but usethis will do this automatically with functions specifically designed to embed data and create vignettes.

References

Add new comment

Plain text

  • Allowed HTML tags: <a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd>