Joshua BrinksISciences, LLC 
Photo by William Iven on Unsplash

Summary

  • Spouting the virtues of replicable, reproducible, and distributable research is commonplace.
  • However, there is a shortage of current, descriptive, and detailed guides for enacting such worfklows.
  • In this series of vignettes, we walk provide detailed guides for several key components to replicable, reproducible, and distributable workflows.

This vignette is an excerpt from the DANTE Project’s beta release of Open, Reproducible, and Distributable Research with R Packages. To view the entire current release, please visit the bookdown site. If you would like to contribute to this bookdown project, please visit the project GitLab repository.

Reports & Manuscripts

Several of the most popular packages readily make use of vignettes to provide long form documentation and demonstrations of package functionality. Vignettes provide greater context for intended use of package functions beyond what’s available in the help-files. Some of my favorites include the sf and data.table packages. While these are the most common uses for package vignettes, they may also be used for research workflows and creating professional manuscripts. It may be helpful to develop vignettes that narrate individual components of your research workflow. These vignettes weave written narratives with code that documents any difficulties and idiosyncrasies of the data, functions, and packages used for the workflow. I typically create vignettes documenting:

  • Data acquisition and initial processing. This includes difficulties dealing with automated downloads or interfacing with APIs. Establishing naming conventions, identifying coding schemes, ensuring numeric categorical variables are properly processed, and producing exploratory figures to visualize distributions, rare values, and correlation structures.
  • More intensive data processing and preparation methods such as imputation procedures, spatial data manipulations and summaries, and verifying the integrity of complicated merges.
  • Trial runs for statistical modeling and machine learning that will form the bulk of the final analysis. This may include developing models for a truncated set of data, documenting the impacts of function settings on results, and light variable selection exercises.
  • Exploring post-hoc analysis and visualizations expected to be part of a final manuscript.

These vignettes are far more beneficial than raw scripts filled with comments, minimal context, and no ready results or visualizations. It’s also helpful to begin to construct cited introductory or methods passages in these vignettes that may be carried over to the final manuscript. Starting off a workflow vignette with 1-2 cited paragraphs introducing the employed packages and underlying comparative methodologies lessens the workload of developing the final manuscript or technical report. Additionally, workflow vignettes are easily shared with colleagues, stake holders, students, and clients. More importantly, they serve as detailed notes when you return to the research some months or years later. After developing the workflow and desired results with individual vignettes, they can be concatenated into a singular professional report or manuscript that also exists as a vignette within your package.

Utilizing vignettes to establish and report your research also ensures that your code works. It’s not uncommon, even for experienced researchers, to establish erroneous results or workflows when working with a loose collection of scripts being fed through the IDE console. Your local environment can quickly become cluttered with hundreds of objects, renamed datasets, and testing iterations that don’t represent your intended workflow. Vignettes are executed in their own “clean” environment that only contains packages, data, and code inside the document. Moreover, vignettes are processed when the package is rebuilt, and if they fail to successfully compile, the rebuild is halted with an identifying error. This acts as a safeguard against your workflow failing. If your vignette uses an embedded dataset you have since altered, a function that’s no longer operating as intended, or any other unforeseen downstream consequence from a code change or typo, the vignette rebuild will be altered or fail.

Creating a Vignette

R Markdown can produce outputs in several file formats, but we will focus on the two most common: HTML and PDF. The easiest way to create a new vignette is with the usethis package.

usethis::use_vignette("data-acquisition")

If it doesn’t already exist, usethis will create the myresearch/vignettes/ directory, create a new R Markdown vignette file (data-acquisition.Rmd) using the quoted name provided in the function, and make a few additions to the DESCRIPTION file (Suggests, VignetteBuilder). A quick review from an earlier section describing how vignettes are created with rmarkdown, knitr, and Pandoc.

  1. Vignettes are written in mostly plain text with code inside of “chunks” in an rmarkdown file (.Rmd).
  2. knitr executes any embedded code in the rmarkdown file (.Rmd), “knits” them together with the text, and produces a markdown file (.md).
  3. Pandoc converts the markdown (.md) file into the specified output format.
  4. For PDF outputs the .Rmd file is converted into a LaTeX file (.tex) and compiled with your local LaTeX distribution. It is highly recommended that you use the tinytex R package as your LaTeX distribution.

Vignette Structure

R Markdown (.Rmd) files are comprised of two sections: the YAML header and the body. Generic vignettes created by usethis have condensed versions of both sections.

The base vignette created by `usethis::use_vignette()`.

The base vignette created by usethis::use_vignette().

Everything at the beginning of the document between the two sets of --- is the YAML header, everything after is the body, and everything in-between pairs of ``` are code chunks are parsed by R. The first code chunk sets document wide chunk defaults for all code chunks. The default chunk options are:

knitr:opts_chunk$set(
collapse = TRUE,
comment = "#>"
)

You may establish a variety of document-wide settings for images, figures, and code parsing. include = FALSE ensures the chunk is parsed but not displayed in the document. For more detailed information on available chunk options refer to the knitr Chunk Options and Package Options reference guide.

YAML Header

The YAML header is used to provide document metadata and specify numerous options for document structure. YAML syntax implements a nesting structure for related options; an example of a typical HTML and PDF YAML is provided following the review of common fields:

General / HTML Document
  • title: The report or manuscript title.
  • author: The author(s). You may separate authors with ,, Alternatively, for more complex authors and affiliations, multiple authors may be listed with - using the following syntax.
---
author:
- John Doe
- Jane Doe
---
  • date: The date may be listed manually or automatically updated with 2021-06-15 14:19:34.
  • output: Establishes the output format. The most commonly used is html_document.
    • theme: Sets the desired theme and styling for the document. Several Bootswatch themes are available by default, but R Markdown offers lots of additional customization options; some of which will be discussed later.
    • toc: Establishes the table of contents when set to true.
    • toc_float: Places the table of contents to the left of the main body.
    • css: Name of a CSS file with optional custom styling for the HTML document.
    • dev: Sets the image output format. Vectorized formats (svg, pdf) maintain the highest detail, but raster formats (png) are smaller.
    • numbered_section: Set to true for numbered sections.
  • abstract: The document abstract written inside of "’s.
  • bibliography: The bibtex file used for the document citations.

This is the YAML header for this document.

---
title: "Open, Reproducible, and Distributable Research With R Packages"
author: "Joshua Brinks"
date: "June 1, 2020"
output:
rmarkdown::html_document:
toc: true
toc_float: true
theme: flatly
vignette: >
%\VignetteIndexEntry{r-for-repro}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
bibliography: "repro.bib"
link-citations: yes
number_sections: false
csl: nature.csl
---
PDF Documents

When creating PDF documents there are additional YAML fields of note. Some of these are familiar LaTeX settings that can be specified by YAML fields.

  • output: The output: must be set to pdf_document:
    • fig_width:, fig.height:, fid_caption All determine document figures, but they may also be set individually at each code chunk that generates an image or figure.
    • template: When developing PDF documents, the template is where the user specifies a custom TEX file that contains additional formatting options. You may specify a system path or even a file you place inside of /myresearch/inst/, but to spare you from troubleshooting and differing system paths across Windows, Linux, and MacOS, place this file in the vignettes directory until you are comfortable with basic package development. This is not a required field, however, if you are an experienced LaTeX user, place most of your standard preamble in this file. In most cases, your personal macros and customizations will work seamlessly with R Markdown and tinytex. One of the few exceptions are tables, which are best implemented using the kableExtra package.1 That being said, I suggest adding components one at a time. Start with the Default Pandoc LaTeX Template and edit it to your liking. This can be taken a step further by creating your own custom variables in the pandoc template that link back to the YAML header. This allows you to set custom options and styling directly from the YAML header. For more information review the Pandoc User Guide’s section on Template Syntax.
  • citation_package: Sets the desired citation back-end to either natbib or biblatex.
  • csl: The file containing the document citation style.
  • fontsize,fontenc, documentclass, geometry, mathfont,link-citations, linkcolor, urlcolor,colorlinks, citecolor These are all common LaTeX options that can be specified within the YAML header. Alternatively you may set them in a custom TEX template.
  • pkgdown: To properly render your PDF document in a pkgdown website you must list additional fields.
    • as_is: Must be set to true for pkgdown to not override stylings.
    • extension: Must be set to pdf so pkgdown does not override the document to an HTML when compiled on the package website.
    • resource_files: Specify files needed to properly render the PDF. This typically includes your bibliography and any custom TEX file(s). Multiple files are nested under the resource_files: field and separated with -.

This is a sample YAML header for a PDF document.

---
# Universal Fields
title: Modeling Conflict, Climate, and Human Migration
date: February 28, 2019
abstract: "Amazing work and fantastic findings."
resource_files:
- josh-latex-pan-temp.latex
- josh-references.bib
output:
pdf_document:
toc: true
toc_depth: 3
fig_crop: no
template: josh-latex-pan-temp.latex
citation_package: biblatex
number_sections: true
pkgdown:
as_is: true
extension: pdf
fontsize: 11pt
geometry: = 2in
biblio-title: "References"
bibfile: josh-references.bib

# Custom YAML Pandoc Variables
line-numbers: true
list-tables: true
list-figures: true
cover-logo: "isciences.png"
corporate-disclaimer: true
header-text: "Modeling Conflict, Climate, and Human Migration: Phase I"

# Package indexing
vignette: >
%\VignetteIndexEntry{split-duration-hindcast}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

It’s beyond the scope of this tutorial to thoroughly review R Markdown syntax, R Markdown code chunk options, all potential YAML fields, and Pandoc Customization. These are some of the best resources for more advanced development:

  • R Studio’s R Markdown Basics is a quick guide for getting started with R Markdown documents and formatting.
  • Yihu Xie’s R Mardown: The Definitive Guide is, simply put, the definitive guide for all thing R Markdown; an excellent resource for beginners and advanced R Markdown customization.2
  • The ymlthis package3 provides helper functions for YAML development in R.
    • The ymlthis YAML Fieldguide4 is a great quick reference for available fields across all R Markdown output formats.
  • The Pandoc User’s Guide is a comprehensive resource for developing custom Pandoc documents to integrate with R Markdown. The Pandoc manual does not provide any R Markdown specific documentation. It’s best used in conjunction with Yihu Xie’s R Mardown: The Definitive Guide, which indicates at which points the user should refer to the Pandoc manual for additional options.

Writing a Vignette

Research vignettes typically fall under 2 categories: 1) Analysis demonstrations and tutorials, and 2) manuscripts and technical reports. When detailing methods and workflows, it’s more more common to walk through processing steps individually in code chunks interwoven with written commentary. Every chunk, including loading libraries, should be visible in the final document.

Conversely, when creating technical reports and manuscripts, data processing and modeling are often front-loaded in the document within chunks that are not visible in the final document (echo = FALSE). This makes processing and modeling code easily accessible while under development. For manuscripts and technical reports, chunks within the greater body of text are raw code used to construct tables and figures. As opposed to developing manuscripts with static saved images in Word or a standard LaTeX distribution, these code chunks are dynamically linked to the processing code prefacing the written body. This ensures that your figures and tables are always representative of your data processing and modeling workflow. Moreover, when data processing and visualization code are properly functionalized, the workflow becomes centralized to a handful of documented scripts inside the myresearch/R/ directory. This greatly minimizes mistakes and results in more reliable and distributable research.

References

1.
Zhu, H. et al. kableExtra: Construct Complex Table with ’kable’ and Pipe Syntax. (2019).
2.
Xie, Y., Allaire, J. J. & Grolemund, G. R Markdown: The Definitive Guide. (2020).
3.
Barrett, M. & Iannone, R. Ymlthis: Write YAML for R Markdown, bookdown, blogdown, and More. https://ymlthis.r-lib.org/ (2020).
4.
Barrett, M. & Iannone, R. The YAML Fieldguide. https://cran.r-project.org/web/packages/ymlthis/vignettes/yaml-fieldguide.html (2020).

Add new comment

Plain text

  • Allowed HTML tags: <a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd>