Submitting a Dataset Page to the DANTE Project

Joshua Brinks, ISciences, LLC

Summary

The DANTE Project is a community based effort designed to accept contributions from researchers, practitioners, and students.
These vignettes assist users in contributing to the DANTE Projects automated GitLab “back-end” that is injested into the live website.

Introduction

The DANTE Project provides an open source community platform to lower the barriers of entry to climate security research and policy making. One of the core components of the project is the dataset library. Although DANTE does not host or distribute datasets, we provide a catalog of data widely used in human geography, political science, and global climate research. Moreover, in contrast to the Socioeconomic Data and Applications Center SEDAC, Humanitarian Data Exchange, and similar data warehouse hosting sites, we present a variety of complimentary tools and commentary tailored to our available datasets. In addition to standard information describing dataset authors, hosting information, and spatial and temporal extents, we present a variety of supplementary tools and information:

Discussion points on its use in the research and practitioner communities.
Critical commentary of where the dataset excels.
Conversely, the data’s potential biases, drawbacks, or other methodological flaws.
R packages or other software developed specifically to compliment the dataset.
Vignettes and tutorials utilizing the data.
Direct commentary from DANTE users.

Completing the Template

Accessing the RMarkdown Template

The danteSubmit package can be installed via GitLab with the devtools package using the following command:

devtools::install_gitlab("/dante-sttr/danteSubmit")

Following installation, the dataset template will be accessible in RStudio through the File > New File > R Markdown ... > From Template > Dante Dataset Submission menu. This will create a new directory in your home directory with the dataset skeleton.Rmd file. This provides the framework of the submission with several DANTE specific metadata fields.

The rmarkdown template interface.

Template Fields

RMarkdown skeleton.Rmd files are typically comprised of two sections: 1) the YAML metadata header, and 2) the text body. The skeleton.Rmd begins with the YAML header. It is demarcated by two sets of ---. The text body uses traditional sectioning with the RMarkdown language. More information pertaining to RMarkdown formatting can be found on their official site. DANTE dataset submissions require no prior knowledge of RMarkdown syntax and users may delete any YAML fields not relevant to the dataset (additional authors, strengths, weaknesses, spatial information for non-spatial data, etc.). Nearly every metadata field for DANTE dataset submissions are adapted from the Federal Geographic Data Committee’s (FGDC’s) Content Standards for Digital Geospatial Metada. These standards are widely used and employ thoroughly vetted nomenclature and definitions.

YAML Metadata

metadata-contact: Name, email, and affiliation (if applicable), of the individual or institution completing the DANTE dataset submission. Multiple authors are separated by - (demonstrated in the template; delete unused slots).
metadata-date: Date the dataset submission was prepared. It’s better to use a manually entered date opposed to a dynamic date like 2021-06-15.
citation-information: Citation information of the dataset. When possible, populate these fields with the official citation metadata. When no accompanying manuscript or officially decreed citation exists, populate the fields with best available information.
- title: The official title of the dataset or accompanying manuscript. This may be abbreviated to a common name if the official title is exceedingly long, which may disrupt the website presentation.
- edition: Current version or edition of the dataset.
- publication-date: Date of the most recent release of the dataset.
- geospatial-data-presentation-form: The form or datatype of the dataset. A minimum of one word describing the data format, e.g. raster, tabular, spatial points, shapefile, country-year, dyadic, etc.
- publisher: Name of institution responsible for publication of the dataset.
- online-linkage: URL for the location of the current version of the dataset.
- dante-citekey: If the dataset already exists in the DANTE Citation Repository, list citation key with the form AuthorYEAR. Otherwise delete this field.
contact-information: Contact information for the dataset authors. Multiple authors are separated by a - (demonstrated in the template; delete unused slots).
contact-person: Name(s), email(s), and affiliation(s) (if applicable), of the individual(s) or institution(s) who authored the dataset. When possible these should match the information for any peer reviewed manuscript that accompanied the release of the dataset. Multiple authors are separated by - (demonstrated in the template; delete unused slots).
dataset-highlights: Approximately 2-3 bullet points highlighting key points of the dataset. This section is best served by illustrating brief strengths and weaknesses of the dataset.
abstract: If it exists, the official abstract for the dataset. This may be copied verbatim as long as either: 1) the DANTE submission contains the direct url link to the dataset hosting site. If no abstract exists, delete this section. In the event that no abstract is present, the discussion section of the submission template should contain adequate descriptive information.
use-constraints: Dataset license specification or written text describing dataset use restrictions.
spatial-information:
- bounding-coordinates: Geographic scope of the dataset relayed as a four point bounding box. When using R, these coordinates can be extracted using raster::extent().
- spatial-reference-information:
  - coordinate-system: Dataset coordinate system (UTM, Latitude-Longitude, etc.)
  - resolution: Dataset spatial resolution.
  - units: Dataset resolution units (meters, decimal degrees, etc.).
  - geodetic-model: Geodetic model used for projection (commonly WGS1984).
time-period-information:
- beginning-date: First date of observations.
- ending-date: Final date of observations.
- resolution: Integration period or temporal resolution of dataset (annual, monthly, weekly, daily, etc.).
related-packages: R packages designed to acquire, process, analyze, or visualize the dataset. Related links utilize raw HTML. Examples for internal and external packages are populated in the rmarkdown template. Replace dante-package-name with the package name found in the associated DANTE url. For example, the entry for demcon would be

- <a href="/demcon">demcon</a>

External packages without a DANTE page can be linked to their hosting site:

- <a href="https://github.com/vdeminstitute/vdemdata">vdemdata</a>

related-datasets: Similar datasets to the one being presented in the submission. Examples for internal and external datasets are populated in the rmarkdown template. Replace dante-dataset-name with the dataset name found in the associated DANTE url. For example, the entry for The Standardized Precipitation Evapotranspiration Index (SPEI) would be:

- <a href="/datasets/spei">Standardized Precipitation Evapotranspiration Index</a>

External datasets without a DANTE page can be linked to their hosting site:

- <a href="https://ldas.gsfc.nasa.gov/gldas">GLDAS 2.1</a>

related-vignettes: Vignettes or other tutorials featuring the dataset. Examples for internal and external vignettes are populated in the rmarkdown template. Replace dante-vignette-name with the vignette name found in the associated DANTE url. For example, the entry for Country Coding Considerations for Dataset Harmonization and Applied Uses would be (Here we use a truncated title to improve website rendering.):

- <a href="/vignettes/ccode-considerations">Applied Country Code Uses</a>

External vignettes without a DANTE page can be linked to their hosting site:

- <a href="https://r-spatial.github.io/sf/articles/sf4.html">Manipulating Simple Features</a>

bibliography: File name for the bibliography used to properly cite the “Discussion” section. If there are no citations (there should be at minimum a citation for the dataset being submitted) delete this section.

The following sections may be left blank and filled in by DANTE staff. However, if you wish to provide imagery used for website catalog browsing and dataset page rendering you may provide images.

browse-image: File name for the image to be used while browsing on the DANTE website. This may be left blank for project administrators to handle. If you would like to provide an image please crop it to 300 x 225 pixels. The file name must match that of the dataset .Rmd submission using the format rmd-submission-browse.jpg. For example, the SPEI file is named spei-browse.jpg.
subhead-image: File name for the image to be used as a subheading of the dataset page on the DANTE website. This may be left blank for project administrators to handle. If you would like to provide an image please crop it to 920 x 180 pixels. This should be the same image, but re-cropped, as the browse-image.jpg. The file name must match that of the dataset .Rmd submission using the format rmd-submission-subhead.jpg. For example, the SPEI file is named spei-subhead.jpg.
image-attribution: All submitted website imagery must be properly attributed using raw HTML. These links are almost always available from the open-source photo website where the imagery was acquired. Paste it into the slot verbatim. Here is attribution for the MapSPAM DANTE page:

image-attribution: <span>Photo by<a href="https://markusspiske.com/">Markus Spiske</a></span>

output: This identifies the rmarkdown template to compile the submission. Should not be altered by the user.

Written Body

The written body is filled out after the second set of --- that demarcates the end of the YAML metadata header. In the body of the text, attempt to provide a brief description of the dataset in question. If possible, you should also provide additional context for this dataset. This may include:

Limitation in data quality, temporal coverage, or resolution.
Compare and contrast to similar datasets.
Additional positive and negative aspects beyond bullet points listed in the YAML header.
Brief references to prominent peer reviewed or commissioned reports featuring the dataset.
Brief passage describing functionality of related R packages listed in the YAML header. Do they provide API interfaces, data processing, analysis, or visualization functionality? Do they work with the current release of the data or are they deprecated?
Brief passage relating the nature of the vignettes listed in the YAML header.This section should be properly cited using rmarkdown conventions when possible.

Submissions should also provide a visual representation of the dataset. Ideally this would be a figure that helps the reader, but if this is not applicable or beyond your skillset, a table would suffice. Alternatively, you may include a representative image from a .PNG file. However, make sure this image is properly credited. Whether you’re using a standalone image, or generating a figure via a code chunk, it’s best to generate representative figures inside of a code chunk.

A figure produced with code:

library(ggplot2)
data(women)
women.scatter<-ggplot2::ggplot(women,ggplot2::aes(x=height,y=weight))+
 ggplot2::geom_point(size=3)+
 ggplot2::stat_smooth(method="lm", formula="y~x")+
 ggplot2::labs(title="Weight as a Function of Height for 15 Women",
 x="Height (Inches)",
 y="Weight (lbs.)")+
 ggplot2::theme_minimal()
women.scatter

Figure 2: Weight as a function of height for a subset of 15 women. Blue line represents linear relationship with 95% confidence interval (grey bands.

This is current best practice on DANTE for submitting a pre-configured figure from a .PNG file.

library(png)

img<-png::readPNG("screenshot.png")
grid::grid.raster(img)

Reference

This should not be altered by the user. It will generate the full citation for any references listed in the written body. If you did not include citations, delete this section heading.

Submitting the Template

If you’re still uncertain on how to proceed, please review completed dataset pages on our GitLab Project Page. You can see several examples for a variety of datasets and types of written discussion.

After all relevant fields are complete, the user should compile and review the HTML submission by knitting the document inside of RStudio. At this point, the submission is ready and the dataset-name.html output, dataset-name.Rmd, all relevant files may be pushed to development section of danteSubmit in a folder with the same name as the .Rmd submission using a merge request.

If you’re not comfortable with Git, you may simply reach out to me through email Josh Brinks.