Joshua Brinks, ISciences, LLC

Summary

The importance of replication and reproducibility are at the core of the scientific process, but these components are readily overlooked.
Not only are they overlooked, but the difficulty in attempting to reproduce a peer reviewed study is rarely demonstrated.
In this series of vignettes we attempt to demonstrate reverse-engineering a recent high-impact peer review publication in the eco-security sector.

Introduction

Welcome to the 3^rd segment of our series replicating Missirian and Schenkler’s 2017: Asylum applications respond to temperature fluctuations. In our first entry, we introduced the premise of their analysis and presented several outstanding questions after reviewing the supplementary materials describing the methods. In the second installment we carried out data pre-processing by calculating weighted mean surface temperature and total precipitation by country and summarized outgoing annual asylum application from the authors’ study countries. In our 3^rd entry we’ll generate core figures from the authors’ supplementary materials.

The authors did not provide raw or processed data with their supplementary materials, however, the manuscript presents several exploratory visualizations. By replicating these visualizations we can be more confident our data processing steps are congruent with their approach. We will reproduce a selection of these graphics to build confidence we have similar underlying datasets for modeling. These include: histograms of temperature and applications anomalies (Figure S4), the global maize cropping fraction dataset (Figure S1), a choropleth of mean baseline temperature for study countries weighted by maize cropping fraction (Figure S2), and time series plots depicting the countries producing the highest cumulative number of applicants over our study period (Figure S5).

Maize Cropping Fraction

The authors produce a rudimentary map of the maize cropping calendar dataset. We’ll modernize the map by re-creating it as an interactive display with leaflet. For dramatic effect, the authors manipulate the scale of fraction data. This isn’t out of bounds by any means, but the log scale of cropping fraction is a bit silly and uninterpretable at minuscule values. We’ll start by generating the legend scale and then map it to a color ramp similar to theirs.

library(leaflet)

# The maize fraction raster
bins<-emdbook::lseq(0.0000000000000181,1.73, length.out = 26)
pal <- colorBin(c("whitesmoke","yellow", "orange", "red", "red4"), raster::values(duplicator::maize.frac), bins=bins,
 na.color = "transparent")

The call to leaflet for this map is quite simple. We can embellish their design with a Stamen basemap and opacity to the raster layer.

leaflet::leaflet() %>%
 leaflet::addProviderTiles(leaflet::providers$Stamen.TonerLite) %>%
 leaflet::addRasterImage(duplicator::maize.frac,
 opacity = 0.8,
 colors = pal) %>%
 leaflet::addLegend(
 pal = pal,
 values = raster::values(duplicator::maize.frac),
 title = "Maize Cropping Fraction"
 ) %>%
 leaflet::setView(zoom = 2.25, lat = 0, lng = 0) %>%
 addScaleBar(position = c("bottomleft"),
 options = scaleBarOptions(metric = TRUE))

$Full interactive figure can be found on [duplicator website](https://dante-sttr.gitlab.io/duplicator/articles/miss-schenk-2017-vis.html#maize-cropping-fraction).$

Full interactive figure can be found on duplicator website.

The result is visually compelling and faithfully reproduces their plot, however, leaflet’s native rounding behavior reveals the absurdity of the log scale legend. More than 20 of the legend entries are rounded to 0 with 3 significant digits. Their values are listed in scientific notation, but we let leaflet round the values. This is raw data so it should be a near replica, however, these datasets are regularly updated or move hosting locations so it’s good to verify, and it is a good opportunity to demonstrate a simple leaflet map.

Mean Temperature Choropleth

Re-creating their weighted mean temperature choropleth with some embellishments in leaflet is a little more complicated. First we must merge the prepared model.dat data with a shapefile of our study country boundaries so our prepared data is associated with the relevant polygon. If you cleared your workspace since we processed the temperature and asylum data, the output from the previous vignette can be retrieved with duplicator::model.dat. You may also need to regenerate the shapefile of our study countries. By default, GADM shapefiles have a high level of detail, which often results in extremely long rendering times within interactive plotting frameworks like leaflet, tmap, and mapview. Therefore, it’s good practice to simplify complex shapefiles before employing them in these environments. The study_countries.shp file included with duplicator has been pre-processed with sf::st_simplify() for faster plotting.

model.dat<-duplicator::model.dat
gadm.study.countries<-sf::st_read(system.file("shape/study_countries.shp", package="duplicator"))
#> Reading layer `study_countries' from data source 
#> `C:\Users\jbrin\Documents\R\win-library\4.0\duplicator\shape\study_countries.shp' using driver `ESRI Shapefile'
#> Simple feature collection with 103 features and 2 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -180 ymin: -34.83292 xmax: 180 ymax: 81.85468
#> Geodetic CRS: WGS 84

country.temps <-
 model.dat[, .(temp = mean(temp.base), apps = exp(sum(apps))), by = .(name, iso3)]
country.temps <-
 merge(
 gadm.study.countries,
 country.temps,
 by.x = "GID_0",
 by.y = "iso3",
 all.x = TRUE
 )

Next, we mimic their legend and palette. The legend is way too big for the 5.5" rmarkdown window so it’s commented out. You can “turn” it back on if you’re viewing in a larger window.

bins<-seq(-6,38,1)
pal <- leaflet::colorBin(c("blue4","blue","cyan2","green", "yellow", "orange", "red", "red4"),
 domain=country.temps$temp, bins=bins,
 na.color = "transparent")

We can create some leaflet embellishments like country outlines, hover highlighting, and country information popups, but first we need to generate the labels.

labels <- sprintf(
 "<strong>%s</strong><br/>%g Mean Temp (C)",
 country.temps$name, country.temps$temp
) %>% lapply(htmltools::HTML)

Now we can produce the map.

leaflet::leaflet(country.temps) %>%
 leaflet::addProviderTiles(providers$Stamen.TonerLite) %>%
 leaflet::addPolygons(
 fillColor = ~ pal(temp),
 weight = 2,
 opacity = 1,
 color = "lightgrey",
 dashArray = 3,
 fillOpacity = 0.8,
 highlight = leaflet::highlightOptions(
 weight = 2,
 color = "#666",
 dashArray = "",
 fillOpacity = 0.7,
 bringToFront = TRUE
 ),
 label = labels,
 labelOptions = leaflet::labelOptions(
 style = list("font-weight" = "normal", padding = "3px 8px"),
 textsize = "15px",
 direction = "auto"
 )) %>%
 # leaflet::addLegend(
 # pal = pal,
 # values = country.temps$temp,
 # title = "Mean Temp (C)",
 # labels = as.character(seq(-6,38,2))
 # ) %>%
 leaflet::setView(zoom = 2.25, lat = 0, lng = 0) %>%
 leaflet::addScaleBar(position = c("bottomleft"),
 options = scaleBarOptions(metric = TRUE))

Full interactive figure can be found on duplicator website.

Ideally, we would generate a scatterplot of our weighted temperature calculations by country vs. the values listed in the manuscript, but the authors do not present a supplementary table of weighted mean temperatures for each country. Therefore, comparing our leaflet choropleth to Figure S2 in their manuscript is our only means by which to judge if we processed the data in a similar manner. The general pattern looks very similar and most of the countries that stand out in their choropleth (Russia, Mongolia, Peru, and Algeria) have similar estimates in our leaflet map. Although there may be slight differences in the weighted estimations, we can be confident moving forward.

Anomoly Histograms

The authors create annual histograms of country temperature and asylum anomalies. This will help verify that we’ve processed the data in a similar fashion. These are fairly simple to make, but I’ll spruce them up with ggplot2 with a more modern palette, a minimal theme, and some transparency. We’ll start with the asylum applications.

ggplot2::ggplot(model.dat, ggplot2::aes(x=apps.anom))+
 ggplot2::geom_vline(xintercept=0, color = "#266197", size=1)+
 ggplot2::geom_histogram(ggplot2::aes(y=..density..),
 color="#a90aa1", 
 fill = "#edceec", 
 alpha=0.6,
 bins = 40)+
 ggplot2::facet_wrap(year~.)+
 ggplot2::labs(title="Annual Asylum Application Anomalies",
 subtitle="For Non-OECD Countries to EU Members from 2000-2014",
 x="Asylum Application Anomalies (Log)",
 y="Density")+
 ggplot2::scale_y_continuous(breaks = seq(0,2,0.5), labels=as.character(seq(0,2,0.5)))+
 ggplot2::theme_minimal()

Our annual distributions closely resemble theirs. I would chalk most of the slight differences up to the specified bin-width of the respective plots. The authors appear to use variable bin numbers across the annual plots. I specified bins = 40 to approximate their plots a little better than with geom_histogram()’s default setting of bins = 30. The United Nations asylum application data has undergone significant changes since the manuscript was published; there is a possibility certain countries retroactively redacted or made other corrections to their reported values.

And now the temperature anomalies with nearly identical code.

ggplot2::ggplot(model.dat, ggplot2::aes(x=temp.anom))+
 ggplot2::geom_vline(xintercept=0, color = "#266197", size=1)+
 ggplot2::geom_histogram(ggplot2::aes(y=..density..),
 color="#a90aa1", 
 fill = "#edceec", 
 alpha=0.6,
 bins = 40)+
 ggplot2::facet_wrap(year~., scales = "fixed")+
 ggplot2::labs(title="Annual Temperature Anomalies",
 subtitle="For Non-OECD Countries w/ Asylum Applications to the EU from 2000-2014",
 x="Temperature Anomalies",
 y="Density")+
 ggplot2::scale_y_continuous(breaks = seq(0,2.5,0.5), 
 labels=as.character(seq(0,2.5,0.5)))+
 ggplot2::theme_minimal()

The authors truncate the lower bounds of their temperature histograms to left-censor an outlier at (-3). No explicit reason is given, but I left the outliers in to get a better sense of the underlying data. The point of exploratory visualizations is to identify potential outliers, underlying distributions, and additional data abnormalities that may affect our modeling results.

A closer comparative examination of our temperature anomaly histograms suggests we diverted paths at some point while processing the weighted mean temperature. While most years are similar in shape, the magnitude of our negative outliers in 2011 and 2012 are much greater, and our plot for the year 2000 exhibits disparities that may be more than a function of just differential binning.

Time Series for Top Asylum Countries

Lastly, we’ll reproduce annual time series plots depicting mean weighted temperature and asylum applicants from the 4 countries that produced the most asylum seekers over our study period. We’ll start by creating a subset data.table with only the 4 highest source countries.

top.apps<-model.dat[,.(total.apps=sum(apps)), by=.(name,iso3)][order(-total.apps)][1:4][,iso3]
ts.app.plot<-model.dat[iso3 %in% top.apps]

The rest is straightforward, however, we’ll utilize calls to ggplot2::scale_y_continuous() and ggplot2::scale_x_continuous() to match their axes. First, the annual application anomalies.

palette<-c("#266197", "#82dc59", "#a90aa1", "#6dc5dd", "#794a98", "#57ecc0")
ggplot2::ggplot(ts.app.plot, ggplot2::aes(x=year, y=apps.anom, group=iso3))+
 ggplot2::geom_hline(yintercept=0, color = "darkgrey", size=1)+
 ggplot2::geom_line(ggplot2::aes(color=iso3), size=1.5, alpha=0.8)+
 ggplot2::geom_point(ggplot2::aes(color=iso3), size=2, alpha=0.8)+
 ggplot2::labs(title="Application Anomalies to EU Members",
 subtitle = "From Top 4 Source Countries During 2000-2014",
 y="Asylum Application Anomalies (log)",
 x="",
 color = "Country (ISO3):")+
 ggplot2::scale_color_manual(values=palette[1:4])+
 ggplot2::scale_x_continuous(breaks=seq(2000,2014,5), minor_breaks = seq(2000,2014,1))+
 ggplot2::scale_y_continuous(breaks=seq(-1,1.25,0.25), minor_breaks = seq(-1,1.25,0.25))+
 ggplot2::theme_minimal()+
 ggplot2::theme(legend.position="bottom")

Our plot is very similar to to Figure S5 (Panel B), but there are very slight differences in the magnitude of certain points that should not be present. For example, our plot shows Russia with nearly -0.5 in the first year while the author’s plot shows less than -0.25. The general shapes hold true across the plot but there are slight differences suggesting either the underling dataset or the processing are different.

And now the temperature anomalies.

ts.temp.plot<-model.dat[iso3 %in% top.apps]

palette<-c("#266197", "#82dc59", "#a90aa1", "#6dc5dd", "#794a98", "#57ecc0")
ggplot2::ggplot(ts.temp.plot, ggplot2::aes(x=year, y=temp.anom, group=iso3))+
 ggplot2::geom_hline(yintercept=0, color = "darkgrey", size=1)+
 ggplot2::geom_line(ggplot2::aes(color=iso3), size=1.5, alpha=0.8)+
 ggplot2::geom_point(ggplot2::aes(color=iso3), size=2, alpha=0.8)+
 ggplot2::labs(title="Annual Temperature Anomalies",
 subtitle = "From Top 4 Application Source Countries During 2000-2014",
 y="Temperature Anomalies",
 x="",
 color = "Country (ISO3):")+
 ggplot2::scale_color_manual(values=palette[1:4])+
 ggplot2::scale_x_continuous(breaks=seq(2000,2014,5), minor_breaks = seq(2000,2014,1))+
 ggplot2::scale_y_continuous(breaks = seq(-1,1.5,0.5), minor_breaks = seq(-1,1.5,0.5))+
 ggplot2::theme_minimal()+
 ggplot2::theme(legend.position="bottom")

Similar to the asylum application time series, our temperature anomalies exhibit a very similar shape, but there are slight differences. Differences in temperature anomalies are more understandable. There are several ways to carry out the weighted temperature calculations, and variations in the country boundary shapefiles, aggregation of cropping fraction rasters, zonal statistics methodologies, adjusting weighted mean calculations for raster cell area, and handling of null cropping fraction data may all result in slight discrepancies.

Now that we’ve processed our data, performed some visual exploration, and are reasonably satisfied we’re working with similar data as Missirian and Schenkler, we can move on to the statistical modeling.

Replicating Missirian & Schenkler (2017): Data Visualization

Summary

Introduction

Maize Cropping Fraction

Mean Temperature Choropleth

Anomoly Histograms

Time Series for Top Asylum Countries

Add new comment

Plain text

DANTE - Data Analytics and Tools for Ecosecurity