Joshua Brinks, ISciences, LLC

Summary

The importance of replication and reproducibility are at the core of the scientific process, but these components are readily overlooked.
Not only are they overlooked, but the difficulty in attempting to reproduce a peer reviewed study is rarely demonstrated.
In this series of vignettes we attempt to demonstrate reverse-engineering a recent high-impact peer review publication in the eco-security sector.

Introduction

This is the first entry in the DANTE Project’s efforts to demonstrate modern analytical techniques through replication, and develop a set of generalized R functions to make these analyses more accessible. Over the course of multiple vignettes, I will attempt to reverse engineer Missirian and Schenkler’s 2017 fantastic recent publication paper examining European asylum applications in response to temperature fluctuations in non-OECD countries during 2000–2014. This is a well received Science paper, which according to Google Scholar has nearly 100 citations since its release. As with most Science, Nature, and PNAS submissions, a detailed supplementary materials section accompanies the truncated featured manuscript.

My intent is to make high impact research more accessible to researchers just starting, or intimidated by modern quantitative techniques. Moreover, I hope to demonstrate the difficulty in replicating modern publications even when presented with detailed supplementary materials. The methods and supplementary materials provided for this publication are detailed in comparison to many peer-reviewed publications. I started by thoroughly reviewing the methods and supplementary materials. After reviewing these materials and planning an analytical workflow to mimic the manuscript, I was left with several questions regarding their data processing and statistical modeling.

Why was Monfreda and Ramunkutty cropping data used over more recent data from MapSPAM; presumably to maintain congruency with cropping calendar data that is not provided by MapSPAM?
What software and methodology was used to extract raster data to ESRI/Garmin vector country boundaries? Zonal statistics methodologies vary widely in their handling of cells that are not entirely contained within the boundaries of the target polygon.
Were interpolated planting and harvest data used? Several countries listed as source countries in the supplementary materials have no valid planting or harvest data. If interpolated data was used, were they validated in any way? The authors of the planting and harvest dataset specifically warn against using the interpolated planting data as it may contain wild inaccuracies.
It’s not clear precisely how the weighted mean temperature was calculated. Was each temperature cell weighted by the “underlying” cropping fraction cell? The narrated portion of the methods leads this open to interpretation.
If the temperature was weighted by the spatially corresponding cropping fraction cell, was the cropping fraction data aggregated to match the resolution of the temperature data? Mean surface temperature is 0.5 x 0.5 degree resolution, while the cropping fraction data is 5 arcmin. If the cropping fraction data was aggregated to match the resolution of the surface temperature raster, what method was used?
Were cropping weights adjusted for cell area? Raster cell size is smaller as you move pole-ward. With samples ranging from Russia to South America this will have a large impact on a weighted zonal extraction.
What was the specific parameterization of the top model presented in the primary manuscript, and what software or packages were used in its implementation? It’s clear the preferred model employed quadratic terms for mean temperature, but it’s not clear exactly what the remaining “country fixed-effects” were. These remaining effects are also not listed in the coefficient table provided in the supplementary material.
Lastly, while significance levels for parameters were provided, what, if any, out of sample goodness of fit tests were carried out to test the suitability of the model. This is of greater importance, because a large portion of the written narrative focuses on predicting future levels of asylum applications under varying climate scenarios.

Although some of these points are only a matter of procedure that may have limited affect on the model inputs, differences in determining weighted mean surface temperature and final model specification can have profound downstream effects. I will attempt to replicate their core model with these considerations in mind. In doing so, I will walk the reader through the data processing steps to create the core quadratic temperature model. At this time, I will not demonstrate their sensitivity checks, which include the addition of cumulative precipitation data, using alternative climate data, the inclusion of conflict data, and future predictions under various climate scenarios. I will carry out this procedure in 3 steps: 1) data acquisition and pre-processing, 2) visual exploration of the processed data, 3) enacting the core model and diagnostics.

We’ll begin in the next section by preparing the data:

Part II: Data Processing

Replicating Missirian & Schenkler (2017): Introduction

Summary

Introduction

Add new comment

Plain text

DANTE - Data Analytics and Tools for Ecosecurity