Summary
- Zonal extractions of raster data are carried out in several ways depending on the tool being used.
- Some methods, unbeknownst to the user, can result in significantly compounded errors. This is especially true of meandering coastal regions.
- This tutorial walks through addressing these issues using the
exactextractr
package.
Introduction
This vignette describes the use of exactextractr
to summarize populationand elevation data from the Gridded Population of the World and EU-DEM datasets.
The exactextractr
package includes samples of both of these datasets, cropped to the extent of São Miguel; the largest and most populous island of the Azores archipelago.
This vignette uses the following packages:
library(dplyr)
library(exactextractr)
library(raster)
library(sf)
Loading the sample data
To begin, we load the population count file from GPW. This raster provides the total population in each pixel for the calendar year 2020. Then we overlay the boundaries for the six municipalities, or concelhos, into which the island is divided on the GPW grid. The population is concentrated along the coastlines with smaller inland communities located in volcanic calderas.
<- pop_count
::raster(system.file('sao_miguel/gpw_v411_2020_count_2020.tif', raster
package = 'exactextractr'))
<- sf::st_read(system.file('sao_miguel/concelhos.gpkg', concelhos
package = 'exactextractr'),
quiet = TRUE)
<-sf::st_geometry(concelhos) concelhos_geom
::plot(pop_count, axes = FALSE) sp
::plot(concelhos_geom, add = TRUE) sp
Calculating total population
Because the population count raster is cropped and contains no land area outside of São Miguel, we can calculate the total population of the island using the cellStats
function from the raster
package.
::cellStats(pop_count, 'sum') raster
## [1] 145603
Calculating population from the GPW population count file
We might also attempt to use exact_extract
with the population count raster to see how the population is divided among concelhos:
::exact_extract(pop_count, concelhos, 'sum', progress = FALSE) exactextractr
## [1] 14539.875 4149.851 66866.711 5293.968 31920.496 9093.449
The result is a vector with one entry for each feature in concelhos
. The order of the population summaries is consistent with the order of the concelhos
in the sf
data object, so we can easily assign the exact_extract
output to a new column in concelhos
.
To calculate the populations, we used fun = 'sum'
, where 'sum'
is a named summary operation recognized by exactextractr
. A full list of supported operations can be found in the function documentation for exact_extract
. If none of the named operations are suitable, we can set fun
equal to an R function such as:
function(pixel_value, coverage_fraction) sum(pixel_value * coverage_fraction)
However, named operations are generally faster than R equivalents and use less memory when using large raster or shapefiles.
To review the results more easily, we can use the append_cols
argument to copy columns from the input sf
object into the result of exact_extract
. We also use some dplyr
operations to add a column for the total population of all concelhos:
<- exactextractr::exact_extract(pop_count, concelhos, 'sum', concelho_pop
append_cols = 'name', progress = FALSE) %>%
::rename(pop = sum) %>% dplyr
::arrange(dplyr::desc(pop)) %>% dplyr
::bind_rows(dplyr::summarize(., name = 'Total', pop = sum(pop))) dplyr
This produces the following table:
name | pop |
---|---|
Ponta Delgada | 66,867 |
Ribeira Grande | 31,920 |
Lagoa | 14,540 |
Vila Franca do Campo | 9,093 |
Povoação | 5,294 |
Nordeste | 4,150 |
Total | 131,864 |
We might reasonably expect the total population to equal the value of 145,603 we previously obtained using cellStats
, but it doesn’t. In fact, 9% of the population is unaccounted for in the concelho totals.
The cause of the discrepancy can be seen by looking closely at the densely populated Ponta Delgada region on the southern coast. Many of the cells containing population are only partially covered by the concelho boundaries, so some of the total population calculated by cellStats
is missing from the totals.
Calculating population from the GPW population density file
It turns out that we need a more complex solution to acquire accurate population counts when polygons follow coastlines. Instead of using the population count raster, we bring in the population density raster, which provides the number of persons per square kilometer of land area in each pixel.
<- raster::raster(system.file('sao_miguel/gpw_v411_2020_density_2020.tif', pop_density
package = 'exactextractr'))
To get a population count, we can multiply the population density by the area of each cell that is covered by the polygon. This can be done by providing the cell areas as a weighting raster to exact_extract
and in conjunction with a custom summary function. Weighted summary functions with exact_extract
have the signature:
function(values, coverage_fractions, weights)
This can be written as follows:
<- exactextractr::exact_extract(pop_density, concelhos, concelho_pop2
function(density, frac, area) {
sum(density * frac * area)
},
weights = raster::area(pop_density),
append_cols = 'name',
progress = FALSE)
This produces the following table:
name | pop |
---|---|
Ponta Delgada | 70,982 |
Ribeira Grande | 35,935 |
Lagoa | 15,702 |
Vila Franca do Campo | 11,704 |
Povoação | 5,965 |
Nordeste | 4,513 |
Total | 144,801 |
The total population obtained using this method is remarkably close (within 0.55%) to the expected value from cellStats
.
While this solution works well for the sample data, it has a couple of disadvantages for larger data sets:
- calling
raster::area(x)
generates an in-memory raster of the same size asx
. For a raster like GPW at 30 arc-second resolution, this would consume several gigabytes of memory. - passing extracted raster values to a summary function written in R requires that
exactextractr
load all values associated with a given polygon into memory at once. This presents no problem when working with the concelho boundaries, but could cause excessive memory usage when working with large national boundaries.
Alternatively, we can use the weighted_sum
summary operation with the weights = 'area'
argument instead of an R function. This instructs exact_extract
to compute its own cell areas based on the projection of pop_density
.
::exact_extract(pop_density, concelhos, 'weighted_sum', weights = 'area') exactextractr
Population-weighted statistics
Suppose that we are interested in calculating the average elevation of a residence in each of the six concelhos. Loading the island’s EU-DEM elevation data, we can see that each concelho is, at least partly, occupied by interior mountains. This indicates that the results of a simple mean would be unrepresentative of the (primarily) coastal population.
<- raster::raster(system.file('sao_miguel/eu_dem_v11.tif', package = 'exactextractr')) elev
::plot(elev, axes = FALSE, box = FALSE) sp
::plot(concelhos_geom, add = TRUE) sp
As in the previous section, we avoid working with the population count raster to prevent population loss along the coastline. We can formulate the population-weighted average elevation in terms of population density and pixel areas as:
\[ \bar{x}_\mathrm{pop} = \frac{ \Sigma_{i=0}^n {x_ic_id_ia_i}}{\Sigma_{i=0}^n{c_id_ia_i}} \] where \(x_i\) is the population of pixel \(i\), \(c_i\) is the fraction of pixel \(i\) that is covered by a polygon, \(d_i\) is the population density of pixel \(i\), and \(a_i\) is the area of pixel \(i\).
If we are working with projected data, or geographic data over a small area like São Miguel, we can assume all pixel areas to be equivalent, in which case the \(a_i\) components cancel each other out and we are left with the direct usage of population density as a weighting raster:
::exact_extract(elev, concelhos, 'weighted_mean', weights = pop_density) exactextractr
What if pixel areas do vary across the region of our analysis?
One option is to create a scaled population count raster by multiplying the population density by the pixel area. For pixels that are partly covered by water, this inflates the pixel population such that we obtain the correct population when only the land area is covered by a polygon. This requires that we create and maintain a separate raster data set.
Another option is to create a RasterStack
of pop_density
and area(pop_density)
, and write a summary function to handle the necessary processing. We use the summarize_df = TRUE
argument to combine the elevation, population density, pixel area, and pixel coverage fraction into a single data frame that is passed to the summary function.
::exact_extract(elev, concelhos, exactextractr
function(df) {
weighted.mean(
x = df$value,
w = df$coverage_fraction * df$pop_density * df$area,
na.rm = TRUE
)
},
weights = raster::stack(list(
pop_density = pop_density,
area = raster::area(pop_density)
)),
summarize_df = TRUE,
progress = FALSE)
This solution shares the same limitations with the previous example using an R summary function with raster::area()
: we must precompute an area raster and store it in memory, and load all raster values intersecting a given polygon into memory at once.
A better solution is to use the coverage_area
argument to exact_extract
, which specifies that all calculations use the area of each cell that is covered by the polygon instead of the fraction of each cell that is covered by the polygon.
<- exactextractr::exact_extract(elev, concelho_mean_elev
concelhos,
c('mean', 'weighted_mean'),
weights = pop_density,
coverage_area = TRUE,
append_cols = 'name',
progress = FALSE)
Here we also calculate the unweighted mean for comparison. We can see that the population-weighted mean elevation is substantially lower than the mean elevation in all concelhos.
name | mean_elev | pop_weighted_mean_elev |
---|---|---|
Lagoa | 233.7098 | 76.87321 |
Nordeste | 453.8504 | 192.47522 |
Ponta Delgada | 274.4062 | 97.71867 |
Povoação | 375.4573 | 170.45435 |
Ribeira Grande | 312.0619 | 74.84953 |
Vila Franca do Campo | 418.7338 | 92.20170 |
Add new comment