Joshua BrinksISciences, LLC 
Image by Jim Black from Pixabay

Summary

  • Network analysis is often used for causal inference, but here we employ it for visualizing refugee flows.
  • Managing the de-centralized R spatial processing and mapping can be difficult.
  • There’s no unified R platform for creating geo-rectified network maps, but we demonstrate a workflow.

Introduction

Note: This vignette was updated August, 2020 for changes to the UNHCR Data API and subsequent streamlining of the untools package.

Network analysis is a powerful tool for visualizing and analyzing dyadic data. They can illuminate interesting trends and underlying causal mechanisms. Although much of network theory was born out of sociological research, it has expanded into computational and evolutionary biology, linguistics, economics, and international relations. Network analysis and visualizations are well suited for dyadic international data. This may include trade, conflict, migration flows, or a variety of dyadic data. For this vignette we will create network plots for dyadic refugee flows.

Packages and Data

When it comes to network analysis in R, there is no singular package for all your network needs. In fact, there are numerous packages and I struggled to define an efficient workflow in preparation of this tutorial. Some of the most discussed in the blogospher and forumsphere are igraph, network, ggraph, ggnet, tidygraph. At first glance, ggnet, tidygraph, or ggraph would presumably garner the benefits of tidyverse management and plotting, however, while partially true, none of them are fully mature packages with complete analytical and visual integration. Moreover, they are mostly wrappers for igraph, which appears to be the leading choice for network analysis with a slew of vignettes and blog posts scattered across the web. After some trial and error I discovered you need to leverage several packages for aesthetically pleasing geo-rectified visualizations using tile mapping services. We’ll start by loading a variety of packages and a palette.

require('data.table')
require('untools')
require('ggmap')
require('ggraph')
require('igraph')
require('tidygraph')
require('untools')
require('raster')
require('maps')
require('countrycode')
palette<-c("#266197", "#82dc59", "#a90aa1", "#6dc5dd", "#794a98", "#57ecc0", "#808080")

Preparing the Dyadic Refugee Flows

Let’s begin by acquiring the dyadic refugee flow data. You can use the getUNref() function to acquire the dataset, or download the United Nations Population Statistics Time Series data from here. For more information processing and visualizing this dataset using the untools package please refer to Exploring United Nations Refugee & Asylum Data With the untools Package. Once you’ve acquired the data and it’s in your home directory, read it into R and prepare it with the prepUNref() function. The time series dates back to 1961 and contains multiple subgroups (refugees, asylum seekers, stateless persons, internally displaced persons, etc.). For a more manageable dataset and visualizations, lets only analyze cumulative refugee flows from 2010–2019.

unhcr <- untools::getUNref()
unhcr <-
untools::prepUNref(
unhcr,
groups = 'refugees',
range = c(2010, 2019),
sum_years = TRUE
)
unhcr <- unhcr[, .(persons = sum(persons)), by = .(coo_iso, coa_iso, coo, coa)]

Verify we got what we wanted:

kable(unhcr[1:10])
coo_iso coa_iso coo coa persons
AFG AFG AFG AFG 0
IRN AFG IRN AFG 340
IRQ AFG IRQ AFG 11
PAK AFG PAK AFG 880265
EGY ALB ARE ALB 45
CHN ALB CHI ALB 107
PSE ALB GAZ ALB 61
IRQ ALB IRQ ALB 71
SRB ALB SRB ALB 573
TUR ALB TUR ALB 45

Although we have cumulative dyadic flows for the 8 years we specified, I suspect this will still be too many observations for an interpretable analysis and any visualizations will be littered flows in every direction. Let’s grab some additional information.

nrow(unhcr)
## [1] 6999
length(unique(unhcr$coo))
## [1] 207
length(unique(unhcr$coa))
## [1] 189

Nearly 7,000 flows between roughly 200 origin and destination countries will be unruly. To make the data more manageable let’s identify the top 15 origin countries, top 15 destination countries, and extract only dyadic flows involving each other. First examine the top sources of refugees in our time period.

top.coo<-unhcr[,.(persons=sum(persons)), by=coo][order(-persons)][1:15]
coo persons
SYR 37101075
AFG 26658825
SOM 10185164
SSD 9993105
MYA 6525045
IRQ 6280791
SUD 6202692
COD 5716794
CAR 3865707
ERT 3793487
SRV 3268464
COL 3113290
BDI 2312688
CHI 2027631
RWA 1808329

Not a lot of surprises with the sources of refugees since 2010. Now lets look at the destination countries that received the most refugees.

top.coa<-unhcr[,.(persons=sum(persons)), by=coa][order(-persons)][1:15]
kable(top.coa)
coa persons
TUR 18641177
PAK 15494239
IRN 9564619
LEB 7110001
UGA 6372775
GFR 6326598
ETH 5966441
JOR 5949922
KEN 4917621
SUD 4679016
BGD 4355016
CHD 4041606
CHI 3070283
COD 3042235
USA 2810439

We see that Pakistan, Turkey, Iran, Lebanon, and Jordan have received hundreds of thousands of refugees since 2010 with Pakistan and Turkey receiving more than 1 million. Now we can create a character vector of the unique countries in these two subsets and use it to filter our original data.

countries<-unique(c(as.character(top.coo$coo), as.character(top.coa$coa)))
countries
## [1] "SYR" "AFG" "SOM" "SSD" "MYA" "IRQ" "SUD" "COD" "CAR" "ERT" "SRV" "COL" "BDI" "CHI" "RWA" "TUR" "PAK"
## [18] "IRN" "LEB" "UGA" "GFR" "ETH" "JOR" "KEN" "BGD" "CHD" "USA"
unhcr<-unhcr[coo %in% countries & coa %in% countries]
nrow(unhcr)
## [1] 343
length(unique(unhcr$coo))
## [1] 27
length(unique(unhcr$coa))
## [1] 26

This will be a lot easier to work with, but 343 flows may still be a bit much for a visualization. Let’s examine the distribution of flows to potentially eliminate smaller observations

ggplot(unhcr, aes(persons))+
geom_histogram(fill=palette[1])+
labs(y='Counts',
x='Refugee Flows')+
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

quantile(unhcr$persons, probs=seq(0,1,.1))
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 
## 0.0 0.0 5.0 23.8 59.4 173.0 673.0 7374.8 38499.0
## 90% 100%
## 351360.4 18342503.0

Yikes. The majority of the data is clustered near 0 along with some massive outliers. The quantile distribution exhibits massive spikes at nearly every decile after 50%. A reasonable cutoff for visualization might be 3000. Let’s view a subset of at least 3000 refugees.

ggplot(unhcr[persons>=3000], aes(persons))+
geom_histogram(fill=palette[1])+
labs(y='Counts',
x='Refugee Flows')+
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

nrow(unhcr[persons>3000])
## [1] 118

The distribution is still rather skewed, but the total number of observations is down to 117 and we’re focused on visualization, not modeling, so let’s carry on with a proper subset of the data.

unhcr.3k<-unhcr[persons>=3000]

Setting up the Network Objects

Network analysis is defined by edges and nodes. In our scenario, the nodes are the countries while the edges are the flows of refugees between them. Furthermore, the inputs to network functions are separated into distinct objects defining the nodes and edges. The object defining the edges is a 3 column data frame with origin, destination, and edge (refugee flow). The object defining the nodes is, at minimum, a single column data frame with a list of the nodes. The node object may also include additional columns describing the nodes. For example, they may include total population, national economic indicators, or climate statistics. Because we want to create geo-rectified plots, our node object must contain spatial information defining the country locations. This can be achieved in multiple ways, but we will use the latitude and longitude of the capital cities. We’ll establish the node and edge objects before adding the spatial information to the nodes; this simplifies the process and makes the objects easier to work with. From here on we’ll be working with just the ISO country codes; full country names are not good for visualizations.

nodes<-data.table(nodes=unique(c(as.character(unhcr.3k$coo_iso),as.character(unhcr.3k$coa_iso))))
edges<-unhcr.3k[,.(coo_iso,coa_iso,persons)]

Merging Spatial Information to the Nodes

Several R packages come pre-loaded with useful data that would otherwise take several processing steps to acquire. We will be using spatial attributes of world cities from the maps package to merge capital city latitude and longitude onto our refugee source and origin nodes. If you were interested in subsetting countries by world regions as opposed to the greatest sources and destinations for refugee flows you might want to use raster::ccodes() to acquire United Nations regional affiliations. Let’s pull in the cities data to an object, subset only the capital cities, use the countrycode package to create a new field with ISO codes, and finally merge only the spatial information with our node object using the newly established ISO3 country code.

world.cities<-data.table::setDT(maps::world.cities)
capitals<-world.cities[capital==1]
capitals[,iso3:=countrycode::countrycode(country.etc, origin = 'country.name', destination = 'iso3c')]
## Warning in countrycode::countrycode(country.etc, origin = "country.name", : Some values were not matched unambiguously: Micronesia, Netherlands Antilles, Serbia and Montenegro
nodes<-merge(nodes, capitals[,.(iso3,lat,long)], by.x='nodes', by.y='iso3', all.x=TRUE)

Not surprisingly, countrycode kicked out an error; it didn’t recognize a few countries in the capitals dataset. Thankfully they are not part of our top countries. While that is good news, a brief look at the post merged data reveals NA values for South Sudan. This is not surprising as South Sudan was established after the 2011 Sudanese Civil War.

nodes[nodes=='SSD']
nodes lat long
SSD NA NA

The capital city of South Sudan is Juda, which is listed under Sudan, not South Sudan, in the world.cities dataset. We can fill in the values ad-hoc.

nodes[nodes=='SSD', lat:=world.cities[name=='Juba',lat]][nodes=='SSD', long:=world.cities[name=='Juba',long]]

Creating the Network Object

In my experience, tidygraph, ggraph, and ggmap are sensitive to the naming structures present in the node and edge objects. I’m not certain of the specific mechanisms, but through trial and error I discovered the correct conventions for the desired output. The objects should be named edges and nodes, the lat/long coordinates should be renamed y/x, and the flows or edges should be named weights. Let’s rename the relevant variables.

names(edges)[3]<-'weight'
names(nodes)[2:3]<-c('y','x')

Now we can use tbl_graph() to set up our network and ggraph() for an initial look at the network.

flows<-tidygraph::tbl_graph(nodes=nodes, edges = edges, directed = TRUE)
plot(flows)

The first view of our network; not very illuminating. There are additional options to spruce this up, but for now we’ll move on to place it in geo-rectified space. Next we’ll convert our tidygraph object created by tbl_graph() into an object capable of being plotted on a map using ggraph().

gg<-ggraph(flows, layout='nicely')
## Warning: Existing variables `x`, `y` overwritten by layout variables

Acquiring the Basemap

Next we need to establish the bounding box for the basemap and pad it a bit so the capital city is not directly on the edge of the map. Then we’ll use those coordinates to acquire the basemap from the tile server

Putting It All Together

Now we’ll overlay our network object onto the basemap along with points for the capital cities using ggmap().

network.map<-ggmap(basemap, base_layer = gg) +
geom_node_point() +
geom_edge_fan()

network.map

It worked! It looks terrible! We’ll modernize this with a slew of additional options.

network.map<-ggmap(basemap, base_layer = gg) +
geom_node_point(size = 1.5, show.legend = F, alpha=0.75, colour=palette[6]) +
geom_edge_arc(colour=palette[1],
curvature = 0.15,
lineend = 'butt',
aes(width = log(weight), alpha = log(weight)),
arrow = arrow(length = unit(1.5, 'mm'), type='closed', angle = 25),
end_cap = circle(3, 'mm'),
start_cap = circle(3, 'mm'),
show.legend = F)+
scale_edge_width(range = c(0.1, 1)) +
scale_edge_alpha(range = c(0.01, 1)) +
labs(#title = 'Dyadic Refugee Flows During 2010 - 2019',
#subtitle = 'For the Top 15 Origin and Destination Countries',
y="",
x="")+
theme_graph()+
theme(axis.ticks=element_blank(),legend.position = 'bottom',
panel.border = element_rect(colour = palette[7], fill=NA, size=2))
## Warning: The curvature argument has been deprecated in favour of strength
network.map

Looks much better. It’s small in the browser but you can save the map to a vectorized format to view it in all of its glory.

ggsave(network.map, file='ref_network.png', dpi=600)

Add new comment

Plain text

  • Allowed HTML tags: <a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd>