Summary
- Network analysis is often used for causal inference, but here we employ it for visualizing refugee flows.
- Managing the de-centralized R spatial processing and mapping can be difficult.
- There’s no unified R platform for creating geo-rectified network maps, but we demonstrate a workflow.
Introduction
Note: This vignette was updated August, 2020 for changes to the UNHCR Data API and subsequent streamlining of the untools
package.
Network analysis is a powerful tool for visualizing and analyzing dyadic data. They can illuminate interesting trends and underlying causal mechanisms. Although much of network theory was born out of sociological research, it has expanded into computational and evolutionary biology, linguistics, economics, and international relations. Network analysis and visualizations are well suited for dyadic international data. This may include trade, conflict, migration flows, or a variety of dyadic data. For this vignette we will create network plots for dyadic refugee flows.
Packages and Data
When it comes to network analysis in R, there is no singular package for all your network needs. In fact, there are numerous packages and I struggled to define an efficient workflow in preparation of this tutorial. Some of the most discussed in the blogospher and forumsphere are igraph, network, ggraph, ggnet, tidygraph. At first glance, ggnet, tidygraph, or ggraph would presumably garner the benefits of tidyverse management and plotting, however, while partially true, none of them are fully mature packages with complete analytical and visual integration. Moreover, they are mostly wrappers for igraph, which appears to be the leading choice for network analysis with a slew of vignettes and blog posts scattered across the web. After some trial and error I discovered you need to leverage several packages for aesthetically pleasing geo-rectified visualizations using tile mapping services. We’ll start by loading a variety of packages and a palette.
require('data.table')
require('untools')
require('ggmap')
require('ggraph')
require('igraph')
require('tidygraph')
require('untools')
require('raster')
require('maps')
require('countrycode')
<-c("#266197", "#82dc59", "#a90aa1", "#6dc5dd", "#794a98", "#57ecc0", "#808080") palette
Preparing the Dyadic Refugee Flows
Let’s begin by acquiring the dyadic refugee flow data. You can use the getUNref()
function to acquire the dataset, or download the United Nations Population Statistics Time Series data from here. For more information processing and visualizing this dataset using the untools package please refer to Exploring United Nations Refugee & Asylum Data With the untools Package. Once you’ve acquired the data and it’s in your home directory, read it into R and prepare it with the prepUNref()
function. The time series dates back to 1961 and contains multiple subgroups (refugees, asylum seekers, stateless persons, internally displaced persons, etc.). For a more manageable dataset and visualizations, lets only analyze cumulative refugee flows from 2010–2019.
<- untools::getUNref() unhcr
<- unhcr
::prepUNref( untools
unhcr,
groups = 'refugees',
range = c(2010, 2019),
sum_years = TRUE
)
<- unhcr[, .(persons = sum(persons)), by = .(coo_iso, coa_iso, coo, coa)] unhcr
Verify we got what we wanted:
kable(unhcr[1:10])
coo_iso | coa_iso | coo | coa | persons |
---|---|---|---|---|
AFG | AFG | AFG | AFG | 0 |
IRN | AFG | IRN | AFG | 340 |
IRQ | AFG | IRQ | AFG | 11 |
PAK | AFG | PAK | AFG | 880265 |
EGY | ALB | ARE | ALB | 45 |
CHN | ALB | CHI | ALB | 107 |
PSE | ALB | GAZ | ALB | 61 |
IRQ | ALB | IRQ | ALB | 71 |
SRB | ALB | SRB | ALB | 573 |
TUR | ALB | TUR | ALB | 45 |
Although we have cumulative dyadic flows for the 8 years we specified, I suspect this will still be too many observations for an interpretable analysis and any visualizations will be littered flows in every direction. Let’s grab some additional information.
nrow(unhcr)
## [1] 6999
length(unique(unhcr$coo))
## [1] 207
length(unique(unhcr$coa))
## [1] 189
Nearly 7,000 flows between roughly 200 origin and destination countries will be unruly. To make the data more manageable let’s identify the top 15 origin countries, top 15 destination countries, and extract only dyadic flows involving each other. First examine the top sources of refugees in our time period.
<-unhcr[,.(persons=sum(persons)), by=coo][order(-persons)][1:15] top.coo
coo | persons |
---|---|
SYR | 37101075 |
AFG | 26658825 |
SOM | 10185164 |
SSD | 9993105 |
MYA | 6525045 |
IRQ | 6280791 |
SUD | 6202692 |
COD | 5716794 |
CAR | 3865707 |
ERT | 3793487 |
SRV | 3268464 |
COL | 3113290 |
BDI | 2312688 |
CHI | 2027631 |
RWA | 1808329 |
Not a lot of surprises with the sources of refugees since 2010. Now lets look at the destination countries that received the most refugees.
<-unhcr[,.(persons=sum(persons)), by=coa][order(-persons)][1:15] top.coa
kable(top.coa)
coa | persons |
---|---|
TUR | 18641177 |
PAK | 15494239 |
IRN | 9564619 |
LEB | 7110001 |
UGA | 6372775 |
GFR | 6326598 |
ETH | 5966441 |
JOR | 5949922 |
KEN | 4917621 |
SUD | 4679016 |
BGD | 4355016 |
CHD | 4041606 |
CHI | 3070283 |
COD | 3042235 |
USA | 2810439 |
We see that Pakistan, Turkey, Iran, Lebanon, and Jordan have received hundreds of thousands of refugees since 2010 with Pakistan and Turkey receiving more than 1 million. Now we can create a character vector of the unique countries in these two subsets and use it to filter our original data.
<-unique(c(as.character(top.coo$coo), as.character(top.coa$coa))) countries
countries
## [1] "SYR" "AFG" "SOM" "SSD" "MYA" "IRQ" "SUD" "COD" "CAR" "ERT" "SRV" "COL" "BDI" "CHI" "RWA" "TUR" "PAK"
## [18] "IRN" "LEB" "UGA" "GFR" "ETH" "JOR" "KEN" "BGD" "CHD" "USA"
<-unhcr[coo %in% countries & coa %in% countries] unhcr
nrow(unhcr)
## [1] 343
length(unique(unhcr$coo))
## [1] 27
length(unique(unhcr$coa))
## [1] 26
This will be a lot easier to work with, but 343 flows may still be a bit much for a visualization. Let’s examine the distribution of flows to potentially eliminate smaller observations
ggplot(unhcr, aes(persons))+
geom_histogram(fill=palette[1])+
labs(y='Counts',
x='Refugee Flows')+
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
quantile(unhcr$persons, probs=seq(0,1,.1))
## 0% 10% 20% 30% 40% 50% 60% 70% 80%
## 0.0 0.0 5.0 23.8 59.4 173.0 673.0 7374.8 38499.0
## 90% 100%
## 351360.4 18342503.0
Yikes. The majority of the data is clustered near 0 along with some massive outliers. The quantile distribution exhibits massive spikes at nearly every decile after 50%. A reasonable cutoff for visualization might be 3000. Let’s view a subset of at least 3000 refugees.
ggplot(unhcr[persons>=3000], aes(persons))+
geom_histogram(fill=palette[1])+
labs(y='Counts',
x='Refugee Flows')+
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
nrow(unhcr[persons>3000])
## [1] 118
The distribution is still rather skewed, but the total number of observations is down to 117 and we’re focused on visualization, not modeling, so let’s carry on with a proper subset of the data.
.3k<-unhcr[persons>=3000] unhcr
Setting up the Network Objects
Network analysis is defined by edges and nodes. In our scenario, the nodes are the countries while the edges are the flows of refugees between them. Furthermore, the inputs to network functions are separated into distinct objects defining the nodes and edges. The object defining the edges is a 3 column data frame with origin, destination, and edge (refugee flow). The object defining the nodes is, at minimum, a single column data frame with a list of the nodes. The node object may also include additional columns describing the nodes. For example, they may include total population, national economic indicators, or climate statistics. Because we want to create geo-rectified plots, our node object must contain spatial information defining the country locations. This can be achieved in multiple ways, but we will use the latitude and longitude of the capital cities. We’ll establish the node and edge objects before adding the spatial information to the nodes; this simplifies the process and makes the objects easier to work with. From here on we’ll be working with just the ISO country codes; full country names are not good for visualizations.
<-data.table(nodes=unique(c(as.character(unhcr.3k$coo_iso),as.character(unhcr.3k$coa_iso)))) nodes
<-unhcr.3k[,.(coo_iso,coa_iso,persons)] edges
Merging Spatial Information to the Nodes
Several R packages come pre-loaded with useful data that would otherwise take several processing steps to acquire. We will be using spatial attributes of world cities from the maps package to merge capital city latitude and longitude onto our refugee source and origin nodes. If you were interested in subsetting countries by world regions as opposed to the greatest sources and destinations for refugee flows you might want to use raster::ccodes()
to acquire United Nations regional affiliations. Let’s pull in the cities data to an object, subset only the capital cities, use the countrycode package to create a new field with ISO codes, and finally merge only the spatial information with our node object using the newly established ISO3 country code.
<-data.table::setDT(maps::world.cities) world.cities
<-world.cities[capital==1] capitals
:=countrycode::countrycode(country.etc, origin = 'country.name', destination = 'iso3c')] capitals[,iso3
## Warning in countrycode::countrycode(country.etc, origin = "country.name", : Some values were not matched unambiguously: Micronesia, Netherlands Antilles, Serbia and Montenegro
<-merge(nodes, capitals[,.(iso3,lat,long)], by.x='nodes', by.y='iso3', all.x=TRUE) nodes
Not surprisingly, countrycode kicked out an error; it didn’t recognize a few countries in the capitals dataset. Thankfully they are not part of our top countries. While that is good news, a brief look at the post merged data reveals NA
values for South Sudan. This is not surprising as South Sudan was established after the 2011 Sudanese Civil War.
=='SSD'] nodes[nodes
nodes | lat | long |
---|---|---|
SSD | NA | NA |
The capital city of South Sudan is Juda, which is listed under Sudan, not South Sudan, in the world.cities
dataset. We can fill in the values ad-hoc.
=='SSD', lat:=world.cities[name=='Juba',lat]][nodes=='SSD', long:=world.cities[name=='Juba',long]] nodes[nodes
Creating the Network Object
In my experience, tidygraph
, ggraph
, and ggmap
are sensitive to the naming structures present in the node and edge objects. I’m not certain of the specific mechanisms, but through trial and error I discovered the correct conventions for the desired output. The objects should be named edges
and nodes
, the lat/long coordinates should be renamed y/x, and the flows or edges should be named weights. Let’s rename the relevant variables.
names(edges)[3]<-'weight'
names(nodes)[2:3]<-c('y','x')
Now we can use tbl_graph()
to set up our network and ggraph()
for an initial look at the network.
<-tidygraph::tbl_graph(nodes=nodes, edges = edges, directed = TRUE) flows
plot(flows)
The first view of our network; not very illuminating. There are additional options to spruce this up, but for now we’ll move on to place it in geo-rectified space. Next we’ll convert our tidygraph object created by tbl_graph()
into an object capable of being plotted on a map using ggraph()
.
<-ggraph(flows, layout='nicely') gg
## Warning: Existing variables `x`, `y` overwritten by layout variables
Acquiring the Basemap
Next we need to establish the bounding box for the basemap and pad it a bit so the capital city is not directly on the edge of the map. Then we’ll use those coordinates to acquire the basemap from the tile server
Putting It All Together
Now we’ll overlay our network object onto the basemap along with points for the capital cities using ggmap()
.
<-ggmap(basemap, base_layer = gg) + network.map
geom_node_point() +
geom_edge_fan()
network.map
It worked! It looks terrible! We’ll modernize this with a slew of additional options.
<-ggmap(basemap, base_layer = gg) + network.map
geom_node_point(size = 1.5, show.legend = F, alpha=0.75, colour=palette[6]) +
geom_edge_arc(colour=palette[1],
curvature = 0.15,
lineend = 'butt',
aes(width = log(weight), alpha = log(weight)),
arrow = arrow(length = unit(1.5, 'mm'), type='closed', angle = 25),
end_cap = circle(3, 'mm'),
start_cap = circle(3, 'mm'),
show.legend = F)+
scale_edge_width(range = c(0.1, 1)) +
scale_edge_alpha(range = c(0.01, 1)) +
labs(#title = 'Dyadic Refugee Flows During 2010 - 2019',
#subtitle = 'For the Top 15 Origin and Destination Countries',
y="",
x="")+
theme_graph()+
theme(axis.ticks=element_blank(),legend.position = 'bottom',
panel.border = element_rect(colour = palette[7], fill=NA, size=2))
## Warning: The curvature argument has been deprecated in favour of strength
network.map
Looks much better. It’s small in the browser but you can save the map to a vectorized format to view it in all of its glory.
ggsave(network.map, file='ref_network.png', dpi=600)
Add new comment