Input data preparation (MADRaT)
< Data processing with magclass | Overview |
Level: ⚫⚫⚫⚫⚪ Advanced
Requirements
Basic knowledge of R programming language
Install madrat and mrtutorial R packages
Fork and clone https://github.com/pik-piam/mrtutorial
Content
Introduction to the MADRaT framework
magclass object functionality
MADRaT-based input data preprocessing
Portable Unaggregated Collections (PUCs) in MADRaT
Overview
MADRaT and mrtutorial installation
In this exercise, we will dig deeper into magclass objects and the MADRaT framework. Libraries organized through this framework do the bulk of the processing of the data that goes in and comes out of the MAgPIE model, and are standardized for consistency. For this exercise, we will work with the mrtutorial package as an example of the mr- package structure.
First, please install and load the madrat package. The installation may prompt you to set a main folder for madrat data. Name this folder ‘/inputdata/’, and put it somewhere easily accessible, such as “C:/PIK/inputdata”.
install.packages("madrat")
library(madrat)
getConfig()
We’ll also work with the mrtutorial package to show how the input data processing pipeline works for feeding into a model such as MAgPIE.
install.packages("mrtutorial")
library(mrtutorial)
Furthermore, we will look into the source code of mrtutorial so 1) fork and 2) locally clone your own branch of mrtutorial from https://github.com/pik-piam/mrtutorial
Let’s open the package folder. The one important thing to note is that the package
has the madrat.R file manually placed in the mrtutorial/R/
folder.
This links the package with the MADRaT framework, and one can also create new MADRaT-based libraries by placing this file in the package’s respective R folder.
MADRaT Functions - Downloading, Reading, Calculating Model Inputs
We will look closely into the workflow of processing new data sources to be ready for use in the MAgPIE model. MADRaT splits this workflow into download, read, correct, convert, and calculate steps, each of which has a specific function wrapper.
Download function
Please open the downloadTutorialWDI R script. Note that this download script requires the WDI package.
install.packages("WDI")
downloadSource("TutorialWDI")
The folder ‘sources’ should now have been created in the ‘inputdata’ folder, with the WDI source folder in this directory. Metadata on where the data was obtained, how it was downloaded, etc. should also be documented in the download script. For another example, please see downloadTau in the MADRaT package.
Note that if direct download not possible, data files can be manually created in the inputdata/sources folder. This is not the preferred implementation, but in this case, a download function is not necessary. Naming of the source folder and the read functions must match.
Note again that although the function is itself named downloadTutorialWDI, we call it via the downloadSource() wrapper.
Read function
Read functions are the first step in transforming input data into magclass objects. They should be as simple as possible, with most steps of data cleaning, filling in, and transforming reserved for the convert and correct function. The Read function should be able to specify between indicators (subtypes).
Remember that magclass objects are an array with spatial information in the first dimension, temporal information in the second dimension, and data values in the third dimemsion(s) (3.1, 3.2…).
Now let’s look at readTutorialWDI.R.
Note: Because of the way WDI formats its data, the naming of the data is assumed in MADRaT to be multiple subdimensions, due to internal use of “.” as dimension separator. Data names that have full stops are to be avoided, to avoid confusing names and dimensions, and we rename this at line 39.
Run this function:
#this script requires dplyr and tidyr packages, so install these if not already available:
install.packages(c("dplyr", "tidyr"))
pop_no_conv <- readSource("TutorialWDI", subtype="SP.POP.TOTL", convert=FALSE)
Again, here readTutorialWDI is called via readSource() and not
readTutorialWDI directly. How can we see the number of countries, years and data colummns the pop_no_conv
object has?
Convert Function
The convert function will complete the magclass object: For MAgPIE input on country-level, all 249 countries represented in MAGPie need to exist in ISO3 country code. toolCountryFill() in the convert function also removes any country that it can not match.
Note that here we have omitted a correctSource() function, it is by
default OFF but can be implemented via convert=“onlycorrect”
Please open convertTutorialWDI.R
We can run the read and convert script together by setting ‘convert = TRUE’.
pop_conv <- readSource("TutorialWDI", subtype = "SP.POP.TOTL", convert = TRUE)
What are some of the main differences between the 2 pop objects we have created? Either in terms of data structure, or the values themselves?
calcOutput
Magclass objects are consistent in structure, making calculations easy. The calcOutput wrapper function calls the functions used to transform input data, called as calcOutput(“functionName”, “otherArguments”, …).
Note! During mag-object calculations, if dimensions across 2 magclass objects do not match, operations will expand the output, by doing the full matrix multiplication across all dimensions.
For example, given mag-object A with regions ‘Greece’, ‘Italy’ ; and mag-object B with regions ‘Italy’, ‘Poland’;
A * B will give an object with regions ‘Greece.Italy’, ‘Greece.Poland’, ‘Italy.Italy’, ‘Italy.Poland’.
Let’s open calcAgGDP, which calculates agricultural gdp as a share of total gdp,
And run two calls of this fucntion, again using the wrapper ‘calcOutput()’ wrapper.
ag_gdp_agg <- calcOutput("AgGDP")
ag_gdp <- calcOutput("AgGDP", aggregate = FALSE)
By default, calcOutput functions will aggregate to the regional level,
otherwise aggregate = F
is required to keep the original regional level.
The region mapping can also be changed in setConfig().
Note that calcOutput functions return a list of objects. x is the main magclass object to be returned. Since the weighting of non-absolute values generally requires weighting, a weight can be specified within the calcOutput() function itself. Unit and Description are important outputs for proper documentation of the function and object.
Now try running some of the same functions again. You may notice that now,
the functions run by loading a cache.
Cache files are stored as .Rds
files in ./inputdata/cache\_folder
after the first time they are run, meaning time-consuming functions do not need
to be re-run from scratch.
The caching functionality also catches any changes to the function content and/or arguments, the source data, or any mappings called within the function. In such cases, the function is re-run and a new cache file for the new settings applied will be created. Caching can toggled with setConfig(forcecache=TRUE).
Exercises:
-
What is the agricultural GDP of germany? (Germany’s ISO3 code is DEU)
-
Using the functions we have seen, calculate the amount of agricultural GDP generated by each person employed in agriculture for each country, assuming globally an employment-to-population ratio of 75%.
Model Preprocessing
fullMODEL() functions and retrieveData() wrapper
Data from mr-libraries are aggregated and bundled together into a .tgz
file to be used as model input. This is done via the “full” functions which calls all the calcOutput()
functions needed to be included.
For MAgPIE these are 3 separate calls for regional, cellular, and validation data: fullMAGPIE()
, fullCELLULARMAGPIE
, and fullVALIDATION
.
“fullMODEL” functions are called via the retrieveData() wrapper, and create a .tgz
file containing the processed (aggregated) function outputs in the /inputdata/output/
folder.
These files are aggregated to the regional level in the case of fullMAGPIE()
and the cluster level in the case of fullCELLULARMAGPIE
.
The dev
argument allows for running development-phase preprocessings, with the contents of the dev
argument appended to the .tgz
output.
If the dev
flag is used, the PUC
is by default not created.
retrieveData(model = "tutorial", rev=1, dev = "")
Portable Unaggregated Collections (PUCs)
The retrieveData()
function also creates a Portable Unaggregated Collections (PUC), an archive of the function outputs at original resolution, in the /inputdata/puc/
folder. This allows for the easier sharing of processed input data that the user can then aggregate as they prefer, using the function pucAggregate
.
Extra arguments can be specified give additional instructions to the preprocessing:
In the case of MAgPIE regional pucs , the main argument to be changed is the regionmapping
.
For cellular data ctype
(the number of clusters) and clusterweight
(weighting for clusters) can additionally be changed.
PUCs thus allow the user to locally process data flexibly for any region mapping, and number and weight of clusters.
Exercise: Recalculate a tgz file with a different regional aggregation, using the OECD region mapping found in folder mrtutorial/inst/extdata and the .puc file we have already created.
fullVALIDATION
As stated, retrieveData() also serves to create data for validation, via the shinyresults application or otherwise. In mrtutorial, the example case is fullTUTORIALVALIDATION. This creates a .tgz that contains the validation.mif
file, which can be used as input into shinyresults::appResults(), as shown in Tutorial 7.
retrieveData(model = "tutorialvalidation", rev=1, dev = "")
Exercises (click on the arrows to uncover the solution)
1. What is the agricultural GDP of germany? (Germany's ISO3 code is DEU) 2. Calculate the amount of agricultural GDP generated by each person employed in agriculture for each country, assuming globally an employment-to-population ratio of 75%.
ag_gdp[“DEU”,2010,]
ag_employment <- readSource(‘TutorialWDI’, subtype = ‘SL.AGR.EMPL.ZS’, convert = T) common_years <- intersect(getItems(ag_employment, dim=2), getItems(pop_conv, dim=2)) ag_pop <- pop_conv[,common_years,] * 0.75 * ag_employment[,common_years,] ag_gdp_per_ag_capita <- ag_gdp[,common_years, ] * ag_pop
Recalculate a tgz file with a different regional aggregation, using the OECD regionmapping found in folder mrtutorial/inst/extdata and the .puc file we have already created.
pucAggregate(puc = 'C:/PIK/inputdata/puc/rev1_extra_tutorial_tag.puc', regionmapping = 'regionmappingOECD.csv')