%%{init: {
"theme": "base",
"themeVariables": {
"primaryColor": "#5f7991",
"edgeLabelBackground":"#ffffff",
"lineColor": "#e40000",
"textColor": "#000000",
"fontSize": "26px"
},
"flowchart": {"curve": "linear"}
}}%%
flowchart LR
%% Stakeholders
subgraph STAKE["Stakeholders"]
direction TB
R[Researcher]
C[CompNet / MDI Network]
S[Statistical<br/>Institute]
R -- "Research project and payload" --> C
C -- "Code to build the infrastructure" --> S
S -- "Metadata preparation" --> C
C -- "Metadata and tools" --> R
end
%% Remote environment
subgraph REMOTE["Environment"]
direction TB
RA["Remote access (AT FR GB NL SI)"]
MP["MDI partner (FI IT)"]
RE["Remote execution (PT DE)"]
RA <--> MP <--> RE
end
%% Outcomes
subgraph OUT["Outcomes"]
direction TB
O1[Special research and publication]
O2[Standard moments and indicators ‑ publication]
O1 -- "Output is obtained by Compnet/ MDI Network" <--> O2
end
%% Rocket
Ro[🚀 **Rocket** 🚀]
%% Graph flow
STAKE --> Ro
STAKE -- "Obtains the output" --> OUT
Ro --> REMOTE
REMOTE --> OUT
%% Style
classDef remote stroke:#e40000,stroke-width:2px;
class RA,MP,RE remote;
classDef output fill:#5f7991,color:#ffffff;
class O1,O2 output;
classDef whitebg fill:#ffffff,stroke:#000000,color:#000000;
class STAKE,REMOTE,OUT whitebg;
MDI Manual
Comprehensive guidance for everyone who builds and uses the Micro Data Infrastructure
1 Introduction to MDI
This user guide provides users of the Micro Data Infrastructure (MDI) with all the necessary information to conduct research by using the mdi set up and setting up the MDI or developing the infrastructure.
- If you are interested in running your research on MDI, do not miss section Use MDI, in particular:
- If you have access to a given country firm-level data and need to build MDI, do not miss section Setting up the MDI:
- If you work on Nuvolos creating mock data, do not miss sections:
1.1 What
The Microdata Infrastructure (MDI) is a platform for cross-country microdata access, developed by CompNet in collaboration with European National Statistical Institutes (NSIs), National Statistical Systems (NSS) and other partners.
The MDI began in 2018, as described in “Creating an EU-wide Micro Data Infrastructure (MDI): a handbook for Micro-Data Linking”. Since then, pilots evolved into a maintained infrastructure that is periodically launched at NSIs. This manual provides operational guidance for current and future MDI launches, building the infrastructure in new countries and defining a medium-term horizon of continuous improvement within each 3–6 month deployment cycle.
The MDI is designed with a dual objective: to harmonize firm-level data across countries and to streamline the research process for conducting cross-country analyses on a wide range of topics.
At its core, MDI provides a standardized environment that enables researchers to perform identical analyses across multiple countries. It ensures microdata comparability and accessibility within a unified framework. The infrastructure supports functions ranging from data importation and harmonization to advanced analytical outputs, all within a secure environment that safeguards data confidentiality. In a nutshell, MDI does the following:
Raw data \(\rightarrow\) Data harmonization \(\rightarrow\) Comparable cross-country microdata
Raw data: refers to the task of compiling all available datasets and variables from each NSI into a detailed metadata inventory.
Data harmonization: refers to the entire process of constructing variables that are comparable across countries. This involves establishing a standardized set of variable names and definitions (MD metadata). Based on this standard, the raw data from each NSI is used to generate corresponding variables and files aligned with the MD metadata. The harmonization process also includes creating concordance tables to standardize categorical codes.
Comparable cross-country microdata: refers to the tools and guidelines provided to researchers for effective data use. This includes best practices for writing research code (module) and the provision of mock data (designed to replicate the structure of the real data) for testing purposes.
1.2 Who
The MDI is a joint initiative by CompNet, National Statistical Institutes (NSIs), and other partners. CompNet staff lead the technical maintenance and development of the infrastructure, and provide training and guidelines on how to use it. Together with NSIs and partners, they access firm-level data across countries and operate the MDI infrastructure to generate research outputs. Please see below the MDI stakeholders and process:
Note: The diagram shows stakeholder roles, execution environments, and outputs. Rocket is the codebase deployed at NSIs. Two access models exist: direct remote access and indirect remote execution. Arrows indicate code, metadata, and output flows. All outputs are subject to NSI disclosure control before publication.
- National Statistical Institutes (NSIs) and other Partners
- NSI remote execution
- NSI remote access
- Partners with country-specific (remote) access
NSIs provide the underlying data and support either remote execution or access to confidential firm-level data. While legal access rules, data availability, and technical infrastructure vary across countries, NSIs form the backbone of the standardized MDI research environment.
- Module writers (MDI users)
- Productivity Boards
- External Academic and Policy
- MDI ‘Theme’ research staff
MDI users include productivity boards, external researchers, and thematic research staff. They are responsible for designing research modules that harness MDI’s infrastructure for cross-country analysis.
- MDI staff
- Country specialists
- Thematic research personnel
- Infrastructure support teams
MDI staff ensure the effective development and operation of the MDI environment. They support NSIs with data preparation and documentation, and assist module writers by providing expertise on data, tools, and research themes.
1.3 How
The MDI infrastructure is a continuously evolving codebase, known as Rocket, that is periodically deployed within the secure environments of NSIs. Its main function is to process and harmonize raw data, execute research code (modules) , and export results, all while strictly complying with NSIs’ disclosure rules. This process, is referred as a launch. The process occurs every 4 to 6 months depending on country readiness. Please see below the MDI launch pipeline:
flowchart LR
R["<b>Rocket</b><br>Contains research codes<br>(<i>modules</i>)<br>+<br>All needed R scripts to<br> harmonize the raw data"]
D[("Harmonized data<br>↑<br><b>Raw data</b><br>↓<br>Metadata<br><small>constantly updated</small>")]
O["<b>Output</b><br><small>CSVs outside the NSI<br> protected environment</small>"]
R --> D
D -->|export| O
D -.->|Metadata feed rocket| R
classDef rocket fill:#ffcccc,stroke:#333,stroke-width:2px;
classDef raw fill:#ccffcc,stroke:#333,stroke-width:2px;
classDef output fill:#ccccff,stroke:#333,stroke-width:2px;
class R rocket;
class D raw;
class O output;
Note: The diagram shows the MDI launch pipeline. The Rocket represents the deployed codebase containing harmonization scripts and research modules. It processes raw data and constantly updated metadata to produce harmonized data within the secure NSI environment. The harmonized datasets are then exported as output files outside the protected environment, only after passing disclosure checks.
- Access models
- Direct access: Researchers connect to the NSI secure environment with user credentials and run approved code on site.
- Indirect access: NSI staff or MDI staff execute the approved code and return only disclosure-safe outputs.
- Class: Describes classification variables in the datasets, such as industry or product codes.
- Codebook: Maps categorical variable values to their corresponding descriptions.
- Data centers: Technical environments managed by NSS components that host, process, and secure microdata.
- Datafiles: Lists all available NSI firm-level data files, including their names and years covered.
- Disclosure Criteria: Rules designed by the NSIs to protect the confidentiality of firm-level data, ensuring that no output allows the identification of individual firms or the disclosure of sensitive information, even in aggregated form.
- Hierarchy: a table that maps a classification at different aggregation levels. E.g., Nace 4d: 6491 is equivalent to Nace 3d 649, Nace 2d, 64 and industry K.
- MD metadata: standardized set of variable names (MD_varname, i.e., firmid, capital, etc.) and respective definitions that forms microdata (MD) panels, or the MD_dataset (i.e., BS, SBS, ENER, etc.) set by the MDI team.
- MDI: Microdata Infrastructure.
- MDI data catalogue: catalog containing all variables and their year range availability by country.
- MDI launch: process of running the modules within the rocket every three months.
- MDI tools: set of R functions created by the MDI team to generate the MD_datasets, manipulate them and execute modules.
- Module: research code.
- Modules names are defined with an acronym (“res_group”). For example, a module about firm dynamics is called FD (res_group=FD).
- NSS: The coordinated institutional and technical framework encompassing the NSI and associated data centers.
- NSIs: National Statistical Institutes. These are the public authorities responsible for official statistics in each country. They host the confidential microdata, set legal rules, run disclosure control, and provide the secure environments where MDI operates.
- Nuvolos: cloud server platform where MDI users develop and test their codes. This space is designed for training, practicing, and familiarizing with the MDI infrastructure. MDI infrastructure: some terms
- Rocket: codebase containing modules and scripts that are periodically deployed within the secure environments of NSIs to process and harmonize raw data, execute modules, and export results.
- Varnames: Documents the variables and their descriptions for each raw data file listed in datafile.
2 Using MDI
This section focuses on using the MDI and is meant primarily for research groups and module writers. It outlines all steps involved in conducting research with the MDI - from formulating a research question to selecting variables and preparing data files. It also provides information about launches, including the research execution process and the overall timeline.
2.1 MDI Users
MDI users (or module writers) include productivity boards, external academic and policy researchers, and MDI ‘Theme’ research staff. They are responsible for developing research modules that leverage MDI’s infrastructure for data analysis.
2.2 Setup for Researcher
Module writers develop and test their research code using mock data on the Nuvolos platform (see Nuvolos section). This process relies on a standardized metadata structure initialized through an R setup program. If a researcher has direct access to the microdata, they may also develop and test their modules directly using real data. Once development is complete, MDI staff consolidate and stack country-level outputs to enable cross-country analysis without granting direct access to firm-level data.
2.3 Workflow for writing modules
Writing modules for the MDI launch is an iterative process that moves from conceptualization to execution. It is a staged process designed for reproducibility and cross-country comparability. Start from a clear research question, select MD variables that exist across countries, prototype on Nuvolos mock data, validate disclosure compliance in-code, and prepare exports with complete metadata.

Deadlines and launch schedules
MDI modules are executed every four months through pre-scheduled launches. Accordingly, the MDI team communicates specific deadlines to all researchers for submitting their research modules and alerts the NSI staff accordingly.
The following table contains an estimate of the duration of a whole launch (between brackets, in the first column, a reference to the items in the diagram above):
| Task | Estimated duration |
| 1) Research module preparation (1. - 4.) | one month |
| 2) Module testing and submission (5. & 6.) | two/three weeks |
| 3) Launch preparation (7.) | a few days to a week |
| 4) Launch execution (7.) | two months |
| 5) Extraction of the results and consolidation of the output (8. & 9.) | a few days to a couple of months |
Hence, a researcher can expect to receive all the consolidated cross-country results between three to six months after the module submission.
2.3.1 Define your research question
Every module begins with a clear and concise research question, designed to leverage MDI’s cross-country data and produce meaningful analytical insights.
Before writing the analytical code, you must define a research acronym for your module (specified as res_group <- '(some 2-letter string)') and communicate it to the MDI team.
2.3.2 Data selection
Use the MDI data catalog to identify and select the most relevant datasets and variables for your analysis.
Ensure that all MD variables used in the module are available across all countries. Especially the employment variable.
However, keep in mind that the harmonized version of classification variable nace, called MDnace, is not present in the catalog.
In case you want to use them in your code, make sure you use MDnace instead of nace.
Conversely, the harmonized version of product and trade codes (prodcom and cn08, respectively) have the original classification variable name.
If you want to use the original non-harmonized codes, make sure you use NSI_(classname) in your code.
Check the dedicated section below for more details.
Additionally, module writers can also look at information on data source, firm sample and other details (taken from the NSI_datafiles tables) on the raw datafiles underlying each MD panel by using the interactive tool Datafiles Info Viewer
Once the final selection of MD variable names has been made for the module, a file named (res_group)_MDnames_select.csv (see example below) must be submitted to the MDI team. This needs to have the column names as shown below.
| MD_dataset | MD_varname |
|---|---|
| BR | firmid |
| BR | plantid |
| BR | entid |
| BR | entgrp |
| BR | year |
2.3.3 Analysis
2.3.3.1 Libraries, packages and the MDI R tools
Make use of MDI R-packages (see ../rocket/Rtools/Rpackages/Rpackage_info_v2.3_.csv) and Rtools (see: ../docs/MDI_Rpackage_1.0.0.pdf): See the R-package libraries currently installed at NSIs and loaded at runtime of the launcher at ../rocket/Rtools/Rpackages/record_package_info.csv or below
mdi R library manual
If you need a package that is not part of the current list of R libraries, notify the MDI staff so it can be added to the NSI requirements. When preparing output, use standardized functions from the MDI Rtools (see directory ../rocket/Rtools/R) whenever possible.
The MDI team also maintains an overview of:
- Installed R versions at each NSI, along with details on package installation policies. This information is available in
Rversions_countries.csv, located in the/MDI/docs/.
- Package conflict resolution preferences are documented in
conflicts_prefer.csv, located under/MDI/rocket/Rtools/Rpackages/.
In the event of package conflicts, we follow the preferences outlined in this file. However, if a module requires a function from a non-preferred package, authors must explicitly use the package::function() syntax to avoid ambiguity. This syntax is generally encouraged to ensure clarity and compatibility across systems.
While we aim for a harmonized environment, some variation between countries may persist due to local constraints. Any such discrepancies are documented and communicated to module writers before deployment.
There are five general categories of R tools:
- Metadata: for generating, verifying, and manipulating metadata
- Infra: used by MDI staff for data importation, harmonization, and disclosure checks
- MDI: mostly for module writers, e.g. merge_datatables, regressions, aggregations, export
- Analysis: support analytical tasks and reporting
- Programmer: assist with R coding tasks
All tools are documented using Roxygen2 and exported as an R package. You can access documentation via the standard ?function() syntax, or by clicking on the mdi package in the RStudio package tab to view the full list of available functions.
2.3.4 Importing data
The NSI metadata enables the creation of standardized microdata panels (MD), which are harmonized and managed by the Launcher based on the logic defined in countdown.R.
MD datasets can be imported from the dirTMPSAVE folder, which is predefined in the environment. For example, the MD dataset ITGS can be imported using the following code snippet:
Code
BR <- readRDS(paste0(dirTMPSAVE, 'BR.rds'))2.3.5 Manipulating data
You can freely manipulate linked panels using R and its libraries, such as data.table, dplyr, and the broader tidyverse.
When writing your analytical module, always check the unit of observation in each MD dataset for all countries where your code will run. This can be verified in the MDI Metadata Viewer.
For example, if using the ENER dataset, note that the unit of observation may vary between countries—such as between France and Portugal. Your R code must account for these differences to ensure analytical consistency. For details on the units used in each MD dataset, please refer to the metadata file MD_idInfo.csv.
To assist with this, the MDI toolkit includes a utility for aggregating or disaggregating between different key IDs. This tool is located in rocket/Rtools/R/mdi_key_id_switch.R and uses the metadata file *NSI*_firmid_entgrp_conc.csv.
2.3.5.1 Working with classifications
Classification variables, such as industry or product codes, are key in microdata work. We use tools that allow classification lists to be coherent over time.
In those harmonized MD datasets where a classification variable is present, be aware that the those inclide both the raw classification variable and the time-concorded one. Hence, make sure you keep this in mind when designing your code!
In particular, when you prepare your (res_group)MDnames_select.csv, please use the original MDnames for all classification variables, but feel free to use the concorded MDnames for the concorded classification variables.
For more details, check the dedicated section.
2.3.5.2 Merging data
Additionally, when merging data from two files, use the mdi_mergedatatables() function. This helps prevent memory issues and ensures that merges are performed correctly.
When working with MD_dataset CIS, given that the data comes from bi-yearly surveys, it is suggested to always merge CIS with BR, like
Code
DT <- merge(BR, CIS, by = c('firmid','year'), all.x = T)Then the user can decide on how to interpolate the values in the missing years for the same firmid.
Last but not least, be a smart coder: clean up unnecessary datasets, and avoid writing code that calls the operating system, creates (sub)directories, or installs R packages.
2.3.6 Analytical tools
The MDI toolkit includes a set of functions designed to help researchers efficiently carry out common tasks in firm-level microdata research. In this section, we present some of the key tools available. For more details, please refer to the MDI R tools section above and consult the mdi library PDF.
mdi_aggregate()
This tool aggregates variables in a
data.tableby group, allowing customizable statistics (e.g., sum, mean, HHI – check the PDF manual of themdilibrary for the full list of methods in the related section), optional merging with the original dataset, and built-in disclosure checks.
Note that any output containing data points (such as plots or tables with quantiles) cannot be exported due to disclosure restrictions. Hence, quantiles cannot be exported as such.
However, keep in mind that mdi_aggregate() allows to compute the mean of the minimum number of observations allowed for disclosure (the function uses MDIminNumObs, check the related section below) around the observation that is closest to the first (q25), second (median) or third (q75) quantile value.
The diagram below illustrates how this value is calculated for a series of values (3 to 11), in case
MDIminNumObsis an odd number
timeline
title Odd: `MDIminNumObs` = 5 → pick 2 below, 1 at quantile, 2 above
3 : |
5 : 🔵 (2nd below)
6 : 🔵 (1st below)
7 : 🔴 (closest to q)
8 : 🔵 (1st above)
9 : 🔵 (2nd above)
11 : |
MDIminNumObsis an even number
timeline
title Even: `MDIminNumObs` = 6 → pick 3 below, 3 above (bias below)
3 : 🔵 (3rd below)
5 : 🔵 (2nd below)
6 : 🔵 (1st below)
7 : 🔴 (closest to q)
8 : 🔵 (1st above)
9 : 🔵 (2nd above)
11 : |
Note that if the number of observations for the aggregate is small then the coefficient might be very different from the quantile value.
estimate_markup()
This function computes firm-level markups using the De Loecker (2012) method, which divides output elasticity multiplied by revenue by input costs, and returns the resulting markup as a new variable.
estimate_prod()
This tool estimates firm-level production function parameters (such as input elasticities and/or TFP) using OLS, ACF, or OP methods under Cobb-Douglas or translog specifications. It offers flexible options for fixed effects, instruments, and grouped estimation.
mdi_regress()
This function runs one or more regressions using
feolsorfeglmfrom thefixestpackage, performs automatic disclosure checks to ensure the minimum observation threshold is met, and optionally exports LaTeX regression tables with accompanying metadata logs.pim_capital()
This tool estimates firm-level capital stock using the Perpetual Inventory Method (PIM), based either on a user-specified depreciation rate or an inferred asset type. It returns the original
data.tablewith an added capital stock variable.
Researchers may wish to conduct their analysis at various levels of sectoral aggregation. The MDI infrastructure supports this by providing classification concordances such as MD_nace_hier.csv and MD_naceR2_CNind_classconc.csv, which allow NACE Rev.2 industry codes to be mapped to broader industry groupings—such as 3-digit, 2-digit, 1-digit levels, and the CompNet macroindustry classification.
2.3.7 Exporting results
Once a results table is generated, the researcher must extract the file at the end of the module. After the launch is fully executed in a given country, the country leader submits an export request to the NSI, which then verifies compliance with disclosure rules for each output file (see disclosure criteria for details).
The mdi_export() function facilitates this process by exporting a data.table to a specified file format and logging comprehensive metadata—including variables used, purpose, and dataset context—into a central description file (OutputDescription.csv), which is also extracted. The function includes optional disclosure checks for summary statistics.
Below is a description of all parameters required for mdi_export():
mdi_export()
It is fundamental, for disclosure reasons, that the module writer fills in exhaustive information related to each output file when using function mdi_export(). In particular, please provide
format
Character string specifying the format of the export (‘csv’, ‘RDS’, ‘txt’, ‘dta’, ‘xlsx’, ‘sas’).output_name
The name of the file to be created, without the file extension and the country code.datasets_used
The name of the MD_dataset(s) used for the analysis.purpose
Describe the research purpose of the analysis.share_0_1
Explain whether the output contains any shares equal to 0 or 1 (i.e. 0% or 100% of the group share the same characteristic). Such cases are not allowed according to the output guidelines and must therefore be suppressed or explicitly justified.zeroes
If the output contains zero values, provide an explanation of why these zeroes are not revealing additional information. Otherwise, this information must be suppressed.rel_other_output
Describe how this output file relates to other previously exported or requested files, for instance whether it performs the analysis in a different way and, if so, how.selection
Describe if the results contained in the file were derived from a specific selection of the sample available (if so, explain which selection) or if the full sample is used.export_type
Character string indicating the type of output (‘sum_stat’ for summary statistics, ‘reg_tab’ for regression tables, or ‘other’).description
A string providing additional explanation of the output file.
2.3.8 Consolidation of MDI Module Output
Once the files pass the disclosure checks:
Country leaders/NSIs will upload each country’s output to their designated Teams folder.
MDI staff will consolidate (stack) the outputs of each module by country and place both the module-specific and general launcher outputs in the appropriate Teams folder for module writers.
2.3.8.1 Procedure to stack MDI Module Output
For cross-country analysis, the individual country exports need to be identified, and consolidated into combined stacked datasets per module.
This is accomplished in three steps, using a sequence of scripts that are stored in dirROCKET/MDIprogs.
- Step 1: get_output_file_list.R Generate Country-Specific File Lists
The first script creates a file inventory for each country.
Inputs:
- Country code (CC, e.g., FR, FI, NL)
- Launch version number (2.3)
- Local path to the country’s upload directory
Process:
1. Iterates through all module output folders for the selected country.
2. Extracts the names of all .csv files, excluding descriptive files (e.g., OutputDescription.txt).
3. Adds metadata:
- Launch version number
- Country code
- A numeric flag indicating the format of the file name (1, 2, or 3).
4. Saves the resulting inventory as launch_<n>_file_list_<CC>.csv in the specified directory at the start of the script.
Output:
A CSV file listing all valid exported files for a single country, annotated with launch and country metadata.
- Step 2: generate_stacked_files.R Combine File Lists Across Countries
The second script consolidates the individual country inventories into one master file list.
Inputs:
- File lists generated by Script 1 (launch_<n>_file_list_<CC>.csv for each country).
Process:
1. Reads each country’s file list.
2. Appends a Country column to identify the file’s origin.
3. Stacks the inventories into one combined dataset.
Output:
A single file: launch__file_list_combined.csv containing metadata on all exported files across participating countries.
- Step 3: consolidate_output.R Consolidate Module-Level Outputs
The third script merges the exported data across countries for a chosen module.
Inputs:
- The combined file list from Step 2.
- Module name (e.g., EN for Energy).
- Country-specific Export directories (Most likely a Teams path).
Process:
1. Filters the combined file list for the specified module.
2. Iterates through each country’s export path and retrieves the corresponding .csv files.
3. Reads and cleans each filename and appends a Country identifier column.
4. Binds all country datasets into one consolidated file.
Output:
A module-specific cross-country combined file (e.g., EN_combined.csv) stored in the specified Research Agenda folder that you input at the start of the script.
To summarise module exports consolidation
- get_output_file_list.R → Generate a country-level export file list.
- generate_stacked_files.R → Combine these lists into a cross-country file inventory.
- consolidate_output.R → Use the inventory to locate, clean, and stack module-level data exports across countries.
By running these 3 scripts, all outputs are systematically catalogued, reproducible, and readily available for post-launch comparative analysis.
2.4 Running Order & How-To (Quick Reference)
2.4.1 Prerequisites
R packages:
data.table,dplyr,readr
Directory layout must follow:
.../MDI Data Providers Forum - CC - CC/Upload/Launch_<n>/<CC>_output_Launch_<n>_<MODULE>/...
- Central outputs:
.../CompNet MDI Research Agenda - General/Launch_<n>
Researchers can run
../launchpad/interactive_MDI.Rto initialize their MDI environment in a standardized way.Researchers then analyze the output, optionally using standardized tools for statistical analysis, graphing, and reporting.
2.5 Dealing with classifications
A key feature of firm-level research is the use of classifications, such as industry codes (NACE codes), product codes (PRODCOM codes) and trade codes (combined nomenclature codes). Given that the official set of codes in a classification can vary across the years, we developed some tools that allow us to have a consistent list of codes over time in each country. Specifically, we make use of two tools:
make_conc()
This tool is currently used to harmonize PRODCOM and ITGS codes over time.
Firstly, it takes the time concordance tables for each couple of years and reproduces the development of each code over time. This way, the yearly concordances traces all the possible changes of the codes from the first year to the last year of the relevant period. Secondly, it links all the paths of codes that have common codes. This way, it harmonizes these groups of codes with a common code from the last year of the relevant period.
Note: Column
leftin a time concordance table, the one we received from the NSI, might not contain all codes we observe in the dataset at timeyear-1. Hence, it is advised to use toolmdi_timeconc_update()from themdipackage. This tool makes sure that if there’s any missing mapping, those are present in the time concordance table for that dataset.As inputs, it requires the yearly concordance tables of the classification (in data.table format), the numeric vector of the years of interest, and the character name of the classification. It returns the data.table that concords each code to the harmonized code, for each year.
It returns the original MD dataset with the old NSI class code (under column
NSI_(classname)) and the harmonized code (under the column using the classification name).concord_nace()This tool harmonizes NSI NACE classification over time.
First, it detects the year with the most NACE code changes, the year with a possible break in the classification. Then, it uses the mode NACE code in the post-break year as harmonized NACE code and harmonizes previous year codes accordingly, by firm. For firms present only before the break year, their codes are harmonized depending on the changes of the surviving firms, which are used to create a concordance table between codes in the pre- and post-break year.
As inputs, it requires the character dataset name; a logical to decide whether or not to weight code matches of surviving firms by employment (instead of number of firms); the number indicating the cumulative residual share of firms deleted for the construction the pre- and post-break year concordance; and the number indicating the share of firms deleted for the construction of such concordance.
It returns the original MD dataset with the old NSI NACE code (under column
nace) and the harmonized NACE (under columnMDnace).It will be possible, then, to add the MD NACE through the concordance table between NSI NACE codes and the MD NACE codes.
2.6 Disclosure Procedures
2.6.1 What are Disclosure Criteria
Disclosure criteria at National Statistical Institutes (NSIs) are rules designed to protect the confidentiality of firm-level data. They ensure that no output allows the identification of individual firms or the disclosure of sensitive information, even in aggregated form. These criteria are crucial for complying with national privacy and data protection regulations.
2.6.2 How Disclosure Criteria Are Applied:
MDI tools automate disclosure control by applying primary and secondary confidentiality rules (such as minimum observation thresholds and dominance criteria) before any output is released. These rules ensure that sensitive data is suppressed or flagged, in line with the parameters defined in file payload/Launch_v2.3/MDmetadata/MD_disclosure_info.csv.
Learn more how this is done
MDI tools such as
mdi_aggregate(),disclose(),DisclosCrit(),mdi_regress(), andmdi_export()help automate disclosure control by enforcing rules based on parameters set in the Countdown, ensuring compliance before output is released.Primary disclosure (Step 1) requires suppression of all cells that fail the dominance criterion or contain fewer than the minimum number of observations (
minNrObs).Secondary disclosure (Step 2) involves suppressing additional cells to protect those flagged in Step 1, following the minimum frequency rule. This typically means suppressing the smallest unsuppressed cell if only one cell was suppressed in Step 1 (applicable to totals/sums where the parent node is available).
For example, a cell not meeting NumObs or exceeding domPerc is suppressed. Outputs violating these criteria are flagged or excluded from export.
2.6.3 Components of Disclosure Criteria in the MDI:
Four main variables are created by MDI tools to assess diclosure criteria:
Dominance Share (MDIdomSh)
The maximum share of the total (e.g., employment, sales) contributed by the largest ‘X’ firms (number ‘X’ defined by domNr) in a cell. Example: If domPerc is 0.75, The top ‘X’ firms cannot contribute more than 75% to the cell’s total.
Minimum Number of Observations (MDIminNumObs):
The minimum number of firms required in a cell for it to be included in the output. Example: If NumObs is 3, at least 3 firms must contribute to a cell.
Top Firms Count (MDIdomNr)
Specifies how many top firms’ shares are considered when applying the dominance criterion (domPerc). Example: If domNr is 1, the dominance is based on the largest firm; if 2, the top two firms are considered.
Dominance Variable (MDIdomVar)
The variable on which the dominance criterion is applied, such as employment (emp) or sales (nq). Different NSIs may apply criteria to different variables, depending on their legal requirements. Note: The domVar can be ‘var’ in the countdown. If so, the domPerc is computed for all variables for which an aggregate is computed.
Show dominance percentiles (show_domVar)
This is a dummy variable indicating whether the dominance percentile columns needs to be included (1) or not (0) in the output file.
Hide or not hide values post-disclosure (show_values)
This is a dummy variable indicating whether the aggregates in the output file need to be hidden (0) or not (1) in case they don’t comply with the disclosure rules of the NSI.
2.6.4 Disclosure Criteria in MDI Countries
Below is the disclosure criteria in MDI countries:
| disclosure_variable | AT | EL | FI | FR | DE | NL | PTx | PT | SI | GB | MT |
|---|---|---|---|---|---|---|---|---|---|---|---|
| MDIminNumObs | 10 | 5 | 3 | 4 | NA | 10 | 1 | 1 | 5 | 10 | 3 |
| MDIdomVar | var | var | persons_br | var | NA | var | var | var | var | var | var |
| MDIdomSh | 0.8 | 0.85 | 0.75 | 0.85 | NA | 0.5 | 1 | 1 | 0.5 | 0.4375 | 0.9 |
| MDIdomNr | 2 | 2 | 1 | 1 | NA | 1 | 1 | 1 | 1 | 1 | 2 |
| show_domPerc | 1 | 1 | 1 | 1 | NA | 1 | 0 | 0 | 1 | 1 | 1 |
| show_values | 0 | 0 | 0 | 1 | NA | 0 | 1 | 0 | 1 | 0 | 0 |
Some countries apply additional disclosure criteria. For instance, the Netherlands (NL) and Slovenia (SI) require that all exported variables (not just employment (emp) and sales (nq)) comply with the dominance share criterion. In such cases, the parameter domVar is set to ‘var’ in the Countdown file. These disclosure parameters are configured during the execution of the infrastructure at an NSI, either by MDI or NSI staff.
2.6.5 Disclosure Routines in MDI
The MDI tools listed below operate using the disclosure parameters defined by the user in the Countdown file.
| Tool | Purpose | Use.by.Researchers | Use.Other.MDI.Tools |
|---|---|---|---|
| mdi_aggregate.R | Aggregate data with optional disclosure checks: dominance threshold (`domPerc`) and minimum observations (`NumObs`). | Yes | Yes (`disclosCrit`, `disclose`) |
| mdi_regress.R | Performs regression analysis and automatically checks whether the number of observations meets the required minimum threshold for disclosure. Skips regressions that fail the check. | Yes | No |
| mdi_export.R | Exports datasets with optional disclosure compliance and metadata logging. | Yes | Yes (`disclose`) |
| disclose.R | Performs primary and secondary disclosure checks. | No | No |
| disclosCrit.R | Adds disclosure metrics (`domPerc`, `NumObs`) to datasets. | No | No |
Module writers are strongly encouraged to use MDI tools to ensure compliance with the disclosure criteria of all countries where the module is intended to run. ### Primary and Secondary Disclosure with disclose
The disclose tool applies two levels of disclosure control to aggregated statistics to ensure compliance with confidentiality requirements.
Suppressed values are replaced with the sentinel value -999, and disclosure flags (discflag1, discflag1_*, discflag2) record which suppression criteria were triggered.
2.6.5.1 1. Primary Disclosure
Primary disclosure applies two main suppression rules to protect confidentiality:
Minimum Observations Rule
Any aggregate based on fewer than the required minimum number of observations (MDIminNumObs) is suppressed.
All affected variables in that row, including theNumObscolumn, are replaced with-999.Dominance Rule
For sum-type variables, the function evaluates the dominance share of the largest contributors (domPerc_*), calculated by [disclosCrit()].- For most countries: a cell is suppressed if the dominance share exceeds the threshold (
domPerc >= MDIdomSh). - Germany (DE): the rule is inverted — a cell is suppressed if the dominance share falls below the threshold (
domPerc < MDIdomSh).
This reflects German statistical disclosure practice, where low dominance values indicate high concentration risk.
- For most countries: a cell is suppressed if the dominance share exceeds the threshold (
All cells suppressed in this step are flagged with discflag1 (and variable-specific flags discflag1_<var>).
2.6.5.2 2. Secondary Disclosure (Hierarchical Totals Only)
When the aggregated data includes hierarchical levels — for example, industry or regional totals — an additional secondary disclosure step prevents back-calculation of suppressed values.
- The hierarchy file (
hhfile) must be a wide table, with one column per hierarchical level (e.g.,h_0,h_1,h_2, …). - The
nodevariable in the dataset identifies the child level. - The parent level is determined by the next column to the right of the child in the sorted hierarchy (e.g., if
node = h_1, the parent ish_2).
Suppression rule:
If within a parent group exactly one child cell was suppressed in the primary step, the tool suppresses one additional child — the non-suppressed cell with the smallest number of observations (NumObs).
This prevents the originally suppressed value from being reconstructed by subtraction from the total.
All such cases are flagged with discflag2.
2.6.5.3 Germany-Specific Note on Dominance Percentiles
The dominance share (domPerc_*) used in the primary disclosure rule is computed differently for Germany in [disclosCrit()].
Instead of using the standard top-n share (sum of the top domNr values divided by the total), Germany applies the following ratio:
\(\text{domPerc} = \frac{\text{Total} - x_1 - x_2}{x_1}\)
where \((x_1)\) and \((x_2)\) are the two largest firm values in each aggregation group.
This yields a dominance percentile that decreases as concentration increases — hence, in Germany, smaller values of domPerc indicate greater dominance and trigger suppression (domPerc < MDIdomSh).
2.6.5.4 Output
The tool returns the dataset with all required suppressions applied and two disclosure flags:
- discflag1: primary disclosure
- discflag2: secondary disclosure
2.7 Auxiliary Files
Deflators are constructed using data extracted from Eurostat via the eurostat R package. These include deflators for
Value-added (
pnv, at NACE level 2)Capital depreciation (
pnc, at NACE level 2)Gross fixed-assets (
pgrK, at NACE level 1)Investment (
pni, at NACE level 2)GDP (
pngdp, at NACE level 2)Harmonized consumer price index (HCPI) (
pnhcpi, at NACE level 2),
all normalized to a 2010 base year (set to 1). The underlying Eurostat datasets—nama_10_a64, nama_10_a64_p5, nama_10_gdp, prc_hicp_aind, and nama_10_nfa—cover national accounts and price indices. I
The processed deflator file is structured by country code (cc), industry code (DEFind, that can be linked to MD variable nace using table nace_DEFind), and year (year). The table also contains asset-specific deflators (e.g., construction, machinery, intellectual property) and includes growth and depreciation rates, offering a detailed dataset for robust analytical use.
2.8 Nuvolos: where MDI users develop and test their codes
To support testing of both modules and the full MDI infrastructure, a dedicated environment has been set up on the server Nuvolos. This environment includes several separate spaces for different purposes—for example, an internal development space for the MDI team, and a testing space for internal and external users to validate module code.
This server replicates the environment of a national statistical institute and includes the MDI infrastructure. The Nuvolos server provides all the necessary tools, scripts, and libraries required to develop research modules. Each space includes mock data, which consists of artificially generated datasets that mimic real NSI data in structure and naming conventions. These datasets allow for realistic and consistent testing of modules and infrastructure. (Details on how the mock data is created can be found here.)
2.8.1 How to access
Nuvolos is the server environment that we use for trainings, testing, developing and debugging. There are three separate environments:
1.MDI Training Environment: this space is meant for MDI users, module writers who want to test their scripts and people who want to learn about the MDI
2.Nuvolos Developer Space: this space is meant for the MDI infra team to develop and test the MDI infrastructure
3.Portugal Data Access Space: this is a space exclusively for people who have data access to the Portuguese data. This is where the MDI for PTx is executed by the MDI team.
If you want to get access to either of the environments, reach out to Johanna via email or Teams.
2.8.1.1 MDI Training Environment
This space is designed for training, practicing, and familiarizing with the MDI infrastructure. It’s intended mostly for external people, as it already contains all relevant files, ie. the whole MDI infrastructure, mock data. It’s an environment in which externals can test their codes using the MDI infrastructure and mock data. It is updated periodically, so doesn’t always reflect the latest version of the MDI. It is not intended for bug-fixing and working on the infrastructure, as it doesn’t link to Github.
As this is a practice space, module writers can import or write their scripts, develop their module, run it as part of the MDI and export their files it if needed. The final module needs to be sent to the MDI team before each launch.
In the training space each user has a separate copy of the MDI and data files. That means, if scripts are altered moved or added, this is only reflected in the user’s space, i.e. deleting a file will not affect any other user and no other user can see your modifications.
2.8.2 How to use
You will find a folder structure such as the one found in the NSI environments containing the actual microdata.
Files structure
- “Files” section {#files-nuvoulos}
You’ll find the following folders:
MDI: containing the MDI infrastructureoutput: if a script generates any output it can saved here
- These additional folders can only be accessed through RStudio:
space_mount/mockdata/NSIdata/: this folder contains NSI mockdata of several countriesspace_mount/mockdata/TMP/: this folder contains MD mockdata of several countries
To write their codes, users need to first follow some steps to load the environment. I.e., run some functions which import the mockdata and all the necessary auxiliary files mimicking the NSI environments.
go to the section “Applications” on the left navigation bar and open the RStudio application
- Run the countdown:
- In RStudio, on the right in the files section, look for the MDI > launchpad and open countdown.
- R Click on ‘source’ to execute it.
- You will be asked choose a program to run, choose interactive MDI by entering 4
- Wait until the script is done. The metadata, Rtools and libraries are imported - you can now create/ run your script.
- Create/ Execute a Module
- Add a new file
- Add your module code or any code that you want to run
- Execute your code either line-by-line by clicking ‘run’ or all at once by clicking ‘source’
- When you run your script, any errors will appear in the Console section. The executed lines are highlighted in blue, while errors are displayed in red. To resolve an issue, identify the problematic line in your script, make the necessary corrections, and run it again.
2.8.2.1 Portugal Data Access Space
This space is set up for the direct access of the Portugese data, this is where the launch execution for Portugal takes place. This space is reserved for the PTx country leaders. It contains the raw PTx data the MDI infrastructure. The MDI infrastructure in this space will be updated regularly but is not necessarily the most recent version found of GitHub.
The data is stored in the large file storage (folder space_mounts/NSIdata). Everyone who has access to the space has read and write permission of the data. Any changes to the data should be done with utmost caution.
All files - infrastructure and data - are shared files across all users of the space. That means any modifications that are done will show for all users of the space.
2.9 The complete MDI pipeline
Below is the complete MDI pipeline for your visualization:
Code
%%{init: {
"theme": "base",
"themeVariables": {
"background": "#ffffff",
"textColor": "#000000",
"lineColor": "#000000",
"fontSize": "26px"
}
}}%%
flowchart TB
%% Setting MDI
A([🚀 **Microdata Infrastructure** 🚀]) --> |to initialize it...|A1([**launchpad/countdown.R**])
A1 --> A2(set country code)
A1 --> A3(set paths to directories)
A1 --> A4(set disclosure parameters)
A2 -.-> A5[AT, DE, FI, FR, NL, PT, SI]
style A fill:#003366,stroke:#000000,color:#ffffff
style A1 fill:#228B22,stroke:#ffffff,color:#ffffff
style A2 fill:#90EE90,stroke:#ffffff,color:#000000
style A3 fill:#90EE90,stroke:#ffffff,color:#000000
style A4 fill:#90EE90,stroke:#ffffff,color:#000000
style A5 fill:#E6FFE6,stroke:#ffffff,color:#000000
%% Options MDI
A3 ---> B1([there are four options:])
A4 ---> B1([there are four options:])
A5 ---> B1([there are four options:])
B1 --> C[(**pre_launch_checker.R**)]
B1 --> D[(**litoff.R**)]
B1 --> E[(**interactive_mdi.R**)]
B1 --> F[(**prepare_NSI.R**)]
style B1 fill:#F0F0F0,stroke:#FFFFFF,color:#000000
%% Pre-launch Checker
C --> C1[checking if metadata corresponds to what we have in the environment]
C1 --> C2(**Database**)
C1 --> C3(**Varnames**)
C1 --> C4(**Codebooks**)
C1 --> C5(**Classfiles**)
C2 -.-> C6[Do they all exist?]
C3 -.-> C7[Do all varnames exist?]
C3 -.-> C8[Are all varnames of the listed data type?]
style C fill:#7A5DC7,stroke:#000000,color:#ffffff
style C1 fill:#E6E6FA,stroke:#ffffff
style C2 fill:#7A5DC7,stroke:#ffffff,color:#ffffff
style C3 fill:#7A5DC7,stroke:#ffffff,color:#ffffff
style C4 fill:#7A5DC7,stroke:#ffffff,color:#ffffff
style C5 fill:#7A5DC7,stroke:#ffffff,color:#ffffff
style C6 fill:#E6E6FA,stroke:#ffffff
style C7 fill:#E6E6FA,stroke:#ffffff
style C8 fill:#E6E6FA,stroke:#ffffff
%% Liftoff
D ---> D1(🚀 **main script that completes the launch sequence: loads essential R libraries, pulls in MDI resources and NSI metadata, and brings raw firm data plus concordance tables into the environment. This script launches the rocket** 🚀)
D1 --> D2(read R libraries, import mdi library, import NSI metadata, import concordance tables)
D1 --> D3(import raw firm data)
D1 ---> |if data is already harmonized...|D4[**execute research modules**]
D4 --> D5[M0]
D4 --> D6[CN]
D4 --> D7[EN]
D4 --> D8[FD]
D4 --> D9[MP]
D4 --> D10[TC]
D5 --> D11[🎊 **extract results** 🎊]
D6 --> D11
D7 --> D11
D8 --> D11
D9 --> D11
D10 --> D11
style D fill:#8B0000,stroke:#000000,color:#ffffff
style D1 fill:#8B0000,stroke:#ffffff,color:#ffffff
style D2 fill:#FFD6D6,stroke:#ffffff,color:#000000
style D3 fill:#FFD6D6,stroke:#ffffff,color:#000000
style D4 fill:#8B0000,stroke:#ffffff,color:#ffffff
style D5 fill:#FFD6D6,stroke:#ffffff,color:#000000
style D6 fill:#FFD6D6,stroke:#ffffff,color:#000000
style D7 fill:#FFD6D6,stroke:#ffffff,color:#000000
style D8 fill:#FFD6D6,stroke:#ffffff,color:#000000
style D9 fill:#FFD6D6,stroke:#ffffff,color:#000000
style D10 fill:#FFD6D6,stroke:#ffffff,color:#000000
style D11 fill:#8B0000,stroke:#ffffff,color:#ffffff
%% Interactive MDI
E --> E1[set up directories, call R libraries, load Rtools, and interactively explore the MDI environment]
style E fill:#FF8C00,stroke:#000000,color:#ffffff
style E1 fill:#FFF2DC,stroke:#ffffff
%% Prepare NSI
F --> F1[first time preparation of data and metadata, updating of metadata when new file years and file types become available]
style F fill:#654321,stroke:#000000,color:#ffffff
style F1 fill:#ECD9C6,stroke:#ffffff
%% Harmonize raw data
D2 --> G(🔩 **harmonize raw data to MDI** 🔩)
D3 --> G
G --> G1(there are 4 methods for the harmonization process)
G1 -.-> G2(**revalue**: transform value of a unique variable)
G2 -.-> G3(**recode/reclass**: concord codebook/class variables as desired)
G3 -.-> G4(**redefine**: aggregate one or more variables)
G4 -.-> G5(**remap**: assign new name to a raw variable)
G ---> D4
style G fill:#4B4B4B,stroke:#ffffff,color:#ffffff
style G1 fill:#E0E0E0,stroke:#ffffff
style G2 fill:#F5F5F5,stroke:#ffffff
style G3 fill:#F5F5F5,stroke:#ffffff
style G4 fill:#F5F5F5,stroke:#ffffff
style G5 fill:#F5F5F5,stroke:#ffffff%%{init: {
"theme": "base",
"themeVariables": {
"background": "#ffffff",
"textColor": "#000000",
"lineColor": "#000000",
"fontSize": "26px"
}
}}%%
flowchart TB
%% Setting MDI
A([🚀 **Microdata Infrastructure** 🚀]) --> |to initialize it...|A1([**launchpad/countdown.R**])
A1 --> A2(set country code)
A1 --> A3(set paths to directories)
A1 --> A4(set disclosure parameters)
A2 -.-> A5[AT, DE, FI, FR, NL, PT, SI]
style A fill:#003366,stroke:#000000,color:#ffffff
style A1 fill:#228B22,stroke:#ffffff,color:#ffffff
style A2 fill:#90EE90,stroke:#ffffff,color:#000000
style A3 fill:#90EE90,stroke:#ffffff,color:#000000
style A4 fill:#90EE90,stroke:#ffffff,color:#000000
style A5 fill:#E6FFE6,stroke:#ffffff,color:#000000
%% Options MDI
A3 ---> B1([there are four options:])
A4 ---> B1([there are four options:])
A5 ---> B1([there are four options:])
B1 --> C[(**pre_launch_checker.R**)]
B1 --> D[(**litoff.R**)]
B1 --> E[(**interactive_mdi.R**)]
B1 --> F[(**prepare_NSI.R**)]
style B1 fill:#F0F0F0,stroke:#FFFFFF,color:#000000
%% Pre-launch Checker
C --> C1[checking if metadata corresponds to what we have in the environment]
C1 --> C2(**Database**)
C1 --> C3(**Varnames**)
C1 --> C4(**Codebooks**)
C1 --> C5(**Classfiles**)
C2 -.-> C6[Do they all exist?]
C3 -.-> C7[Do all varnames exist?]
C3 -.-> C8[Are all varnames of the listed data type?]
style C fill:#7A5DC7,stroke:#000000,color:#ffffff
style C1 fill:#E6E6FA,stroke:#ffffff
style C2 fill:#7A5DC7,stroke:#ffffff,color:#ffffff
style C3 fill:#7A5DC7,stroke:#ffffff,color:#ffffff
style C4 fill:#7A5DC7,stroke:#ffffff,color:#ffffff
style C5 fill:#7A5DC7,stroke:#ffffff,color:#ffffff
style C6 fill:#E6E6FA,stroke:#ffffff
style C7 fill:#E6E6FA,stroke:#ffffff
style C8 fill:#E6E6FA,stroke:#ffffff
%% Liftoff
D ---> D1(🚀 **main script that completes the launch sequence: loads essential R libraries, pulls in MDI resources and NSI metadata, and brings raw firm data plus concordance tables into the environment. This script launches the rocket** 🚀)
D1 --> D2(read R libraries, import mdi library, import NSI metadata, import concordance tables)
D1 --> D3(import raw firm data)
D1 ---> |if data is already harmonized...|D4[**execute research modules**]
D4 --> D5[M0]
D4 --> D6[CN]
D4 --> D7[EN]
D4 --> D8[FD]
D4 --> D9[MP]
D4 --> D10[TC]
D5 --> D11[🎊 **extract results** 🎊]
D6 --> D11
D7 --> D11
D8 --> D11
D9 --> D11
D10 --> D11
style D fill:#8B0000,stroke:#000000,color:#ffffff
style D1 fill:#8B0000,stroke:#ffffff,color:#ffffff
style D2 fill:#FFD6D6,stroke:#ffffff,color:#000000
style D3 fill:#FFD6D6,stroke:#ffffff,color:#000000
style D4 fill:#8B0000,stroke:#ffffff,color:#ffffff
style D5 fill:#FFD6D6,stroke:#ffffff,color:#000000
style D6 fill:#FFD6D6,stroke:#ffffff,color:#000000
style D7 fill:#FFD6D6,stroke:#ffffff,color:#000000
style D8 fill:#FFD6D6,stroke:#ffffff,color:#000000
style D9 fill:#FFD6D6,stroke:#ffffff,color:#000000
style D10 fill:#FFD6D6,stroke:#ffffff,color:#000000
style D11 fill:#8B0000,stroke:#ffffff,color:#ffffff
%% Interactive MDI
E --> E1[set up directories, call R libraries, load Rtools, and interactively explore the MDI environment]
style E fill:#FF8C00,stroke:#000000,color:#ffffff
style E1 fill:#FFF2DC,stroke:#ffffff
%% Prepare NSI
F --> F1[first time preparation of data and metadata, updating of metadata when new file years and file types become available]
style F fill:#654321,stroke:#000000,color:#ffffff
style F1 fill:#ECD9C6,stroke:#ffffff
%% Harmonize raw data
D2 --> G(🔩 **harmonize raw data to MDI** 🔩)
D3 --> G
G --> G1(there are 4 methods for the harmonization process)
G1 -.-> G2(**revalue**: transform value of a unique variable)
G2 -.-> G3(**recode/reclass**: concord codebook/class variables as desired)
G3 -.-> G4(**redefine**: aggregate one or more variables)
G4 -.-> G5(**remap**: assign new name to a raw variable)
G ---> D4
style G fill:#4B4B4B,stroke:#ffffff,color:#ffffff
style G1 fill:#E0E0E0,stroke:#ffffff
style G2 fill:#F5F5F5,stroke:#ffffff
style G3 fill:#F5F5F5,stroke:#ffffff
style G4 fill:#F5F5F5,stroke:#ffffff
style G5 fill:#F5F5F5,stroke:#ffffff
Note. This scheme shows how to initialize and run MDI. Start launchpad/countdown.R, set country, paths, and disclosure parameters. Choose one of four programs: pre_launch_checker to validate metadata and varnames. liftoff to load libraries, import metadata and raw data, harmonize using four methods (revalue, recode/reclass, redefine, remap), then execute modules and extract results after disclosure checks. interactive_mdi to explore the environment. prepare_NSI for first-time setup and metadata updates. Boxes indicate steps. Arrows indicate control and data flow.
3 Setting up the MDI
This section covers the technical details of the MDI - including preparation for implementation in a new country, the modifications required for a launch, and how to develop the infrastructure further. It provides information for the MDI team and the country leads responsible for setting up the MDI in their respective countries.
3.1 Introduction & Setup to MDI Infrastructure
The MDI (Microdata Infrastructure) provides a unified research environment implemented identically at all national statistical institutes (NSIs), including the mock data site. This consistent setup allows researchers to analyze standardized microdata (MD) panels, constructed from diverse national sources.
These MD panels are harmonized through detailed metadata, which ensures legal compliance, transparency, and comparability of statistical outputs across countries. For each NSI, the metadata specifies the available source files, variables, and classification lists, and maps them to the shared MD format. As a result, the datasets are syntactically identical across countries, even when the underlying data differs.
NSIs vary in their legal frameworks, technical setups, and the types of data they maintain—from registers and surveys to administrative sources. The MDI infrastructure addresses this heterogeneity by applying a common structure and metadata standard across all participating institutes.
To execute research, individual researchers write analysis modules—typically in R—the payload. This payload is executed inside the secure MDI environment. During a launch, the rocket reads the metadata and data, harmonizes them, constructs the MD panels, and then runs the payload modules. The outputs comply with disclosure rules, enabling valid cross-country comparisons.
3.1.1 Launch Preparation by the MDI team
These are the steps the MDI team must follow in this order before each launch.
Lock the R package list.Freeze the package versions to ensure consistency across all NSI environments.
Create a dedicated GitHub branch, e.g.
post_Launch_vX.X.This branch will track all changes made during the launch at the NSIs.Create an error tracking file
alldocs/Launch_vX.X_errors_overview.csv. This shared file is used by all country leaders to document errors and changes during the launch process.Generate documentation with roxygenize. Run
roxygenize(paste0(dirROCKET, "Rtools"))to generate documentation. No further changes should be made to the R tools after this step.Capture the Git commit details. Run
rocket/MDIprogs/get_commit_details.Rand re-commit the branch to lock the exact version.Deploy to NSI Teams folder. Use the appropriate scripts to copy the finalized GitHub branch to the NSI-specific Teams folders.
Notify NSIs to download and set up. Ask NSI system administrators to download the folder, install or update required R packages, and install the MDI package.
3.1.2 Launch Sequence Overview
This section outlines the main steps for executing a full MDI launch.
Import MDI Files
Copy the complete MDI folder from the country Teams directory into a local working directory of your choice. Ensure that both the user and R have read access to the raw data files in that location.Configure Countdown Script
Openlaunchpad/countdown.Rand update the required parameters to match your site-specific setup. Find details and explanations hereRun pre_launch_checker.R
Begin by runningcountdown.Rfrom your working directory and selecting thepre_launch_checker.Roption. This checks for inconsistencies between the metadata expected by MDI and the actual metadata at the NSI site. It generates concordance files and a report listing issues to fix. Find details and explanations hereRun the Post-Harmonization Checker
In thecountdown.Rset the optionMDImoduleRun = FALSEand run the countdown again, selectingliftoff.R. This executes the full MDI rocket without running the analytical modules. During this phase, the Post-Harmonization Checker (PHC) is triggered to validate the harmonized data. The PHC script performs quality checks on the harmonized microdata, such as detecting duplicates, format mismatches, date inconsistencies, and structural breaks. The results of these checks are saved to two files:<CountryCode>_phc_results.txtandbreaks_report.pdf(indirTMPSAVE) . These files must be reviewed and issues resolved before proceeding. Please have a look at the detailed section on post harmonization checks for instructions for country leaders.Full module execution
After resolving all issues flagged by the PHC, setMDImoduleRun = TRUEand rerunliftoff.Rto execute the full set of research modules. Iterations with the MDI staff may be needed for fixes and patches to the rocket and payload, until the final results are written to thedirOUTPUTdirectory.Export
The files indirOUTPUTneed to be checked for disclosure. After disclosure checks are completed, the approved files from this directory can be uploaded to the MDI TEAMS cloud directory designated for the NSI staff.
3.1.3 Additional Programs (Not Part of Launch Sequence)
The following programs are not part of the formal launch sequence but support metadata setup and interactive work:
prepare_NSI.R
Used to generate and structure metadata at the NSI. Should be run before any launch steps if metadata is not yet available.interactive_MDI.R
Enables interactive work inside the MDI environment (e.g. for metadata exploration or manual testing). Not used during automated launches.
3.1.4 The structure of the MDI
These are the main directories in the MDI folder:
- docs: Documentation of the MDI system, including the MDI Manual.
- rocket: Code supporting and controlling MDI rocket launches, including NSI metadata and auxiliary data.
- payload: Research modules, including metadata and NSI-specific
NSI_MDconcordances. - launchpad NSI-specific information for controlling MDI code and rocket launches.
Files in the MDI folder
.
├── docs
├── launchpad
├── payload
├── rocket
└── site
directory launchpad (with files to launch MDI)
launchpad
├── README.md
├── countdown.R
├── interactive_MDI.R
├── liftoff.R
├── pre_launch_checker.R
├── prepare_NSI.R
└── report_file_changes.R
subdirectories of rocket (with code, and (meta)data to support MDI)
rocket
├── CompNet
├── MDIprogs
├── NSImetadata
├── Rtools
├── auxdata
└── control
subdirectories of payload (with analytical code and MD (meta)data)
payload
├── Launch_v2.0
├── Launch_v2.1
├── Launch_v2.2
├── Launch_v2.3
└── Launch_v2.mini
3.1.5 Importing to, and exporting from, the NSI site
Whenever an import or export operation is required from or to an environment, it is important to consider both the time it takes and whether the operation incurs any monetary cost.
3.2 Metadata
This section provides an overview of the structure of NSI and MD metadata, how to construct them and to establish the connection between them, ensuring that country-specific data sources are accurately mapped to the standardized MD panel structure.
3.2.1 Specifications for the NSI Metadata
This section summarizes the structure and content of the NSI metadata files. These files document, in both machine- and human-readable formats, the available data files, the unit of observation (i.e., the description of each row), the names and descriptions of the variables (i.e., columns) in each file, and the valid values for each variable, including their class and domain. The following paragraphs offer guidance on how to prepare country-specific metadata files accordingly.
Once created, the NSI metadata files must be uploaded to the designated TEAMS directory. After the NSI downloads the updated rocket, these metadata files will be located in the rocket/NSImetadata/*NSI*/ directory of the MDI infrastructure. The MDI program pre_launch_checker.R, which should be run whenever the MDI is updated—will identify inconsistencies and other issues in the metadata.
The main types of NSI metadata files prepared include:
datafiles: Lists all available NSI firm-level data files, including their names and years covered.varnames: Documents the variables and their descriptions for each raw data file listed indatafile.codebook: Maps categorical variable values to their corresponding descriptions.class: Describes classification variables in the datasets, such as industry or product codes.classvart0_classvart1_conc: Details how classification variables evolve over time, providing concordance between versions.keyID1_keyID2_conc: Maps a firm identifier (keyID1) to higher level identifier (keyID2) by year.
It is advised to construct the files in the same order as in the above list.
Together with the MDI team, the NSI prepares metadata to support the harmonization of NSI data to the MD specification. The MDI team supplies metadata—potentially specific to each launch that describes the MD datasets and their variables. In addition, the MDI team and the NSI jointly provide concordances used to align NSI data files with the standardized MD format.
To facilitate this process, the MDI team also provides tools for creating the required metadata files. These tools can be found in the directory /rocket/MDIprogs/metadata_tools/.
In the filenames for the metadata, the acronym NSI is used. This should be substituted with the 2-letter country code for the country in question (using the ISO3166-2 standard, e.g. NSI = PT). For the MDI metadata, the two letters MD are used.
3.2.1.1 List of NSI datafiles – NSI_datafiles.csv
This file contains the list of all available raw data files on a country’s environment. The file has the following columns:
NSI_datafile,NSI_dataset,yearvar,year_start,year_end,format,path,details,firm_unit,data_source,firm_sample,preprocessing
where:
NSI_datasetis the ‘generic’ name of the NSI datafile.NSI_datafileis the name of the file in the NSI environment.yearvargives name of year variable ifNSI_datafileis a panel, empty otherwise.year_startis the starting year of the data file.year_endis the last year of the data fileformatis the file extension (csv, sas, stata, etc) of the file (i.e. also the storage format of the data).pathindicates path of the datafile relative to the NSI data directory (given by the parameterdirINPUTDATAin launchpad/countdown.R).detailscontains additional notes on the file.firm_unitindicates the type of firm observation unit. There can be four types of units. Below we provide a definition for each, taken from the Eurostat glossary, hierarchically ordered, i.e.plant: A single-location enterprise or part of one, primarily engaged in one main productive activity. Also often known as ‘establishment’. This corresponds toplantidin the MD datalegal_unit: Either legal persons recognized by law or natural persons conducting economic activity independently. This corresponds tofirmidin the MD dataenterprise: An organizational unit producing goods or services, with decision-making autonomy, possibly spanning multiple activities, locations, or legal units. Hence, one enterprise might be constituted by more than one legal unit. This corresponds toentidin the MD dataenterprise_group: A set of legally or financially linked enterprises, controlled by a group head, forming an economic entity with shared or centralized decision-making. This corresponds toentgrpin the MD data
data_source: refers to the origin of the data. Three options are possible:survey: If the data was collected through surveysadministrative_source: Information collected by public authorities from firms as part of legal or regulatory requirements, such as tax records, employment filings, or financial statementsmixed: If the data comes both from surveys, administrative source of other collection methods.
firm_sample: Information about the population of firms present in the datafile (usually a piece of text, trying to be as concise as possible)preprocessing: Instructions on how to perform a data preprocessing operation on the raw datafile. For more details, check the dedicated box.
An example (for NSI=FI, 2018) of the metadata for the raw data files (the columns yearvar, year_end, path and details are omitted for viewing):
| NSI_datafile | NSI_dataset | year_start | format | firm_unit | data_source | preprocessing |
|---|---|---|---|---|---|---|
| bd2018 | bd | 2018 | csv | legal_unit | adminiatrative_source | NA |
| br2018 | br | 2018 | csv | legal_unit | adminiatrative_source | NA |
| bs2018 | bs | 2018 | csv | legal_unit | adminiatrative_source | NA |
| cis2018 | cis | 2018 | csv | legal_unit | survey | NA |
| ictec2018 | ictec | 2018 | csv | legal_unit | survey | NA |
| ifats2018 | ifats | 2018 | csv | legal_unit | adminiatrative_source | NA |
| *Note: Only the first 5 rows are displayed. | ||||||
3.2.1.2 File-specific metadata – NSI_varnames.csv
This file contains the list of all variables in each raw datafile appearing in the column NSI_datafile of NSI_datafiles.csv. It has the following columns:
[1] NSI_datafile,NSI_varname,is_key,description,class,domain
where:
NSI_datafileis the name of the file in the NSI environment.NSI_varnameis the name (hopefully mnemonic) of the variable in the raw file.is_keyis a boolean stating whether variable belongs to the (possibly joint) unique keys of the datafile, e.g.firmid, orfirmid,yearare often the unique key(s).descriptioncontains a description of the variable, if possible using Eurostat convention.classis the type of value that the variable holds. The following data types can be encountered:- numeric: Numbers with or without decimals (e.g., 3, 4.5).
- character: Text or string values (e.g., “apple”).
- date: Calendar dates stored as
Dateobjects (e.g., 2023-05-09). - logical / boolean:
TRUE/FALSEvalues used in conditions and comparisons.
domainprovides information on the values of the variable. See examples below:- classification: e.g. list of industry, region, product or codes. (values is metadata filename: e.g. NSI_classname_class.csv, which provides a list of permissible values and descriptions)
- file-specific codebook of categorical answers. (value is metadata filename, e.g. *NSI_codebook.csv containing permissible values, such as ‘yes’, ‘no’,‘maybe’, or ‘small’, ‘large’, ‘medium’.
- For other values:
- For monetary values, “1000” (for 1000 Euros)
- For dates: “%m%d%Y” (R date-format for mmddyyyy). For ‘year’ variable, we use “%Y”
- For real units, choose from: “ton” (weight, 1000kg), “m3” (volume), “GJ” (energy), “unit” (1 item).
3.2.1.2.1 Domain: Expenditures, Quantities, Dates
| Measure | Domain Entry | Description |
|---|---|---|
| Expenditure | 1000 | … or 1 Euro; 10000000 Euro; etc. |
| Foreign currency | 1*FXC | … or 1000 etc.; Where FXC is an ISO 4217 3-letter currency code |
| Employment | 1 | 1 here refers to 1 FTE; or 1000; … or 1 Emp if in persons |
| Numerical | 1 | 1 here refers to 1 unit; … or 10; 100; where ‘unit’ gives unit in lowercase for the variable in the NSI data file |
| Date | %Y-%m-%d | Use the R date format that matches the values for the NSI date or year variable |
3.2.1.2.2 Domain: Classification or Categorical (factor) variables
| Variable | Domain_Entry | Description |
|---|---|---|
| Classification variable | NSI_classname_class | An (official) list, ie NL_nace |
| Categorical variable | NSI_codebook | Contains permissible values for categorical (factor) variables, e.g. ‘yes’, ‘no’,‘maybe’ |
3.2.1.2.3 Example: Netherlands (SBS, 2018): NL_varnames.
| NSI_datafile | NSI_varname | is_key | description | class | domain |
|---|---|---|---|---|---|
| sbs2018 | ent_id | 1 | Enterprise ID (identification | character | |
| sbs2018 | sbs_12110 | 0 | Turnover | numeric | 1000 |
| sbs2018 | sbs_12150 | 0 | Value added at factor cost | numeric | 1000 |
| sbs2018 | sbs_12170 | 0 | Gross operating surplus | numeric | 1000 |
| sbs2018 | sbs_13110 | 0 | Total purchases of goods and s | numeric | 1000 |
| *Note: Only the first 5 rows are displayed. | |||||
3.2.1.3 Codebook for categorical variables – NSI_codebook.csv
This file contains the possible values of a categorical variable and the description that belongs to that value. There rows give the possible values occuring in firm data for a particular NSI_datafile and NSI_varname. The name of the codebook should be given in the ‘domain’ columnn of NSI_varnames for the relevant categorical variable.
[1] NSI_dataset,NSI_varname,year,code,description
where:
NSI_datasetis the name of the generic dataset in the NSI environment.NSI_varnameis the name of the variable of that specific raw dataset.yearis the year for which codebook values hold. If empty, holds for all years of the NSI dataset.codegives all the values of the categorical variable that occur for thatNSI_varnamein thatNSI_datafile.descriptiongives the description explaining each code value.
As mentioned, if a specific mapping holds for all years available of a specific NSI_dataset, then the year column for that mapping needs to be empty. For instance, say that we have raw NSI_dataset data_ictec with NSI_varname var112 being a categorical variable taking values ‘0’, ‘2’, ‘999’, referring to ‘no’, ‘yes’, ‘not available’ for all years. In this case, the NSI_codebook will contain only three rows for these three mapping without any reference to the years. In practice:
| NSI_dataset | NSI_varname | year | code | description |
|---|---|---|---|---|
| data_ictec | var112 | 0 | no | |
| data_ictec | var112 | 2 | yes | |
| data_ictec | var112 | 999 | not available |
That said, if a mapping is not constant across all years of an NSI_dataset, then the year column needs to have a value for all mappings reported. In this context, there can two cases:
- The codes differ by year for the same variable: This means that
var_112takes values, say, ‘0’, ‘2’, ‘999’, referring to ‘no’, ‘yes’, ‘not available’ for 2006 and in 2007 the mapping changes to ‘1’, ‘2’, ‘9’, for ‘no’, ‘yes’, ‘not available’, respectively. Hence, we need to indicate year-specific mappings in the codebook table:
| NSI_dataset | NSI_varname | year | code | description |
|---|---|---|---|---|
| data_ictec | var112 | 2006 | 0 | no |
| data_ictec | var112 | 2006 | 2 | yes |
| data_ictec | var112 | 2006 | 999 | not available |
| data_ictec | var112 | 2007 | 0 | no |
| data_ictec | var112 | 2007 | 1 | yes |
| data_ictec | var112 | 2007 | 9 | not available |
- The NSI_varname isn’t available for all years: As an example, let
var_112be only available in 2005 and 2006 but it’s dropped or has a different name in the other years. Then, we need to include it for both 2005 and 2006, regardless of whether the codes are identical or not:
| NSI_dataset | NSI_varname | year | code | description |
|---|---|---|---|---|
| data_ictec | var112 | 2005 | 0 | no |
| data_ictec | var112 | 2005 | 2 | yes |
| data_ictec | var112 | 2005 | 999 | not available |
| data_ictec | var112 | 2006 | 0 | no |
| data_ictec | var112 | 2006 | 2 | yes |
| data_ictec | var112 | 2006 | 999 | not available |
An example below of the values occurring for the unit of measurement in the SI PRODCOM data for 2012.
| NSI_dataset | year | NSI_varname | code | description |
|---|---|---|---|---|
| MIKRO_INDL_razST | NA | ME | 1000 SIT | thousands of slovenian tolars |
| MIKRO_INDL_razST | NA | ME | EUR | euros |
| MIKRO_INDL_razST | NA | ME | GJ | Gigajoule - a unit of energy |
| MIKRO_INDL_razST | NA | ME | MWh | Megawatt-hour - a unit of energy |
| MIKRO_INDL_razST | NA | ME | TJ | terajoules |
| *Note: Only the first 5 rows are displayed. | ||||
3.2.1.4 Classification lists – NSI_classvar_class.csv
This file contains the unique list of codes per year of a specific classification variable in a country. Note that there should be a list for every categorical variable in each dataset. The related table has the following columns:
[1] code,year,description
where:
codeis the list values of the classification variable observed in the data.yearis the related year. If the mapping is not changing across all years available for that NSI_varname, then theyearcolumn will be filled withNA.descriptiongives the description for each code value.
An sample of rows from the table of PRODCOM codes (in this case some randomly selected rows from the list of codes of Finland):
| code | year |
|---|---|
| 27512630 | 2020 |
| 22214180 | 2019 |
| 19301352 | 2005 |
| 20142320 | 2021 |
| 26702490 | 2016 |
| 20165490 | 2013 |
| 24521090 | 2019 |
| 28133200 | 2010 |
| 15842150 | 2005 |
| 19303150 | 2006 |
| 13301380 | 2016 |
| 26518110 | 2021 |
| 20101032 | 2004 |
| 16292320 | 2012 |
| 26518550 | 2011 |
3.2.1.5 Time concordances for classifications – NSI_classvart0_classvart1_conc.csv
This file contains the concordance table for classification lists between a code at time t-1 and the corresponding code at t. This means that in column left all codes appearing in the reference year should be present, regardless of whether they change or disappear in the following year. The reference year is indicated in column year. In other words, t=year. This file is currently used in when concording a classification variable to the latest year observed in the data, as explained here.
[1] left,right,year
where:
leftis the code at timet-1.rightis the value(s) the code in columnleftcan take at timet.yearis the reference year.
The table needs to contain the full history of all codes. In other words, if a code is present in column right at year t, then a row with that code in column left must be present where year is equal to t+1.
For example, let code ‘5678’ turn into ‘1234’ in year 2003. In the table, we would then have the following row:
| left | right | year |
|---|---|---|
| 5678 | 1234 | 2003 |
This now means that ‘1234’ is present in the data for year 2003. As a result, ‘1234’ must appear in column left when year is ‘2004’.
In this case, four things could happen:
- The code survives:
| left | right | year |
|---|---|---|
| 1234 | 1234 | 2004 |
- The code is substituted by one or more new codes:
If ‘1234’ is substituted by ‘9012’:
| left | right | year |
|---|---|---|
| 1234 | 9012 | 2004 |
If ‘1234’ is substituted by ‘9012’ and ‘3456’ (‘one-to-many’ case):
| left | right | year |
|---|---|---|
| 1234 | 9012 | 2004 |
| 1234 | 3456 | 2004 |
- The code is substituted by one or more new codes but it can still be found in 2004:
If ‘1234’ is partially substituted by ‘9012’ but appears in 2004 (special ‘one-to-many’ case):
| left | right | year |
|---|---|---|
| 1234 | 9012 | 2004 |
| 1234 | 1234 | 2004 |
- The code is dropped:
If ‘1234’ doesn’t appear in the data in 2004:
| left | right | year |
|---|---|---|
| 1234 | 2004 |
For example, the following table displays how such table looks like for a subset of ITGS codes in PT:
| left | right | year |
|---|---|---|
| 39172991 | 39172990 | 2005 |
| 85243990 | 85243980 | 2001 |
| 48064090 | 48239090 | 2001 |
| 90309010 | 90309085 | 2005 |
| 8802 20 00 | 8806 91 00 | 2021 |
| 85319020 | 85423910 | 2006 |
| 3215 19 90 | 3215 19 00 | 2017 |
| 84717090 | 84717098 | 2005 |
| 3901 90 90 | 3901 40 00 | 2016 |
| 0302 90 00 | 0302 91 00 | 2016 |
| 86073099 | 86073000 | 2010 |
| 72254020 | 72254012 | 2003 |
| 37061010 | 37061011 | 1994 |
| 85066010 | 85066000 | 2010 |
| 22042183 | 22042142 | 2009 |
3.2.1.6 Key ID concordance – NSI_keyID1_keyID2_conc.csv
This file contains a concordance table between two firm unit codes by year.
[1] "keyID1" "keyID2" "year"
where:
keyID1is the first firm unit codekeyID2is the second firm unit codeyearis the reference year
For example, a concordance between units firmid and entgrp could look like:
| firmid | entgrp | year |
|---|---|---|
| EoYncPX1QK | ZDfQhvv | 2019 |
| Zn1yzAeYA4 | oOUz7Ep | 2021 |
| B4iCIhzgPi | oOUz7Ep | 2013 |
| 9sJnqQM0lo | ZDfQhvv | 2002 |
| FPSgiOjwA7 | MEVq1XW | 2007 |
| 0g0AFLyHCe | MEVq1XW | 2015 |
| hlmrG4AyLu | oOUz7Ep | 2011 |
3.2.2 Specifications for the MD Metadata
Work in this section is a collaboration between NSI and MDI staff.
In an iterative process, using NSI metadata for each country, and taking into account research needs of MD users, a specification is made of the MD panels and variables.
- MD_datafiles.csv describe the harmonized panel datasets generated in each launch
- MD_varnames.csv describe the variables per dataset, with their description, class, and domain.
- MD classifications: versions of official classifications, such as EU NaceR2 activities or NUTS2 regions
- MD codebooks: valid values for categorical variables
3.2.2.1 List of micro-dataset (MD) panels – MD_datafiles.csv
This file contains the list of all firm-level MD panels generated by the MDI code and usable by researchers via code modules. The file has following columns:
This file contains the list of all harmonized firm-level micro data panels (MD datafiles) that can be used in research by code in an MDI launch, either individually, or linked at the firm-year level.
[1] MD_dataset,description,details
where:
MD_datasetis the name of the MD panel (R data.table) at runtime of the launch.descriptionA description of the panel and its underlying source data.detailscontains additional notes on the file.
Below is the list of currently available MD panels:
| MD_dataset | description | details |
|---|---|---|
| BR | Business Register | see: https://ec.europa.eu/eurostat/ |
| BS | Balance Sheet | Balance Sheet on Enterprise groups |
| BD | Business Dynamics | |
| SBS | Structural Business Statistics | |
| CIS | Community Innovation Survey | (only available in even numbered ye |
| ICTEC | ICT Usage in Enterprises Survey | https://ec.europa.eu/eurostat/cache |
| ITGS | International Trade in Goods | |
| ITS | International Trade in Services | |
| OFATS | Outgoing Foreign Affiliates Statist | |
| IFATS | Incoming Foreign Affiliates Statsti | |
| ENER | Energy Use at Firms | in progress harmonization across co |
| PRODCOM | Production Communitaire by firm and | https://ec.europa.eu/eurostat/web/p |
3.2.2.2 Micro-dataset (MD) variables – MD_varnames.csv
This file contains the list of all variables available in all the MD firm-level panel datsets that have been generated by the MDI code using the NSI datafiles, NSI metadata, and the NSI-MD concordances. The file has the following columns:
[1] MD_varname,MD_dataset,is_key,description,class,domain
where:
MD_datasetis the name of the MD firm-level panel dataset, ie BR, SBS, etc.MD_varnameis the name of the variable in the virtual firm-level dataset.is_keyis a boolean stating whether variable belongs to the (possibly joint) unique keys of the dataset, e.g.firmid, orfirmid,yearare often the unique key(s).
Given that an MD dataset can have different unique identifier–be it plant, legal unit, enterprise or enterprise group, check the NSI_datafiles section under firm_unit–depending on the raw data which is based on, is_key takes value 1 for each of the four possible unit types.
That said, in the harmonzied MD dataset researchers will work on, only one of the four units will be available, allowing then module writers to aggregate or disaggregate across units (when possible) using tool mdi_key_id_switch() from package mdi. The tool makes use of different aggregation or disaggregation methods based on the MD varname, as indicated in file MD_aggr_disaggr_methods.csv.
descriptioncontains a description of the variable, if possible using Eurostat convention.classis the type of value that the variable holds (e.g. integer, character, boolean etc.).domain- classification: e.g. list of industry, region, product or codes. (values is metadata filename: e.g. MD_filename_varname_list.csv, which provides a list of permissible values and descriptions)
- MD-specific codebook of categorical answers. (value is metadata filename, e.g. MD_codebookname_codes.csv containing permissible values, such as ‘yes’, ‘no’,‘maybe’
- For other values:
- For monetary values, “1000” (for 1000 Euros)
- For dates: “%m%d%Y” (R date-format for mmddyyyy). For ‘year’ variable, we use “%Y”
- For real units, choose from: “ton” (weight, 1000kg), “m3” (volume), “GJ” (energy), “unit” (1 item).
3.2.2.2.1 Domain: Expenditures, Quantities, Dates
| Measure | Domain_Entry | Description |
|---|---|---|
| Expenditure | 1000 Euro | |
| Employment | 1 FTE | … or 1 Emp if in persons |
| Numerical | 1 ‘unit’ | ‘Unit’ gives unit used in NSI data file, or is left blank if just a count. |
| Date | %Y | For now, we use a R format for 4-digit year as the date variable |
| Weight | 1 kg | |
| Volume | 1 m3 | |
| Area | 1 m2 | |
| Length | 1 m | |
| Energy | 1 GJ | GigaJoule |
3.2.2.2.2 Domain: Classification or Categorical (factor) variables
| Variable | Domain_Entry | Description |
|---|---|---|
| Classification variable | NSI_classname_class | An (official) list, ie NL_nace |
| Categorical variable | NSI_codebook | Contains permissible values for categorical (factor) variables, e.g. ‘yes’, ‘no’,‘maybe’ |
Below is a sample of 5 rows of the file MD_varnames with harmonized MD variables
| MD_dataset | MD_varname | description | domain |
|---|---|---|---|
| CIS | inpssu | Introduced onto the marke | |
| ICTEC | RBTS | Use service robots | |
| ICTEC | CRMSTR | share of information with | |
| BD | merger | Enterprise merged with an | |
| CIS | year | Year | %Y |
3.2.2.3 Classification lists – MD_classvar_class.csv
This file contains the unique list of codes per year of a specific classification variable from the MD panels. Note that there should be a list for every categorical variable in each MD datasets. The related table has the following columns:
[1] code,description
where:
codeis the list values of the classification variable observed in the data.descriptiongives the description for each code value.
An example of the table for NACE codes (in this case the official EU NaceR2 classification):
| code | description |
|---|---|
| C17.2.2 | ____Manufacture of household and sanitary goods and of toilet requisit |
| C17.2.3 | ____Manufacture of paper stationery |
| C17.2.4 | ____Manufacture of wallpaper |
| C17.2.9 | ____Manufacture of other articles of paper and paperboard |
| C18 | __Printing and reproduction of recorded media |
| C18.1 | ___Printing and service activities related to printing |
| *Note: Only 5 rows are displayed. | |
3.2.2.4 Hierarchy files for classifications – MD_classvar_hier.csv
This file contains a series of columns that refer to different nodes of the classification variable in question. With this file, the user can easily aggregate or disaggregate the data based on the different nodes of the classification variable.
The columns of the file are labelled as h_X, where X is a number from 0 to N denoting one of the N available nodes in the variable.
An example of a hierarchy table for NACE codes (in this case the official EU NaceR2 classification):
| h_0 | h_1 | h_2 | h_3 | h_4 |
|---|---|---|---|---|
| 6491 | 649 | 64 | K | TOT |
| 2222 | 222 | 22 | C | TOT |
| 9810 | 981 | 98 | T | TOT |
| 2331 | 233 | 23 | C | TOT |
| 4743 | 474 | 47 | G | TOT |
| *Note: Only 5 rows are displayed. | ||||
3.2.2.5 Codebook for categorical variables – MD_codebook.csv
This file contains the possible values of a categorical variable and the description that belongs to that value. Note that sometimes a particular codebook is ‘re-used’ for multiple variables. The name of the codebook should be given in the ‘domain’ columnn of the metadata for the file containing the categorical variable.
[1] MD_dataset,MD_varname,code,description
where:
MD_datasetis the name of the MD firm-level panel dataset, ie BR, SBS, etc.MD_varnameis the name of the variable of that specific MD dataset.codegives the valid values of the ccategorical variable.descriptiongives the description for each code value.
An example below of the values given for some variables in the MD BR business register dataset
| MD_dataset | MD_varname | code | description |
|---|---|---|---|
| BD | status | 1 | born in reference year |
| BD | status | 2 | active entire reference year |
| BD | status | 3 | dead in reference year |
| BD | status | 4 | born and dead in reference year |
| BR | demo | 0 | ”No demographic relation in ref. year” |
| BR | demo | 1 | ”Receiving employment from other enterprise in ref. year” |
| *Note: Only 5 rows are displayed. | |||
3.2.2.6 Key ID overview – MD_idInfo.csv
As we need to coordinate our data work across multiple countries, there are differences in what the key identifiers of the different MD datasets are. The table below illustrates the situation for the countries to which we currently have access to.
| MD_dataset | AT | DE | FI | FR | NL | PTx | PT | SI | GB |
|---|---|---|---|---|---|---|---|---|---|
| BR | firmid | firmid | firmid | firmid | entid | entid | firmid | plantid | NA |
| BD | firmid | firmid | entid | firmid | NA | ||||
| BS | firmid | firmid | entgrp | entid | firmid | firmid | NA | ||
| CIS | firmid | firmid | firmid | entid | firmid | NA | |||
| ENER | plantid | firmid | plantid | entid | firmid | plantid | NA | ||
| ICTEC | firmid | firmid | firmid | entid | firmid | entid | NA | ||
| IFATS | firmid | firmid | entid | firmid | NA | ||||
| ITGS | firmid | firmid | firmid | firmid | entid | entid | firmid | firmid | NA |
| ITS | firmid | NA | |||||||
| OFATS | firmid | firmid | firmid | NA | |||||
| PRODCOM | plantid | firmid | firmid | firmid | entid | entid | firmid | plantid | NA |
| SBS | firmid | firmid | firmid | firmid | entid | entid | firmid | firmid | NA |
3.2.2.7 MD_aggr_disaggr_methods.csv
This table contains instructions on how a specific MD_varname is aggregated or disaggregated to a higher/lower firm unit. It is only used by mdi_keyID_switch.R file if a module writer wants to perform such operation on a given harmonized MD dataset.
[1] MD_dataset,MD_varname,NSI_dataset,NSI_varname,method,detail,year
MD_datasetis the dataset name that the variable belongs to (e.g. SBS, BS, PRODCOM, BR). This determines the source of the variable within the integrated microdata framework.MD_varnameis the standardized variable identifier, harmonized across datasets (e.g.emp,rev,pay,assets). Used to link equivalent variables across datasets.classis the variable type (numeric, categorical, boolean, date). It defines what operations are logically and statistically valid for the variable.aggregation_methodis the rule for aggregating data from a lower level to a higher level (e.g. plant \(\rightarrow\) firm, firm \(\rightarrow\) group). Specifies how observations are collapsed across identifiers during aggregation.disaggregation_methodis the rule for splitting or allocating data from a higher level to a lower level (e.g. firm \(\rightarrow\) plant). Indicates which weighting logic or fallback hierarchy is used to distribute values.
Aggregation Methods
sum
Adds up all values in the group.
Used for additive variables such as employment, turnover, pay, or total assets.mean
Calculates the simple arithmetic mean.
Used for ratio or intensity variables (e.g. productivity, profitability ratios).weighted_avg:
<var1>|<var2>|...
Computes a weighted average using one or more candidate weighting variables.
The first available candidate is used.
Example:weighted_avg:emp|rev\(\rightarrow\) weights byempif available, otherwise byrev.mode
Returns the most frequent category (the statistical mode).
Used for qualitative variables like ownership type or legal form.weighted_mode:
<var1>|<var2>|...
Returns the category that maximizes the weighted frequency count.
Example:weighted_mode:rev|empgives greater weight to categories from larger firms.any
Logical aggregation returningTRUEif any record in the group isTRUE.
Used for indicators such as export participation.all
Logical aggregation returningTRUEonly if all records in the group areTRUE.
Useful for group-level flags (e.g. all plants meet environmental certification).min
Returns the smallest value or earliest date in the group.
Useful for start dates or minimum rates.max
Returns the largest value or latest date in the group.
Useful for end dates or maximum thresholds.
Disaggregation Methods
equal
Splits the higher-level total equally across all lower-level entities.
Example: 100 employees across 4 plants \(\rightarrow\) each gets 25.replicate
Copies the same value across all sub-entities.
Used for categorical variables like region, legal form, or activity code.weighted_alloc:
<dataset.var1>|<dataset.var2>|...|equal
Allocates a higher-level value proportionally using variables from other datasets that exist at the disaggregated level.
The listed candidates are checked in order, and the first available is used.
Example:
weighted_alloc:PRODCOM.rev|ITGS.ntrade|SBS.emp|equal
\(\rightarrow\) uses product-level revenue, if unavailable uses trade value, then employment, then equal split.proportional_alloc (optional)
Variant ofweighted_allocwhere weights are normalized within each group.
Usually equivalent toweighted_allocin implementation.
This design ensures that numerical variables preserve total consistency, while categorical and boolean fields retain logical coherence during aggregation and disaggregation.
Below a sample of five rows contained in the table:
| MD_varname | MD_dataset | aggregation_method | disaggregation_method |
|---|---|---|---|
| rdemp | SBS | sum | weighted_alloc:BS.total_assets|PRODCOM.rev|ITGS.ntrade|BR.persons_br|equal |
| start_nace | BR | mode | replicate |
| distr_heat_noenerg | ENER | sum | weighted_alloc:SBS.nv|SBS.emp|BS.total_assets|PRODCOM.rev|ITGS.ntrade|BR.persons_br|equal |
| inpdgd | CIS | any | replicate |
| fte | SBS | sum | weighted_alloc:BS.total_assets|PRODCOM.rev|ITGS.ntrade|BR.persons_br|equal |
| *Note: Only 5 rows are displayed. | |||
3.2.3 Specifications for metadata needed for the NSI to MD harmonization
Harmonization of MD panels entails harmonization of units of observation, variable definitions, and variable values.
The key to harmonization is NSI metadata, MD metadata, and NSI to MDI concordances.
The MD standard metadata is found ‘iteratively’ and can evolve as countries join and as new MDI research users and MDI launches have different data requirements.
- The MD metadata and NSI to MDI concordances allow live updates of the MDI data documentation.
Mapping units of ‘firms’, enterprises, legal units requires knowledge of NSI source data: registers, (weighted) sampling, sample designs.
Harmonizing variable definitions and nomenclature is done through renaming, revaluing or combining NSI variables.
- In the
*NSI*\_MD\_conc.csvfile, information is available to show how an MD variable (from a particular MD dataset) is generated from NSI variables, through the harmonization operations remap, revalue, or redefine.
- In the
Harmonizing values of classification variables is done by reclassifying values over time to MD standard.
- A concordance for each NSI classification version to the MDI standard is needed. Each observed value of the classification code in rawdata needs to be mapped to the MD classification, otherwise the raw data observations are lost. This is done using the concordance file
*NSI*_*classname*_MD_*classname*_classconc.csv.
- A concordance for each NSI classification version to the MDI standard is needed. Each observed value of the classification code in rawdata needs to be mapped to the MD classification, otherwise the raw data observations are lost. This is done using the concordance file
Harmonizing categorical variables is done by recoding between conforming values from codebooks.
- To harmonize data values for categorical variables, a concordance is made between
*NSI*__MD_codeconc.csv.
- To harmonize data values for categorical variables, a concordance is made between
To concord other data values ((currency) units, date values), R functions are used to revalue.
- E.g. If the domains of the variable in NSI data is 1000 and in MD data 1, then the NSI value is multiplied by 1000. If the NSI value is in an R date-value, say
%d%m%Y, an R date function is used convert to the required R date-value.
- E.g. If the domains of the variable in NSI data is 1000 and in MD data 1, then the NSI value is multiplied by 1000. If the NSI value is in an R date-value, say
Storing an MDvarname in an MD panel
Only remap and redefine rows store columns in the final MD panel. Hence, for a MD variable to be present in the output data, one of the two methods needs to be used.
The reason for this is that revalue, recode and reclass only change the content of the NSI_varname, given that the NSI_varname on which the operation has been applied to could be use for multiple mappings, be it a remap or a redefine, in the same concordance year.
Therefore, if you would like to store an MD_varname after a revalue, recode or reclass operation, make sure you add a row for the same NSI_varname-MD_varname mapping with method=remap.
3.2.3.1 Concordance file – NSI_MD_conc.csv
This file contains the list of all variables in a particular MD panel, with information on how to map the NSI variables from one or more raw datafile (often one for each year) to the MD variable. The related table has the following column names:
[1] MD_dataset,MD_varname,NSI_dataset,NSI_varname,method,detail,year
where:
NSI_datasetis the generic name of the data, that together withyearspecify the NSI datafile that hosts the variable to be use in concording. If year is empty, the concordance does not change over the years.yearis the year for which the concordance holds. If the mapping involves an NSI_datafile which is a panel, the column needs to be filled just with the first year available. If the NSI_datafile is a cross-section, the column needs to be filled with the year it is referenced to (in other words, there has to be one set of mapping rows per NSI_datafile).NSI_varnameis the name of the variable in the NSI datafile.MD_datasetis the name of the MD firm-level panel dataset, ie BR, SBS, etc.MD_varnameis the name of the variable in the MD data source to be generated. Make sure theyearvariable is not included in the concordance table, since it is harmonized separately by the infrastructure.methodis the method used to harmonize the data. The value of the categorical variable, provides the method for generating the harmonized variableMD_varname.revalueThe values of the variable are changed using an R function and parameters in the columndetailand possibly from theclassanddomainsvariable from the relevant_varnamesfiles.recodeThe values of the variable are changed using a codebook concordance, whose name is given in dedetailscolumn, e.g. ‘NSI_filename_MD_dataset_codeconc.csv’. Only values that need to be changed require a row in the_codeconc.reclassThe values of the variable are changed using a classification concordance, whose name is given in thedetailscolumn, ‘NSI_classname_MD_classname_classconc.csv’. This is used to reclassify e.g. industry, region classifications.remapThe name of the variable is changed, in a one-to-one mapping from NSI_varname toMD_varname.redefineThe MD variable is generated as a linear combination of the NSI variable. The column detail specifies the linear combination, i.e ‘+’ or ‘-’) in the many-to-one NSI_varname to MD_varname mapping.
detailcontains the function for revalue, the concordance for codebook or classification for recode and remap, and the linear operations for redefine. For revalue, any valid operation operating on the NSI_varname (referred to asx) is good. If the domains of the variable in NSI data is 1000 and in MD data 1, then the NSI value is multiplied by 1000, so detail =x*1000. If the NSI value is in an R date-value, say%d%m%Y, an R date function is used convert to the required R date-value, format(as.Date(x,“%d%m%Y”),“%Y”)NSI_datafileis the name of the raw dataset from where theNSI_variableis taken fromyearis the reference year for that specific row, which will be used to construct the MD-cross section for that year
The year variable for each MD_dataset is automatically mapped to the harmonized data based on the metadata and the year value assigned for the corresponding concordance table rows. Hence, please do not add any row in the concordance table where MD_varname = 'year'.
3.2.3.1.1
Given that some raw datafiles require specific preprocessing, in very special cases some NSI_varnames might end up be different than those appearing in file NSI_varnames. Hence, in case you would like to concord some of the variables from a dataset for which preprocessing is needed – which you can verify by looking at column preprocessing of that datafile in NSI_datafiles – please keep that in mind. For more information, check out the box on datafile preprocessing or get in touch with the MDI team.
3.2.3.1.2 Examples by harmonization method
As said, a revalue row in the concordance table simply transforms the content of the raw data’s column, without changing the column name to the desired MD_varname. For instance, say that you want to remove all dots from a string in raw variable var1 from NSI_dataset data_firm for year 2010. To do that, we add a row in the concordance table which looks as follows:
| NSI_dataset | year | NSI_varname | MD_dataset | MD_varname | method | detail |
|---|---|---|---|---|---|---|
| data_firm | 2010 | var1 | BR | nace | revalue | gsub(“\.”, ““, x) |
In practical terms, this operation will transform the raw datafile from
| … | var1 | … |
|---|---|---|
| … | 10.40 | … |
| … | 20.59 | … |
| … | 01.45 | … |
| … | 32.10 | … |
| … | … | … |
to
| … | var1 | … |
|---|---|---|
| … | 1040 | … |
| … | 2059 | … |
| … | 0145 | … |
| … | 3210 | … |
| … | … | … |
A reclass/recode row in the concordance table maps the values of a categorical variable (be it a class or codebook variable) to some specified objective values, as indicated in the corresponding class/codeconc table. For instance, say that you want to change the mapping of categorical variable var2 from NSI_dataset survey_firm for year 2012. The raw variable can take values 1, 2 and 9, which link to ‘yes’, ‘no’, ‘not available’. To do that, we assign recode (given that this is a codebook variable; we would indicate reclass in case of a class variable) in the method column, as follows
| NSI_dataset | year | NSI_varname | MD_dataset | MD_varname | method | detail |
|---|---|---|---|---|---|---|
| survey_firm | 2012 | var2 | CIS | inpdsv | recode | NSI_MD_codeconc |
The harmonization tool will open the codeconc file and transform the values as shown below
1 → 0
2 → 1
9 → 9
In practical terms, this operation will transform the raw datafile from
| … | var2 | … |
|---|---|---|
| … | 1 | … |
| … | 9 | … |
| … | 2 | … |
| … | 1 | … |
| … | … | … |
to
| … | var2 | … |
|---|---|---|
| … | 0 | … |
| … | 9 | … |
| … | 1 | … |
| … | 0 | … |
| … | … | … |
A redefine row in the concordance table aggregates two or more NSI_varname’s to create an objective MD_varname. As mentioned, the aggregation function is not restricted to a specific form. It can be a sum or subtraction of all non-NA values of the raw variables of the aggregation (detail = + or -) or a custom function (detail = fn('content of the function in R syntax')).
For example, if we want to sum the values of var3, var4 and var5 from datafile 2005_bs to create MD_varname nv, we add the following rows to the table
| NSI_dataset | year | NSI_varname | MD_dataset | MD_varname | method | detail |
|---|---|---|---|---|---|---|
| 2005_bs | 2005 | var3 | BS | nv | redefine | + |
| 2005_bs | 2005 | var4 | BS | nv | redefine | + |
| 2005_bs | 2005 | var5 | BS | nv | redefine | + |
This operation will transform the raw datafile from
| … | var3 | var4 | var5 | … |
|---|---|---|---|---|
| … | 12 | 4 | 15 | … |
| … | NA |
2 | 16 | … |
| … | 9 | 32 | 19 | … |
| … | 8 | 14 | NA |
… |
| … | … | … | … |
to (also by removing the original raw variables)
| … | nv | … |
|---|---|---|
| … | 31 | … |
| … | 18 | … |
| … | 60 | … |
| … | 22 | … |
| … | … | … |
On the other hand, if we want to sum var3 to var4 and divide the result by var5, we need to build a custom function, as in the below concordance rows:
| NSI_dataset | year | NSI_varname | MD_dataset | MD_varname | method | detail |
|---|---|---|---|---|---|---|
| 2005_bs | 2005 | var3 | BS | nv | redefine | fn((var3+var4)/var5) |
| 2005_bs | 2005 | var4 | BS | nv | redefine | fn((var3+var4)/var5) |
| 2005_bs | 2005 | var5 | BS | nv | redefine | fn((var3+var4)/var5) |
This operation will transform the raw datafile from
| … | var3 | var4 | var5 |
|---|---|---|---|
| … | 12 | 4 | 15 |
| … | NA |
2 | 16 |
| … | 9 | 32 | 19 |
| … | 8 | 14 | NA |
| … | … | … | … |
to (also by removing the original raw variables)
| … | nv | … |
|---|---|---|
| … | 1.067 | … |
| … | NA |
… |
| … | 2.158 | … |
| … | NA |
… |
| … | … | … |
A remap row in the concordance table assigns the name of a given MD_varname to an NSI_varname’ without changing the values of the variable itself. This method is usually used to store variables to the objective MD panel without changing anything or which were subject to a revalue or recode/reclass operation.
For example, if we want to store var6 from datafile ener_2001 to create MD_varname firmid, we add the following row to the table
| NSI_dataset | year | NSI_varname | MD_dataset | MD_varname | method | detail |
|---|---|---|---|---|---|---|
| ener_2001 | 2001 | var6 | ENER | firmid | remap |
This operation will transform the raw datafile from
| … | var5 | … |
|---|---|---|
| … | nwejn | … |
| … | aios2 | … |
| … | cjnje | … |
| … | 29hbd | … |
| … | … | … |
to
| … | firmid | … |
|---|---|---|
| … | nwejn | … |
| … | aios2 | … |
| … | cjnje | … |
| … | 29hbd | … |
| … | … | … |
An example of the table for a few variables needed for Slovenian harmonized MD BR for year 2007 (column year is omitted):
| MD_dataset | MD_varname | NSI_dataset | NSI_varname | method | detail |
|---|---|---|---|---|---|
| BR | firmid | MIKRO_PRS_razST | MS10_razST | remap | |
| BR | entgrp | MIKRO_PRS_razST | MS10_IZP_MS7_razST | remap | |
| BR | birthyr | MIKRO_PRS_razST | Datum_prv_vnosa | revalue | as.Date(as.character(x), ‘%d.%m.%Y’) |
| BR | exityr | MIKRO_PRS_razST | Datum_izbrisa | revalue | as.Date(as.character(x), ‘%d.%m.%Y’) |
| BR | nace | MIKRO_PRS_razST | Skd | remap | |
| BR | soe | MIKRO_PRS_razST | Vrsta_lastnine | recode | SI_MD_codeconc |
| BR | birthyr | MIKRO_PRS_razST | Datum_prv_vnosa | remap | |
| BR | exityr | MIKRO_PRS_razST | Datum_izbrisa | remap | |
| BR | soe | MIKRO_PRS_razST | Vrsta_lastnine | remap | |
| BR | nace | MIKRO_PRS_razST | Skd | revalue | sub(‘^(\d{2})\.(\d{2})\d$’,‘\1\2’,x) |
3.2.3.2 NSI_MD_codeconc.csv
[1] NSI_dataset,year,NSI_varname,MD_dataset,MD_varname,left,right
where:
NSI_datasetis the generic name of the data, that together withyearspecify the NSI datafile that hosts the variable to be use in concording. If year is empty, the concordance does not change over the years.yearis the year for which the concordance holds. If empty, the same concordance rows are used for all NSI datafiles associated with the genericNSI_dataset.NSI_varnameis the name of the variable of the specific NSI datafile associated with dataset and year.MD_varnameis the name of the corresponding MD variable.leftgives the valid values of the categorical variable in the raw NSI dataset.rightgives the corresponding MDI dataset values to map.
If the mapping of a categorical variable already corresponds to that of the objective MD_varname, then there’s no need to add the related row in the codeconc. For instance, say that we have raw NSI_dataset data_ictec with NSI_varname var112 being a categorical variable taking values ‘0’ referring to ‘no’. Then let the variable be mapped to MD_varname IACC from MD_dataset ICTEC. Given that, as indicated in the MD metadata, code ‘0’ is linked to ‘no’ for this MD_varname, we don’t need to add any row for this specific mapping.
However, if the other mappings don’t correspond, the rows in the codeconc file need to be present for those!
As mentioned, if a specific mapping holds for all years available of a specific NSI_dataset, then the year column for that mapping needs to be empty.
For instance, say that we have raw NSI_dataset data_ictec with NSI_varname var112 being a categorical variable taking values ‘0’, ‘2’, ‘999’, referring to ‘no’, ‘yes’, ‘not available’ for all years. This variable will be harmonized to IACC of the MD_dataset ICTEC. In this case, the NSI_MD_codeconc will contain only three rows for these three mapping without any reference to the years. In practice:
| NSI_dataset | NSI_varname | year | MD_dataset | MD_varname | left | right |
|---|---|---|---|---|---|---|
| data_ictec | var112 | ICTEC | IACC | 2 | 1 | |
| data_ictec | var112 | ICTEC | IACC | 999 | NA |
Note that the mapping for ‘0’ - ‘no’ is missing given that it already corresponds to the objective MD mapping.
That said, if a mapping is not constant across all years of an NSI_dataset, then the year column needs to have a value for all mappings reported. In this context, there can two cases:
- The codes differ by year for the same variable: This means that
var_112takes values, say, ‘0’, ‘2’, ‘999’, referring to ‘no’, ‘yes’, ‘not available’ for 2006 and in 2007 the mapping changes to ‘1’, ‘2’, ‘9’, for ‘no’, ‘yes’, ‘not available’, respectively. Hence, we need to indicate year-specific mappings in the codebook table:
| NSI_dataset | NSI_varname | year | MD_dataset | MD_varname | left | right |
|---|---|---|---|---|---|---|
| data_ictec | var112 | 2006 | ICTEC | IACC | 2 | 1 |
| data_ictec | var112 | 2006 | ICTEC | IACC | 999 | NA |
| data_ictec | var112 | 2007 | ICTEC | IACC | 1 | 0 |
| data_ictec | var112 | 2007 | ICTEC | IACC | 2 | 1 |
| data_ictec | var112 | 2007 | ICTEC | IACC | 999 | NA |
- The NSI_varname isn’t available for all years: As an example, let
var_112be only available in 2005 and 2006 but it’s dropped or has a different name in the other years. Then, we need to include it for both 2005 and 2006, regardless of whether the codes are identical or not:
| NSI_dataset | NSI_varname | year | MD_dataset | MD_varname | left | right |
|---|---|---|---|---|---|---|
| data_ictec | var112 | 2005 | ICTEC | IACC | 2 | 1 |
| data_ictec | var112 | 2005 | ICTEC | IACC | 999 | NA |
| data_ictec | var112 | 2006 | ICTEC | IACC | 2 | 1 |
| data_ictec | var112 | 2006 | ICTEC | IACC | 999 | NA |
As the right column of an NSI_MD_codeconc file needs to have entries that are present in the MD_codebook, there can be cases in no corresponding value can be found between a categorical value in a country’s dataset and the MD_codebook. For example, say that a very specific value for variable unit of MD dataset PRODCOM is available in the data of a country and no corresponding value can be found in the MD codebook. In that case, please reach out to the MDI team, as we might consider adding that value to the MD_codebook.
Below an example of the codebook concordance table in Portugal:
| NSI_dataset | NSI_varname | year | MD_dataset | MD_varname | left | right |
|---|---|---|---|---|---|---|
| ifats | imputeifats | NA | IFATS | imputed | 1 | 0 |
| ifats | imputeifats | NA | IFATS | imputed | 2 | 1 |
| itgs | exim | NA | ITGS | exim | 1 | 0 |
| itgs | exim | NA | ITGS | exim | 2 | 1 |
| itgs | imputeitgs | NA | ITGS | imputed | 1 | 0 |
| *Note: Only 5 rows are displayed. | ||||||
3.2.3.2.1 NSI_classname_MD_classname_classconc.csv
[1] year,left,right
where:
leftis the name of the current NSI classification code list for variableNSI\_classname.rightis the name of the MD classification code list the user wants to concord to.yearis the year for which the concordance holds.
Below a sample of the concordance from the raw data’s common nomenclature code list (left) to the harmonized one (right).
| year | left | right |
|---|---|---|
| 2005 | 61019090 | 61019090D |
| 2004 | 72249014 | 72249014 |
| 2005 | 09104090 | 09104090D |
| 2004 | 85407200 | 85407200D |
| 2005 | 84145910 | 84145910D |
| *Note: Only 5 rows are displayed. | ||
3.2.4 Data documentation and MDI implementation process in phases
To construct the necessary metadata of the raw data and concordance tables needed to produce the harmonized MD datasets is a particularly long and tedious process.
A necessary requirement to have the MDI installed on the remote environment of a country is to have sufficiently large RAM on the server. Naturally, the amount of RAM needed depends on the size of the data. An indicative measure for this could be the ratio RAM - number of observations in the BR. This ratio should be approximately larger or equal to 2.
To make the experience more manageable as well as to give it more structure, we developed the following timeline divided in seven phases:
| Phase | Completed files |
|---|---|
| I |
|
| II |
|
| III |
|
| IV |
|
| V |
|
| VI |
|
| VII |
|
| VIII |
|
Each phase refers to the construction of a specific file, as described in the above sections. There are a few elements that haven’t been explicitly mentioned yet. A brief explanation is provided below:
Raw files cleanup: making sure the raw files directory is tidy and usable
Firm unit analysis: most granular firm identifier (plant, legal unit, enterprise, enterprise group) of each raw file (see the NSI_datafiles section to check the list of possible units)
Unique keys: the uniquely identifying columns of each raw file (see the NSI_varnames section)
Disclosure rules: detailed description of the disclosure routines in place in the NSI
import script that harmonizes variables: after Phase IV, given that the NSI_MD_conc file mappings for BR, BS (if available) and SBS are ready, those files can be harmonized. To do so, we don’t import the whole infrastructure (yet) but we provide you with a code that does that by reading your metadata files and relevant raw data and produces the MD panels.
upload MDI CN module and execute it on BR, BS and SBS and importing the CompNet-related files (Phase V): After having harmonized BR, BS (if available) and SBS, we ask you to import a few more files.
CN module: This script produces some files under a specific directory
questionnaire: The Questionnaire is an Excel file that contains:
- paths
- variable names
- confidentiality routine settings, that needs to be filled in with these things. So in particular the variable names to reflect the country-specific variable mapping
country file: A Stata
.dtafile atcountry-year-industry2d-sizeclasslevel and it contains- population firm numbers from Eurostat
- industry-level deflators from Eurostat/EU KLEMS/AMECO
- some additional measures from public sources (e.g. 10year government bond yields from Eurostat)
- one predefined measures from us (i.e. not from public sources; if this is an issue we could leave it out)
Stata files for CompNet: The Stata (
.do) files take as input in the output of the CN module, the questionnaire and the country file and produce a limited first version of some CompNet indicators
Executing the CompNet .do files requires that
- Stata can be used in the same remote environment as the MDI
- The NSI agrees to export and have the CompNet indicators be published to third parties
The timeline was built based on our past experience. It’s meant to be first and foremost a help when creating metadata and concordance table from scratch.
3.2.5 Constructing NSI metadata
This section consists in a guide on how to build NSI metadata files. The guide will make references to some scripts which can assist users to create the files from scratch. Naturally, only using these scripts is not enough, as many fields of the tables need to be manually specified.
1. Constructing NSI_datafiles.csv
This script scans the raw data directory in the protected NSI environment and builds the metadata table required by the CompNet rocket. It creates the *_datafiles.csv file according to the specification in mantioned above.
The script can be found in this dropdown menu:
# NSI_datafiles builder (spec §3.2.1.1)
# author: AM-MM, date: 2025-09-29
library(data.table)
library(stringr)
# ---- Inputs you must define upstream ----
# dirINPUTDATA: the NSI data root (manual's reference base dir)
# dirROCKET: rocket root to use for storing NSI metadata
# CountryCode: 2-letter ISO code (e.g., "IT")
file_path <- dirINPUTDATA # use the actual base for relative paths
# ---- List files (absolute) ----
abs_files <- list.files(file_path, recursive = TRUE, full.names = TRUE)
# keep only files (exclude dirs)
abs_files <- abs_files[file.info(abs_files)$isdir == FALSE]
# ---- Build table ----
DT <- data.table(abs_path = abs_files)
# relative path to dirINPUTDATA (allow trailing slash in file_path; escape regex)
file_path_norm <- normalizePath(file_path, winslash = "/", mustWork = FALSE)
file_path_esc <- gsub("([\\^$.|?*+()\\[\\]{}\\\\])", "\\\\\\1", file_path_norm)
DT[, rel := sub(paste0("^", file_path_esc, "/?"), "", normalizePath(abs_path, winslash="/"))]
# split dir / filename / extension
DT[, filename := basename(rel)]
DT[, path := dirname(rel)]
DT[, format := tools::file_ext(filename)]
DT[, NSI_datafile := tools::file_path_sans_ext(filename)] # spec name
# ---- Derive NSI_dataset (generic) by stripping all 4-digit years and separators ----
DT[, NSI_dataset := NSI_datafile |>
str_remove_all("\\d{4}") |>
str_replace_all("[-_.]+", "_") |>
str_replace("^_|_$", "") |>
str_to_lower()
]
# ---- Years: extract all 4-digit tokens from the filename.
# Take min/max if present; if none found, leave NA.
# IMPORTANT: always double-check that year_start / year_end are correct,
# especially for files named like "bd2018" (single year) or with unusual patterns.
extract_years <- function(x) as.integer(str_extract_all(x, "\\d{4}")[[1]])
yrs <- lapply(DT$NSI_datafile, extract_years)
DT[, year_start := vapply(yrs, function(v) if (length(v)) min(v) else NA_integer_, integer(1))]
DT[, year_end := vapply(yrs, function(v) if (length(v)) max(v) else NA_integer_, integer(1))]
# ---- Fields to be filled manually or via later tools ----
DT[, yearvar := NA_character_] # name of year column if panel; else ""/NA
DT[, details := NA_character_]
DT[, firm_unit := NA_character_] # one of: plant, legal_unit, enterprise, enterprise_group
DT[, data_source := NA_character_] # one of: survey, administrative_source, mixed
DT[, firm_sample := NA_character_]
DT[, preprocessing:= NA_character_] # instruction string per §3.2.4–3.2.9
# ---- Final spec order EXACTLY as in the manual ----
out_cols <- c(
"NSI_datafile","NSI_dataset","yearvar","year_start","year_end",
"format","path","details","firm_unit","data_source","firm_sample","preprocessing"
)
NSI_datafiles <- DT[, ..out_cols]
# ---- Write CSV ----
outdir <- file.path(dirROCKET, "NSImetadata")
dir.create(outdir, showWarnings = FALSE, recursive = TRUE)
fwrite(NSI_datafiles, file.path(outdir, paste0(CountryCode, "_datafiles.csv")))Steps:
- Set inputs at the top of the script
dirINPUTDATA: the base directory containing the raw data.
dirROCKET: the base directory containing theNSImetadata/folder.
CountryCode: the two-letter ISO country code (e.g."IT").
- Run the script
The script will automatically collect:NSI_datafile(filename without extension)
NSI_dataset(generic dataset name, stripped of years and underscores)
year_start/year_end(from 4-digit tokens in the filename)
format(file extension)
path(relative path todirINPUTDATA)
- Fields requiring manual completion
The following columns are created but left empty (NA). They must be filled manually by the NSI team:yearvar: name of the year column if the file is a panel (leave empty otherwise).
details: clarifications on dataset coverage or specific notes.
firm_unit: one of {plant,legal_unit,enterprise,enterprise_group}.
data_source: one of {survey,administrative_source,mixed}.
firm_sample: information on the sampling scheme.
preprocessing: description of preprocessing steps, if any (see section below).
- Double-check the automatic fields
- Years: the script extracts all 4-digit numbers from the filename and assigns the minimum to
year_startand the maximum toyear_end.
⚠️ Always verify that these values match the actual time coverage of the dataset. For example,bd2018will yield bothyear_start=2018andyear_end=2018, which may or may not be correct.
- NSI_dataset: confirm that the generic dataset name is harmonised and consistent across files.
- Years: the script extracts all 4-digit numbers from the filename and assigns the minimum to
- Export
The script writes the CSV into:/NSImetadata/_datafiles.csv
Key reminders
- The script is a first pass only: it automates extraction of filenames, formats, and candidate years.
- Most metadata fields must be filled manually by the NSI staff who know the data.
- Always double-check the final file before uploading to the rocket.
2. Constructing NSI_varnames.csv
This script produces the *_varlist.csv files, one for each dataset listed in NSI_datafiles.csv.
It should always be run after the NSI_datafiles script.
The script can be found in this dropdown menu:
# Script to generate NSI_varnames metadata (per §3.2.1.2 of manual)
# author EB-MH-MM-AM, rev 2025-09-29
library(data.table)
library(dplyr)
library(readxl)
library(tools)
# ---- Inputs you must define upstream ----
# dirINPUTDATA: the NSI data root (manual's reference base dir)
# dirROCKET: rocket root to use for storing NSI metadata
# CountryCode: 2-letter ISO code (e.g., "IT")
# ---- Inputs ----
md_folder <- file.path(dirROCKET, "NSImetadata")
NSI_datafiles <- fread(file.path(md_folder, paste0(CountryCode, "_datafiles.csv")))
file_path <- dirINPUTDATA # base folder for raw data
# Helper: test if a set of columns uniquely identifies rows
is_key_id <- function(data, cols) {
n_distinct <- data %>% select(all_of(cols)) %>% distinct() %>% nrow()
return(n_distinct == nrow(data))
}
# Process per dataset
for (DS in unique(NSI_datafiles$NSI_dataset)) {
NSI_datafiles_filtered <- unique(NSI_datafiles[NSI_dataset == DS,])
# Build file paths
abs_paths <- file.path(file_path,
NSI_datafiles_filtered$path,
paste0(NSI_datafiles_filtered$NSI_datafile, ".", NSI_datafiles_filtered$format))
# Load files
file_list <- lapply(abs_paths, function(f) {
import_data(dir = dirname(f), file = basename(f), format = file_ext(f))
})
var_names_list <- lapply(file_list, function(df) data.table(NSI_varname = names(df)))
for (i in seq_along(var_names_list)) {
rawdata <- file_list[[i]]
# Load variable descriptions
desc_file <- file.path(file_path,
NSI_datafiles_filtered$path[i],
paste0(NSI_datafiles_filtered$NSI_dataset[i], "_descr.csv"))
if (!file.exists(desc_file)) {
stop(paste("Description file not found:", desc_file,
"Please create it as required by the manual."))
}
description <- fread(desc_file)
if (!"NSI_varname" %in% colnames(description)) {
stop("Description file must contain a column named 'NSI_varname'")
}
# Add class, domain, NSI_datafile
var_names_list[[i]]$class <- sapply(rawdata, function(x) paste(class(x), collapse=","))
var_names_list[[i]]$domain <- NA_character_ # manual input required
var_names_list[[i]]$NSI_datafile <- file_path_sans_ext(basename(abs_paths[i]))
# Merge with descriptions
var_names_list[[i]] <- merge(var_names_list[[i]], description,
by = "NSI_varname", all.x = TRUE)
# Report missing variables in description
missing_vars <- setdiff(names(rawdata), description$NSI_varname)
if (length(missing_vars) > 0) {
message("Missing variable descriptions for ", var_names_list[[i]]$NSI_datafile[1], ": ",
paste(missing_vars, collapse=", "))
}
# Identify key variables
colnms <- colnames(rawdata)
max_cols <- min(4, length(colnms))
found <- NULL
for (k in 1:max_cols) {
for (comb in combn(colnms, k, simplify = FALSE)) {
if (is_key_id(rawdata, comb)) {
found <- comb
break
}
}
if (!is.null(found)) break
}
var_names_list[[i]]$is_key <- var_names_list[[i]]$NSI_varname %in% found
message("++ ", var_names_list[[i]]$NSI_datafile[1], " processed.")
}
stacked_df <- bind_rows(var_names_list)
# Final column order per manual
stacked_df <- stacked_df[, c("NSI_datafile","NSI_varname","description","is_key","class","domain")]
# Export
fwrite(stacked_df, file.path(md_folder, paste0(CountryCode, "_", DS, "_varlist.csv")))
message("List for dataset ", DS, " exported.")
}Steps:
- Inputs required
dirROCKET: base directory containing theNSImetadata/folder.
CountryCode: the two-letter ISO code (e.g."IT").
dirINPUTDATA: main folder containing the raw NSI data.
- Run the script
For each dataset inNSI_datafiles.csv, the script will:Load the raw files listed for that dataset.
Extract the variable names (
NSI_varname).
Read the corresponding description file
<dataset>_descr.csv(must be provided by the NSI).
Record the variable
class(data type).
Attempt to infer which variables form a key (
is_key).
Create an empty
domaincolumn to be filled manually.
Export the compiled metadata to:
<dirROCKET>/NSImetadata/<CountryCode>_<dataset>_varlist.csv
- Fields requiring manual completion
description: ensure that the description file is complete and correctly labelled.
domain: must always be filled manually (see manual §3.2.1.2 for details).
is_key: the automatic detection may fail or give false positives. Double-check and adjust manually.
- Double-check the automatic fields
- Verify that all variables in the raw data are listed in the description file.
Missing variables are reported in the console when running the script.
- Confirm that the
classcolumn is meaningful and consistent.
- Verify that all variables in the raw data are listed in the description file.
Key reminders
- A description file
<dataset>_descr.csvis mandatory. If missing, the script stops with an error.
- The
is_keydetection is heuristic. Always verify manually which variables uniquely identify records.
- The
domainclassification cannot be automated. It must be completed by the NSI staff.
- Always inspect the final
*_varlist.csvfiles before uploading them to the rocket.
3. Constructing NSI_class.csv
This script produces the classification metadata files *_class.csv required by the rocket.
It should always be run after the NSI_datafiles script.
The script can be found in this dropdown menu:
# Script to generate NSI_class metadata (per §3.2.1.3 of manual)
# Run only after NSI_datafiles.R
library(readr)
library(dplyr)
library(stringr)
# ---- Inputs ----
CountryCode <- "IT" # set your 2-letter code
dirROCKET <- "your_dir" # base folder for rocket
dirINPUTDATA <- "your_data_folder" # main raw data folder
# Load metadata from NSI_datafiles
data_files <- read_csv(file.path(dirROCKET, "NSImetadata", paste0(CountryCode, "_datafiles.csv")),
show_col_types = FALSE)
# ---- Specify dataset and classification variable ----
class_dataset <- "your_dataset" # must match NSI_dataset in datafiles
class_name <- "name_class_variable" # e.g. "NACE2"
NSI_datafiles_filtered <- filter(data_files, NSI_dataset == class_dataset)
if (nrow(NSI_datafiles_filtered) == 0) {
stop("Dataset not found in NSI_datafiles: ", class_dataset)
}
# ---- Function to read classification data ----
extract_columns <- function(file_path) {
data <- read_csv(file_path, show_col_types = FALSE)
required_columns <- c(class_name, "year", "description")
if (!all(required_columns %in% names(data))) {
stop("Missing one or more required columns in: ", file_path,
". Expected: ", paste(required_columns, collapse=", "))
}
out <- select(data, all_of(required_columns))
# Rename classification variable to generic name 'classvar'
colnames(out)[1] <- "classvar"
return(out)
}
# ---- Process files ----
results <- list()
for (i in seq_len(nrow(NSI_datafiles_filtered))) {
f <- file.path(dirINPUTDATA,
NSI_datafiles_filtered$path[i],
paste0(NSI_datafiles_filtered$NSI_datafile[i], ".",
NSI_datafiles_filtered$format[i]))
if (!file.exists(f)) {
message("File not found: ", f)
next
}
extracted <- extract_columns(f)
dataset_name <- tolower(NSI_datafiles_filtered$NSI_dataset[i])
output_file <- file.path(dirROCKET, "NSImetadata",
paste0(CountryCode, "_", dataset_name, "_class.csv"))
# enforce column order
extracted <- extracted[, c("classvar", "year", "description")]
write_csv(extracted, output_file)
message("Exported classification metadata to ", output_file)
results[[dataset_name]] <- extracted
}Steps
- Inputs required
dirROCKET: base directory containing theNSImetadata/folder.
CountryCode: the two-letter ISO code (e.g."IT").
dirINPUTDATA: main folder containing the raw NSI data.
class_dataset: the dataset where the classification variable is found (must match anNSI_datasetinNSI_datafiles.csv).
class_name: the name of the classification variable (e.g."nace").
- Run the script
For the specified dataset, the script will:Load the raw files linked to the dataset.
Extract three required fields:
classvar(the classification variable, renamed from the raw variableclass_name),
year(validity year),
description(text label of the classification code).
Export the results into:
<dirROCKET>/NSImetadata/<CountryCode>_<dataset>_class.csv
- Fields requiring manual completion / verification
- Ensure that the classification variable chosen (
class_name) matches the raw file.
- Check that
yearanddescriptioncolumns exist and are correctly populated in the raw data.
- Confirm that the
classvarcolumn has been renamed properly and contains only the classification codes.
- Ensure that the classification variable chosen (
- Double-check the automatic fields
- The script will stop if any of the required columns (
class_name,year,description) are missing.
- Even if the file is created, NSIs must review the exported
*_class.csvcarefully to verify that:yearcorresponds to the reference period of the classification.
descriptioncorrectly describes each classification code.
- No codes are missing or duplicated.
- The script will stop if any of the required columns (
Key reminders
- Each dataset that includes a classification variable must have a corresponding
*_class.csvfile.
- Column order in the final CSV must be exactly:
classvar, year, description - Always inspect the final file manually before uploading it to the rocket.
4. Constructing NSI_codebook.csv
This script produces the *_codebook.csv files required by the rocket.
It should always be run after the NSI_datafiles script.
The script can be found in this dropdown menu:
# Script to generate NSI_codebook metadata (per §3.2.1.4 of manual)
# Produces a single consolidated <CountryCode>_codebook.csv
# Run only after NSI_datafiles.R
library(data.table)
library(tools)
# ---- Inputs ----
CountryCode <- "IT" # two-letter code
dirROCKET <- "your_dir" # rocket base folder
dirINPUTDATA <- "your_data_folder"
# ---- Import function ----
import_data <- function(file_path) {
fread(file_path, stringsAsFactors = FALSE)
}
# ---- Helper: detect large digit variation ----
has_large_digits_variation <- function(values, threshold = 1) {
values <- na.omit(values)
digits <- nchar(as.character(values))
digit_diff <- abs(digits - min(digits, na.rm = TRUE)) > 3
sum(digit_diff) > threshold
}
# ---- Create codebook for one dataset ----
create_codebook <- function(df, dataset_name,
max_unique_values = 50,
digit_variation_threshold = 1) {
codebook <- data.table(NSI_dataset = character(),
NSI_varname = character(),
code = character(),
year = character(),
description = character())
for (var_name in names(df)) {
unique_values <- unique(df[[var_name]])
# Skip high-cardinality vars or numerics with wide digit variation
if (length(unique_values) > max_unique_values ||
(is.numeric(df[[var_name]]) &&
has_large_digits_variation(unique_values, digit_variation_threshold))) {
next
}
temp_dt <- data.table(
NSI_dataset = dataset_name,
NSI_varname = var_name,
code = as.character(unique_values),
year = "", # ++++ to be reviewed manually ++++
description = NA_character_ # ++++ to be filled manually ++++
)
codebook <- rbind(codebook, temp_dt, fill = TRUE)
}
return(codebook)
}
# ---- Create single consolidated codebook for all datasets ----
create_codebook_all <- function(rd_folder, md_folder, CountryCode,
max_unique_values = 50, digit_variation_threshold = 1) {
csv_files <- list.files(path = rd_folder, pattern = "\\.csv$", full.names = TRUE)
all_codebooks <- list()
for (file_path in csv_files) {
dataset_name <- tools::file_path_sans_ext(basename(file_path))
dataset <- import_data(file_path)
cb <- create_codebook(dataset, dataset_name,
max_unique_values, digit_variation_threshold)
all_codebooks[[dataset_name]] <- cb
message("Processed dataset: ", dataset_name)
}
# Stack all datasets together
codebook_all <- rbindlist(all_codebooks, fill = TRUE)
# Enforce manual’s column order
codebook_all <- codebook_all[, c("NSI_dataset","NSI_varname","year","code","description")]
# Export single consolidated file
output_file <- file.path(md_folder, paste0(CountryCode, "_codebook.csv"))
fwrite(codebook_all, output_file, quote = FALSE)
message("Exported consolidated codebook: ", output_file)
}
# ---- Execute ----
md_folder <- file.path(dirROCKET, "NSImetadata")
dir.create(md_folder, showWarnings = FALSE, recursive = TRUE)
create_codebook_all(dirINPUTDATA, md_folder, CountryCode)Steps
- Inputs required
dirROCKET: base directory containing theNSImetadata/folder.
CountryCode: the two-letter ISO code (e.g."IT").
dirINPUTDATA: main folder containing the raw NSI data.
- Run the script
For each raw dataset (CSV) in the folder, the script will:Extract variable names (
NSI_varname).
Collect their observed values (
code).
Create empty
yearanddescriptionfields.
Stack all rows and define what
NSI_datasetthey relate to.
Export the result to:
<dirROCKET>/NSImetadata/<CountryCode>_codebook.csv
- Fields requiring manual completion
year: must be reviewed and filled manually where relevant.
description: must always be filled manually (label for each code).
- Double-check the automatic fields
- The script excludes variables with too many unique values or with large numeric variation.
- Ensure that important categorical variables were not skipped.
- Verify that codes are consistent across years.
- The script excludes variables with too many unique values or with large numeric variation.
Key reminders
- Column order in the final CSV must be exactly:
NSI_dataset, NSI_varname, code, year, description - This script only provides a first draft. Most of the meaningful content (
year,description) must be added manually by NSI staff.
- Always inspect the final file carefully before uploading to the rocket.
5. Constructing the timeconc table
The timeconc table is part of the metadata required by the rocket.
Unlike the other metadata files, it cannot be generated from the raw microdata.
Key points
- The
timeconctable provides official information on the time coverage of the data. - It must be obtained directly from an authoritative NSI or official source.
- Once collected, the table should be stored and maintained in the
NSImetadatafolder with the naming convention:classvar*t0\_*classvar*t1_conc
Responsibilities
- The NSI staff must provide the
timeconctable using official sources (e.g. methodological notes, published documentation, internal validation).
- The role of the CompNet rocket is only to read and integrate this file; it does not generate it.
Key reminder
⚠️ Always ensure that the timeconc table comes from an officially validated source and is kept up to date. This file underpins the correct interpretation of the temporal dimension of the datasets and cannot be replaced by automated extraction.
6. Constructing NSI_keyID1_keyID2_conc.csv
The firm ID concordance table establishes the link between two firm identifiers (among plantid, firmid, entid, entgrp) used in different datasets.
It is essential for ensuring consistent longitudinal tracking of firms and dataset merging.
The script can be found in this dropdown menu:
# Pseudo-code: Building the firm ID concordance table (NSI_firmid.csv)
library(data.table)
# ---- Inputs ----
CountryCode <- "IT" # two-letter code
dirROCKET <- "your_dir" # rocket base
dirINPUTDATA <- "your_data_folder" # raw data
# Step 1: Identify dataset(s) that contain both ID variables
# Example: suppose "id_old" and "id_new" are two firm identifiers
candidate_datasets <- c("dataset_with_ids")
# Step 2: For each dataset, load and stack across years if not a panel
firmid_list <- list()
# Read NSI_datafiles
datafiles <- fread(paste0(dirNSIMETA, CountryCode, '_datafiles.csv'))
for (ds in candidate_datasets) {
# Build file path(s) from NSI_datafiles.csv
files <- datafiles[ds == NSI_dataset,]$path
files <- paste0(dirINPUTDATA, files)
# If multiple cross-sections: bind them into a long panel (add year column!)
if (length(files) > 1) {
raw <- rbindlist(lapply(files, fread), fill = TRUE) # Works only if datafiles have the same column names!
} else {
raw <- fread(files)
}
year_var <- '...' # Define the variable name for the year variable
# Step 3: Extract only the two ID columns + year column ---> manually fix the id column names
firmid_sub <- unique(raw[, .(id_old, id_new, ..year_var)])
# Step 4: Standardise column names ---> manually fix the column names
setnames(firmid_sub, old = c("id_old","id_new","..."),
new = c("firmid_old","firmid_new","year"))
firmid_list[[ds]] <- firmid_sub
}
# Step 5: Combine all datasets (if more than one provides concordance)
firmid_all <- rbindlist(firmid_list, fill = TRUE)
# Step 6: Export to NSImetadata
fwrite(firmid_all, file.path(dirNSIMETA,
paste0(CountryCode,"keyID1_keyID2_conc.csv"))) # Substitute the keyIDS with their proper name!Key principles
- The table can only be created if at least one dataset contains both identifiers in the same file.
- If the dataset is not a panel but a set of yearly cross-sections, it must be stacked into a long format before extracting IDs.
- The table must always contain unique triples: keyID1, keyID2,
year, where the ID names need to be picked fromplantid,firmid,entid,entgrp.
Steps
- Identify dataset(s)
- Review
NSI_datafiles.csvand raw data.
- Find which dataset(s) include the two firm ID variables (e.g.
id_oldandid_new).
- Stack data if needed
- If the dataset is stored as separate cross-sections by year, stack them and add a
yearcolumn.
- If the dataset is a panel, the year is already present.
- Extract unique concordance
- Keep only the two ID columns and the
yearcolumn.
- Deduplicate (
unique) to avoid duplicates across files.
- Rename columns
- Use the standard names:
- keyID1
- keyID2
year
- keyID1
- Export
Save as:
<dirROCKET>/NSImetadata/<CountryCode>_firmid.csv
Key reminders
- This file is not always available — it depends on the data structure in the NSI.
- The NSI staff must verify that the mapping is correct and covers the relevant years.
- Always check that:
- Both ID variables are properly harmonised.
- No spurious duplicates or mismatches exist.
- Cross-sections have been stacked correctly.
3.2.6 Data Pre-Processing
In the Netherlands and France, NSIs have already harmonized their raw data files to resemble MD datasets, resulting in minimal harmonization work being required from the Launcher.
However, there is a strategic intention to shift the boundary between the responsibilities of NSIs and the MDI infrastructure. Two approaches are under consideration:
NSIs document their raw files, and the Launcher—guided by this metadata—performs the harmonization and constructs the MD panels.
NSIs carry out the full harmonization to MD standards, and the Launcher simply reads the pre-harmonized files into R.
Some raw datasets might require specific preprocessing. This is taken care of by the infrastructure right after the raw datafile is imported by the launcher and before it is harmonized using the preprocessing tool (rocket/MDIprogs/datafile_preprocessing_tool.R).
The tool is a general-purpose function designed to apply one or more preprocessing transformations to raw datasets (stored as data.table objects). It enables modular, rule-based data cleaning and transformation by interpreting a structured string called preprocessing_string.
How it works
- Instruction string (
preprocessing_string) encodes all preprocessing steps. - The string is split into separate operations using
||. - Each operation is parsed and executed in sequence.
- The data is modified in-place step by step, and the final
data.tableis returned.
Syntax rules
- Operations are separated by
||
- Parameters within each operation are separated by
|
- Multiple elements in a parameter (e.g., multiple column names) are separated by a hash symbol
#
Supported Operations
| Operation | Format | Description |
|---|---|---|
dedup |
dedup|id_col1#id_col2|method|[optional:dedup_col] |
Removes duplicates by ID(s). Methods: na, min, max, meanmode, random. |
filter |
filter|column|operator|value |
Filters rows based on logical conditions. Operators: eq, neq, gt, gte, lt, lte, in, nin. |
agg |
agg|group_col1#group_col2|var1:func#var2:func |
Aggregates rows over groups using functions: sum, mean, min, max, median, sd, mode, pickmaxby-refcol. |
restruct |
restruct|column_to_remove|col1#col2#col3 |
Drops a column and deduplicates rows based on remaining selected columns. |
reshape |
reshape|id_col1#id_col2|names_from|values_from1#values_from2 |
Reshapes data from long to wide format using dcast(). |
derive |
derive|new_col|condmap|cond1:val1#cond2:val2#...|default:<default_val> |
Creates a new column from conditional logic. Conditions use standard R syntax; values can be column names or literals. A default must be specified. |
scaleif |
scaleif|condition_col|val1:factor1#val2:factor2#...|col1#col2#... |
Conditionally multiplies one or more columns by a factor based on a categorical column’s value. |
trimchars |
trimchars|col1#col2#...|n |
Trims the last n characters from each specified character column. Useful to normalize identifiers or string variables. |
mergefrom |
mergefrom|datafile_name|join_key1#join_key2#...|col1#col2#...|[join_type] |
Imports one or more raw files belonging to datafile_name (as defined in column NSI_datafile of file NSI_datafiles), stacks them if multiple, and merges the specified columns into the current datafile using the provided join keys. By default a left join is performed. If join_type is set to outer, a full outer join is applied. |
Special Feature in agg: pickmaxby-refcol
You can specify that a categorical column should take the value from the row with the highest value in another column.
Syntax:
my_categorical_col:pickmaxby-SCORE_col
This selects the value of my_categorical_col from the row that has the highest SCORE_col within each group.
Example
Code
preprocessing_string <-
"dedup|firm_id#year|random||
filter|year|gt|2010||
agg|firm_id|sales:sum#country:pickmaxby-sales||
trimchars|vat_id|2||
mergefrom|employment_survey_08|FIRM_ID#year|industry_code#size_class|left"In case you need clarifications regarding the tool, please reach out to the MDI team.
The advantage of having the Launcher perform the harmonization is a reduction in maintenance costs for NSIs, particularly for recurring annual updates. It also improves the codification and reproducibility of the conceptual work done by NSI staff. However, this approach entails higher initial costs, as it requires NSIs to adopt a more rigid system of metadata documentation and to coordinate more closely with the MDI team.
3.3 Run MDI: Harmonization & Modules
This section describes the main steps to configure and run the MDI system. It covers how to prepare the countdown.R script, perform metadata checks, and execute harmonization and analysis modules. Each step is explained in detail below. An overview of all the steps an be found in the previous section
3.3.1 Countdown
The countdown (countdown.R) is the starting point for all use of the MDI. It requires the user to make a number of adjustments to ensure that the MDI can be executed successfully for the selected purpose. The following parameters must be reviewed and set by the user.
MDI Installation Directory (
dirMDI)
Set the full path to the directory where all MDI files are installed. Make sure the path ends with a “/”, e.g.dirMDI = "my/dir/".Country code (
CountryCode)
Specify a 2-letter country code following the ISO 3166-1 alpha-2 standard.NSI data directory (
dirINPUTDATA)
Set the full path to the directory containing the NSI firm-level data files (or mock data files). These are the raw input files provided by the National Statistical Institute (NSI).Output directory (
dirOUTPUT)
Define the directory to which all generated output files will be exported. This directory must have read/write permissions and will contain results, module outputs, and other exported files.Temporary storage directory (
dirTMPSAVE)
Set the directory used for temporary storage of MDI virtual longitudinal datasets.
This directory is used to store intermediate datasets and allows reuse of processed data without re-importing raw NSI files.
Optional – flags for temporary files
These flags control how the MDI process handles raw data imports and temporary files.MDIimportFlag: Set to TRUE to import raw NSI data files. If FALSE, existing virtual datasets stored indirTMPSAVEare used.MDIcleanTMP: If TRUE, the temporary directorydirTMPSAVEis cleaned before execution.
Optional – mock data flag
This flag controls whether the MDI is executed using mock data.IsMOCK: Set to TRUE to run the MDI in a test scenario using mock data. When TRUE, temporary files are stored in a country-specific subdirectory to avoid overwriting files when switching countries.
Optional – execution control flags
These flags influence how the MDI scripts run and how much output is produced.MDImoduleRUN: Set to TRUE only after post-harmonization checks have been completed and research modules are ready to run. It should be FALSE during the first execution.MDIdebug: Set to TRUE to display logs, warnings, and errors. Use FALSE for a quieter run.MDIimputeFlag: Reserved for potential data imputation routines (currently not in active use).filteredHarmonization: Set to TRUE if harmonization should be restricted to variables listed in the currentMDnames_selectfile.
Click here to see the entire countdown.R script
# This file is used to work start MDI
# fill in all the parameters and save this file: countdown.R
# run the program in R and then choose to execute:
# 1. run pre_launch_checker.R to run after an update of MDI at NSI, to check and fix metadata
# 2. run liftoff.R to run rocket: execute MD harmonizer and run payload modules
# 3. run prepare_NSI.R to run things to aid in getting metadata in good shape
# 4. run interactive_MDI.R to initialize environment to test/debug/explore/write module code.
rm(list = ls())
MDI_launch_version <- "v2.3"
########################################
# Compulsory steps
########################################
########################################
# 10. Set the full path to the directory where you install the MDI files
########################################
dirMDI <- "/files/MDI/"
########################################
# 9. Give 2 letter country code for your site ("ISO 3166-1 alpha-2" standard)
########################################
CountryCode <- "PTx"
########################################
# 8. Set the full path to the directory with NSI firm-level data files (or mockdata files)
########################################
dirINPUTDATA <- "/files/NSIdatafiles/"
########################################
# 7. Set the full path to the directory to which generated files are exported (dirOUTPUT)
########################################
dirOUTPUT <- "/files/output/"
########################################
# 6. Set the full path to the directory for temporary storage of MDI virtual longitudinal datasets (dirTMPSAVE)
########################################
dirTMPSAVE <- "/files/TMP/"
########################################
# Optional steps (steps 5, 4, 3 & 2)
########################################
#####################################
# 5. Flag for temporary MDI files #
#####################################
# set ImportFlag=TRUE if you want to import raw NSI data files (if FALSE: reads MDI virtual data from dirTMPSAVE)
MDIimportFlag <- TRUE
################################################
# 4. Flag for cleaning the temoporary folder #
################################################
# set cleanTMP=TRUE if you would like to clean the dirTMPSAVE before running
MDIcleanTMP <- FALSE
#####################################
# 3. Flag for mock data use #
#####################################
IsMOCK <- TRUE
# If isMOCK, temporary files are stored in CountryCode folder,
# so that files arent overwritten when switching country
if (IsMOCK) {
dirTMPSAVE <- paste0(dirTMPSAVE, CountryCode, "/")
if (!dir.exists(dirTMPSAVE)) {
dir.create(dirTMPSAVE)
}
}
#####################################
# 2. Flags to control execution #
#####################################
# Set MDImoduleRUN = TRUE if the post_harmonization script has been run and checked and modules are ready to be run.
# Should be set to FALSE when running the launch for the first time.
MDImoduleRUN <- FALSE
# set debug = TRUE if you don't want to suppress logs, warnings and errors
MDIdebug <- TRUE
## NOTE EB: Nothing done at the moment with the imputeflag (was called inputeflag in early versions)
MDIimputeFlag <- FALSE
# Flag to indicate whether this is a test run on the server with mock data
# If you want the harmonization to be done only for the variables included in the current
# launch's MDnames_select file
filteredHarmonization <- FALSE
##############################
# 1. Liftoff #
##############################
# save the program countdown.R to your work directory.
# Run the file to choose which program/feature to execute:
# Now, pick the program to be executed
# 1. run pre_launch_checker.R to run after an update of MDI at NSI, to check and fix metadata
# 2. run liftoff.R to run rocket: execute MD harmonizer and run payload modules
# 3. run prepare_NSI.R to run things to aid in getting metadata in good shape
# 4. run interactive_MDI.R to initialize environment to test/debug/explore/write module code.
# ---> Choose below with number of the selected program
# Check if the session is interactive (works both in RStudio and console)
if (interactive()) {
# Use select.list() for interactive selection
user_input <- select.list(c("pre_launch_checker.R", "liftoff.R", "prepare_NSI.R", "interactive_MDI.R"), title = "Choose a program to run:")
if (user_input != "") {
# Source the corresponding script
source(paste0(dirMDI, "launchpad/", user_input))
} else {
cat("No selection made. Exiting.\n")
}
} else {
# If not in an interactive session, use readline()
user_input <- as.integer(readline(prompt = "Please enter an integer (1 for pre_launch_checker.R, 2 for liftoff.R, 3 prepare_NSI.R, 4 interactive_MDI.R): "))
if (!is.na(user_input) && user_input %in% 1:4) {
# Map the user input to the corresponding script name
# Source the corresponding script
source(paste0(dirMDI, "launchpad/", user_input))
} else {
cat("Invalid input. Please enter a valid integer (1, 2, 3 or 4).\n")
}
}3.3.2 Pre-Launch-Checker
The program pre_launch_checker.R (run countdown.R and choose this program) needs to be run before anything else. It performs various checks on the NSI metadata to avoid errors later on. The results of the checks can be found in the file pre_launch_checker_results.txt in the output directory. It shows possible errors that should be adjusted in the NSI metadata. Additionally, two concordance files (NSI_pcc8t0_pcc8t1_conc.csv and NSI_MD_nace_conc.csv) are created using existing concordance files and updating them with the data at the NSI. These concordance table might contain empty values, if no value was previously defined. Missing values need to be filled in manually. When the concordance files are ready to be used, they need to be moved to the directory indicated in pre_launch_checker_results.txt
3.3.3 Post-Harmonization Quality Checks
After harmonizing your country’s microdata to the MD format, the Post-Harmonization Checker (PHC) script is automatically executed in the rocket to ensure that the harmonized datasets meet essential quality and consistency standards. This diagnostic process validates whether the resulting data is clean, correctly structured, and ready for module execution.
The script performs the following checks on the :
| Check Type | Description |
|---|---|
| 1. Duplicate Check | Identifies rows where the key ID variable (e.g., firmid) is duplicated. |
| 2. Variable Class Check | Verifies that each variable matches its expected R data type (e.g., numeric, character, date). |
| 3. Date Format Check | Ensures date variables are correctly formatted and parseable (e.g., %Y, %d%m%Y). |
| 4. Date Range Check | Extracts the minimum and maximum detected dates per variable to check date range matches. |
| 5. Break Detection | Identifies structural breaks in aggregate-level distributions over time (over 10% jumps). |
Each of these checks outputs either a summary table (.txt) or a visual diagnostic (.pdf) to help identify problems.
3.3.3.1 PHC Output Files Generated
After the script runs successfully, you will find the following two files:
<CountryCode>_phc_results.txt
Location:dirTMPSAVE
Contents: Duplicate summary, class and format mismatches, detected date ranges.
Duplicate Check Table
| Column Name | Description |
|---|---|
dataset |
MD dataset (e.g., BR, SBS, ICTEC) |
id_var |
The country-specific ID variable used to identify unique records taken from MD_idInfo |
has_duplicates |
TRUE if duplicated rows are found based on id_var, FALSE otherwise |
num_duplicated_rows |
Total number of rows that are duplicates (may include multiple per key) |
num_unique_duplicated_keys |
Number of unique key values (id_var) that are duplicated |
Variable Class Check Table
| Column Name | Description |
|---|---|
dataset |
MD Dataset |
variable |
Variable name being checked |
expected_class |
Class assigned to this variable in the metadata (MD_varnames) |
actual_class |
Actual class detected in the harmonized .RDS file |
class_match |
TRUE if expected and actual class match, FALSE otherwise |
Date Format & Class Check Table
| Column Name | Description |
|---|---|
dataset |
MD Dataset |
variable |
Variable name being checked |
expected_class |
Expected class (usually "date") |
actual_class |
Class detected in the file |
expected_format |
Date format expected (e.g., %Y, %d%m%Y) |
actual_format |
Detected format based on sample values |
format_valid |
TRUE if values can be parsed using expected_format, FALSE otherwise |
class_match |
Whether the variable is stored as a Date object |
Date Range Check Table
| Column Name | Description |
|---|---|
dataset |
MD Dataset |
variable |
Date variable being checked |
actual_format |
Detected format used to parse the variable |
actual_min_date |
Earliest parsed date in the variable |
actual_max_date |
Latest parsed date in the variable |
expected_range |
Expected range of years (as specified in MD_catalogue) |
breaks_report.pdf
Location:dirTMPSAVE
Contents: Plots showing time-series breaks for each numeric variable by dataset. The red dots show a structural break in the time series defined by a minimum 10% jump.
Break Summary Table (PDF)
| Column Name | Description |
|---|---|
dataset |
Dataset name |
variable |
Numeric variable being assessed for breaks |
stat |
Statistic showing the break (e.g., mean, p50, sd) |
year |
Year in which a structural break was detected |
growth |
Relative change from previous year (e.g., +0.25 = 25% increase, -1.0 = 100% drop) |
3.3.3.2 Instructions for Country Leaders for Reviewing and Fixing PHC Errors
- Duplicate Check
- Check: Whether any rows share the same key (e.g.,
firmid) more than once. - Look for:
has_duplicates == TRUEand high values innum_duplicated_rowsornum_unique_duplicated_keys. - Fix: Review your harmonization step and ensure that each firm-year observation is uniquely identified. If intentional (e.g., due to panel structure), document it clearly.
- Variable Class Check
- Check: Compares expected vs. actual data types.
- Look for:
class_match == FALSE - Fix: In your country metadata, ensure each variable is explicitly cast to the correct type using functions like
as.numeric(),as.character(), oras.Date()using the revalue method.
- Date Format Check
- Check: Whether date variables match expected formats (e.g.,
%d%m%Y). - Look for:
format_valid == FALSEoractual_format == "unknown" - Fix: Recheck how date strings are parsed in your harmonization script and in the metadata. Use
as.Date()with the proper format string.
- Date Range Check
- Check: Compares detected date range with expected year coverage.
- Look for: Min or max dates far outside expected range (e.g., year 1001 or 9122).
- Fix: Likely due to incorrect parsing. Verify input formats and metadata.
- Break Detection
- Check: Identifies abrupt jumps/drops in:
- p25, p50, p75
- Mean
- Standard deviation
- p25, p50, p75
- Look for: Large positive/negative
growthvalues in the break summary and red dots in the plots. - Fix: Review input consistency across years (e.g., variable definitions, missing categories). Cross-check with national data providers to see if breaks are expected due to methodology changes.
You are ready to proceed with running the MDI modules (e.g., setting MDImoduleRun = TRUE) only after:
- All critical issues (e.g., duplicate rows, format mismatches, corrupted dates) are resolved.
- You’ve documented any justified exceptions (e.g., expected breaks).
- You’ve shared updates or escalated open issues to the MDI team.
- Please keep backup copies of your harmonized
.RDSfiles before making changes.
3.4 Developing & Testing
3.4.1 Nuvolos Developer Space
This space is intended for code development, module creation, and script testing on mock data. It is designed for MDI team members and module writers who are familiar with the MDI infrastructure and have access to the MDI GitHub repository. (More information on Nuvolos: where MDI users develop and test their codes)
Each user must connect their Github account to enable pushing and pulling changes. Note the following:
Each user works in their own isolated space—your changes remain private until you explicitly push them to GitHub.
It is your responsibility to ensure you are working on the latest version of the MDI codebase by pulling updates from GitHub when you start your RStudio session (how to work with Git).
The workspace includes both NSI mock data and harmonized MD mock data, which you can use for testing and developing your modules. You can find the NSI mockdata in
space_mounts/mockdata/NSIdata/and the harmonized MD mockdata inspace_mounts/mockdata/TMP/. (Note: These folders are only accessible via RStudio and won’t show up in the Files section)
If you’re using the Nuvolos Developer Space for the first time you need to connect your GitHub account. Follow the steps below:
Open Nuvolos, navigate to “applications” (left menu) and open RStudio
Go to the terminal
- Generate a public/private key pair by executing this command:
ssh-keygen -t ed25519(no need to change suggested location or create a password > press enter 3 times) - Navigate to the folder where both are saved. You can do that on the file section on the right side, you might have to click on the “/” to see all directories. The folder .ssh is a hidden folder, so click on the gear symbol and select the option “Show Hidden Files”
- Generate a public/private key pair by executing this command:
Open the id_ed25519.pub and copy it’s content
Open Github in the browser
- In Github: go to settings>SSH and GPG keys
- Click “add new SSH key”
- Add the copied pub key into the key section and add a title eg. “Nuvolos MDI test environment”, then save
Back to Nuvolos, in RStudio, Terminal, clone the branch using this command (from within /files folder, which is default):
git clone --branch pre_Launch_v2.2_backup --single-branch git@github.com:Secretariat-CompNet/MDI.git- The MDI with all files will show up in the files section on the right
- Then go to home/datahub/ in the file section and open the .gitconfig file. The file will look like this:
[user] email = 12345678+Name@users.noreply.github.com name = YourName [credential] helper = cache --timeout 64800Make sure that the email address is the one from your Github, not eg. the iwh email address. To check which is the right one go to your Github account > Settings > Emails. Copy the email address ending on @users.noreply.github.com as shown below into the .gitconfig file. Save file.
The MDI is correctly set up now. You can run your code with mockdata, edit and pull/push changes to Github
To test codes: execute countdown and select interactive_MDI (option 4) before running your own script.
3.4.2 General Workflow with Git
If you’re in the right branch and your repository is up-to-date, this is the normal workflow:
- You change a file/ add a module/ add metadata.
- You save your edited file.
- (Best practice: Check status (
git status) and make sure there are no recent updates on the branch) - You add your file(s) to a commit (
git add file_name) - You create a commit with a commit message (
git commit -m "this is the commit message") - You push your commit to GitHub (
git push) - You can verify your commit with
git log. This shows a list of all recent commits (on top the one you just did)
If you haven’t worked with the MDI in a while, the repository might be outdated or you might be in an old branch. Below are the steps to make sure you’re working in the right branch and have the latest updates.
Navigate into your MDI repository
cd /files/MDIVerify the status of your MDI version
git statusThis tells you what branch you’re on, eg. :
On branch branch_nameand if you’re up-to-date with the latest changes. There are three possible options:
Your branch is up to date with 'origin/branch_name'.You have all the latest changes of that branch. No need to do anything elseYour branch is ahead of 'origin/branch_name' by x commits.You have changes that you didn’t push to GitHub yet.Your branch is behind 'origin/branch_name' by x commits.There are updates that you haven’t pulled yet.
If you want to change the branch:
git switch new_branch_nameIf you want to pull changes:
git pullIf you want to push your changes:
To add updated files to a commit use:
git add name_of_your_changed_file(Use that command for each file individually, ot usegit add .to add all changed files)To create a commit use
git commit -m "Add your commit message here"(Make sure your commit message described your updates well)Push your commit to GitHub:
git push
Check the status again to verify that you’re in the latest version of your desired branch:
git statusThis should now give:
On branch branch_name Your branch is up to date with 'origin/branch_name'. nothing to commit, working tree clean
3.4.3 How to add your research module to the MDI infrastructure
If you want to add your module to the MDI infrastructure via the Nuvolos Developer Space you need to have access to the MDI GitHub repository and set up your GitHub account in Nuvolos (see:first time users) Then open RStudio in the Developer Space and follow these steps to add your module:
Add a module folder
In the files section on the right, navigate to the folder
MDI/payload/Launch_vX.X/Rmodules/(Replace X.X with the current launch version). In that folder create a new folder with a two-character name, abbreviating your research module, eg. “XY”.Add your MDnames_select file
Inside your module folder add the MDnames_select file. This file contains a list of the variables that your module uses (more information here) and it needs to be named like this:
(res_group)_MDnames_select.csvwhere res_group is your module abbreviation, eg.XY_MDnames_select.csvAdd your main script
Inside your module folder add your main script. This script needs to be named the following way:
Launch_X.X_(res_group).Rwhere X.X is the current launch version and res_group is your module abbreviation. Eg.Launch_2.3_XY.R. This scripts will be executed by the code when your module is run. That does not mean all your code needs to be in that one script. You can add as many scripts as you like and call them usingsource(path/to/your/script/).Add any other scripts/files/ folders
If you need any additional scripts or files place them in your module folder.
This is an example of a module folder:
The folder has the module abbreviation CN and contains the main script Launch_2.3_CN.R, the CN_MDnames_select.csv file and two additional files EU_countries.csv and Questionnaire.xlsx.
3.4.4 How to develop and test your module using (mock) data
To develop your module you first need to adjust and run the countdown. This will import all libraries and variables you might want to use in your module. To do so, navigate to launchpad/countdown.R.
The countdown script functions as a configuration file that, for example, sets up paths and flags for the MDI execution. You need to adjust the parameters in the script to fit your needs. For example, set the flag isMOCK to TRUE if you’re working with mock data, or set dirTMP to the directory where the harmonized mock data is stored (space_mounts/mockdata/TMP/). You can find all parameters and flags in the countdown section of this manual.
After adjusting the parameters, run the countdown script and select option 4 “Interactive MDI”. This will set up the environment for you to develop and test your module, but it will not run any MDI module. You can then run your module script (e.g. Launch_2.3_XY.R) to test your module using the mock data.
If you want to mimic a launch as it would happen at an NSI, set the flag MDIimportFlag to FALSE and MDImoduleRUN to TRUE. Then run the countdown and select option 2 “liftoff”. This will set up the environment and run all modules with the selected mock data.
3.4.5 Mockdata
To ensure robustness, consistency, and functionality across the MDI infrastructure, the development and use of mock data is essential.
Specifications for mock data.
- All files from
NSI_datafiles.csvare covered, with all years as given byNSI_varnames. - All variables from all NSI files must be represented with the correct format and domain, including classifications, codebooks, and value labels.
- All files from
Underlying ‘firm’ datasets.
- SBS/BS Cobb-Douglas model: deterministic framework with stochastic draws; firm size is used to infer capital (k), materials (m), and output (y) based on productivity shocks and capital-labor moments.
- SBS/BS forward-looking Hopenhayn model: includes stochastic productivity draws and shocks to productivity and demand, allowing for endogenous firm exit.
- SBS/BS Aglio–Bartelsman-type firms: based on parameter draws for A/g, η, and ρ.
- Firm dynamics with innovation: models firms’ extensive choices in innovative activities (e.g., ICTEC, R&D).
- Firms with trade behavior: captures extensive and intensive trade choices across modules such as ITGS, ITS, OFATS, and IFATS.
BLOCK0: Prepare Auxiliary Files
- Define the country, the sample periods, the datasets and read country-specific NSI metadata (datafile, varname, codebook).
- Create a table specifying the hierarchical structure among variables.
- Develop a table defining the concordance between fundamental model variables and NSI variables.
- Compile a file detailing auxiliary regressions for predicting numerical, logical, and categorical variables.
BLOCK0: Obtain Data Moments from the Data or by Simulation
- Calculate the sample mean and variance of employment for each NACE 2-digit sector.
- Determine the average exit rate for each NACE 2-digit sector.
- Extract regression coefficients for auxiliary regressions.
- Gather information on sample sizes of surveys.
- Compute key economic ratios and rates: capital-labor ratio, capital rental rate (interest rate), wage rate, and capital depreciation rate.
BLOCK1: Simulate an Unbalanced Panel Dataset
- Generate an unbalanced panel dataset for firms over time, incorporating firm entry and exit dynamics based on Hopenhayn (1992).
- Estimate model parameters: \(\alpha\) (output elasticity of labor), \(\sigma\) (standard deviation of TFP process), and
z_exit(exit threshold for firms’ productivity) by targeting the sample mean and variance of the firm size (employment) distribution and the exit probability. - The simulated panel data includes firm ID, year, productivity, labor, capital, depreciation, and EBITDA.
BLOCK2: Predict BR and BS Variables
- Use concordance tables between model variables and NSI variables, as well as auxiliary regressions and regression coefficients to predict BR and BS variables.
BLOCK2: Sample from the ‘Universe’ of Firms
- For each survey table, sample from the firm universe and predict NSI variables using the auxiliary regressions and regression coefficients.
4 Acknowledgement
We gratefully acknowledge the support of the European Union, whose funding made this project possible. We also thank all National Statistical Institutes (NSIs), National Statistical Systems, National Productivity Boards (NPBs), and other collaborators for their valuable contributions to the development of the MDI project and this manual.