MDI Manual

Comprehensive guidance for everyone who builds and uses the Micro Data Infrastructure

Author
Affiliation

Eric Bartelsman and MDI team

Vrije Universiteit Amsterdam;Tinbergen Institute;Halle Institute for Economic Research (IWH);Competitiveness Research Network (CompNet); Bocconi University; Centre for Business and Productivity Dynamics (CBPD)

Published

March 11, 2026

1 Introduction to MDI

This user guide provides users of the Micro Data Infrastructure (MDI) with all the necessary information to conduct research by using the mdi set up and setting up the MDI or developing the infrastructure.

1.1 What

The Microdata Infrastructure (MDI) is a platform for cross-country microdata access, developed by CompNet in collaboration with European National Statistical Institutes (NSIs), National Statistical Systems (NSS) and other partners.

The MDI began in 2018, as described in “Creating an EU-wide Micro Data Infrastructure (MDI): a handbook for Micro-Data Linking”. Since then, pilots evolved into a maintained infrastructure that is periodically launched at NSIs. This manual provides operational guidance for current and future MDI launches, building the infrastructure in new countries and defining a medium-term horizon of continuous improvement within each 3–6 month deployment cycle.

The MDI is designed with a dual objective: to harmonize firm-level data across countries and to streamline the research process for conducting cross-country analyses on a wide range of topics.

At its core, MDI provides a standardized environment that enables researchers to perform identical analyses across multiple countries. It ensures microdata comparability and accessibility within a unified framework. The infrastructure supports functions ranging from data importation and harmonization to advanced analytical outputs, all within a secure environment that safeguards data confidentiality. In a nutshell, MDI does the following:

Raw data \(\rightarrow\) Data harmonization \(\rightarrow\) Comparable cross-country microdata

Raw data: refers to the task of compiling all available datasets and variables from each NSI into a detailed metadata inventory.

Data harmonization: refers to the entire process of constructing variables that are comparable across countries. This involves establishing a standardized set of variable names and definitions (MD metadata). Based on this standard, the raw data from each NSI is used to generate corresponding variables and files aligned with the MD metadata. The harmonization process also includes creating concordance tables to standardize categorical codes.

Comparable cross-country microdata: refers to the tools and guidelines provided to researchers for effective data use. This includes best practices for writing research code (module) and the provision of mock data (designed to replicate the structure of the real data) for testing purposes.

1.2 Who

The MDI is a joint initiative by CompNet, National Statistical Institutes (NSIs), and other partners. CompNet staff lead the technical maintenance and development of the infrastructure, and provide training and guidelines on how to use it. Together with NSIs and partners, they access firm-level data across countries and operate the MDI infrastructure to generate research outputs. Please see below the MDI stakeholders and process:

%%{init: {
  "theme": "base",
  "themeVariables": {
    "primaryColor": "#5f7991",
    "edgeLabelBackground":"#ffffff",
    "lineColor": "#e40000",
    "textColor": "#000000",
    "fontSize": "26px"
  },
  "flowchart": {"curve": "linear"}
}}%%

flowchart LR

%% Stakeholders

subgraph STAKE["Stakeholders"]
  direction TB
  R[Researcher]
  C[CompNet / MDI Network]
  S[Statistical<br/>Institute]
  
  R  -- "Research project and payload" --> C
  C  -- "Code to build the infrastructure" --> S
  S  -- "Metadata preparation" --> C
  C  -- "Metadata and tools" --> R
end
  
%% Remote environment 

subgraph REMOTE["Environment"]
  direction TB
  RA["Remote access (AT FR GB NL SI)"]
  MP["MDI partner (FI IT)"]
  RE["Remote execution (PT DE)"]
  
  RA <--> MP <--> RE
end
  
%% Outcomes

subgraph OUT["Outcomes"]
  direction TB 
  O1[Special research and publication]
  O2[Standard moments and indicators ‑ publication]
  
  O1 -- "Output is obtained by Compnet/ MDI Network" <--> O2
end

%% Rocket 

Ro[🚀 **Rocket** 🚀]

%% Graph flow 

STAKE --> Ro
STAKE -- "Obtains the output" --> OUT

Ro --> REMOTE
REMOTE --> OUT

%% Style  

classDef remote stroke:#e40000,stroke-width:2px;
class RA,MP,RE remote;

classDef output fill:#5f7991,color:#ffffff;
class O1,O2 output;

classDef whitebg fill:#ffffff,stroke:#000000,color:#000000;
class STAKE,REMOTE,OUT whitebg;
  

Note: The diagram shows stakeholder roles, execution environments, and outputs. Rocket is the codebase deployed at NSIs. Two access models exist: direct remote access and indirect remote execution. Arrows indicate code, metadata, and output flows. All outputs are subject to NSI disclosure control before publication.

  • National Statistical Institutes (NSIs) and other Partners
    • NSI remote execution
    • NSI remote access
    • Partners with country-specific (remote) access

NSIs provide the underlying data and support either remote execution or access to confidential firm-level data. While legal access rules, data availability, and technical infrastructure vary across countries, NSIs form the backbone of the standardized MDI research environment.

  • Module writers (MDI users)
    • Productivity Boards
    • External Academic and Policy
    • MDI ‘Theme’ research staff

MDI users include productivity boards, external researchers, and thematic research staff. They are responsible for designing research modules that harness MDI’s infrastructure for cross-country analysis.

  • MDI staff
    • Country specialists
    • Thematic research personnel
    • Infrastructure support teams

MDI staff ensure the effective development and operation of the MDI environment. They support NSIs with data preparation and documentation, and assist module writers by providing expertise on data, tools, and research themes.

1.3 How

The MDI infrastructure is a continuously evolving codebase, known as Rocket, that is periodically deployed within the secure environments of NSIs. Its main function is to process and harmonize raw data, execute research code (modules) , and export results, all while strictly complying with NSIs’ disclosure rules. This process, is referred as a launch. The process occurs every 4 to 6 months depending on country readiness. Please see below the MDI launch pipeline:

flowchart LR
    R["<b>Rocket</b><br>Contains research codes<br>(<i>modules</i>)<br>+<br>All needed R scripts to<br> harmonize the raw data"]
    D[("Harmonized data<br>&uarr;<br><b>Raw data</b><br>&darr;<br>Metadata<br><small>constantly updated</small>")]
    O["<b>Output</b><br><small>CSVs outside the NSI<br> protected environment</small>"]

    R --> D
    D -->|export| O
    D -.->|Metadata feed rocket| R
    
    classDef rocket fill:#ffcccc,stroke:#333,stroke-width:2px;
    classDef raw fill:#ccffcc,stroke:#333,stroke-width:2px;
    classDef output fill:#ccccff,stroke:#333,stroke-width:2px;

    class R rocket;
    class D raw;
    class O output;

Note: The diagram shows the MDI launch pipeline. The Rocket represents the deployed codebase containing harmonization scripts and research modules. It processes raw data and constantly updated metadata to produce harmonized data within the secure NSI environment. The harmonized datasets are then exported as output files outside the protected environment, only after passing disclosure checks.

  • Access models
    • Direct access: Researchers connect to the NSI secure environment with user credentials and run approved code on site.
    • Indirect access: NSI staff or MDI staff execute the approved code and return only disclosure-safe outputs.
  • Class: Describes classification variables in the datasets, such as industry or product codes.
  • Codebook: Maps categorical variable values to their corresponding descriptions.
  • Data centers: Technical environments managed by NSS components that host, process, and secure microdata.
  • Datafiles: Lists all available NSI firm-level data files, including their names and years covered.
  • Disclosure Criteria: Rules designed by the NSIs to protect the confidentiality of firm-level data, ensuring that no output allows the identification of individual firms or the disclosure of sensitive information, even in aggregated form.
  • Hierarchy: a table that maps a classification at different aggregation levels. E.g., Nace 4d: 6491 is equivalent to Nace 3d 649, Nace 2d, 64 and industry K.
  • MD metadata: standardized set of variable names (MD_varname, i.e., firmid, capital, etc.) and respective definitions that forms microdata (MD) panels, or the MD_dataset (i.e., BS, SBS, ENER, etc.) set by the MDI team.
  • MDI: Microdata Infrastructure.
  • MDI data catalogue: catalog containing all variables and their year range availability by country.
  • MDI launch: process of running the modules within the rocket every three months.
  • MDI tools: set of R functions created by the MDI team to generate the MD_datasets, manipulate them and execute modules.
  • Module: research code.
  • Modules names are defined with an acronym (“res_group”). For example, a module about firm dynamics is called FD (res_group=FD).
  • NSS: The coordinated institutional and technical framework encompassing the NSI and associated data centers.
  • NSIs: National Statistical Institutes. These are the public authorities responsible for official statistics in each country. They host the confidential microdata, set legal rules, run disclosure control, and provide the secure environments where MDI operates.
  • Nuvolos: cloud server platform where MDI users develop and test their codes. This space is designed for training, practicing, and familiarizing with the MDI infrastructure. MDI infrastructure: some terms
  • Rocket: codebase containing modules and scripts that are periodically deployed within the secure environments of NSIs to process and harmonize raw data, execute modules, and export results.
  • Varnames: Documents the variables and their descriptions for each raw data file listed in datafile.

2 Using MDI

This section focuses on using the MDI and is meant primarily for research groups and module writers. It outlines all steps involved in conducting research with the MDI - from formulating a research question to selecting variables and preparing data files. It also provides information about launches, including the research execution process and the overall timeline.

2.1 MDI Users

MDI users (or module writers) include productivity boards, external academic and policy researchers, and MDI ‘Theme’ research staff. They are responsible for developing research modules that leverage MDI’s infrastructure for data analysis.

2.2 Setup for Researcher

Module writers develop and test their research code using mock data on the Nuvolos platform (see Nuvolos section). This process relies on a standardized metadata structure initialized through an R setup program. If a researcher has direct access to the microdata, they may also develop and test their modules directly using real data. Once development is complete, MDI staff consolidate and stack country-level outputs to enable cross-country analysis without granting direct access to firm-level data.

2.3 Workflow for writing modules

Writing modules for the MDI launch is an iterative process that moves from conceptualization to execution. It is a staged process designed for reproducibility and cross-country comparability. Start from a clear research question, select MD variables that exist across countries, prototype on Nuvolos mock data, validate disclosure compliance in-code, and prepare exports with complete metadata.

Note: MDI Module Writing workflow

Deadlines and launch schedules

MDI modules are executed every four months through pre-scheduled launches. Accordingly, the MDI team communicates specific deadlines to all researchers for submitting their research modules and alerts the NSI staff accordingly.

The following table contains an estimate of the duration of a whole launch (between brackets, in the first column, a reference to the items in the diagram above):

Task Estimated duration
1) Research module preparation (1. - 4.) one month
2) Module testing and submission (5. & 6.) two/three weeks
3) Launch preparation (7.) a few days to a week
4) Launch execution (7.) two months
5) Extraction of the results and consolidation of the output (8. & 9.) a few days to a couple of months

Hence, a researcher can expect to receive all the consolidated cross-country results between three to six months after the module submission.

2.3.1 Define your research question

Every module begins with a clear and concise research question, designed to leverage MDI’s cross-country data and produce meaningful analytical insights.

Important

Before writing the analytical code, you must define a research acronym for your module (specified as res_group <- '(some 2-letter string)') and communicate it to the MDI team.

2.3.2 Data selection

Use the MDI data catalog to identify and select the most relevant datasets and variables for your analysis.

Important

Ensure that all MD variables used in the module are available across all countries. Especially the employment variable.

However, keep in mind that the harmonized version of classification variable nace, called MDnace, is not present in the catalog.

In case you want to use them in your code, make sure you use MDnace instead of nace.

Conversely, the harmonized version of product and trade codes (prodcom and cn08, respectively) have the original classification variable name.

If you want to use the original non-harmonized codes, make sure you use NSI_(classname) in your code.

Check the dedicated section below for more details.

Open the MDI Metadata Viewer

Additionally, module writers can also look at information on data source, firm sample and other details (taken from the NSI_datafiles tables) on the raw datafiles underlying each MD panel by using the interactive tool Datafiles Info Viewer

Once the final selection of MD variable names has been made for the module, a file named (res_group)_MDnames_select.csv (see example below) must be submitted to the MDI team. This needs to have the column names as shown below.

MD_dataset MD_varname
BR firmid
BR plantid
BR entid
BR entgrp
BR year

2.3.3 Analysis

2.3.3.1 Libraries, packages and the MDI R tools

Make use of MDI R-packages (see ../rocket/Rtools/Rpackages/Rpackage_info_v2.3_.csv) and Rtools (see: ../docs/MDI_Rpackage_1.0.0.pdf): See the R-package libraries currently installed at NSIs and loaded at runtime of the launcher at ../rocket/Rtools/Rpackages/record_package_info.csv or below

If you need a package that is not part of the current list of R libraries, notify the MDI staff so it can be added to the NSI requirements. When preparing output, use standardized functions from the MDI Rtools (see directory ../rocket/Rtools/R) whenever possible.

Package Version Title NL_version
abind 1.4-5 Combine Multidimensional Arrays
broom 1.0.6 Convert Statistical Objects into Tidy Tibbles 1.0.7
cluster 2.1.6 ""Finding Groups in Data"": Cluster Analysis Extended Rousseeuw et al.
conflicted 1.2.0 An Alternative Conflict Resolution Strategy
data.table 1.17.2 Extension of `data.frame` 1.16.2
devtools 2.4.5 Tools to Make Developing R Packages Easier
DiagrammeR 1.0.11 Graph/Network Visualization
dplyr 1.1.4 A Grammar of Data Manipulation
factoextra 1.0.7 Extract and Visualize the Results of Multivariate Data Analyses
fixest 0.12.1 Fast Fixed-Effects Estimations
FNN 1.1.4.1 Fast Nearest Neighbor Search Algorithms and Applications
foreign 0.8-86 Read Data Stored by 'Minitab', 'S', 'SAS', 'SPSS', 'Stata', 'Systat', 'Weka', 'dBase', ...
frontier 1.1-8 Stochastic Frontier Analysis
fs 1.6.6 Cross-Platform File System Operations Based on 'libuv' 1.6.4
ggplot2 3.5.2 Create Elegant Data Visualisations Using the Grammar of Graphics 3.5.1
git2r 0.36.2 Provides Access to Git Repositories 0.33.0
gmm 1.8 Generalized Method of Moments and Generalized Empirical Likelihood
gridExtra 2.3 Miscellaneous Functions for ""Grid"" Graphics
gt 0.11.0 Easily Create Presentation-Ready Display Tables 0.11.1
haven 2.5.4 Import and Export 'SPSS', 'Stata' and 'SAS' Files
igraph 2.0.3 Network Analysis and Visualization
knitr 1.50 A General-Purpose Package for Dynamic Report Generation in R 1.49
lfe 3.0-0 Linear Group Fixed Effects
mFilter 0.1-5 Miscellaneous Time Series Filters
modelsummary 2.2.0 Summary Tables and Plots for Statistical Models and Data: Beautiful, Customizable, and Publication-Ready
momentfit 0.5 Methods of Moments
openxlsx 4.2.5.2 Read, Write and Edit xlsx Files 4.2.7.1
pander 0.6.5 An R 'Pandoc' Writer
plm 2.6-4 Linear Models for Panel Data
poLCA 1.6.0.1 Polytomous Variable Latent Class Analysis
readr 2.1.5 Read Rectangular Text Data
readxl 1.4.3 Read Excel Files
Rtsne 0.17 T-Distributed Stochastic Neighbor Embedding using a Barnes-Hut Implementation
shiny 1.10.0 Web Application Framework for R 1.9.1
stargazer 5.2.3 Well-Formatted Regression and Summary Statistics Tables
stringr 1.5.1 Simple, Consistent Wrappers for Common String Operations
tidyr 1.3.1 Tidy Messy Data
zip 2.3.1 Cross-Platform 'zip' Compression
zoo 1.8-12 S3 Infrastructure for Regular and Irregular Time Series (Z's Ordered Observations)

The MDI team also maintains an overview of:

  • Installed R versions at each NSI, along with details on package installation policies. This information is available in Rversions_countries.csv, located in the /MDI/docs/.
Code
library(readr)
library(knitr)
library(kableExtra)

# Read the rversions data
rversions_data <- read_csv("Rversions_countries.csv", show_col_types = FALSE)

# Create a scrollable table
kable(rversions_data, format = "html", escape = TRUE) %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(ncol(rversions_data), width = "25em") %>%  
  scroll_box(width = "100%", height = "auto")
Country R-version Date cran_install (Yes/No) install_method (CRAN, IT-managed, Other specify) version_control (Yes/No/Partial specify) r_update_freq library_restore (Yes/No/Manual) Additional info
SI 4.4.2 Jul.25 Yes Yes Yes a few times per year Yes NA
FR 4.3.1 Aug.24 No Local CRAN Mirror No. Only the latest version of the packages are available by default A few times per year Manual NA
AT 4.1.3 Aug.24 No IT-managed No They provide all Users with a shared library so they always install the newest version. Ad hoc NA
FI 4.3.1 Apr.25 No Local CRAN Mirror. Package updates may take a few days No. Only the latest version of the packages are available by default Infrequent (less than once per year). The packages will need to be reinstalled after each R update. No NA
PTx 4.5.1 Jan.26 Yes CRAN Yes no restrictions No NA
NL 4.2.3 Sept.24 No IT-managed Partial Every 1.5 months Yes During each scheduled maintenance weekend, all packages are updated to their latest versions. Older versions are archived, allowing rollback to a specific version if needed (CBS will assist with this on request). However, if a desired version was never the latest at the time of a maintenance update, it won’t be available on the server.
GB 4.4.0 Apr.25 No The package files need to be imported to the environment and manually installed Yes but it can be annoying to handle with dependencies a few times per year No NA
MT NA Jan.26 Yes CRAN Yes NA NA NA
EL 4.4.2 Jan.26 Yes IT-Managed Yes Ad hoc Manual RStudio Server 2024.09.1+394 running on Ubuntu.
Code
# Display table with column definitions
rversions_def <- read_csv("Rversions_columns_definitions.csv", show_col_types = FALSE)

# Display it as a regular table
kable(rversions_def)
Column name Definition
R-version The current R-version installed in the NSI remote env.
Date Date of the last info updation on the R-version
cran_install (Yes/No) Whether packages can be installed directly from CRAN (Yes/No)
install_method (CRAN, IT-managed, Other specify) Method used if not CRAN (e.g., Manual, IT-managed, Custom script)
version_control (Yes/No/Partial specify) Can specific package versions be installed (Yes/No/Partial)
r_update_freq Frequency of R updates (e.g., Quarterly, Annually, Ad hoc)
library_restore (Yes/No/Manual) Whether initial libraries are restored after update (Yes/No/Manual)
Additional info Any additional notes or info
  • Package conflict resolution preferences are documented in conflicts_prefer.csv, located under /MDI/rocket/Rtools/Rpackages/.
Code
library(readr)
library(knitr)
library(kableExtra)

# Read the rversions data
setwd('..')
setwd('rocket/Rtools/Rpackages/')
rconflicts_data <- read_csv("conflicts_prefer.csv", show_col_types = FALSE)

# Create a scrollable table
kable(rconflicts_data, format = "html", escape = TRUE) %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(ncol(rconflicts_data), width = "25em") %>%  
  scroll_box(width = "100%", height = "auto")
Function ConflictingPackages Count PreferredPackage
%>% dplyr, tidyr, stringr, DiagrammeR, gt 5 dplyr
Position ggplot2, base 2 ggplot2
all_of dplyr, tidyr 2 dplyr
any_of dplyr, tidyr 2 dplyr
as.Date zoo, base 2 base
as.Date.numeric zoo, base 2 base
as.data.frame git2r, base 2 git2r
as_label dplyr, ggplot2 2 dplyr
as_tibble dplyr, tidyr 2 dplyr
between data.table, dplyr, plm 3 data.table
body<- methods, base 2 base
bread fixest, momentfit, sandwich 3 fixest
coef momentfit, stats 2 momentfit
combine dplyr, gridExtra 2 dplyr
confint momentfit, stats 2 momentfit
contains dplyr, tidyr, gt 3 dplyr
diff git2r, base 2 git2r
ends_with dplyr, tidyr, gt 3 dplyr
enexpr dplyr, ggplot2 2 dplyr
enexprs dplyr, ggplot2 2 dplyr
enquo dplyr, ggplot2 2 dplyr
enquos dplyr, ggplot2 2 dplyr
ensym dplyr, ggplot2 2 dplyr
ensyms dplyr, ggplot2 2 dplyr
estfun fixest, sandwich 2 fixest
everything dplyr, tidyr, gt 3 dplyr
expr dplyr, ggplot2 2 dplyr
filter dplyr, stats 2 dplyr
first data.table, dplyr 2 data.table
fixef fixest, plm 2 fixest
head git2r, utils 2 git2r
index zoo, plm 2 plm
intersect dplyr, base 2 dplyr
kernapply momentfit, stats 2 momentfit
kronecker methods, base 2 base
lag dplyr, plm, stats 3 dplyr
last data.table, dplyr 2 data.table
last_col dplyr, tidyr 2 dplyr
lead dplyr, plm 2 dplyr
matches dplyr, tidyr, gt 3 dplyr
merge momentfit, git2r, base 3 momentfit
model.matrix momentfit, stats 2 momentfit
nobs plm, stats 2 plm
npk MASS, datasets 2 MASS
num_range dplyr, tidyr, gt 3 dplyr
one_of dplyr, tidyr, gt 3 dplyr
p shiny, pander 2 shiny
plot momentfit, graphics, base 3 momentfit
print momentfit, base 2 momentfit
pull dplyr, git2r 2 dplyr
quo dplyr, ggplot2 2 dplyr
quo_name dplyr, ggplot2 2 dplyr
quos dplyr, ggplot2 2 dplyr
reset lmtest, git2r 2 git2r
residuals momentfit, stats 2 momentfit
select dplyr, MASS 2 dplyr
setdiff dplyr, base 2 dplyr
setequal dplyr, base 2 dplyr
show momentfit, methods 2 momentfit
starts_with dplyr, tidyr, gt 3 dplyr
subset momentfit, base 2 momentfit
summary momentfit, base 2 momentfit
sym dplyr, ggplot2 2 dplyr
syms dplyr, ggplot2 2 dplyr
tag shiny, git2r 2 shiny
tags shiny, git2r 2 shiny
tibble dplyr, tidyr 2 dplyr
tribble dplyr, tidyr 2 dplyr
union dplyr, base 2 dplyr
update momentfit, stats 2 momentfit
vars dplyr, ggplot2, gt 3 dplyr
vcov momentfit, stats 2 momentfit
vcovHAC momentfit, sandwich 2 momentfit
vcovHC sandwich, plm 2 plm
yearmon data.table, zoo 2 data.table
yearqtr data.table, zoo 2 data.table

In the event of package conflicts, we follow the preferences outlined in this file. However, if a module requires a function from a non-preferred package, authors must explicitly use the package::function() syntax to avoid ambiguity. This syntax is generally encouraged to ensure clarity and compatibility across systems.

While we aim for a harmonized environment, some variation between countries may persist due to local constraints. Any such discrepancies are documented and communicated to module writers before deployment.

There are five general categories of R tools:

  • Metadata: for generating, verifying, and manipulating metadata
  • Infra: used by MDI staff for data importation, harmonization, and disclosure checks
  • MDI: mostly for module writers, e.g. merge_datatables, regressions, aggregations, export
  • Analysis: support analytical tasks and reporting
  • Programmer: assist with R coding tasks

All tools are documented using Roxygen2 and exported as an R package. You can access documentation via the standard ?function() syntax, or by clicking on the mdi package in the RStudio package tab to view the full list of available functions.

2.3.4 Importing data

The NSI metadata enables the creation of standardized microdata panels (MD), which are harmonized and managed by the Launcher based on the logic defined in countdown.R.

MD datasets can be imported from the dirTMPSAVE folder, which is predefined in the environment. For example, the MD dataset ITGS can be imported using the following code snippet:

Code
BR <- readRDS(paste0(dirTMPSAVE, 'BR.rds'))

2.3.5 Manipulating data

You can freely manipulate linked panels using R and its libraries, such as data.table, dplyr, and the broader tidyverse.

Managing firm units

When writing your analytical module, always check the unit of observation in each MD dataset for all countries where your code will run. This can be verified in the MDI Metadata Viewer.

For example, if using the ENER dataset, note that the unit of observation may vary between countries—such as between France and Portugal. Your R code must account for these differences to ensure analytical consistency. For details on the units used in each MD dataset, please refer to the metadata file MD_idInfo.csv.

To assist with this, the MDI toolkit includes a utility for aggregating or disaggregating between different key IDs. This tool is located in rocket/Rtools/R/mdi_key_id_switch.R and uses the metadata file *NSI*_firmid_entgrp_conc.csv.

2.3.5.1 Working with classifications

Classification variables, such as industry or product codes, are key in microdata work. We use tools that allow classification lists to be coherent over time.

In those harmonized MD datasets where a classification variable is present, be aware that the those inclide both the raw classification variable and the time-concorded one. Hence, make sure you keep this in mind when designing your code!

In particular, when you prepare your (res_group)MDnames_select.csv, please use the original MDnames for all classification variables, but feel free to use the concorded MDnames for the concorded classification variables.

For more details, check the dedicated section.

2.3.5.2 Merging data

Additionally, when merging data from two files, use the mdi_mergedatatables() function. This helps prevent memory issues and ensures that merges are performed correctly.

Tip

When working with MD_dataset CIS, given that the data comes from bi-yearly surveys, it is suggested to always merge CIS with BR, like

Code
DT <- merge(BR, CIS, by = c('firmid','year'), all.x = T)

Then the user can decide on how to interpolate the values in the missing years for the same firmid.

Last but not least, be a smart coder: clean up unnecessary datasets, and avoid writing code that calls the operating system, creates (sub)directories, or installs R packages.

2.3.6 Analytical tools

The MDI toolkit includes a set of functions designed to help researchers efficiently carry out common tasks in firm-level microdata research. In this section, we present some of the key tools available. For more details, please refer to the MDI R tools section above and consult the mdi library PDF.

  • mdi_aggregate()

    This tool aggregates variables in a data.table by group, allowing customizable statistics (e.g., sum, mean, HHI – check the PDF manual of the mdi library for the full list of methods in the related section), optional merging with the original dataset, and built-in disclosure checks.

Working with quantiles

Note that any output containing data points (such as plots or tables with quantiles) cannot be exported due to disclosure restrictions. Hence, quantiles cannot be exported as such.

However, keep in mind that mdi_aggregate() allows to compute the mean of the minimum number of observations allowed for disclosure (the function uses MDIminNumObs, check the related section below) around the observation that is closest to the first (q25), second (median) or third (q75) quantile value.

The diagram below illustrates how this value is calculated for a series of values (3 to 11), in case

  • MDIminNumObs is an odd number

timeline
    title Odd: `MDIminNumObs` = 5 → pick 2 below, 1 at quantile, 2 above
    3  : |
    5  : 🔵 (2nd below)
    6  : 🔵 (1st below)
    7  : 🔴 (closest to q)
    8  : 🔵 (1st above)
    9  : 🔵 (2nd above)
    11 : |

  • MDIminNumObs is an even number

timeline
    title Even: `MDIminNumObs` = 6 → pick 3 below, 3 above (bias below)
    3  : 🔵 (3rd below)
    5  : 🔵 (2nd below)
    6  : 🔵 (1st below)
    7  : 🔴 (closest to q)
    8  : 🔵 (1st above)
    9  : 🔵 (2nd above)
    11 : |

Note that if the number of observations for the aggregate is small then the coefficient might be very different from the quantile value.

  • estimate_markup()

    This function computes firm-level markups using the De Loecker (2012) method, which divides output elasticity multiplied by revenue by input costs, and returns the resulting markup as a new variable.

  • estimate_prod()

    This tool estimates firm-level production function parameters (such as input elasticities and/or TFP) using OLS, ACF, or OP methods under Cobb-Douglas or translog specifications. It offers flexible options for fixed effects, instruments, and grouped estimation.

  • mdi_regress()

    This function runs one or more regressions using feols or feglm from the fixest package, performs automatic disclosure checks to ensure the minimum observation threshold is met, and optionally exports LaTeX regression tables with accompanying metadata logs.

  • pim_capital()

    This tool estimates firm-level capital stock using the Perpetual Inventory Method (PIM), based either on a user-specified depreciation rate or an inferred asset type. It returns the original data.table with an added capital stock variable.

Industry Aggregations

Researchers may wish to conduct their analysis at various levels of sectoral aggregation. The MDI infrastructure supports this by providing classification concordances such as MD_nace_hier.csv and MD_naceR2_CNind_classconc.csv, which allow NACE Rev.2 industry codes to be mapped to broader industry groupings—such as 3-digit, 2-digit, 1-digit levels, and the CompNet macroindustry classification.

2.3.7 Exporting results

Once a results table is generated, the researcher must extract the file at the end of the module. After the launch is fully executed in a given country, the country leader submits an export request to the NSI, which then verifies compliance with disclosure rules for each output file (see disclosure criteria for details).

The mdi_export() function facilitates this process by exporting a data.table to a specified file format and logging comprehensive metadata—including variables used, purpose, and dataset context—into a central description file (OutputDescription.csv), which is also extracted. The function includes optional disclosure checks for summary statistics.

Below is a description of all parameters required for mdi_export():

It is fundamental, for disclosure reasons, that the module writer fills in exhaustive information related to each output file when using function mdi_export(). In particular, please provide

  1. format
    Character string specifying the format of the export (‘csv’, ‘RDS’, ‘txt’, ‘dta’, ‘xlsx’, ‘sas’).

  2. output_name
    The name of the file to be created, without the file extension and the country code.

  3. datasets_used
    The name of the MD_dataset(s) used for the analysis.

  4. purpose
    Describe the research purpose of the analysis.

  5. share_0_1
    Explain whether the output contains any shares equal to 0 or 1 (i.e. 0% or 100% of the group share the same characteristic). Such cases are not allowed according to the output guidelines and must therefore be suppressed or explicitly justified.

  6. zeroes
    If the output contains zero values, provide an explanation of why these zeroes are not revealing additional information. Otherwise, this information must be suppressed.

  7. rel_other_output
    Describe how this output file relates to other previously exported or requested files, for instance whether it performs the analysis in a different way and, if so, how.

  8. selection
    Describe if the results contained in the file were derived from a specific selection of the sample available (if so, explain which selection) or if the full sample is used.

  9. export_type
    Character string indicating the type of output (‘sum_stat’ for summary statistics, ‘reg_tab’ for regression tables, or ‘other’).

  10. description
    A string providing additional explanation of the output file.

2.3.8 Consolidation of MDI Module Output

Once the files pass the disclosure checks:

  • Country leaders/NSIs will upload each country’s output to their designated Teams folder.

  • MDI staff will consolidate (stack) the outputs of each module by country and place both the module-specific and general launcher outputs in the appropriate Teams folder for module writers.

2.3.8.1 Procedure to stack MDI Module Output

For cross-country analysis, the individual country exports need to be identified, and consolidated into combined stacked datasets per module.

This is accomplished in three steps, using a sequence of scripts that are stored in dirROCKET/MDIprogs.

  1. Step 1: get_output_file_list.R Generate Country-Specific File Lists

The first script creates a file inventory for each country.

Inputs:
- Country code (CC, e.g., FR, FI, NL)
- Launch version number (2.3)
- Local path to the country’s upload directory

Process:
1. Iterates through all module output folders for the selected country.
2. Extracts the names of all .csv files, excluding descriptive files (e.g., OutputDescription.txt).
3. Adds metadata:
- Launch version number
- Country code
- A numeric flag indicating the format of the file name (1, 2, or 3).
4. Saves the resulting inventory as launch_<n>_file_list_<CC>.csv in the specified directory at the start of the script.

Output:
A CSV file listing all valid exported files for a single country, annotated with launch and country metadata.


  1. Step 2: generate_stacked_files.R Combine File Lists Across Countries

The second script consolidates the individual country inventories into one master file list.

Inputs:
- File lists generated by Script 1 (launch_<n>_file_list_<CC>.csv for each country).

Process:
1. Reads each country’s file list.
2. Appends a Country column to identify the file’s origin.
3. Stacks the inventories into one combined dataset.

Output:
A single file: launch__file_list_combined.csv containing metadata on all exported files across participating countries.


  1. Step 3: consolidate_output.R Consolidate Module-Level Outputs

The third script merges the exported data across countries for a chosen module.

Inputs:
- The combined file list from Step 2.
- Module name (e.g., EN for Energy).
- Country-specific Export directories (Most likely a Teams path).

Process:
1. Filters the combined file list for the specified module.
2. Iterates through each country’s export path and retrieves the corresponding .csv files.
3. Reads and cleans each filename and appends a Country identifier column.
4. Binds all country datasets into one consolidated file.

Output:
A module-specific cross-country combined file (e.g., EN_combined.csv) stored in the specified Research Agenda folder that you input at the start of the script.


To summarise module exports consolidation

  1. get_output_file_list.R → Generate a country-level export file list.
  2. generate_stacked_files.R → Combine these lists into a cross-country file inventory.
  3. consolidate_output.R → Use the inventory to locate, clean, and stack module-level data exports across countries.

By running these 3 scripts, all outputs are systematically catalogued, reproducible, and readily available for post-launch comparative analysis.


2.4 Running Order & How-To (Quick Reference)

2.4.1 Prerequisites

  • R packages: data.table, dplyr, readr

  • Directory layout must follow:

    • .../MDI Data Providers Forum - CC - CC/Upload/Launch_<n>/<CC>_output_Launch_<n>_<MODULE>/...
    • Central outputs: .../CompNet MDI Research Agenda - General/Launch_<n>
  • Researchers can run ../launchpad/interactive_MDI.R to initialize their MDI environment in a standardized way.

  • Researchers then analyze the output, optionally using standardized tools for statistical analysis, graphing, and reporting.

2.5 Dealing with classifications

A key feature of firm-level research is the use of classifications, such as industry codes (NACE codes), product codes (PRODCOM codes) and trade codes (combined nomenclature codes). Given that the official set of codes in a classification can vary across the years, we developed some tools that allow us to have a consistent list of codes over time in each country. Specifically, we make use of two tools:

  • make_conc()

    This tool is currently used to harmonize PRODCOM and ITGS codes over time.

    Firstly, it takes the time concordance tables for each couple of years and reproduces the development of each code over time. This way, the yearly concordances traces all the possible changes of the codes from the first year to the last year of the relevant period. Secondly, it links all the paths of codes that have common codes. This way, it harmonizes these groups of codes with a common code from the last year of the relevant period.

    Note: Column left in a time concordance table, the one we received from the NSI, might not contain all codes we observe in the dataset at time year-1. Hence, it is advised to use tool mdi_timeconc_update() from the mdi package. This tool makes sure that if there’s any missing mapping, those are present in the time concordance table for that dataset.

    As inputs, it requires the yearly concordance tables of the classification (in data.table format), the numeric vector of the years of interest, and the character name of the classification. It returns the data.table that concords each code to the harmonized code, for each year.

    It returns the original MD dataset with the old NSI class code (under column NSI_(classname)) and the harmonized code (under the column using the classification name).

  • concord_nace()

    This tool harmonizes NSI NACE classification over time.

    First, it detects the year with the most NACE code changes, the year with a possible break in the classification. Then, it uses the mode NACE code in the post-break year as harmonized NACE code and harmonizes previous year codes accordingly, by firm. For firms present only before the break year, their codes are harmonized depending on the changes of the surviving firms, which are used to create a concordance table between codes in the pre- and post-break year.

    As inputs, it requires the character dataset name; a logical to decide whether or not to weight code matches of surviving firms by employment (instead of number of firms); the number indicating the cumulative residual share of firms deleted for the construction the pre- and post-break year concordance; and the number indicating the share of firms deleted for the construction of such concordance.

    It returns the original MD dataset with the old NSI NACE code (under column nace) and the harmonized NACE (under column MDnace).

    It will be possible, then, to add the MD NACE through the concordance table between NSI NACE codes and the MD NACE codes.

2.6 Disclosure Procedures

2.6.1 What are Disclosure Criteria

Disclosure criteria at National Statistical Institutes (NSIs) are rules designed to protect the confidentiality of firm-level data. They ensure that no output allows the identification of individual firms or the disclosure of sensitive information, even in aggregated form. These criteria are crucial for complying with national privacy and data protection regulations.

2.6.2 How Disclosure Criteria Are Applied:

MDI tools automate disclosure control by applying primary and secondary confidentiality rules (such as minimum observation thresholds and dominance criteria) before any output is released. These rules ensure that sensitive data is suppressed or flagged, in line with the parameters defined in file payload/Launch_v2.3/MDmetadata/MD_disclosure_info.csv.

Learn more how this is done
  • MDI tools such as mdi_aggregate(), disclose(), DisclosCrit(), mdi_regress(), and mdi_export() help automate disclosure control by enforcing rules based on parameters set in the Countdown, ensuring compliance before output is released.

  • Primary disclosure (Step 1) requires suppression of all cells that fail the dominance criterion or contain fewer than the minimum number of observations (minNrObs).

  • Secondary disclosure (Step 2) involves suppressing additional cells to protect those flagged in Step 1, following the minimum frequency rule. This typically means suppressing the smallest unsuppressed cell if only one cell was suppressed in Step 1 (applicable to totals/sums where the parent node is available).

For example, a cell not meeting NumObs or exceeding domPerc is suppressed. Outputs violating these criteria are flagged or excluded from export.

2.6.3 Components of Disclosure Criteria in the MDI:

Four main variables are created by MDI tools to assess diclosure criteria:

Dominance Share (MDIdomSh)

The maximum share of the total (e.g., employment, sales) contributed by the largest ‘X’ firms (number ‘X’ defined by domNr) in a cell. Example: If domPerc is 0.75, The top ‘X’ firms cannot contribute more than 75% to the cell’s total.

Minimum Number of Observations (MDIminNumObs):

The minimum number of firms required in a cell for it to be included in the output. Example: If NumObs is 3, at least 3 firms must contribute to a cell.

Top Firms Count (MDIdomNr)

Specifies how many top firms’ shares are considered when applying the dominance criterion (domPerc). Example: If domNr is 1, the dominance is based on the largest firm; if 2, the top two firms are considered.

Dominance Variable (MDIdomVar)

The variable on which the dominance criterion is applied, such as employment (emp) or sales (nq). Different NSIs may apply criteria to different variables, depending on their legal requirements. Note: The domVar can be ‘var’ in the countdown. If so, the domPerc is computed for all variables for which an aggregate is computed.

Show dominance percentiles (show_domVar)

This is a dummy variable indicating whether the dominance percentile columns needs to be included (1) or not (0) in the output file.

Hide or not hide values post-disclosure (show_values)

This is a dummy variable indicating whether the aggregates in the output file need to be hidden (0) or not (1) in case they don’t comply with the disclosure rules of the NSI.

2.6.4 Disclosure Criteria in MDI Countries

Below is the disclosure criteria in MDI countries:

disclosure_variable AT EL FI FR DE NL PTx PT SI GB MT
MDIminNumObs 10 5 3 4 NA 10 1 1 5 10 3
MDIdomVar var var persons_br var NA var var var var var var
MDIdomSh 0.8 0.85 0.75 0.85 NA 0.5 1 1 0.5 0.4375 0.9
MDIdomNr 2 2 1 1 NA 1 1 1 1 1 2
show_domPerc 1 1 1 1 NA 1 0 0 1 1 1
show_values 0 0 0 1 NA 0 1 0 1 0 0

Some countries apply additional disclosure criteria. For instance, the Netherlands (NL) and Slovenia (SI) require that all exported variables (not just employment (emp) and sales (nq)) comply with the dominance share criterion. In such cases, the parameter domVar is set to ‘var’ in the Countdown file. These disclosure parameters are configured during the execution of the infrastructure at an NSI, either by MDI or NSI staff.

2.6.5 Disclosure Routines in MDI

The MDI tools listed below operate using the disclosure parameters defined by the user in the Countdown file.

Tool Purpose Use.by.Researchers Use.Other.MDI.Tools
mdi_aggregate.R Aggregate data with optional disclosure checks: dominance threshold (`domPerc`) and minimum observations (`NumObs`). Yes Yes (`disclosCrit`, `disclose`)
mdi_regress.R Performs regression analysis and automatically checks whether the number of observations meets the required minimum threshold for disclosure. Skips regressions that fail the check. Yes No
mdi_export.R Exports datasets with optional disclosure compliance and metadata logging. Yes Yes (`disclose`)
disclose.R Performs primary and secondary disclosure checks. No No
disclosCrit.R Adds disclosure metrics (`domPerc`, `NumObs`) to datasets. No No

Module writers are strongly encouraged to use MDI tools to ensure compliance with the disclosure criteria of all countries where the module is intended to run. ### Primary and Secondary Disclosure with disclose

The disclose tool applies two levels of disclosure control to aggregated statistics to ensure compliance with confidentiality requirements.
Suppressed values are replaced with the sentinel value -999, and disclosure flags (discflag1, discflag1_*, discflag2) record which suppression criteria were triggered.


2.6.5.1 1. Primary Disclosure

Primary disclosure applies two main suppression rules to protect confidentiality:

  • Minimum Observations Rule
    Any aggregate based on fewer than the required minimum number of observations (MDIminNumObs) is suppressed.
    All affected variables in that row, including the NumObs column, are replaced with -999.

  • Dominance Rule
    For sum-type variables, the function evaluates the dominance share of the largest contributors (domPerc_*), calculated by [disclosCrit()].

    • For most countries: a cell is suppressed if the dominance share exceeds the threshold (domPerc >= MDIdomSh).
    • Germany (DE): the rule is inverted — a cell is suppressed if the dominance share falls below the threshold (domPerc < MDIdomSh).
      This reflects German statistical disclosure practice, where low dominance values indicate high concentration risk.

All cells suppressed in this step are flagged with discflag1 (and variable-specific flags discflag1_<var>).


2.6.5.2 2. Secondary Disclosure (Hierarchical Totals Only)

When the aggregated data includes hierarchical levels — for example, industry or regional totals — an additional secondary disclosure step prevents back-calculation of suppressed values.

  • The hierarchy file (hhfile) must be a wide table, with one column per hierarchical level (e.g., h_0, h_1, h_2, …).
  • The node variable in the dataset identifies the child level.
  • The parent level is determined by the next column to the right of the child in the sorted hierarchy (e.g., if node = h_1, the parent is h_2).

Suppression rule:
If within a parent group exactly one child cell was suppressed in the primary step, the tool suppresses one additional child — the non-suppressed cell with the smallest number of observations (NumObs).
This prevents the originally suppressed value from being reconstructed by subtraction from the total.
All such cases are flagged with discflag2.


2.6.5.3 Germany-Specific Note on Dominance Percentiles

The dominance share (domPerc_*) used in the primary disclosure rule is computed differently for Germany in [disclosCrit()].
Instead of using the standard top-n share (sum of the top domNr values divided by the total), Germany applies the following ratio:

\(\text{domPerc} = \frac{\text{Total} - x_1 - x_2}{x_1}\)

where \((x_1)\) and \((x_2)\) are the two largest firm values in each aggregation group.
This yields a dominance percentile that decreases as concentration increases — hence, in Germany, smaller values of domPerc indicate greater dominance and trigger suppression (domPerc < MDIdomSh).


2.6.5.4 Output

The tool returns the dataset with all required suppressions applied and two disclosure flags:
- discflag1: primary disclosure
- discflag2: secondary disclosure

2.7 Auxiliary Files

Deflators are constructed using data extracted from Eurostat via the eurostat R package. These include deflators for

  • Value-added (pnv, at NACE level 2)

  • Capital depreciation (pnc, at NACE level 2)

  • Gross fixed-assets (pgrK, at NACE level 1)

  • Investment (pni, at NACE level 2)

  • GDP (pngdp, at NACE level 2)

  • Harmonized consumer price index (HCPI) (pnhcpi, at NACE level 2),

all normalized to a 2010 base year (set to 1). The underlying Eurostat datasets—nama_10_a64, nama_10_a64_p5, nama_10_gdp, prc_hicp_aind, and nama_10_nfa—cover national accounts and price indices. I

The processed deflator file is structured by country code (cc), industry code (DEFind, that can be linked to MD variable nace using table nace_DEFind), and year (year). The table also contains asset-specific deflators (e.g., construction, machinery, intellectual property) and includes growth and depreciation rates, offering a detailed dataset for robust analytical use.

2.8 Nuvolos: where MDI users develop and test their codes

To support testing of both modules and the full MDI infrastructure, a dedicated environment has been set up on the server Nuvolos. This environment includes several separate spaces for different purposes—for example, an internal development space for the MDI team, and a testing space for internal and external users to validate module code.

This server replicates the environment of a national statistical institute and includes the MDI infrastructure. The Nuvolos server provides all the necessary tools, scripts, and libraries required to develop research modules. Each space includes mock data, which consists of artificially generated datasets that mimic real NSI data in structure and naming conventions. These datasets allow for realistic and consistent testing of modules and infrastructure. (Details on how the mock data is created can be found here.)

2.8.1 How to access

Nuvolos is the server environment that we use for trainings, testing, developing and debugging. There are three separate environments:

1.MDI Training Environment: this space is meant for MDI users, module writers who want to test their scripts and people who want to learn about the MDI

2.Nuvolos Developer Space: this space is meant for the MDI infra team to develop and test the MDI infrastructure

3.Portugal Data Access Space: this is a space exclusively for people who have data access to the Portuguese data. This is where the MDI for PTx is executed by the MDI team.

If you want to get access to either of the environments, reach out to Johanna via email or Teams.

2.8.1.1 MDI Training Environment

This space is designed for training, practicing, and familiarizing with the MDI infrastructure. It’s intended mostly for external people, as it already contains all relevant files, ie. the whole MDI infrastructure, mock data. It’s an environment in which externals can test their codes using the MDI infrastructure and mock data. It is updated periodically, so doesn’t always reflect the latest version of the MDI. It is not intended for bug-fixing and working on the infrastructure, as it doesn’t link to Github.

As this is a practice space, module writers can import or write their scripts, develop their module, run it as part of the MDI and export their files it if needed. The final module needs to be sent to the MDI team before each launch.

In the training space each user has a separate copy of the MDI and data files. That means, if scripts are altered moved or added, this is only reflected in the user’s space, i.e. deleting a file will not affect any other user and no other user can see your modifications.

2.8.2 How to use

You will find a folder structure such as the one found in the NSI environments containing the actual microdata.

Files structure
  1. “Files” section {#files-nuvoulos}

You’ll find the following folders:

  • MDI: containing the MDI infrastructure
  • output: if a script generates any output it can saved here
  1. These additional folders can only be accessed through RStudio:
  • space_mount/mockdata/NSIdata/: this folder contains NSI mockdata of several countries
  • space_mount/mockdata/TMP/: this folder contains MD mockdata of several countries

To write their codes, users need to first follow some steps to load the environment. I.e., run some functions which import the mockdata and all the necessary auxiliary files mimicking the NSI environments.

go to the section “Applications” on the left navigation bar and open the RStudio application

  • Run the countdown:
    • In RStudio, on the right in the files section, look for the MDI > launchpad and open countdown.
    • R Click on ‘source’ to execute it.
    • You will be asked choose a program to run, choose interactive MDI by entering 4
    • Wait until the script is done. The metadata, Rtools and libraries are imported - you can now create/ run your script.
  • Create/ Execute a Module
    • Add a new file
    • Add your module code or any code that you want to run
    • Execute your code either line-by-line by clicking ‘run’ or all at once by clicking ‘source’
  • When you run your script, any errors will appear in the Console section. The executed lines are highlighted in blue, while errors are displayed in red. To resolve an issue, identify the problematic line in your script, make the necessary corrections, and run it again.

2.8.2.1 Portugal Data Access Space

This space is set up for the direct access of the Portugese data, this is where the launch execution for Portugal takes place. This space is reserved for the PTx country leaders. It contains the raw PTx data the MDI infrastructure. The MDI infrastructure in this space will be updated regularly but is not necessarily the most recent version found of GitHub.

The data is stored in the large file storage (folder space_mounts/NSIdata). Everyone who has access to the space has read and write permission of the data. Any changes to the data should be done with utmost caution.

All files - infrastructure and data - are shared files across all users of the space. That means any modifications that are done will show for all users of the space.

2.9 The complete MDI pipeline

Below is the complete MDI pipeline for your visualization:

Code
%%{init: {
  "theme": "base", 
  "themeVariables": {
    "background": "#ffffff",
    "textColor": "#000000",
    "lineColor": "#000000",
    "fontSize": "26px"
  }
}}%%

flowchart TB

%% Setting MDI

A([🚀 **Microdata Infrastructure** 🚀]) --> |to initialize it...|A1([**launchpad/countdown.R**])

A1 --> A2(set country code)
A1 --> A3(set paths to directories)
A1 --> A4(set disclosure parameters)

A2 -.-> A5[AT, DE, FI, FR, NL, PT, SI]

  style A fill:#003366,stroke:#000000,color:#ffffff  
  style A1 fill:#228B22,stroke:#ffffff,color:#ffffff
  style A2 fill:#90EE90,stroke:#ffffff,color:#000000
  style A3 fill:#90EE90,stroke:#ffffff,color:#000000   
  style A4 fill:#90EE90,stroke:#ffffff,color:#000000  
  style A5 fill:#E6FFE6,stroke:#ffffff,color:#000000


%% Options MDI

A3 ---> B1([there are four options:])
A4 ---> B1([there are four options:])
A5 ---> B1([there are four options:])

B1 --> C[(**pre_launch_checker.R**)]
B1 --> D[(**litoff.R**)]
B1 --> E[(**interactive_mdi.R**)]
B1 --> F[(**prepare_NSI.R**)]

  style B1 fill:#F0F0F0,stroke:#FFFFFF,color:#000000

%% Pre-launch Checker

C --> C1[checking if metadata corresponds to what we have in the environment]

C1 --> C2(**Database**)
C1 --> C3(**Varnames**)
C1 --> C4(**Codebooks**)
C1 --> C5(**Classfiles**)

C2 -.-> C6[Do they all exist?]
C3 -.-> C7[Do all varnames exist?]
C3 -.-> C8[Are all varnames of the listed data type?]

  style C fill:#7A5DC7,stroke:#000000,color:#ffffff
  style C1 fill:#E6E6FA,stroke:#ffffff
  style C2 fill:#7A5DC7,stroke:#ffffff,color:#ffffff
  style C3 fill:#7A5DC7,stroke:#ffffff,color:#ffffff
  style C4 fill:#7A5DC7,stroke:#ffffff,color:#ffffff
  style C5 fill:#7A5DC7,stroke:#ffffff,color:#ffffff
  style C6 fill:#E6E6FA,stroke:#ffffff
  style C7 fill:#E6E6FA,stroke:#ffffff
  style C8 fill:#E6E6FA,stroke:#ffffff

%% Liftoff

D ---> D1(🚀 **main script that completes the launch sequence: loads essential R libraries, pulls in MDI resources and NSI metadata, and brings raw firm data plus concordance tables into the environment. This script launches the rocket** 🚀)

D1 --> D2(read R libraries, import mdi library, import NSI metadata, import concordance tables)
D1 --> D3(import raw firm data)

D1 ---> |if data is already harmonized...|D4[**execute research modules**]

D4 --> D5[M0]
D4 --> D6[CN]
D4 --> D7[EN]
D4 --> D8[FD]
D4 --> D9[MP]
D4 --> D10[TC]

D5 --> D11[🎊 **extract results** 🎊]
D6 --> D11
D7 --> D11
D8 --> D11
D9 --> D11
D10 --> D11

  style D fill:#8B0000,stroke:#000000,color:#ffffff  
  style D1 fill:#8B0000,stroke:#ffffff,color:#ffffff 
  style D2 fill:#FFD6D6,stroke:#ffffff,color:#000000  
  style D3 fill:#FFD6D6,stroke:#ffffff,color:#000000
  style D4 fill:#8B0000,stroke:#ffffff,color:#ffffff 
  style D5 fill:#FFD6D6,stroke:#ffffff,color:#000000
  style D6 fill:#FFD6D6,stroke:#ffffff,color:#000000
  style D7 fill:#FFD6D6,stroke:#ffffff,color:#000000
  style D8 fill:#FFD6D6,stroke:#ffffff,color:#000000
  style D9 fill:#FFD6D6,stroke:#ffffff,color:#000000
  style D10 fill:#FFD6D6,stroke:#ffffff,color:#000000
  style D11 fill:#8B0000,stroke:#ffffff,color:#ffffff 

%% Interactive MDI

E --> E1[set up directories, call R libraries, load Rtools, and interactively explore the MDI environment]

  style E fill:#FF8C00,stroke:#000000,color:#ffffff   
  style E1 fill:#FFF2DC,stroke:#ffffff   

%% Prepare NSI

F --> F1[first time preparation of data and metadata, updating of metadata when new file years and file types become available]

  style F fill:#654321,stroke:#000000,color:#ffffff   
  style F1 fill:#ECD9C6,stroke:#ffffff  
  
%% Harmonize raw data

D2 --> G(🔩 **harmonize raw data to MDI** 🔩)
D3 --> G

G --> G1(there are 4 methods for the harmonization process)
G1 -.-> G2(**revalue**: transform value of a unique variable)
G2 -.-> G3(**recode/reclass**: concord codebook/class variables as desired)
G3 -.-> G4(**redefine**: aggregate one or more variables)
G4 -.-> G5(**remap**: assign new name to a raw variable)

G ---> D4

style G fill:#4B4B4B,stroke:#ffffff,color:#ffffff   
style G1 fill:#E0E0E0,stroke:#ffffff   
style G2 fill:#F5F5F5,stroke:#ffffff   
style G3 fill:#F5F5F5,stroke:#ffffff  
style G4 fill:#F5F5F5,stroke:#ffffff    
style G5 fill:#F5F5F5,stroke:#ffffff

%%{init: {
  "theme": "base", 
  "themeVariables": {
    "background": "#ffffff",
    "textColor": "#000000",
    "lineColor": "#000000",
    "fontSize": "26px"
  }
}}%%

flowchart TB

%% Setting MDI

A([🚀 **Microdata Infrastructure** 🚀]) --> |to initialize it...|A1([**launchpad/countdown.R**])

A1 --> A2(set country code)
A1 --> A3(set paths to directories)
A1 --> A4(set disclosure parameters)

A2 -.-> A5[AT, DE, FI, FR, NL, PT, SI]

  style A fill:#003366,stroke:#000000,color:#ffffff  
  style A1 fill:#228B22,stroke:#ffffff,color:#ffffff
  style A2 fill:#90EE90,stroke:#ffffff,color:#000000
  style A3 fill:#90EE90,stroke:#ffffff,color:#000000   
  style A4 fill:#90EE90,stroke:#ffffff,color:#000000  
  style A5 fill:#E6FFE6,stroke:#ffffff,color:#000000


%% Options MDI

A3 ---> B1([there are four options:])
A4 ---> B1([there are four options:])
A5 ---> B1([there are four options:])

B1 --> C[(**pre_launch_checker.R**)]
B1 --> D[(**litoff.R**)]
B1 --> E[(**interactive_mdi.R**)]
B1 --> F[(**prepare_NSI.R**)]

  style B1 fill:#F0F0F0,stroke:#FFFFFF,color:#000000

%% Pre-launch Checker

C --> C1[checking if metadata corresponds to what we have in the environment]

C1 --> C2(**Database**)
C1 --> C3(**Varnames**)
C1 --> C4(**Codebooks**)
C1 --> C5(**Classfiles**)

C2 -.-> C6[Do they all exist?]
C3 -.-> C7[Do all varnames exist?]
C3 -.-> C8[Are all varnames of the listed data type?]

  style C fill:#7A5DC7,stroke:#000000,color:#ffffff
  style C1 fill:#E6E6FA,stroke:#ffffff
  style C2 fill:#7A5DC7,stroke:#ffffff,color:#ffffff
  style C3 fill:#7A5DC7,stroke:#ffffff,color:#ffffff
  style C4 fill:#7A5DC7,stroke:#ffffff,color:#ffffff
  style C5 fill:#7A5DC7,stroke:#ffffff,color:#ffffff
  style C6 fill:#E6E6FA,stroke:#ffffff
  style C7 fill:#E6E6FA,stroke:#ffffff
  style C8 fill:#E6E6FA,stroke:#ffffff

%% Liftoff

D ---> D1(🚀 **main script that completes the launch sequence: loads essential R libraries, pulls in MDI resources and NSI metadata, and brings raw firm data plus concordance tables into the environment. This script launches the rocket** 🚀)

D1 --> D2(read R libraries, import mdi library, import NSI metadata, import concordance tables)
D1 --> D3(import raw firm data)

D1 ---> |if data is already harmonized...|D4[**execute research modules**]

D4 --> D5[M0]
D4 --> D6[CN]
D4 --> D7[EN]
D4 --> D8[FD]
D4 --> D9[MP]
D4 --> D10[TC]

D5 --> D11[🎊 **extract results** 🎊]
D6 --> D11
D7 --> D11
D8 --> D11
D9 --> D11
D10 --> D11

  style D fill:#8B0000,stroke:#000000,color:#ffffff  
  style D1 fill:#8B0000,stroke:#ffffff,color:#ffffff 
  style D2 fill:#FFD6D6,stroke:#ffffff,color:#000000  
  style D3 fill:#FFD6D6,stroke:#ffffff,color:#000000
  style D4 fill:#8B0000,stroke:#ffffff,color:#ffffff 
  style D5 fill:#FFD6D6,stroke:#ffffff,color:#000000
  style D6 fill:#FFD6D6,stroke:#ffffff,color:#000000
  style D7 fill:#FFD6D6,stroke:#ffffff,color:#000000
  style D8 fill:#FFD6D6,stroke:#ffffff,color:#000000
  style D9 fill:#FFD6D6,stroke:#ffffff,color:#000000
  style D10 fill:#FFD6D6,stroke:#ffffff,color:#000000
  style D11 fill:#8B0000,stroke:#ffffff,color:#ffffff 

%% Interactive MDI

E --> E1[set up directories, call R libraries, load Rtools, and interactively explore the MDI environment]

  style E fill:#FF8C00,stroke:#000000,color:#ffffff   
  style E1 fill:#FFF2DC,stroke:#ffffff   

%% Prepare NSI

F --> F1[first time preparation of data and metadata, updating of metadata when new file years and file types become available]

  style F fill:#654321,stroke:#000000,color:#ffffff   
  style F1 fill:#ECD9C6,stroke:#ffffff  
  
%% Harmonize raw data

D2 --> G(🔩 **harmonize raw data to MDI** 🔩)
D3 --> G

G --> G1(there are 4 methods for the harmonization process)
G1 -.-> G2(**revalue**: transform value of a unique variable)
G2 -.-> G3(**recode/reclass**: concord codebook/class variables as desired)
G3 -.-> G4(**redefine**: aggregate one or more variables)
G4 -.-> G5(**remap**: assign new name to a raw variable)

G ---> D4

style G fill:#4B4B4B,stroke:#ffffff,color:#ffffff   
style G1 fill:#E0E0E0,stroke:#ffffff   
style G2 fill:#F5F5F5,stroke:#ffffff   
style G3 fill:#F5F5F5,stroke:#ffffff  
style G4 fill:#F5F5F5,stroke:#ffffff    
style G5 fill:#F5F5F5,stroke:#ffffff

Note. This scheme shows how to initialize and run MDI. Start launchpad/countdown.R, set country, paths, and disclosure parameters. Choose one of four programs: pre_launch_checker to validate metadata and varnames. liftoff to load libraries, import metadata and raw data, harmonize using four methods (revalue, recode/reclass, redefine, remap), then execute modules and extract results after disclosure checks. interactive_mdi to explore the environment. prepare_NSI for first-time setup and metadata updates. Boxes indicate steps. Arrows indicate control and data flow.

3 Setting up the MDI

This section covers the technical details of the MDI - including preparation for implementation in a new country, the modifications required for a launch, and how to develop the infrastructure further. It provides information for the MDI team and the country leads responsible for setting up the MDI in their respective countries.

3.1 Introduction & Setup to MDI Infrastructure

The MDI (Microdata Infrastructure) provides a unified research environment implemented identically at all national statistical institutes (NSIs), including the mock data site. This consistent setup allows researchers to analyze standardized microdata (MD) panels, constructed from diverse national sources.

These MD panels are harmonized through detailed metadata, which ensures legal compliance, transparency, and comparability of statistical outputs across countries. For each NSI, the metadata specifies the available source files, variables, and classification lists, and maps them to the shared MD format. As a result, the datasets are syntactically identical across countries, even when the underlying data differs.

NSIs vary in their legal frameworks, technical setups, and the types of data they maintain—from registers and surveys to administrative sources. The MDI infrastructure addresses this heterogeneity by applying a common structure and metadata standard across all participating institutes.

To execute research, individual researchers write analysis modules—typically in R—the payload. This payload is executed inside the secure MDI environment. During a launch, the rocket reads the metadata and data, harmonizes them, constructs the MD panels, and then runs the payload modules. The outputs comply with disclosure rules, enabling valid cross-country comparisons.

3.1.1 Launch Preparation by the MDI team

These are the steps the MDI team must follow in this order before each launch.

  • Lock the R package list.Freeze the package versions to ensure consistency across all NSI environments.

  • Create a dedicated GitHub branch, e.g. post_Launch_vX.X .This branch will track all changes made during the launch at the NSIs.

  • Create an error tracking file alldocs/Launch_vX.X_errors_overview.csv. This shared file is used by all country leaders to document errors and changes during the launch process.

  • Generate documentation with roxygenize. Run roxygenize(paste0(dirROCKET, "Rtools")) to generate documentation. No further changes should be made to the R tools after this step.

  • Capture the Git commit details. Run rocket/MDIprogs/get_commit_details.R and re-commit the branch to lock the exact version.

  • Deploy to NSI Teams folder. Use the appropriate scripts to copy the finalized GitHub branch to the NSI-specific Teams folders.

  • Notify NSIs to download and set up. Ask NSI system administrators to download the folder, install or update required R packages, and install the MDI package.

3.1.2 Launch Sequence Overview

This section outlines the main steps for executing a full MDI launch.

  1. Import MDI Files
    Copy the complete MDI folder from the country Teams directory into a local working directory of your choice. Ensure that both the user and R have read access to the raw data files in that location.

  2. Configure Countdown Script
    Open launchpad/countdown.R and update the required parameters to match your site-specific setup. Find details and explanations here

  3. Run pre_launch_checker.R
    Begin by running countdown.R from your working directory and selecting the pre_launch_checker.R option. This checks for inconsistencies between the metadata expected by MDI and the actual metadata at the NSI site. It generates concordance files and a report listing issues to fix. Find details and explanations here

  4. Run the Post-Harmonization Checker
    In the countdown.R set the option MDImoduleRun = FALSE and run the countdown again, selecting liftoff.R. This executes the full MDI rocket without running the analytical modules. During this phase, the Post-Harmonization Checker (PHC) is triggered to validate the harmonized data. The PHC script performs quality checks on the harmonized microdata, such as detecting duplicates, format mismatches, date inconsistencies, and structural breaks. The results of these checks are saved to two files: <CountryCode>_phc_results.txt and breaks_report.pdf (in dirTMPSAVE ) . These files must be reviewed and issues resolved before proceeding. Please have a look at the detailed section on post harmonization checks for instructions for country leaders.

  5. Full module execution
    After resolving all issues flagged by the PHC, set MDImoduleRun = TRUE and rerun liftoff.R to execute the full set of research modules. Iterations with the MDI staff may be needed for fixes and patches to the rocket and payload, until the final results are written to the dirOUTPUT directory.

  6. Export
    The files in dirOUTPUT need to be checked for disclosure. After disclosure checks are completed, the approved files from this directory can be uploaded to the MDI TEAMS cloud directory designated for the NSI staff.

3.1.3 Additional Programs (Not Part of Launch Sequence)

The following programs are not part of the formal launch sequence but support metadata setup and interactive work:

  • prepare_NSI.R
    Used to generate and structure metadata at the NSI. Should be run before any launch steps if metadata is not yet available.

  • interactive_MDI.R
    Enables interactive work inside the MDI environment (e.g. for metadata exploration or manual testing). Not used during automated launches.

3.1.4 The structure of the MDI

These are the main directories in the MDI folder:

  • docs: Documentation of the MDI system, including the MDI Manual.
  • rocket: Code supporting and controlling MDI rocket launches, including NSI metadata and auxiliary data.
  • payload: Research modules, including metadata and NSI-specific NSI_MD concordances.
  • launchpad NSI-specific information for controlling MDI code and rocket launches.

Files in the MDI folder

.
├── docs
├── launchpad
├── payload
├── rocket
└── site

directory launchpad (with files to launch MDI)

launchpad
├── README.md
├── countdown.R
├── interactive_MDI.R
├── liftoff.R
├── pre_launch_checker.R
├── prepare_NSI.R
└── report_file_changes.R

subdirectories of rocket (with code, and (meta)data to support MDI)

rocket
├── CompNet
├── MDIprogs
├── NSImetadata
├── Rtools
├── auxdata
└── control

subdirectories of payload (with analytical code and MD (meta)data)

payload
├── Launch_v2.0
├── Launch_v2.1
├── Launch_v2.2
├── Launch_v2.3
└── Launch_v2.mini

3.1.5 Importing to, and exporting from, the NSI site

Whenever an import or export operation is required from or to an environment, it is important to consider both the time it takes and whether the operation incurs any monetary cost.

Country Type Time Cost
AT Export 45-60 days 343.92 Euro
AT Import 2-3 days 124.00 Euro / hour
FI Export 2-3 days 0.00 Euro
FI Import 2-3 days 0.00 Euro
FR Export 1-2 days 31.80 Euro / 30 minutes
FR Import 1-2 days 0.00 Euro
NL Export 1-2 days 125.00 Euro - 266.00 Euro
NL Import 1-2 days 0.00 Euro
PT Export 2-3 days 0.00 Euro
PT Import 2-3 days 0.00 Euro
SI Export 1 day 0.00 Euro
SI Import 30 days 0.00 Euro

In general, files to be downloaded from MDI to an NSI are stored in a TEAMS directory accessible to the respective NSI team. These files are located in the download folder of a SharePoint directory, which can be synced to your local machine. For example: ../OneDrive-SharedLibraries-IWHEconomicStudiesLab/MDI Data Providers Forum - PT. Each NSI also has an MDI TEAMS ‘upload’ directory, used to upload output generated by the rocket.

3.2 Metadata

This section provides an overview of the structure of NSI and MD metadata, how to construct them and to establish the connection between them, ensuring that country-specific data sources are accurately mapped to the standardized MD panel structure.

3.2.1 Specifications for the NSI Metadata

This section summarizes the structure and content of the NSI metadata files. These files document, in both machine- and human-readable formats, the available data files, the unit of observation (i.e., the description of each row), the names and descriptions of the variables (i.e., columns) in each file, and the valid values for each variable, including their class and domain. The following paragraphs offer guidance on how to prepare country-specific metadata files accordingly.

Once created, the NSI metadata files must be uploaded to the designated TEAMS directory. After the NSI downloads the updated rocket, these metadata files will be located in the rocket/NSImetadata/*NSI*/ directory of the MDI infrastructure. The MDI program pre_launch_checker.R, which should be run whenever the MDI is updated—will identify inconsistencies and other issues in the metadata.

The main types of NSI metadata files prepared include:

  • datafiles: Lists all available NSI firm-level data files, including their names and years covered.

  • varnames: Documents the variables and their descriptions for each raw data file listed in datafile.

  • codebook: Maps categorical variable values to their corresponding descriptions.

  • class: Describes classification variables in the datasets, such as industry or product codes.

  • classvart0_classvart1_conc: Details how classification variables evolve over time, providing concordance between versions.

  • keyID1_keyID2_conc: Maps a firm identifier (keyID1) to higher level identifier (keyID2) by year.

It is advised to construct the files in the same order as in the above list.

Together with the MDI team, the NSI prepares metadata to support the harmonization of NSI data to the MD specification. The MDI team supplies metadata—potentially specific to each launch that describes the MD datasets and their variables. In addition, the MDI team and the NSI jointly provide concordances used to align NSI data files with the standardized MD format.

To facilitate this process, the MDI team also provides tools for creating the required metadata files. These tools can be found in the directory /rocket/MDIprogs/metadata_tools/.

In the filenames for the metadata, the acronym NSI is used. This should be substituted with the 2-letter country code for the country in question (using the ISO3166-2 standard, e.g. NSI = PT). For the MDI metadata, the two letters MD are used.

3.2.1.1 List of NSI datafiles – NSI_datafiles.csv

This file contains the list of all available raw data files on a country’s environment. The file has the following columns:

NSI_datafile,NSI_dataset,yearvar,year_start,year_end,format,path,details,firm_unit,data_source,firm_sample,preprocessing 

where:

  • NSI_dataset is the ‘generic’ name of the NSI datafile.

  • NSI_datafile is the name of the file in the NSI environment.

  • yearvar gives name of year variable if NSI_datafile is a panel, empty otherwise.

  • year_start is the starting year of the data file.

  • year_end is the last year of the data file

  • format is the file extension (csv, sas, stata, etc) of the file (i.e. also the storage format of the data).

  • path indicates path of the datafile relative to the NSI data directory (given by the parameter dirINPUTDATA in launchpad/countdown.R).

  • details contains additional notes on the file.

  • firm_unit indicates the type of firm observation unit. There can be four types of units. Below we provide a definition for each, taken from the Eurostat glossary, hierarchically ordered, i.e.

    • plant: A single-location enterprise or part of one, primarily engaged in one main productive activity. Also often known as ‘establishment’. This corresponds to plantid in the MD data
    • legal_unit: Either legal persons recognized by law or natural persons conducting economic activity independently. This corresponds to firmid in the MD data
    • enterprise: An organizational unit producing goods or services, with decision-making autonomy, possibly spanning multiple activities, locations, or legal units. Hence, one enterprise might be constituted by more than one legal unit. This corresponds to entid in the MD data
    • enterprise_group: A set of legally or financially linked enterprises, controlled by a group head, forming an economic entity with shared or centralized decision-making. This corresponds to entgrp in the MD data
  • data_source: refers to the origin of the data. Three options are possible:

    • survey: If the data was collected through surveys
    • administrative_source: Information collected by public authorities from firms as part of legal or regulatory requirements, such as tax records, employment filings, or financial statements
    • mixed: If the data comes both from surveys, administrative source of other collection methods.
  • firm_sample: Information about the population of firms present in the datafile (usually a piece of text, trying to be as concise as possible)

  • preprocessing: Instructions on how to perform a data preprocessing operation on the raw datafile. For more details, check the dedicated box.

An example (for NSI=FI, 2018) of the metadata for the raw data files (the columns yearvar, year_end, path and details are omitted for viewing):

NSI_datafile NSI_dataset year_start format firm_unit data_source preprocessing
bd2018 bd 2018 csv legal_unit adminiatrative_source NA
br2018 br 2018 csv legal_unit adminiatrative_source NA
bs2018 bs 2018 csv legal_unit adminiatrative_source NA
cis2018 cis 2018 csv legal_unit survey NA
ictec2018 ictec 2018 csv legal_unit survey NA
ifats2018 ifats 2018 csv legal_unit adminiatrative_source NA
*Note: Only the first 5 rows are displayed.

3.2.1.2 File-specific metadata – NSI_varnames.csv

This file contains the list of all variables in each raw datafile appearing in the column NSI_datafile of NSI_datafiles.csv. It has the following columns:

[1] NSI_datafile,NSI_varname,is_key,description,class,domain

where:

  • NSI_datafile is the name of the file in the NSI environment.

  • NSI_varname is the name (hopefully mnemonic) of the variable in the raw file.

  • is_key is a boolean stating whether variable belongs to the (possibly joint) unique keys of the datafile, e.g. firmid, or firmid,year are often the unique key(s).

  • description contains a description of the variable, if possible using Eurostat convention.

  • class is the type of value that the variable holds. The following data types can be encountered:

    • numeric: Numbers with or without decimals (e.g., 3, 4.5).
    • character: Text or string values (e.g., “apple”).
    • date: Calendar dates stored as Date objects (e.g., 2023-05-09).
    • logical / boolean: TRUE/FALSE values used in conditions and comparisons.
  • domain provides information on the values of the variable. See examples below:

    • classification: e.g. list of industry, region, product or codes. (values is metadata filename: e.g. NSI_classname_class.csv, which provides a list of permissible values and descriptions)
    • file-specific codebook of categorical answers. (value is metadata filename, e.g. *NSI_codebook.csv containing permissible values, such as ‘yes’, ‘no’,‘maybe’, or ‘small’, ‘large’, ‘medium’.
    • For other values:
      • For monetary values, “1000” (for 1000 Euros)
      • For dates: “%m%d%Y” (R date-format for mmddyyyy). For ‘year’ variable, we use “%Y”
      • For real units, choose from: “ton” (weight, 1000kg), “m3” (volume), “GJ” (energy), “unit” (1 item).
3.2.1.2.1 Domain: Expenditures, Quantities, Dates
Measure Domain Entry Description
Expenditure 1000 … or 1 Euro; 10000000 Euro; etc.
Foreign currency 1*FXC … or 1000 etc.; Where FXC is an ISO 4217 3-letter currency code
Employment 1 1 here refers to 1 FTE; or 1000; … or 1 Emp if in persons
Numerical 1 1 here refers to 1 unit; … or 10; 100; where ‘unit’ gives unit in lowercase for the variable in the NSI data file
Date %Y-%m-%d Use the R date format that matches the values for the NSI date or year variable
Format Description Example
%a Abbreviated weekday Sun, Thu
%A Full weekday Sunday, Thursday
%b or %h Abbreviated month May, Jul
%B Full month May, July
%d Day of the month 01-31 27, 07
%j Day of the year 001-366 148, 188
%m Month 01-12 05, 07
%U Week 01-53, (start Sunday) 22, 27
%w Weekday 0-6 (Sunday= 0) 0, 4
%W Week 00-53 (start Monday) 21, 27
%x Date, locale-specific
%y Year 2-digit 00-99 84, 05
%Y Year 4-dig: (69 to 99 - 19xx) 1984, 2005
%C Century 19, 20
%D Date formatted %m/%d/%y 05/27/84, 07/07/05
%u Weekday 1-7 (Monday=1 7, 4
3.2.1.2.2 Domain: Classification or Categorical (factor) variables
Variable Domain_Entry Description
Classification variable NSI_classname_class An (official) list, ie NL_nace
Categorical variable NSI_codebook Contains permissible values for categorical (factor) variables, e.g. ‘yes’, ‘no’,‘maybe’
3.2.1.2.3 Example: Netherlands (SBS, 2018): NL_varnames.
NSI_datafile NSI_varname is_key description class domain
sbs2018 ent_id 1 Enterprise ID (identification character
sbs2018 sbs_12110 0 Turnover numeric 1000
sbs2018 sbs_12150 0 Value added at factor cost numeric 1000
sbs2018 sbs_12170 0 Gross operating surplus numeric 1000
sbs2018 sbs_13110 0 Total purchases of goods and s numeric 1000
*Note: Only the first 5 rows are displayed.

3.2.1.3 Codebook for categorical variables – NSI_codebook.csv

This file contains the possible values of a categorical variable and the description that belongs to that value. There rows give the possible values occuring in firm data for a particular NSI_datafile and NSI_varname. The name of the codebook should be given in the ‘domain’ columnn of NSI_varnames for the relevant categorical variable.

[1] NSI_dataset,NSI_varname,year,code,description

where:

  • NSI_dataset is the name of the generic dataset in the NSI environment.

  • NSI_varname is the name of the variable of that specific raw dataset.

  • year is the year for which codebook values hold. If empty, holds for all years of the NSI dataset.

  • code gives all the values of the categorical variable that occur for that NSI_varname in that NSI_datafile.

  • description gives the description explaining each code value.

Dealing with the year column

As mentioned, if a specific mapping holds for all years available of a specific NSI_dataset, then the year column for that mapping needs to be empty. For instance, say that we have raw NSI_dataset data_ictec with NSI_varname var112 being a categorical variable taking values ‘0’, ‘2’, ‘999’, referring to ‘no’, ‘yes’, ‘not available’ for all years. In this case, the NSI_codebook will contain only three rows for these three mapping without any reference to the years. In practice:

NSI_dataset NSI_varname year code description
data_ictec var112 0 no
data_ictec var112 2 yes
data_ictec var112 999 not available

That said, if a mapping is not constant across all years of an NSI_dataset, then the year column needs to have a value for all mappings reported. In this context, there can two cases:

  1. The codes differ by year for the same variable: This means that var_112 takes values, say, ‘0’, ‘2’, ‘999’, referring to ‘no’, ‘yes’, ‘not available’ for 2006 and in 2007 the mapping changes to ‘1’, ‘2’, ‘9’, for ‘no’, ‘yes’, ‘not available’, respectively. Hence, we need to indicate year-specific mappings in the codebook table:
NSI_dataset NSI_varname year code description
data_ictec var112 2006 0 no
data_ictec var112 2006 2 yes
data_ictec var112 2006 999 not available
data_ictec var112 2007 0 no
data_ictec var112 2007 1 yes
data_ictec var112 2007 9 not available
  1. The NSI_varname isn’t available for all years: As an example, let var_112 be only available in 2005 and 2006 but it’s dropped or has a different name in the other years. Then, we need to include it for both 2005 and 2006, regardless of whether the codes are identical or not:
NSI_dataset NSI_varname year code description
data_ictec var112 2005 0 no
data_ictec var112 2005 2 yes
data_ictec var112 2005 999 not available
data_ictec var112 2006 0 no
data_ictec var112 2006 2 yes
data_ictec var112 2006 999 not available

An example below of the values occurring for the unit of measurement in the SI PRODCOM data for 2012.

NSI_dataset year NSI_varname code description
MIKRO_INDL_razST NA ME 1000 SIT thousands of slovenian tolars
MIKRO_INDL_razST NA ME EUR euros
MIKRO_INDL_razST NA ME GJ Gigajoule - a unit of energy
MIKRO_INDL_razST NA ME MWh Megawatt-hour - a unit of energy
MIKRO_INDL_razST NA ME TJ terajoules
*Note: Only the first 5 rows are displayed.

3.2.1.4 Classification lists – NSI_classvar_class.csv

This file contains the unique list of codes per year of a specific classification variable in a country. Note that there should be a list for every categorical variable in each dataset. The related table has the following columns:

[1] code,year,description

where:

  • code is the list values of the classification variable observed in the data.

  • year is the related year. If the mapping is not changing across all years available for that NSI_varname, then the year column will be filled with NA.

  • description gives the description for each code value.

An sample of rows from the table of PRODCOM codes (in this case some randomly selected rows from the list of codes of Finland):

code year
27512630 2020
22214180 2019
19301352 2005
20142320 2021
26702490 2016
20165490 2013
24521090 2019
28133200 2010
15842150 2005
19303150 2006
13301380 2016
26518110 2021
20101032 2004
16292320 2012
26518550 2011

3.2.1.6 Key ID concordance – NSI_keyID1_keyID2_conc.csv

This file contains a concordance table between two firm unit codes by year.

[1] "keyID1" "keyID2" "year"  

where:

  • keyID1 is the first firm unit code
  • keyID2 is the second firm unit code
  • year is the reference year

For example, a concordance between units firmid and entgrp could look like:

firmid entgrp year
EoYncPX1QK ZDfQhvv 2019
Zn1yzAeYA4 oOUz7Ep 2021
B4iCIhzgPi oOUz7Ep 2013
9sJnqQM0lo ZDfQhvv 2002
FPSgiOjwA7 MEVq1XW 2007
0g0AFLyHCe MEVq1XW 2015
hlmrG4AyLu oOUz7Ep 2011

3.2.2 Specifications for the MD Metadata

  • Work in this section is a collaboration between NSI and MDI staff.

  • In an iterative process, using NSI metadata for each country, and taking into account research needs of MD users, a specification is made of the MD panels and variables.

    1. MD_datafiles.csv describe the harmonized panel datasets generated in each launch
    2. MD_varnames.csv describe the variables per dataset, with their description, class, and domain.
    3. MD classifications: versions of official classifications, such as EU NaceR2 activities or NUTS2 regions
    4. MD codebooks: valid values for categorical variables

3.2.2.1 List of micro-dataset (MD) panels – MD_datafiles.csv

This file contains the list of all firm-level MD panels generated by the MDI code and usable by researchers via code modules. The file has following columns:

This file contains the list of all harmonized firm-level micro data panels (MD datafiles) that can be used in research by code in an MDI launch, either individually, or linked at the firm-year level.

[1] MD_dataset,description,details

where:

  • MD_dataset is the name of the MD panel (R data.table) at runtime of the launch.

  • description A description of the panel and its underlying source data.

  • details contains additional notes on the file.

Below is the list of currently available MD panels:

MD_dataset description details
BR Business Register see: https://ec.europa.eu/eurostat/
BS Balance Sheet Balance Sheet on Enterprise groups
BD Business Dynamics
SBS Structural Business Statistics
CIS Community Innovation Survey (only available in even numbered ye
ICTEC ICT Usage in Enterprises Survey https://ec.europa.eu/eurostat/cache
ITGS International Trade in Goods
ITS International Trade in Services
OFATS Outgoing Foreign Affiliates Statist
IFATS Incoming Foreign Affiliates Statsti
ENER Energy Use at Firms in progress harmonization across co
PRODCOM Production Communitaire by firm and https://ec.europa.eu/eurostat/web/p

3.2.2.2 Micro-dataset (MD) variables – MD_varnames.csv

This file contains the list of all variables available in all the MD firm-level panel datsets that have been generated by the MDI code using the NSI datafiles, NSI metadata, and the NSI-MD concordances. The file has the following columns:

[1] MD_varname,MD_dataset,is_key,description,class,domain

where:

  • MD_dataset is the name of the MD firm-level panel dataset, ie BR, SBS, etc.

  • MD_varname is the name of the variable in the virtual firm-level dataset.

  • is_key is a boolean stating whether variable belongs to the (possibly joint) unique keys of the dataset, e.g. firmid, or firmid,year are often the unique key(s).

Note

Given that an MD dataset can have different unique identifier–be it plant, legal unit, enterprise or enterprise group, check the NSI_datafiles section under firm_unit–depending on the raw data which is based on, is_key takes value 1 for each of the four possible unit types.

That said, in the harmonzied MD dataset researchers will work on, only one of the four units will be available, allowing then module writers to aggregate or disaggregate across units (when possible) using tool mdi_key_id_switch() from package mdi. The tool makes use of different aggregation or disaggregation methods based on the MD varname, as indicated in file MD_aggr_disaggr_methods.csv.

  • description contains a description of the variable, if possible using Eurostat convention.

  • class is the type of value that the variable holds (e.g. integer, character, boolean etc.).

  • domain

    • classification: e.g. list of industry, region, product or codes. (values is metadata filename: e.g. MD_filename_varname_list.csv, which provides a list of permissible values and descriptions)
    • MD-specific codebook of categorical answers. (value is metadata filename, e.g. MD_codebookname_codes.csv containing permissible values, such as ‘yes’, ‘no’,‘maybe’
    • For other values:
      • For monetary values, “1000” (for 1000 Euros)
      • For dates: “%m%d%Y” (R date-format for mmddyyyy). For ‘year’ variable, we use “%Y”
      • For real units, choose from: “ton” (weight, 1000kg), “m3” (volume), “GJ” (energy), “unit” (1 item).
3.2.2.2.1 Domain: Expenditures, Quantities, Dates
Measure Domain_Entry Description
Expenditure 1000 Euro
Employment 1 FTE … or 1 Emp if in persons
Numerical 1 ‘unit’ ‘Unit’ gives unit used in NSI data file, or is left blank if just a count.
Date %Y For now, we use a R format for 4-digit year as the date variable
Weight 1 kg
Volume 1 m3
Area 1 m2
Length 1 m
Energy 1 GJ GigaJoule
3.2.2.2.2 Domain: Classification or Categorical (factor) variables
Variable Domain_Entry Description
Classification variable NSI_classname_class An (official) list, ie NL_nace
Categorical variable NSI_codebook Contains permissible values for categorical (factor) variables, e.g. ‘yes’, ‘no’,‘maybe’

Below is a sample of 5 rows of the file MD_varnames with harmonized MD variables

MD_dataset MD_varname description domain
CIS inpssu Introduced onto the marke
ICTEC RBTS Use service robots
ICTEC CRMSTR share of information with
BD merger Enterprise merged with an
CIS year Year %Y

3.2.2.3 Classification lists – MD_classvar_class.csv

This file contains the unique list of codes per year of a specific classification variable from the MD panels. Note that there should be a list for every categorical variable in each MD datasets. The related table has the following columns:

[1] code,description

where:

  • code is the list values of the classification variable observed in the data.

  • description gives the description for each code value.

An example of the table for NACE codes (in this case the official EU NaceR2 classification):

code description
C17.2.2 ____Manufacture of household and sanitary goods and of toilet requisit
C17.2.3 ____Manufacture of paper stationery
C17.2.4 ____Manufacture of wallpaper
C17.2.9 ____Manufacture of other articles of paper and paperboard
C18 __Printing and reproduction of recorded media
C18.1 ___Printing and service activities related to printing
*Note: Only 5 rows are displayed.

3.2.2.4 Hierarchy files for classifications – MD_classvar_hier.csv

This file contains a series of columns that refer to different nodes of the classification variable in question. With this file, the user can easily aggregate or disaggregate the data based on the different nodes of the classification variable.

The columns of the file are labelled as h_X, where X is a number from 0 to N denoting one of the N available nodes in the variable.

An example of a hierarchy table for NACE codes (in this case the official EU NaceR2 classification):

h_0 h_1 h_2 h_3 h_4
6491 649 64 K TOT
2222 222 22 C TOT
9810 981 98 T TOT
2331 233 23 C TOT
4743 474 47 G TOT
*Note: Only 5 rows are displayed.

3.2.2.5 Codebook for categorical variables – MD_codebook.csv

This file contains the possible values of a categorical variable and the description that belongs to that value. Note that sometimes a particular codebook is ‘re-used’ for multiple variables. The name of the codebook should be given in the ‘domain’ columnn of the metadata for the file containing the categorical variable.

[1] MD_dataset,MD_varname,code,description

where:

  • MD_dataset is the name of the MD firm-level panel dataset, ie BR, SBS, etc.

  • MD_varname is the name of the variable of that specific MD dataset.

  • code gives the valid values of the ccategorical variable.

  • description gives the description for each code value.

An example below of the values given for some variables in the MD BR business register dataset

MD_dataset MD_varname code description
BD status 1 born in reference year
BD status 2 active entire reference year
BD status 3 dead in reference year
BD status 4 born and dead in reference year
BR demo 0 ”No demographic relation in ref. year”
BR demo 1 ”Receiving employment from other enterprise in ref. year”
*Note: Only 5 rows are displayed.

3.2.2.6 Key ID overview – MD_idInfo.csv

As we need to coordinate our data work across multiple countries, there are differences in what the key identifiers of the different MD datasets are. The table below illustrates the situation for the countries to which we currently have access to.

MD_dataset AT DE FI FR NL PTx PT SI GB
BR firmid firmid firmid firmid entid entid firmid plantid NA
BD firmid firmid entid firmid NA
BS firmid firmid entgrp entid firmid firmid NA
CIS firmid firmid firmid entid firmid NA
ENER plantid firmid plantid entid firmid plantid NA
ICTEC firmid firmid firmid entid firmid entid NA
IFATS firmid firmid entid firmid NA
ITGS firmid firmid firmid firmid entid entid firmid firmid NA
ITS firmid NA
OFATS firmid firmid firmid NA
PRODCOM plantid firmid firmid firmid entid entid firmid plantid NA
SBS firmid firmid firmid firmid entid entid firmid firmid NA

3.2.2.7 MD_aggr_disaggr_methods.csv

This table contains instructions on how a specific MD_varname is aggregated or disaggregated to a higher/lower firm unit. It is only used by mdi_keyID_switch.R file if a module writer wants to perform such operation on a given harmonized MD dataset.

[1] MD_dataset,MD_varname,NSI_dataset,NSI_varname,method,detail,year
  • MD_dataset is the dataset name that the variable belongs to (e.g. SBS, BS, PRODCOM, BR). This determines the source of the variable within the integrated microdata framework.

  • MD_varname is the standardized variable identifier, harmonized across datasets (e.g. emp, rev, pay, assets). Used to link equivalent variables across datasets.

  • class is the variable type (numeric, categorical, boolean, date). It defines what operations are logically and statistically valid for the variable.

  • aggregation_method is the rule for aggregating data from a lower level to a higher level (e.g. plant \(\rightarrow\) firm, firm \(\rightarrow\) group). Specifies how observations are collapsed across identifiers during aggregation.

  • disaggregation_method is the rule for splitting or allocating data from a higher level to a lower level (e.g. firm \(\rightarrow\) plant). Indicates which weighting logic or fallback hierarchy is used to distribute values.


Aggregation Methods

  • sum
    Adds up all values in the group.
    Used for additive variables such as employment, turnover, pay, or total assets.

  • mean
    Calculates the simple arithmetic mean.
    Used for ratio or intensity variables (e.g. productivity, profitability ratios).

  • weighted_avg:<var1>|<var2>|...
    Computes a weighted average using one or more candidate weighting variables.
    The first available candidate is used.
    Example: weighted_avg:emp|rev \(\rightarrow\) weights by emp if available, otherwise by rev.

  • mode
    Returns the most frequent category (the statistical mode).
    Used for qualitative variables like ownership type or legal form.

  • weighted_mode:<var1>|<var2>|...
    Returns the category that maximizes the weighted frequency count.
    Example: weighted_mode:rev|emp gives greater weight to categories from larger firms.

  • any
    Logical aggregation returning TRUE if any record in the group is TRUE.
    Used for indicators such as export participation.

  • all
    Logical aggregation returning TRUE only if all records in the group are TRUE.
    Useful for group-level flags (e.g. all plants meet environmental certification).

  • min
    Returns the smallest value or earliest date in the group.
    Useful for start dates or minimum rates.

  • max
    Returns the largest value or latest date in the group.
    Useful for end dates or maximum thresholds.


Disaggregation Methods

  • equal
    Splits the higher-level total equally across all lower-level entities.
    Example: 100 employees across 4 plants \(\rightarrow\) each gets 25.

  • replicate
    Copies the same value across all sub-entities.
    Used for categorical variables like region, legal form, or activity code.

  • weighted_alloc:<dataset.var1>|<dataset.var2>|...|equal
    Allocates a higher-level value proportionally using variables from other datasets that exist at the disaggregated level.
    The listed candidates are checked in order, and the first available is used.
    Example:
    weighted_alloc:PRODCOM.rev|ITGS.ntrade|SBS.emp|equal
    \(\rightarrow\) uses product-level revenue, if unavailable uses trade value, then employment, then equal split.

  • proportional_alloc (optional)
    Variant of weighted_alloc where weights are normalized within each group.
    Usually equivalent to weighted_alloc in implementation.

This design ensures that numerical variables preserve total consistency, while categorical and boolean fields retain logical coherence during aggregation and disaggregation.

Below a sample of five rows contained in the table:

MD_varname MD_dataset aggregation_method disaggregation_method
rdemp SBS sum weighted_alloc:BS.total_assets|PRODCOM.rev|ITGS.ntrade|BR.persons_br|equal
start_nace BR mode replicate
distr_heat_noenerg ENER sum weighted_alloc:SBS.nv|SBS.emp|BS.total_assets|PRODCOM.rev|ITGS.ntrade|BR.persons_br|equal
inpdgd CIS any replicate
fte SBS sum weighted_alloc:BS.total_assets|PRODCOM.rev|ITGS.ntrade|BR.persons_br|equal
*Note: Only 5 rows are displayed.

3.2.3 Specifications for metadata needed for the NSI to MD harmonization

  • Harmonization of MD panels entails harmonization of units of observation, variable definitions, and variable values.

  • The key to harmonization is NSI metadata, MD metadata, and NSI to MDI concordances.

  • The MD standard metadata is found ‘iteratively’ and can evolve as countries join and as new MDI research users and MDI launches have different data requirements.

    • The MD metadata and NSI to MDI concordances allow live updates of the MDI data documentation.
  • Mapping units of ‘firms’, enterprises, legal units requires knowledge of NSI source data: registers, (weighted) sampling, sample designs.

  • Harmonizing variable definitions and nomenclature is done through renaming, revaluing or combining NSI variables.

    • In the *NSI*\_MD\_conc.csv file, information is available to show how an MD variable (from a particular MD dataset) is generated from NSI variables, through the harmonization operations remap, revalue, or redefine.
  • Harmonizing values of classification variables is done by reclassifying values over time to MD standard.

    • A concordance for each NSI classification version to the MDI standard is needed. Each observed value of the classification code in rawdata needs to be mapped to the MD classification, otherwise the raw data observations are lost. This is done using the concordance file *NSI*_*classname*_MD_*classname*_classconc.csv.
  • Harmonizing categorical variables is done by recoding between conforming values from codebooks.

    • To harmonize data values for categorical variables, a concordance is made between *NSI*__MD_codeconc.csv.
  • To concord other data values ((currency) units, date values), R functions are used to revalue.

    • E.g. If the domains of the variable in NSI data is 1000 and in MD data 1, then the NSI value is multiplied by 1000. If the NSI value is in an R date-value, say %d%m%Y, an R date function is used convert to the required R date-value.

Storing an MDvarname in an MD panel

Only remap and redefine rows store columns in the final MD panel. Hence, for a MD variable to be present in the output data, one of the two methods needs to be used.

The reason for this is that revalue, recode and reclass only change the content of the NSI_varname, given that the NSI_varname on which the operation has been applied to could be use for multiple mappings, be it a remap or a redefine, in the same concordance year.

Therefore, if you would like to store an MD_varname after a revalue, recode or reclass operation, make sure you add a row for the same NSI_varname-MD_varname mapping with method=remap.

3.2.3.1 Concordance file – NSI_MD_conc.csv

This file contains the list of all variables in a particular MD panel, with information on how to map the NSI variables from one or more raw datafile (often one for each year) to the MD variable. The related table has the following column names:

[1] MD_dataset,MD_varname,NSI_dataset,NSI_varname,method,detail,year

where:

  • NSI_dataset is the generic name of the data, that together with year specify the NSI datafile that hosts the variable to be use in concording. If year is empty, the concordance does not change over the years.

  • year is the year for which the concordance holds. If the mapping involves an NSI_datafile which is a panel, the column needs to be filled just with the first year available. If the NSI_datafile is a cross-section, the column needs to be filled with the year it is referenced to (in other words, there has to be one set of mapping rows per NSI_datafile).

  • NSI_varname is the name of the variable in the NSI datafile.

  • MD_dataset is the name of the MD firm-level panel dataset, ie BR, SBS, etc.

  • MD_varname is the name of the variable in the MD data source to be generated. Make sure the year variable is not included in the concordance table, since it is harmonized separately by the infrastructure.

  • method is the method used to harmonize the data. The value of the categorical variable, provides the method for generating the harmonized variable MD_varname.

    • revalue The values of the variable are changed using an R function and parameters in the column detail and possibly from the class and domains variable from the relevant _varnames files.
    • recode The values of the variable are changed using a codebook concordance, whose name is given in de details column, e.g. ‘NSI_filename_MD_dataset_codeconc.csv’. Only values that need to be changed require a row in the _codeconc.
    • reclass The values of the variable are changed using a classification concordance, whose name is given in the details column, ‘NSI_classname_MD_classname_classconc.csv’. This is used to reclassify e.g. industry, region classifications.
    • remap The name of the variable is changed, in a one-to-one mapping from NSI_varname to MD_varname.
    • redefine The MD variable is generated as a linear combination of the NSI variable. The column detail specifies the linear combination, i.e ‘+’ or ‘-’) in the many-to-one NSI_varname to MD_varname mapping.
  • detail contains the function for revalue, the concordance for codebook or classification for recode and remap, and the linear operations for redefine. For revalue, any valid operation operating on the NSI_varname (referred to as x) is good. If the domains of the variable in NSI data is 1000 and in MD data 1, then the NSI value is multiplied by 1000, so detail = x*1000. If the NSI value is in an R date-value, say %d%m%Y, an R date function is used convert to the required R date-value, format(as.Date(x,“%d%m%Y”),“%Y”)

  • NSI_datafile is the name of the raw dataset from where the NSI_variable is taken from

  • year is the reference year for that specific row, which will be used to construct the MD-cross section for that year

Storing the year MD_varname in the data

The year variable for each MD_dataset is automatically mapped to the harmonized data based on the metadata and the year value assigned for the corresponding concordance table rows. Hence, please do not add any row in the concordance table where MD_varname = 'year'.

3.2.3.1.1
Data preprocessing

Given that some raw datafiles require specific preprocessing, in very special cases some NSI_varnames might end up be different than those appearing in file NSI_varnames. Hence, in case you would like to concord some of the variables from a dataset for which preprocessing is needed – which you can verify by looking at column preprocessing of that datafile in NSI_datafiles – please keep that in mind. For more information, check out the box on datafile preprocessing or get in touch with the MDI team.

3.2.3.1.2 Examples by harmonization method

As said, a revalue row in the concordance table simply transforms the content of the raw data’s column, without changing the column name to the desired MD_varname. For instance, say that you want to remove all dots from a string in raw variable var1 from NSI_dataset data_firm for year 2010. To do that, we add a row in the concordance table which looks as follows:

NSI_dataset year NSI_varname MD_dataset MD_varname method detail
data_firm 2010 var1 BR nace revalue gsub(“\.”, ““, x)

In practical terms, this operation will transform the raw datafile from

var1
10.40
20.59
01.45
32.10

to

var1
1040
2059
0145
3210

A reclass/recode row in the concordance table maps the values of a categorical variable (be it a class or codebook variable) to some specified objective values, as indicated in the corresponding class/codeconc table. For instance, say that you want to change the mapping of categorical variable var2 from NSI_dataset survey_firm for year 2012. The raw variable can take values 1, 2 and 9, which link to ‘yes’, ‘no’, ‘not available’. To do that, we assign recode (given that this is a codebook variable; we would indicate reclass in case of a class variable) in the method column, as follows

NSI_dataset year NSI_varname MD_dataset MD_varname method detail
survey_firm 2012 var2 CIS inpdsv recode NSI_MD_codeconc

The harmonization tool will open the codeconc file and transform the values as shown below

1 → 0

2 → 1

9 → 9

In practical terms, this operation will transform the raw datafile from

var2
1
9
2
1

to

var2
0
9
1
0

A redefine row in the concordance table aggregates two or more NSI_varname’s to create an objective MD_varname. As mentioned, the aggregation function is not restricted to a specific form. It can be a sum or subtraction of all non-NA values of the raw variables of the aggregation (detail = + or -) or a custom function (detail = fn('content of the function in R syntax')).

For example, if we want to sum the values of var3, var4 and var5 from datafile 2005_bs to create MD_varname nv, we add the following rows to the table

NSI_dataset year NSI_varname MD_dataset MD_varname method detail
2005_bs 2005 var3 BS nv redefine +
2005_bs 2005 var4 BS nv redefine +
2005_bs 2005 var5 BS nv redefine +

This operation will transform the raw datafile from

var3 var4 var5
12 4 15
NA 2 16
9 32 19
8 14 NA

to (also by removing the original raw variables)

nv
31
18
60
22

On the other hand, if we want to sum var3 to var4 and divide the result by var5, we need to build a custom function, as in the below concordance rows:

NSI_dataset year NSI_varname MD_dataset MD_varname method detail
2005_bs 2005 var3 BS nv redefine fn((var3+var4)/var5)
2005_bs 2005 var4 BS nv redefine fn((var3+var4)/var5)
2005_bs 2005 var5 BS nv redefine fn((var3+var4)/var5)

This operation will transform the raw datafile from

var3 var4 var5
12 4 15
NA 2 16
9 32 19
8 14 NA

to (also by removing the original raw variables)

nv
1.067
NA
2.158
NA

A remap row in the concordance table assigns the name of a given MD_varname to an NSI_varname’ without changing the values of the variable itself. This method is usually used to store variables to the objective MD panel without changing anything or which were subject to a revalue or recode/reclass operation.

For example, if we want to store var6 from datafile ener_2001 to create MD_varname firmid, we add the following row to the table

NSI_dataset year NSI_varname MD_dataset MD_varname method detail
ener_2001 2001 var6 ENER firmid remap

This operation will transform the raw datafile from

var5
nwejn
aios2
cjnje
29hbd

to

firmid
nwejn
aios2
cjnje
29hbd

An example of the table for a few variables needed for Slovenian harmonized MD BR for year 2007 (column year is omitted):

MD_dataset MD_varname NSI_dataset NSI_varname method detail
BR firmid MIKRO_PRS_razST MS10_razST remap
BR entgrp MIKRO_PRS_razST MS10_IZP_MS7_razST remap
BR birthyr MIKRO_PRS_razST Datum_prv_vnosa revalue as.Date(as.character(x), ‘%d.%m.%Y’)
BR exityr MIKRO_PRS_razST Datum_izbrisa revalue as.Date(as.character(x), ‘%d.%m.%Y’)
BR nace MIKRO_PRS_razST Skd remap
BR soe MIKRO_PRS_razST Vrsta_lastnine recode SI_MD_codeconc
BR birthyr MIKRO_PRS_razST Datum_prv_vnosa remap
BR exityr MIKRO_PRS_razST Datum_izbrisa remap
BR soe MIKRO_PRS_razST Vrsta_lastnine remap
BR nace MIKRO_PRS_razST Skd revalue sub(‘^(\d{2})\.(\d{2})\d$’,‘\1\2’,x)

3.2.3.2 NSI_MD_codeconc.csv

[1] NSI_dataset,year,NSI_varname,MD_dataset,MD_varname,left,right

where:

  • NSI_dataset is the generic name of the data, that together with year specify the NSI datafile that hosts the variable to be use in concording. If year is empty, the concordance does not change over the years.

  • year is the year for which the concordance holds. If empty, the same concordance rows are used for all NSI datafiles associated with the generic NSI_dataset.

  • NSI_varname is the name of the variable of the specific NSI datafile associated with dataset and year.

  • MD_varname is the name of the corresponding MD variable.

  • left gives the valid values of the categorical variable in the raw NSI dataset.

  • right gives the corresponding MDI dataset values to map.

Note

If the mapping of a categorical variable already corresponds to that of the objective MD_varname, then there’s no need to add the related row in the codeconc. For instance, say that we have raw NSI_dataset data_ictec with NSI_varname var112 being a categorical variable taking values ‘0’ referring to ‘no’. Then let the variable be mapped to MD_varname IACC from MD_dataset ICTEC. Given that, as indicated in the MD metadata, code ‘0’ is linked to ‘no’ for this MD_varname, we don’t need to add any row for this specific mapping.

However, if the other mappings don’t correspond, the rows in the codeconc file need to be present for those!

Dealing with the year column

As mentioned, if a specific mapping holds for all years available of a specific NSI_dataset, then the year column for that mapping needs to be empty.

For instance, say that we have raw NSI_dataset data_ictec with NSI_varname var112 being a categorical variable taking values ‘0’, ‘2’, ‘999’, referring to ‘no’, ‘yes’, ‘not available’ for all years. This variable will be harmonized to IACC of the MD_dataset ICTEC. In this case, the NSI_MD_codeconc will contain only three rows for these three mapping without any reference to the years. In practice:

NSI_dataset NSI_varname year MD_dataset MD_varname left right
data_ictec var112 ICTEC IACC 2 1
data_ictec var112 ICTEC IACC 999 NA

Note that the mapping for ‘0’ - ‘no’ is missing given that it already corresponds to the objective MD mapping.

That said, if a mapping is not constant across all years of an NSI_dataset, then the year column needs to have a value for all mappings reported. In this context, there can two cases:

  1. The codes differ by year for the same variable: This means that var_112 takes values, say, ‘0’, ‘2’, ‘999’, referring to ‘no’, ‘yes’, ‘not available’ for 2006 and in 2007 the mapping changes to ‘1’, ‘2’, ‘9’, for ‘no’, ‘yes’, ‘not available’, respectively. Hence, we need to indicate year-specific mappings in the codebook table:
NSI_dataset NSI_varname year MD_dataset MD_varname left right
data_ictec var112 2006 ICTEC IACC 2 1
data_ictec var112 2006 ICTEC IACC 999 NA
data_ictec var112 2007 ICTEC IACC 1 0
data_ictec var112 2007 ICTEC IACC 2 1
data_ictec var112 2007 ICTEC IACC 999 NA
  1. The NSI_varname isn’t available for all years: As an example, let var_112 be only available in 2005 and 2006 but it’s dropped or has a different name in the other years. Then, we need to include it for both 2005 and 2006, regardless of whether the codes are identical or not:
NSI_dataset NSI_varname year MD_dataset MD_varname left right
data_ictec var112 2005 ICTEC IACC 2 1
data_ictec var112 2005 ICTEC IACC 999 NA
data_ictec var112 2006 ICTEC IACC 2 1
data_ictec var112 2006 ICTEC IACC 999 NA
Missing codebook entry in MD_codebook

As the right column of an NSI_MD_codeconc file needs to have entries that are present in the MD_codebook, there can be cases in no corresponding value can be found between a categorical value in a country’s dataset and the MD_codebook. For example, say that a very specific value for variable unit of MD dataset PRODCOM is available in the data of a country and no corresponding value can be found in the MD codebook. In that case, please reach out to the MDI team, as we might consider adding that value to the MD_codebook.

Below an example of the codebook concordance table in Portugal:

NSI_dataset NSI_varname year MD_dataset MD_varname left right
ifats imputeifats NA IFATS imputed 1 0
ifats imputeifats NA IFATS imputed 2 1
itgs exim NA ITGS exim 1 0
itgs exim NA ITGS exim 2 1
itgs imputeitgs NA ITGS imputed 1 0
*Note: Only 5 rows are displayed.
3.2.3.2.1 NSI_classname_MD_classname_classconc.csv
[1] year,left,right

where:

  • left is the name of the current NSI classification code list for variable NSI\_classname.

  • right is the name of the MD classification code list the user wants to concord to.

  • year is the year for which the concordance holds.

Below a sample of the concordance from the raw data’s common nomenclature code list (left) to the harmonized one (right).

year left right
2005 61019090 61019090D
2004 72249014 72249014
2005 09104090 09104090D
2004 85407200 85407200D
2005 84145910 84145910D
*Note: Only 5 rows are displayed.

3.2.4 Data documentation and MDI implementation process in phases

To construct the necessary metadata of the raw data and concordance tables needed to produce the harmonized MD datasets is a particularly long and tedious process.

A necessary requirement to have the MDI installed on the remote environment of a country is to have sufficiently large RAM on the server. Naturally, the amount of RAM needed depends on the size of the data. An indicative measure for this could be the ratio RAM - number of observations in the BR. This ratio should be approximately larger or equal to 2.

To make the experience more manageable as well as to give it more structure, we developed the following timeline divided in seven phases:

Metadata/concordance table construction phases and deliverables
Phase Completed files
I
  • raw files cleanup (paths, list of files needed, etc.)

  • NSI_datafiles

  • firm unit analysis

  • unique keys

  • detailed information on disclosure rules of the NSI

II
  • Phase I

  • NSI_varnames

III
  • Phase II

  • NSI_codebook (just for BR, BS (if available), SBS)

  • NSI_class (just for BR, BS (if available), SBS)

IV
  • Phase III

  • skeleton NSI_MD_conc (just for BR, BS (if available), SBS)

V
  • Phase IV
  • import script that harmonizes variables
  • harmonize BR, BS (if available), SBS
  • import
    • MDI CN module and execute it on BR, BS (if available) and SBS in R
    • Stata packages
    • questionnaire for the CompNet variables
    • country file
    • the CompNet Stata files and run them on Stata for a subset of indicators
  • extract the tables
VI
  • Phase V

  • NSI_codebook (for the remaining datasets)

  • NSI_class (for the remaining datasets)

  • NSI_MD_codeconc for non-surveys

  • NSI_MD_classconc for non-surveys

VII
  • Phase VI

  • NSI_MD_codeconc for surveys (CIS, ICTEC, …), sequentially (from easier to more complex)

  • NSI_MD_classconc for surveys (CIS, ICTEC, …), sequentially (from easier to more complex)

  • NSI_MD_conc (excluding ITGS and PRODCOM)

VIII
  • Phase VII

  • NSI_MD_conc for ITGS and PRODCOM

  • timeconc

  • firm ID conc

  • various cleaning/leftovers

Each phase refers to the construction of a specific file, as described in the above sections. There are a few elements that haven’t been explicitly mentioned yet. A brief explanation is provided below:

  • Raw files cleanup: making sure the raw files directory is tidy and usable

  • Firm unit analysis: most granular firm identifier (plant, legal unit, enterprise, enterprise group) of each raw file (see the NSI_datafiles section to check the list of possible units)

  • Unique keys: the uniquely identifying columns of each raw file (see the NSI_varnames section)

  • Disclosure rules: detailed description of the disclosure routines in place in the NSI

  • import script that harmonizes variables: after Phase IV, given that the NSI_MD_conc file mappings for BR, BS (if available) and SBS are ready, those files can be harmonized. To do so, we don’t import the whole infrastructure (yet) but we provide you with a code that does that by reading your metadata files and relevant raw data and produces the MD panels.

  • upload MDI CN module and execute it on BR, BS and SBS and importing the CompNet-related files (Phase V): After having harmonized BR, BS (if available) and SBS, we ask you to import a few more files.

    • CN module: This script produces some files under a specific directory

    • questionnaire: The Questionnaire is an Excel file that contains:

      1. paths
      2. variable names
      3. confidentiality routine settings, that needs to be filled in with these things. So in particular the variable names to reflect the country-specific variable mapping
    • country file: A Stata .dta file at country-year-industry2d-sizeclass level and it contains

      1. population firm numbers from Eurostat
      2. industry-level deflators from Eurostat/EU KLEMS/AMECO
      3. some additional measures from public sources (e.g. 10year government bond yields from Eurostat)
      4. one predefined measures from us (i.e. not from public sources; if this is an issue we could leave it out)
    • Stata files for CompNet: The Stata (.do) files take as input in the output of the CN module, the questionnaire and the country file and produce a limited first version of some CompNet indicators

Important

Executing the CompNet .do files requires that

  1. Stata can be used in the same remote environment as the MDI
  2. The NSI agrees to export and have the CompNet indicators be published to third parties

The timeline was built based on our past experience. It’s meant to be first and foremost a help when creating metadata and concordance table from scratch.

3.2.5 Constructing NSI metadata

This section consists in a guide on how to build NSI metadata files. The guide will make references to some scripts which can assist users to create the files from scratch. Naturally, only using these scripts is not enough, as many fields of the tables need to be manually specified.

1. Constructing NSI_datafiles.csv

This script scans the raw data directory in the protected NSI environment and builds the metadata table required by the CompNet rocket. It creates the *_datafiles.csv file according to the specification in mantioned above.

The script can be found in this dropdown menu:

# NSI_datafiles builder (spec §3.2.1.1)
# author: AM-MM, date: 2025-09-29

library(data.table)
library(stringr)

# ---- Inputs you must define upstream ----
# dirINPUTDATA: the NSI data root (manual's reference base dir)
# dirROCKET:    rocket root to use for storing NSI metadata
# CountryCode:  2-letter ISO code (e.g., "IT")
file_path <- dirINPUTDATA  # use the actual base for relative paths

# ---- List files (absolute) ----
abs_files <- list.files(file_path, recursive = TRUE, full.names = TRUE)

# keep only files (exclude dirs)
abs_files <- abs_files[file.info(abs_files)$isdir == FALSE]

# ---- Build table ----
DT <- data.table(abs_path = abs_files)

# relative path to dirINPUTDATA (allow trailing slash in file_path; escape regex)
file_path_norm <- normalizePath(file_path, winslash = "/", mustWork = FALSE)
file_path_esc  <- gsub("([\\^$.|?*+()\\[\\]{}\\\\])", "\\\\\\1", file_path_norm)
DT[, rel := sub(paste0("^", file_path_esc, "/?"), "", normalizePath(abs_path, winslash="/"))]

# split dir / filename / extension
DT[, filename      := basename(rel)]
DT[, path          := dirname(rel)]
DT[, format        := tools::file_ext(filename)]
DT[, NSI_datafile  := tools::file_path_sans_ext(filename)]   # spec name

# ---- Derive NSI_dataset (generic) by stripping all 4-digit years and separators ----
DT[, NSI_dataset := NSI_datafile |>
     str_remove_all("\\d{4}") |>
     str_replace_all("[-_.]+", "_") |>
     str_replace("^_|_$", "") |>
     str_to_lower()
]

# ---- Years: extract all 4-digit tokens from the filename.
# Take min/max if present; if none found, leave NA.
# IMPORTANT: always double-check that year_start / year_end are correct,
# especially for files named like "bd2018" (single year) or with unusual patterns.
extract_years <- function(x) as.integer(str_extract_all(x, "\\d{4}")[[1]])
yrs <- lapply(DT$NSI_datafile, extract_years)
DT[, year_start := vapply(yrs, function(v) if (length(v)) min(v) else NA_integer_, integer(1))]
DT[, year_end   := vapply(yrs, function(v) if (length(v)) max(v) else NA_integer_, integer(1))]

# ---- Fields to be filled manually or via later tools ----
DT[, yearvar      := NA_character_]   # name of year column if panel; else ""/NA
DT[, details      := NA_character_]
DT[, firm_unit    := NA_character_]   # one of: plant, legal_unit, enterprise, enterprise_group
DT[, data_source  := NA_character_]   # one of: survey, administrative_source, mixed
DT[, firm_sample  := NA_character_]
DT[, preprocessing:= NA_character_]   # instruction string per §3.2.4–3.2.9

# ---- Final spec order EXACTLY as in the manual ----
out_cols <- c(
  "NSI_datafile","NSI_dataset","yearvar","year_start","year_end",
  "format","path","details","firm_unit","data_source","firm_sample","preprocessing"
)
NSI_datafiles <- DT[, ..out_cols]

# ---- Write CSV ----
outdir <- file.path(dirROCKET, "NSImetadata")
dir.create(outdir, showWarnings = FALSE, recursive = TRUE)
fwrite(NSI_datafiles, file.path(outdir, paste0(CountryCode, "_datafiles.csv")))

Steps:

  1. Set inputs at the top of the script
    • dirINPUTDATA: the base directory containing the raw data.
    • dirROCKET: the base directory containing the NSImetadata/ folder.
    • CountryCode: the two-letter ISO country code (e.g. "IT").
  2. Run the script
    The script will automatically collect:
    • NSI_datafile (filename without extension)
    • NSI_dataset (generic dataset name, stripped of years and underscores)
    • year_start / year_end (from 4-digit tokens in the filename)
    • format (file extension)
    • path (relative path to dirINPUTDATA)
  3. Fields requiring manual completion
    The following columns are created but left empty (NA). They must be filled manually by the NSI team:
    • yearvar: name of the year column if the file is a panel (leave empty otherwise).
    • details: clarifications on dataset coverage or specific notes.
    • firm_unit: one of {plant, legal_unit, enterprise, enterprise_group}.
    • data_source: one of {survey, administrative_source, mixed}.
    • firm_sample: information on the sampling scheme.
    • preprocessing: description of preprocessing steps, if any (see section below).
  4. Double-check the automatic fields
    • Years: the script extracts all 4-digit numbers from the filename and assigns the minimum to year_start and the maximum to year_end.
      ⚠️ Always verify that these values match the actual time coverage of the dataset. For example, bd2018 will yield both year_start=2018 and year_end=2018, which may or may not be correct.
    • NSI_dataset: confirm that the generic dataset name is harmonised and consistent across files.
  5. Export
    The script writes the CSV into: /NSImetadata/_datafiles.csv

Key reminders

  • The script is a first pass only: it automates extraction of filenames, formats, and candidate years.
  • Most metadata fields must be filled manually by the NSI staff who know the data.
  • Always double-check the final file before uploading to the rocket.

2. Constructing NSI_varnames.csv

This script produces the *_varlist.csv files, one for each dataset listed in NSI_datafiles.csv.
It should always be run after the NSI_datafiles script.

The script can be found in this dropdown menu:

# Script to generate NSI_varnames metadata (per §3.2.1.2 of manual)
# author EB-MH-MM-AM, rev 2025-09-29

library(data.table)
library(dplyr)
library(readxl)
library(tools)

# ---- Inputs you must define upstream ----
# dirINPUTDATA: the NSI data root (manual's reference base dir)
# dirROCKET:    rocket root to use for storing NSI metadata
# CountryCode:  2-letter ISO code (e.g., "IT")

# ---- Inputs ----
md_folder   <- file.path(dirROCKET, "NSImetadata")
NSI_datafiles <- fread(file.path(md_folder, paste0(CountryCode, "_datafiles.csv")))
file_path  <- dirINPUTDATA  # base folder for raw data

# Helper: test if a set of columns uniquely identifies rows
is_key_id <- function(data, cols) {
  n_distinct <- data %>% select(all_of(cols)) %>% distinct() %>% nrow()
  return(n_distinct == nrow(data))
}

# Process per dataset
for (DS in unique(NSI_datafiles$NSI_dataset)) {
  
  NSI_datafiles_filtered <- unique(NSI_datafiles[NSI_dataset == DS,])
  
  # Build file paths
  abs_paths <- file.path(file_path,
                         NSI_datafiles_filtered$path,
                         paste0(NSI_datafiles_filtered$NSI_datafile, ".", NSI_datafiles_filtered$format))
  
  # Load files
  file_list <- lapply(abs_paths, function(f) {
    import_data(dir = dirname(f), file = basename(f), format = file_ext(f))
  })
  
  var_names_list <- lapply(file_list, function(df) data.table(NSI_varname = names(df)))
  
  for (i in seq_along(var_names_list)) {
    rawdata <- file_list[[i]]
    
    # Load variable descriptions
    desc_file <- file.path(file_path,
                           NSI_datafiles_filtered$path[i],
                           paste0(NSI_datafiles_filtered$NSI_dataset[i], "_descr.csv"))
    
    if (!file.exists(desc_file)) {
      stop(paste("Description file not found:", desc_file,
                 "Please create it as required by the manual."))
    }
    description <- fread(desc_file)
    
    if (!"NSI_varname" %in% colnames(description)) {
      stop("Description file must contain a column named 'NSI_varname'")
    }
    
    # Add class, domain, NSI_datafile
    var_names_list[[i]]$class <- sapply(rawdata, function(x) paste(class(x), collapse=","))
    var_names_list[[i]]$domain <- NA_character_  # manual input required
    var_names_list[[i]]$NSI_datafile <- file_path_sans_ext(basename(abs_paths[i]))
    
    # Merge with descriptions
    var_names_list[[i]] <- merge(var_names_list[[i]], description,
                                 by = "NSI_varname", all.x = TRUE)
    
    # Report missing variables in description
    missing_vars <- setdiff(names(rawdata), description$NSI_varname)
    if (length(missing_vars) > 0) {
      message("Missing variable descriptions for ", var_names_list[[i]]$NSI_datafile[1], ": ",
              paste(missing_vars, collapse=", "))
    }
    
    # Identify key variables
    colnms <- colnames(rawdata)
    max_cols <- min(4, length(colnms))
    found <- NULL
    for (k in 1:max_cols) {
      for (comb in combn(colnms, k, simplify = FALSE)) {
        if (is_key_id(rawdata, comb)) {
          found <- comb
          break
        }
      }
      if (!is.null(found)) break
    }
    var_names_list[[i]]$is_key <- var_names_list[[i]]$NSI_varname %in% found
    
    message("++ ", var_names_list[[i]]$NSI_datafile[1], " processed.")
  }
  
  stacked_df <- bind_rows(var_names_list)
  
  # Final column order per manual
  stacked_df <- stacked_df[, c("NSI_datafile","NSI_varname","description","is_key","class","domain")]
  
  # Export
  fwrite(stacked_df, file.path(md_folder, paste0(CountryCode, "_", DS, "_varlist.csv")))
  message("List for dataset ", DS, " exported.")
}

Steps:

  1. Inputs required
    • dirROCKET: base directory containing the NSImetadata/ folder.
    • CountryCode: the two-letter ISO code (e.g. "IT").
    • dirINPUTDATA: main folder containing the raw NSI data.
  2. Run the script
    For each dataset in NSI_datafiles.csv, the script will:
    • Load the raw files listed for that dataset.

    • Extract the variable names (NSI_varname).

    • Read the corresponding description file <dataset>_descr.csv (must be provided by the NSI).

    • Record the variable class (data type).

    • Attempt to infer which variables form a key (is_key).

    • Create an empty domain column to be filled manually.

    • Export the compiled metadata to:

      <dirROCKET>/NSImetadata/<CountryCode>_<dataset>_varlist.csv
  3. Fields requiring manual completion
    • description: ensure that the description file is complete and correctly labelled.
    • domain: must always be filled manually (see manual §3.2.1.2 for details).
    • is_key: the automatic detection may fail or give false positives. Double-check and adjust manually.
  4. Double-check the automatic fields
    • Verify that all variables in the raw data are listed in the description file.
      Missing variables are reported in the console when running the script.
    • Confirm that the class column is meaningful and consistent.

Key reminders

  • A description file <dataset>_descr.csv is mandatory. If missing, the script stops with an error.
  • The is_key detection is heuristic. Always verify manually which variables uniquely identify records.
  • The domain classification cannot be automated. It must be completed by the NSI staff.
  • Always inspect the final *_varlist.csv files before uploading them to the rocket.

3. Constructing NSI_class.csv

This script produces the classification metadata files *_class.csv required by the rocket.
It should always be run after the NSI_datafiles script.

The script can be found in this dropdown menu:

# Script to generate NSI_class metadata (per §3.2.1.3 of manual)
# Run only after NSI_datafiles.R

library(readr)
library(dplyr)
library(stringr)

# ---- Inputs ----
CountryCode <- "IT"        # set your 2-letter code
dirROCKET   <- "your_dir"  # base folder for rocket
dirINPUTDATA <- "your_data_folder"  # main raw data folder

# Load metadata from NSI_datafiles
data_files <- read_csv(file.path(dirROCKET, "NSImetadata", paste0(CountryCode, "_datafiles.csv")),
                       show_col_types = FALSE)

# ---- Specify dataset and classification variable ----
class_dataset <- "your_dataset"         # must match NSI_dataset in datafiles
class_name    <- "name_class_variable"  # e.g. "NACE2"

NSI_datafiles_filtered <- filter(data_files, NSI_dataset == class_dataset)

if (nrow(NSI_datafiles_filtered) == 0) {
  stop("Dataset not found in NSI_datafiles: ", class_dataset)
}

# ---- Function to read classification data ----
extract_columns <- function(file_path) {
  data <- read_csv(file_path, show_col_types = FALSE)
  
  required_columns <- c(class_name, "year", "description")
  if (!all(required_columns %in% names(data))) {
    stop("Missing one or more required columns in: ", file_path,
         ". Expected: ", paste(required_columns, collapse=", "))
  }
  
  out <- select(data, all_of(required_columns))
  # Rename classification variable to generic name 'classvar'
  colnames(out)[1] <- "classvar"
  return(out)
}

# ---- Process files ----
results <- list()

for (i in seq_len(nrow(NSI_datafiles_filtered))) {
  f <- file.path(dirINPUTDATA,
                 NSI_datafiles_filtered$path[i],
                 paste0(NSI_datafiles_filtered$NSI_datafile[i], ".",
                        NSI_datafiles_filtered$format[i]))
  
  if (!file.exists(f)) {
    message("File not found: ", f)
    next
  }
  
  extracted <- extract_columns(f)
  
  dataset_name <- tolower(NSI_datafiles_filtered$NSI_dataset[i])
  output_file <- file.path(dirROCKET, "NSImetadata",
                           paste0(CountryCode, "_", dataset_name, "_class.csv"))
  
  # enforce column order
  extracted <- extracted[, c("classvar", "year", "description")]
  
  write_csv(extracted, output_file)
  message("Exported classification metadata to ", output_file)
  
  results[[dataset_name]] <- extracted
}

Steps

  1. Inputs required
    • dirROCKET: base directory containing the NSImetadata/ folder.
    • CountryCode: the two-letter ISO code (e.g. "IT").
    • dirINPUTDATA: main folder containing the raw NSI data.
    • class_dataset: the dataset where the classification variable is found (must match an NSI_dataset in NSI_datafiles.csv).
    • class_name: the name of the classification variable (e.g. "nace").
  2. Run the script
    For the specified dataset, the script will:
    • Load the raw files linked to the dataset.

    • Extract three required fields:

      • classvar (the classification variable, renamed from the raw variable class_name),
      • year (validity year),
      • description (text label of the classification code).
    • Export the results into:

      <dirROCKET>/NSImetadata/<CountryCode>_<dataset>_class.csv
  3. Fields requiring manual completion / verification
    • Ensure that the classification variable chosen (class_name) matches the raw file.
    • Check that year and description columns exist and are correctly populated in the raw data.
    • Confirm that the classvar column has been renamed properly and contains only the classification codes.
  4. Double-check the automatic fields
    • The script will stop if any of the required columns (class_name, year, description) are missing.
    • Even if the file is created, NSIs must review the exported *_class.csv carefully to verify that:
      • year corresponds to the reference period of the classification.
      • description correctly describes each classification code.
      • No codes are missing or duplicated.

Key reminders

  • Each dataset that includes a classification variable must have a corresponding *_class.csv file.
  • Column order in the final CSV must be exactly: classvar, year, description
  • Always inspect the final file manually before uploading it to the rocket.

4. Constructing NSI_codebook.csv

This script produces the *_codebook.csv files required by the rocket.
It should always be run after the NSI_datafiles script.

The script can be found in this dropdown menu:

# Script to generate NSI_codebook metadata (per §3.2.1.4 of manual)
# Produces a single consolidated <CountryCode>_codebook.csv
# Run only after NSI_datafiles.R

library(data.table)
library(tools)

# ---- Inputs ----
CountryCode  <- "IT"              # two-letter code
dirROCKET    <- "your_dir"        # rocket base folder
dirINPUTDATA <- "your_data_folder"

# ---- Import function ----
import_data <- function(file_path) {
  fread(file_path, stringsAsFactors = FALSE)
}

# ---- Helper: detect large digit variation ----
has_large_digits_variation <- function(values, threshold = 1) {
  values <- na.omit(values)
  digits <- nchar(as.character(values))
  digit_diff <- abs(digits - min(digits, na.rm = TRUE)) > 3
  sum(digit_diff) > threshold
}

# ---- Create codebook for one dataset ----
create_codebook <- function(df, dataset_name,
                            max_unique_values = 50,
                            digit_variation_threshold = 1) {
  codebook <- data.table(NSI_dataset = character(),
                         NSI_varname = character(),
                         code = character(),
                         year = character(),
                         description = character())
  
  for (var_name in names(df)) {
    unique_values <- unique(df[[var_name]])
    
    # Skip high-cardinality vars or numerics with wide digit variation
    if (length(unique_values) > max_unique_values ||
        (is.numeric(df[[var_name]]) &&
         has_large_digits_variation(unique_values, digit_variation_threshold))) {
      next
    }
    
    temp_dt <- data.table(
      NSI_dataset = dataset_name,
      NSI_varname = var_name,
      code = as.character(unique_values),
      year = "",                         # ++++ to be reviewed manually ++++
      description = NA_character_        # ++++ to be filled manually ++++
    )
    codebook <- rbind(codebook, temp_dt, fill = TRUE)
  }
  
  return(codebook)
}

# ---- Create single consolidated codebook for all datasets ----
create_codebook_all <- function(rd_folder, md_folder, CountryCode,
                                max_unique_values = 50, digit_variation_threshold = 1) {
  csv_files <- list.files(path = rd_folder, pattern = "\\.csv$", full.names = TRUE)
  all_codebooks <- list()
  
  for (file_path in csv_files) {
    dataset_name <- tools::file_path_sans_ext(basename(file_path))
    dataset <- import_data(file_path)
    cb <- create_codebook(dataset, dataset_name,
                          max_unique_values, digit_variation_threshold)
    all_codebooks[[dataset_name]] <- cb
    message("Processed dataset: ", dataset_name)
  }
  
  # Stack all datasets together
  codebook_all <- rbindlist(all_codebooks, fill = TRUE)
  
  # Enforce manual’s column order
  codebook_all <- codebook_all[, c("NSI_dataset","NSI_varname","year","code","description")]
  
  # Export single consolidated file
  output_file <- file.path(md_folder, paste0(CountryCode, "_codebook.csv"))
  fwrite(codebook_all, output_file, quote = FALSE)
  message("Exported consolidated codebook: ", output_file)
}

# ---- Execute ----
md_folder <- file.path(dirROCKET, "NSImetadata")
dir.create(md_folder, showWarnings = FALSE, recursive = TRUE)
create_codebook_all(dirINPUTDATA, md_folder, CountryCode)

Steps

  1. Inputs required
    • dirROCKET: base directory containing the NSImetadata/ folder.
    • CountryCode: the two-letter ISO code (e.g. "IT").
    • dirINPUTDATA: main folder containing the raw NSI data.
  2. Run the script
    For each raw dataset (CSV) in the folder, the script will:
    • Extract variable names (NSI_varname).

    • Collect their observed values (code).

    • Create empty year and description fields.

    • Stack all rows and define what NSI_dataset they relate to.

    • Export the result to:

      <dirROCKET>/NSImetadata/<CountryCode>_codebook.csv
  3. Fields requiring manual completion
    • year: must be reviewed and filled manually where relevant.
    • description: must always be filled manually (label for each code).
  4. Double-check the automatic fields
    • The script excludes variables with too many unique values or with large numeric variation.
    • Ensure that important categorical variables were not skipped.
    • Verify that codes are consistent across years.

Key reminders

  • Column order in the final CSV must be exactly: NSI_dataset, NSI_varname, code, year, description
  • This script only provides a first draft. Most of the meaningful content (year, description) must be added manually by NSI staff.
  • Always inspect the final file carefully before uploading to the rocket.

5. Constructing the timeconc table

The timeconc table is part of the metadata required by the rocket.
Unlike the other metadata files, it cannot be generated from the raw microdata.

Key points

  • The timeconc table provides official information on the time coverage of the data.
  • It must be obtained directly from an authoritative NSI or official source.
  • Once collected, the table should be stored and maintained in the NSImetadata folder with the naming convention: classvar*t0\_*classvar*t1_conc

Responsibilities

  • The NSI staff must provide the timeconc table using official sources (e.g. methodological notes, published documentation, internal validation).
  • The role of the CompNet rocket is only to read and integrate this file; it does not generate it.

Key reminder

⚠️ Always ensure that the timeconc table comes from an officially validated source and is kept up to date. This file underpins the correct interpretation of the temporal dimension of the datasets and cannot be replaced by automated extraction.

6. Constructing NSI_keyID1_keyID2_conc.csv

The firm ID concordance table establishes the link between two firm identifiers (among plantid, firmid, entid, entgrp) used in different datasets.
It is essential for ensuring consistent longitudinal tracking of firms and dataset merging.

The script can be found in this dropdown menu:

# Pseudo-code: Building the firm ID concordance table (NSI_firmid.csv)

library(data.table)

# ---- Inputs ----
CountryCode  <- "IT"                  # two-letter code
dirROCKET    <- "your_dir"            # rocket base
dirINPUTDATA <- "your_data_folder"    # raw data

# Step 1: Identify dataset(s) that contain both ID variables
# Example: suppose "id_old" and "id_new" are two firm identifiers
candidate_datasets <- c("dataset_with_ids")

# Step 2: For each dataset, load and stack across years if not a panel
firmid_list <- list()

# Read NSI_datafiles
datafiles <- fread(paste0(dirNSIMETA, CountryCode, '_datafiles.csv'))

for (ds in candidate_datasets) {
  # Build file path(s) from NSI_datafiles.csv
  files <- datafiles[ds == NSI_dataset,]$path
  files <- paste0(dirINPUTDATA, files)
  
  # If multiple cross-sections: bind them into a long panel (add year column!)
  if (length(files) > 1) {
    raw <- rbindlist(lapply(files, fread), fill = TRUE) # Works only if datafiles have the same column names!
  } else {
    raw <- fread(files)
  }
  
  year_var <- '...' # Define the variable name for the year variable
  
  # Step 3: Extract only the two ID columns + year column ---> manually fix the id column names
  firmid_sub <- unique(raw[, .(id_old, id_new, ..year_var)])
  
  # Step 4: Standardise column names ---> manually fix the column names
  setnames(firmid_sub, old = c("id_old","id_new","..."),
           new = c("firmid_old","firmid_new","year"))
  
  firmid_list[[ds]] <- firmid_sub
}

# Step 5: Combine all datasets (if more than one provides concordance)
firmid_all <- rbindlist(firmid_list, fill = TRUE)

# Step 6: Export to NSImetadata
fwrite(firmid_all, file.path(dirNSIMETA,
                               paste0(CountryCode,"keyID1_keyID2_conc.csv"))) # Substitute the keyIDS with their proper name!

Key principles

  • The table can only be created if at least one dataset contains both identifiers in the same file.
  • If the dataset is not a panel but a set of yearly cross-sections, it must be stacked into a long format before extracting IDs.
  • The table must always contain unique triples: keyID1, keyID2, year, where the ID names need to be picked from plantid, firmid, entid, entgrp.

Steps

  1. Identify dataset(s)
  • Review NSI_datafiles.csv and raw data.
  • Find which dataset(s) include the two firm ID variables (e.g. id_old and id_new).
  1. Stack data if needed
  • If the dataset is stored as separate cross-sections by year, stack them and add a year column.
  • If the dataset is a panel, the year is already present.
  1. Extract unique concordance
  • Keep only the two ID columns and the year column.
  • Deduplicate (unique) to avoid duplicates across files.
  1. Rename columns
  • Use the standard names:
    • keyID1
    • keyID2
    • year
  1. Export
  • Save as:

    <dirROCKET>/NSImetadata/<CountryCode>_firmid.csv

Key reminders

  • This file is not always available — it depends on the data structure in the NSI.
  • The NSI staff must verify that the mapping is correct and covers the relevant years.
  • Always check that:
  • Both ID variables are properly harmonised.
  • No spurious duplicates or mismatches exist.
  • Cross-sections have been stacked correctly.

3.2.6 Data Pre-Processing

In the Netherlands and France, NSIs have already harmonized their raw data files to resemble MD datasets, resulting in minimal harmonization work being required from the Launcher.

However, there is a strategic intention to shift the boundary between the responsibilities of NSIs and the MDI infrastructure. Two approaches are under consideration:

  • NSIs document their raw files, and the Launcher—guided by this metadata—performs the harmonization and constructs the MD panels.

  • NSIs carry out the full harmonization to MD standards, and the Launcher simply reads the pre-harmonized files into R.

Some raw datasets might require specific preprocessing. This is taken care of by the infrastructure right after the raw datafile is imported by the launcher and before it is harmonized using the preprocessing tool (rocket/MDIprogs/datafile_preprocessing_tool.R).

The tool is a general-purpose function designed to apply one or more preprocessing transformations to raw datasets (stored as data.table objects). It enables modular, rule-based data cleaning and transformation by interpreting a structured string called preprocessing_string.

How it works

  1. Instruction string (preprocessing_string) encodes all preprocessing steps.
  2. The string is split into separate operations using ||.
  3. Each operation is parsed and executed in sequence.
  4. The data is modified in-place step by step, and the final data.table is returned.

Syntax rules

  • Operations are separated by ||
  • Parameters within each operation are separated by |
  • Multiple elements in a parameter (e.g., multiple column names) are separated by a hash symbol #

Supported Operations

Operation Format Description
dedup dedup|id_col1#id_col2|method|[optional:dedup_col] Removes duplicates by ID(s). Methods: na, min, max, meanmode, random.
filter filter|column|operator|value Filters rows based on logical conditions. Operators: eq, neq, gt, gte, lt, lte, in, nin.
agg agg|group_col1#group_col2|var1:func#var2:func Aggregates rows over groups using functions: sum, mean, min, max, median, sd, mode, pickmaxby-refcol.
restruct restruct|column_to_remove|col1#col2#col3 Drops a column and deduplicates rows based on remaining selected columns.
reshape reshape|id_col1#id_col2|names_from|values_from1#values_from2 Reshapes data from long to wide format using dcast().
derive derive|new_col|condmap|cond1:val1#cond2:val2#...|default:<default_val> Creates a new column from conditional logic. Conditions use standard R syntax; values can be column names or literals. A default must be specified.
scaleif scaleif|condition_col|val1:factor1#val2:factor2#...|col1#col2#... Conditionally multiplies one or more columns by a factor based on a categorical column’s value.
trimchars trimchars|col1#col2#...|n Trims the last n characters from each specified character column. Useful to normalize identifiers or string variables.
mergefrom mergefrom|datafile_name|join_key1#join_key2#...|col1#col2#...|[join_type] Imports one or more raw files belonging to datafile_name (as defined in column NSI_datafile of file NSI_datafiles), stacks them if multiple, and merges the specified columns into the current datafile using the provided join keys. By default a left join is performed. If join_type is set to outer, a full outer join is applied.

Special Feature in agg: pickmaxby-refcol

You can specify that a categorical column should take the value from the row with the highest value in another column.

Syntax:
my_categorical_col:pickmaxby-SCORE_col

This selects the value of my_categorical_col from the row that has the highest SCORE_col within each group.


Example

Code
preprocessing_string <- 
  "dedup|firm_id#year|random||
   filter|year|gt|2010||
   agg|firm_id|sales:sum#country:pickmaxby-sales||
   trimchars|vat_id|2||
   mergefrom|employment_survey_08|FIRM_ID#year|industry_code#size_class|left"

In case you need clarifications regarding the tool, please reach out to the MDI team.

The advantage of having the Launcher perform the harmonization is a reduction in maintenance costs for NSIs, particularly for recurring annual updates. It also improves the codification and reproducibility of the conceptual work done by NSI staff. However, this approach entails higher initial costs, as it requires NSIs to adopt a more rigid system of metadata documentation and to coordinate more closely with the MDI team.

3.3 Run MDI: Harmonization & Modules

This section describes the main steps to configure and run the MDI system. It covers how to prepare the countdown.R script, perform metadata checks, and execute harmonization and analysis modules. Each step is explained in detail below. An overview of all the steps an be found in the previous section

3.3.1 Countdown

The countdown (countdown.R) is the starting point for all use of the MDI. It requires the user to make a number of adjustments to ensure that the MDI can be executed successfully for the selected purpose. The following parameters must be reviewed and set by the user.

  • MDI Installation Directory (dirMDI)
    Set the full path to the directory where all MDI files are installed. Make sure the path ends with a “/”, e.g. dirMDI = "my/dir/".

  • Country code (CountryCode)
    Specify a 2-letter country code following the ISO 3166-1 alpha-2 standard.

  • NSI data directory (dirINPUTDATA)
    Set the full path to the directory containing the NSI firm-level data files (or mock data files). These are the raw input files provided by the National Statistical Institute (NSI).

  • Output directory (dirOUTPUT)
    Define the directory to which all generated output files will be exported. This directory must have read/write permissions and will contain results, module outputs, and other exported files.

  • Temporary storage directory (dirTMPSAVE)
    Set the directory used for temporary storage of MDI virtual longitudinal datasets.

Note

This directory is used to store intermediate datasets and allows reuse of processed data without re-importing raw NSI files.

  • Optional – flags for temporary files
    These flags control how the MDI process handles raw data imports and temporary files.

    • MDIimportFlag: Set to TRUE to import raw NSI data files. If FALSE, existing virtual datasets stored in dirTMPSAVE are used.
    • MDIcleanTMP: If TRUE, the temporary directory dirTMPSAVE is cleaned before execution.
  • Optional – mock data flag
    This flag controls whether the MDI is executed using mock data.

    • IsMOCK: Set to TRUE to run the MDI in a test scenario using mock data. When TRUE, temporary files are stored in a country-specific subdirectory to avoid overwriting files when switching countries.
  • Optional – execution control flags
    These flags influence how the MDI scripts run and how much output is produced.

    • MDImoduleRUN: Set to TRUE only after post-harmonization checks have been completed and research modules are ready to run. It should be FALSE during the first execution.

    • MDIdebug: Set to TRUE to display logs, warnings, and errors. Use FALSE for a quieter run.

    • MDIimputeFlag: Reserved for potential data imputation routines (currently not in active use).

    • filteredHarmonization: Set to TRUE if harmonization should be restricted to variables listed in the current MDnames_select file.

Click here to see the entire countdown.R script
# This file is used to work start MDI
# fill in all the parameters and save this file: countdown.R
# run the program in R and then choose to execute:
# 1. run pre_launch_checker.R  to run after an update of MDI at NSI, to check and fix metadata
# 2. run liftoff.R to run rocket: execute MD harmonizer and run payload modules
# 3. run prepare_NSI.R to run things to aid in getting metadata in good shape
# 4. run interactive_MDI.R to initialize environment to test/debug/explore/write module code.


rm(list = ls())

MDI_launch_version <- "v2.3"


########################################
# Compulsory steps
########################################


########################################
# 10. Set the full path to the directory where you install the MDI files
########################################

dirMDI <- "/files/MDI/"


########################################
# 9. Give 2 letter country code for your site ("ISO 3166-1 alpha-2" standard)
########################################

CountryCode <- "PTx"


########################################
# 8. Set the full path to the  directory with NSI firm-level data files (or mockdata files)
########################################

dirINPUTDATA <- "/files/NSIdatafiles/"


########################################
# 7. Set the full path to the directory to which generated files are exported (dirOUTPUT)
########################################

dirOUTPUT <- "/files/output/"


########################################
# 6. Set the full path to the directory for temporary storage of MDI virtual longitudinal datasets (dirTMPSAVE)
########################################

dirTMPSAVE <- "/files/TMP/"


########################################
# Optional steps (steps 5, 4, 3 & 2)
########################################

#####################################
# 5. Flag for temporary MDI files   #
#####################################

# set ImportFlag=TRUE if you want to import raw NSI data files (if FALSE: reads MDI virtual data from dirTMPSAVE)

MDIimportFlag <- TRUE

################################################
# 4. Flag for cleaning the temoporary folder   #
################################################

# set cleanTMP=TRUE if you would like to clean the dirTMPSAVE before running

MDIcleanTMP <- FALSE

#####################################
# 3. Flag for mock data use         #
#####################################

IsMOCK <- TRUE
# If isMOCK, temporary files are stored in CountryCode folder,
# so that files arent overwritten when switching country
if (IsMOCK) {
  dirTMPSAVE <- paste0(dirTMPSAVE, CountryCode, "/")
  if (!dir.exists(dirTMPSAVE)) {
    dir.create(dirTMPSAVE)
  }
}

#####################################
# 2. Flags to control execution     #
#####################################

# Set MDImoduleRUN = TRUE if the post_harmonization script has been run and checked and modules are ready to be run.
# Should be set to FALSE when running the launch for the first time.

MDImoduleRUN <- FALSE

# set debug = TRUE if you don't want to suppress logs, warnings and errors

MDIdebug <- TRUE

## NOTE EB: Nothing done at the moment with the imputeflag (was called inputeflag in early versions)

MDIimputeFlag <- FALSE

# Flag to indicate whether this is a test run on the server with mock data

# If you want the harmonization to be done only for the variables included in the current
# launch's MDnames_select file
filteredHarmonization <- FALSE

##############################
# 1. Liftoff                 #
##############################

# save the program countdown.R to your work directory.
# Run the file to choose which program/feature to execute:

# Now, pick the program to be executed
# 1. run pre_launch_checker.R  to run after an update of MDI at NSI, to check and fix metadata
# 2. run liftoff.R to run rocket: execute MD harmonizer and run payload modules
# 3. run prepare_NSI.R to run things to aid in getting metadata in good shape
# 4. run interactive_MDI.R to initialize environment to test/debug/explore/write module code.
# ---> Choose below with number of the selected program

# Check if the session is interactive (works both in RStudio and console)

if (interactive()) {
  # Use select.list() for interactive selection

  user_input <- select.list(c("pre_launch_checker.R", "liftoff.R", "prepare_NSI.R", "interactive_MDI.R"), title = "Choose a program to run:")


  if (user_input != "") {
    # Source the corresponding script

    source(paste0(dirMDI, "launchpad/", user_input))
  } else {
    cat("No selection made. Exiting.\n")
  }
} else {
  # If not in an interactive session, use readline()

  user_input <- as.integer(readline(prompt = "Please enter an integer (1 for pre_launch_checker.R, 2 for liftoff.R, 3 prepare_NSI.R, 4 interactive_MDI.R): "))
  if (!is.na(user_input) && user_input %in% 1:4) {
    # Map the user input to the corresponding script name
    # Source the corresponding script

    source(paste0(dirMDI, "launchpad/", user_input))
  } else {
    cat("Invalid input. Please enter a valid integer (1, 2, 3 or 4).\n")
  }
}

3.3.2 Pre-Launch-Checker

The program pre_launch_checker.R (run countdown.R and choose this program) needs to be run before anything else. It performs various checks on the NSI metadata to avoid errors later on. The results of the checks can be found in the file pre_launch_checker_results.txt in the output directory. It shows possible errors that should be adjusted in the NSI metadata. Additionally, two concordance files (NSI_pcc8t0_pcc8t1_conc.csv and NSI_MD_nace_conc.csv) are created using existing concordance files and updating them with the data at the NSI. These concordance table might contain empty values, if no value was previously defined. Missing values need to be filled in manually. When the concordance files are ready to be used, they need to be moved to the directory indicated in pre_launch_checker_results.txt

3.3.3 Post-Harmonization Quality Checks

After harmonizing your country’s microdata to the MD format, the Post-Harmonization Checker (PHC) script is automatically executed in the rocket to ensure that the harmonized datasets meet essential quality and consistency standards. This diagnostic process validates whether the resulting data is clean, correctly structured, and ready for module execution.

The script performs the following checks on the :

Check Type Description
1. Duplicate Check Identifies rows where the key ID variable (e.g., firmid) is duplicated.
2. Variable Class Check Verifies that each variable matches its expected R data type (e.g., numeric, character, date).
3. Date Format Check Ensures date variables are correctly formatted and parseable (e.g., %Y, %d%m%Y).
4. Date Range Check Extracts the minimum and maximum detected dates per variable to check date range matches.
5. Break Detection Identifies structural breaks in aggregate-level distributions over time (over 10% jumps).

Each of these checks outputs either a summary table (.txt) or a visual diagnostic (.pdf) to help identify problems.

3.3.3.1 PHC Output Files Generated

After the script runs successfully, you will find the following two files:

  1. <CountryCode>_phc_results.txt
    Location: dirTMPSAVE
    Contents: Duplicate summary, class and format mismatches, detected date ranges.

Duplicate Check Table

Column Name Description
dataset MD dataset (e.g., BR, SBS, ICTEC)
id_var The country-specific ID variable used to identify unique records taken from MD_idInfo
has_duplicates TRUE if duplicated rows are found based on id_var, FALSE otherwise
num_duplicated_rows Total number of rows that are duplicates (may include multiple per key)
num_unique_duplicated_keys Number of unique key values (id_var) that are duplicated

Variable Class Check Table

Column Name Description
dataset MD Dataset
variable Variable name being checked
expected_class Class assigned to this variable in the metadata (MD_varnames)
actual_class Actual class detected in the harmonized .RDS file
class_match TRUE if expected and actual class match, FALSE otherwise

Date Format & Class Check Table

Column Name Description
dataset MD Dataset
variable Variable name being checked
expected_class Expected class (usually "date")
actual_class Class detected in the file
expected_format Date format expected (e.g., %Y, %d%m%Y)
actual_format Detected format based on sample values
format_valid TRUE if values can be parsed using expected_format, FALSE otherwise
class_match Whether the variable is stored as a Date object

Date Range Check Table

Column Name Description
dataset MD Dataset
variable Date variable being checked
actual_format Detected format used to parse the variable
actual_min_date Earliest parsed date in the variable
actual_max_date Latest parsed date in the variable
expected_range Expected range of years (as specified in MD_catalogue)
  1. breaks_report.pdf
    Location: dirTMPSAVE
    Contents: Plots showing time-series breaks for each numeric variable by dataset. The red dots show a structural break in the time series defined by a minimum 10% jump.

Break Summary Table (PDF)

Column Name Description
dataset Dataset name
variable Numeric variable being assessed for breaks
stat Statistic showing the break (e.g., mean, p50, sd)
year Year in which a structural break was detected
growth Relative change from previous year (e.g., +0.25 = 25% increase, -1.0 = 100% drop)

3.3.3.2 Instructions for Country Leaders for Reviewing and Fixing PHC Errors

  1. Duplicate Check
  • Check: Whether any rows share the same key (e.g., firmid) more than once.
  • Look for: has_duplicates == TRUE and high values in num_duplicated_rows or num_unique_duplicated_keys.
  • Fix: Review your harmonization step and ensure that each firm-year observation is uniquely identified. If intentional (e.g., due to panel structure), document it clearly.
  1. Variable Class Check
  • Check: Compares expected vs. actual data types.
  • Look for: class_match == FALSE
  • Fix: In your country metadata, ensure each variable is explicitly cast to the correct type using functions like as.numeric(), as.character(), or as.Date() using the revalue method.
  1. Date Format Check
  • Check: Whether date variables match expected formats (e.g., %d%m%Y).
  • Look for: format_valid == FALSE or actual_format == "unknown"
  • Fix: Recheck how date strings are parsed in your harmonization script and in the metadata. Use as.Date() with the proper format string.
  1. Date Range Check
  • Check: Compares detected date range with expected year coverage.
  • Look for: Min or max dates far outside expected range (e.g., year 1001 or 9122).
  • Fix: Likely due to incorrect parsing. Verify input formats and metadata.
  1. Break Detection
  • Check: Identifies abrupt jumps/drops in:
    • p25, p50, p75
    • Mean
    • Standard deviation
  • Look for: Large positive/negative growth values in the break summary and red dots in the plots.
  • Fix: Review input consistency across years (e.g., variable definitions, missing categories). Cross-check with national data providers to see if breaks are expected due to methodology changes.
Note

You are ready to proceed with running the MDI modules (e.g., setting MDImoduleRun = TRUE) only after:

  • All critical issues (e.g., duplicate rows, format mismatches, corrupted dates) are resolved.
  • You’ve documented any justified exceptions (e.g., expected breaks).
  • You’ve shared updates or escalated open issues to the MDI team.
  • Please keep backup copies of your harmonized .RDS files before making changes.

3.4 Developing & Testing

3.4.1 Nuvolos Developer Space

This space is intended for code development, module creation, and script testing on mock data. It is designed for MDI team members and module writers who are familiar with the MDI infrastructure and have access to the MDI GitHub repository. (More information on Nuvolos: where MDI users develop and test their codes)

Each user must connect their Github account to enable pushing and pulling changes. Note the following:

  • Each user works in their own isolated space—your changes remain private until you explicitly push them to GitHub.

  • It is your responsibility to ensure you are working on the latest version of the MDI codebase by pulling updates from GitHub when you start your RStudio session (how to work with Git).

  • The workspace includes both NSI mock data and harmonized MD mock data, which you can use for testing and developing your modules. You can find the NSI mockdata in space_mounts/mockdata/NSIdata/ and the harmonized MD mockdata in space_mounts/mockdata/TMP/. (Note: These folders are only accessible via RStudio and won’t show up in the Files section)

If you’re using the Nuvolos Developer Space for the first time you need to connect your GitHub account. Follow the steps below:

  • Open Nuvolos, navigate to “applications” (left menu) and open RStudio

  • Go to the terminal

    • Generate a public/private key pair by executing this command: ssh-keygen -t ed25519 (no need to change suggested location or create a password > press enter 3 times)
    • Navigate to the folder where both are saved. You can do that on the file section on the right side, you might have to click on the “/” to see all directories. The folder .ssh is a hidden folder, so click on the gear symbol and select the option “Show Hidden Files”
  • Open the id_ed25519.pub and copy it’s content

  • Open Github in the browser

    • In Github: go to settings>SSH and GPG keys
    • Click “add new SSH key”
    • Add the copied pub key into the key section and add a title eg. “Nuvolos MDI test environment”, then save
  • Back to Nuvolos, in RStudio, Terminal, clone the branch using this command (from within /files folder, which is default):

    • git clone --branch pre_Launch_v2.2_backup --single-branch git@github.com:Secretariat-CompNet/MDI.git
    • The MDI with all files will show up in the files section on the right
    • Then go to home/datahub/ in the file section and open the .gitconfig file. The file will look like this:
    [user] email = 12345678+Name@users.noreply.github.com        
          name = YourName 
    [credential]           
          helper = cache --timeout 64800      
  • Make sure that the email address is the one from your Github, not eg. the iwh email address. To check which is the right one go to your Github account > Settings > Emails. Copy the email address ending on @users.noreply.github.com as shown below into the .gitconfig file. Save file.

  • The MDI is correctly set up now. You can run your code with mockdata, edit and pull/push changes to Github

  • To test codes: execute countdown and select interactive_MDI (option 4) before running your own script.

3.4.2 General Workflow with Git

If you’re in the right branch and your repository is up-to-date, this is the normal workflow:

  1. You change a file/ add a module/ add metadata.
  2. You save your edited file.
  3. (Best practice: Check status (git status) and make sure there are no recent updates on the branch)
  4. You add your file(s) to a commit (git add file_name)
  5. You create a commit with a commit message (git commit -m "this is the commit message")
  6. You push your commit to GitHub (git push)
  7. You can verify your commit with git log. This shows a list of all recent commits (on top the one you just did)

If you haven’t worked with the MDI in a while, the repository might be outdated or you might be in an old branch. Below are the steps to make sure you’re working in the right branch and have the latest updates.

  1. Navigate into your MDI repository

    cd /files/MDI

  2. Verify the status of your MDI version

    git status

    This tells you what branch you’re on, eg. : On branch branch_name

    and if you’re up-to-date with the latest changes. There are three possible options:

    1. Your branch is up to date with 'origin/branch_name'. You have all the latest changes of that branch. No need to do anything else

    2. Your branch is ahead of 'origin/branch_name' by x commits. You have changes that you didn’t push to GitHub yet.

    3. Your branch is behind 'origin/branch_name' by x commits. There are updates that you haven’t pulled yet.

  3. If you want to change the branch: git switch new_branch_name

  4. If you want to pull changes: git pull

  5. If you want to push your changes:

    1. To add updated files to a commit use: git add name_of_your_changed_file (Use that command for each file individually, ot use git add . to add all changed files)

    2. To create a commit use git commit -m "Add your commit message here" (Make sure your commit message described your updates well)

    3. Push your commit to GitHub: git push

  6. Check the status again to verify that you’re in the latest version of your desired branch: git status

    This should now give:

    On branch branch_name Your branch is up to date with 'origin/branch_name'. nothing to commit, working tree clean

3.4.3 How to add your research module to the MDI infrastructure

If you want to add your module to the MDI infrastructure via the Nuvolos Developer Space you need to have access to the MDI GitHub repository and set up your GitHub account in Nuvolos (see:first time users) Then open RStudio in the Developer Space and follow these steps to add your module:

  1. Add a module folder

    In the files section on the right, navigate to the folder MDI/payload/Launch_vX.X/Rmodules/ (Replace X.X with the current launch version). In that folder create a new folder with a two-character name, abbreviating your research module, eg. “XY”.

  2. Add your MDnames_select file

    Inside your module folder add the MDnames_select file. This file contains a list of the variables that your module uses (more information here) and it needs to be named like this: (res_group)_MDnames_select.csv where res_group is your module abbreviation, eg. XY_MDnames_select.csv

  3. Add your main script

    Inside your module folder add your main script. This script needs to be named the following way: Launch_X.X_(res_group).R where X.X is the current launch version and res_group is your module abbreviation. Eg. Launch_2.3_XY.R. This scripts will be executed by the code when your module is run. That does not mean all your code needs to be in that one script. You can add as many scripts as you like and call them using source(path/to/your/script/).

  4. Add any other scripts/files/ folders

    If you need any additional scripts or files place them in your module folder.

This is an example of a module folder: The folder has the module abbreviation CN and contains the main script Launch_2.3_CN.R, the CN_MDnames_select.csv file and two additional files EU_countries.csv and Questionnaire.xlsx.

3.4.4 How to develop and test your module using (mock) data

To develop your module you first need to adjust and run the countdown. This will import all libraries and variables you might want to use in your module. To do so, navigate to launchpad/countdown.R.
The countdown script functions as a configuration file that, for example, sets up paths and flags for the MDI execution. You need to adjust the parameters in the script to fit your needs. For example, set the flag isMOCK to TRUE if you’re working with mock data, or set dirTMP to the directory where the harmonized mock data is stored (space_mounts/mockdata/TMP/). You can find all parameters and flags in the countdown section of this manual.

After adjusting the parameters, run the countdown script and select option 4 “Interactive MDI”. This will set up the environment for you to develop and test your module, but it will not run any MDI module. You can then run your module script (e.g. Launch_2.3_XY.R) to test your module using the mock data.

If you want to mimic a launch as it would happen at an NSI, set the flag MDIimportFlag to FALSE and MDImoduleRUN to TRUE. Then run the countdown and select option 2 “liftoff”. This will set up the environment and run all modules with the selected mock data.

3.4.5 Mockdata

To ensure robustness, consistency, and functionality across the MDI infrastructure, the development and use of mock data is essential.

  • Specifications for mock data.

    • All files from NSI_datafiles.csv are covered, with all years as given by NSI_varnames.
    • All variables from all NSI files must be represented with the correct format and domain, including classifications, codebooks, and value labels.
  • Underlying ‘firm’ datasets.

    • SBS/BS Cobb-Douglas model: deterministic framework with stochastic draws; firm size is used to infer capital (k), materials (m), and output (y) based on productivity shocks and capital-labor moments.
    • SBS/BS forward-looking Hopenhayn model: includes stochastic productivity draws and shocks to productivity and demand, allowing for endogenous firm exit.
    • SBS/BS Aglio–Bartelsman-type firms: based on parameter draws for A/g, η, and ρ.
    • Firm dynamics with innovation: models firms’ extensive choices in innovative activities (e.g., ICTEC, R&D).
    • Firms with trade behavior: captures extensive and intensive trade choices across modules such as ITGS, ITS, OFATS, and IFATS.
  • BLOCK0: Prepare Auxiliary Files

    • Define the country, the sample periods, the datasets and read country-specific NSI metadata (datafile, varname, codebook).
    • Create a table specifying the hierarchical structure among variables.
    • Develop a table defining the concordance between fundamental model variables and NSI variables.
    • Compile a file detailing auxiliary regressions for predicting numerical, logical, and categorical variables.
  • BLOCK0: Obtain Data Moments from the Data or by Simulation

    • Calculate the sample mean and variance of employment for each NACE 2-digit sector.
    • Determine the average exit rate for each NACE 2-digit sector.
    • Extract regression coefficients for auxiliary regressions.
    • Gather information on sample sizes of surveys.
    • Compute key economic ratios and rates: capital-labor ratio, capital rental rate (interest rate), wage rate, and capital depreciation rate.
  • BLOCK1: Simulate an Unbalanced Panel Dataset

    • Generate an unbalanced panel dataset for firms over time, incorporating firm entry and exit dynamics based on Hopenhayn (1992).
    • Estimate model parameters: \(\alpha\) (output elasticity of labor), \(\sigma\) (standard deviation of TFP process), and z_exit (exit threshold for firms’ productivity) by targeting the sample mean and variance of the firm size (employment) distribution and the exit probability.
    • The simulated panel data includes firm ID, year, productivity, labor, capital, depreciation, and EBITDA.
  • BLOCK2: Predict BR and BS Variables

    • Use concordance tables between model variables and NSI variables, as well as auxiliary regressions and regression coefficients to predict BR and BS variables.
  • BLOCK2: Sample from the ‘Universe’ of Firms

    • For each survey table, sample from the firm universe and predict NSI variables using the auxiliary regressions and regression coefficients.

4 Acknowledgement

We gratefully acknowledge the support of the European Union, whose funding made this project possible. We also thank all National Statistical Institutes (NSIs), National Statistical Systems, National Productivity Boards (NPBs), and other collaborators for their valuable contributions to the development of the MDI project and this manual.