MDI Manual

Comprehensive guidance for everyone who builds and uses the Micro Data Infrastructure

Author

Affiliation

Eric Bartelsman and MDI team

Vrije Universiteit Amsterdam;Tinbergen Institute;Halle Institute for Economic Research (IWH);Competitiveness Research Network (CompNet); Bocconi University; Centre for Business and Productivity Dynamics (CBPD)

Published

March 11, 2026

1 Introduction to MDI

This user guide provides users of the Micro Data Infrastructure (MDI) with all the necessary information to conduct research by using the mdi set up and setting up the MDI or developing the infrastructure.

If you are interested in running your research on MDI, do not miss section Use MDI, in particular:
If you have access to a given country firm-level data and need to build MDI, do not miss section Setting up the MDI:
If you work on Nuvolos creating mock data, do not miss sections:
- Developing & Testing
- Mockdata

1.1 What

The Microdata Infrastructure (MDI) is a platform for cross-country microdata access, developed by CompNet in collaboration with European National Statistical Institutes (NSIs), National Statistical Systems (NSS) and other partners.

The MDI began in 2018, as described in “Creating an EU-wide Micro Data Infrastructure (MDI): a handbook for Micro-Data Linking”. Since then, pilots evolved into a maintained infrastructure that is periodically launched at NSIs. This manual provides operational guidance for current and future MDI launches, building the infrastructure in new countries and defining a medium-term horizon of continuous improvement within each 3–6 month deployment cycle.

The MDI is designed with a dual objective: to harmonize firm-level data across countries and to streamline the research process for conducting cross-country analyses on a wide range of topics.

At its core, MDI provides a standardized environment that enables researchers to perform identical analyses across multiple countries. It ensures microdata comparability and accessibility within a unified framework. The infrastructure supports functions ranging from data importation and harmonization to advanced analytical outputs, all within a secure environment that safeguards data confidentiality. In a nutshell, MDI does the following:

Raw data $\rightarrow$ Data harmonization $\rightarrow$ Comparable cross-country microdata

Raw data: refers to the task of compiling all available datasets and variables from each NSI into a detailed metadata inventory.

Data harmonization: refers to the entire process of constructing variables that are comparable across countries. This involves establishing a standardized set of variable names and definitions (MD metadata). Based on this standard, the raw data from each NSI is used to generate corresponding variables and files aligned with the MD metadata. The harmonization process also includes creating concordance tables to standardize categorical codes.

Comparable cross-country microdata: refers to the tools and guidelines provided to researchers for effective data use. This includes best practices for writing research code (module) and the provision of mock data (designed to replicate the structure of the real data) for testing purposes.

1.2 Who

The MDI is a joint initiative by CompNet, National Statistical Institutes (NSIs), and other partners. CompNet staff lead the technical maintenance and development of the infrastructure, and provide training and guidelines on how to use it. Together with NSIs and partners, they access firm-level data across countries and operate the MDI infrastructure to generate research outputs. Please see below the MDI stakeholders and process:

%%{init: {
  "theme": "base",
  "themeVariables": {
    "primaryColor": "#5f7991",
    "edgeLabelBackground":"#ffffff",
    "lineColor": "#e40000",
    "textColor": "#000000",
    "fontSize": "26px"
  },
  "flowchart": {"curve": "linear"}
}}%%

flowchart LR

%% Stakeholders

subgraph STAKE["Stakeholders"]
  direction TB
  R[Researcher]
  C[CompNet / MDI Network]
  S[Statistical<br/>Institute]
  
  R  -- "Research project and payload" --> C
  C  -- "Code to build the infrastructure" --> S
  S  -- "Metadata preparation" --> C
  C  -- "Metadata and tools" --> R
end
  
%% Remote environment 

subgraph REMOTE["Environment"]
  direction TB
  RA["Remote access (AT FR GB NL SI)"]
  MP["MDI partner (FI IT)"]
  RE["Remote execution (PT DE)"]
  
  RA <--> MP <--> RE
end
  
%% Outcomes

subgraph OUT["Outcomes"]
  direction TB 
  O1[Special research and publication]
  O2[Standard moments and indicators ‑ publication]
  
  O1 -- "Output is obtained by Compnet/ MDI Network" <--> O2
end

%% Rocket 

Ro[🚀 **Rocket** 🚀]

%% Graph flow 

STAKE --> Ro
STAKE -- "Obtains the output" --> OUT

Ro --> REMOTE
REMOTE --> OUT

%% Style  

classDef remote stroke:#e40000,stroke-width:2px;
class RA,MP,RE remote;

classDef output fill:#5f7991,color:#ffffff;
class O1,O2 output;

classDef whitebg fill:#ffffff,stroke:#000000,color:#000000;
class STAKE,REMOTE,OUT whitebg;

Note: The diagram shows stakeholder roles, execution environments, and outputs. Rocket is the codebase deployed at NSIs. Two access models exist: direct remote access and indirect remote execution. Arrows indicate code, metadata, and output flows. All outputs are subject to NSI disclosure control before publication.

Stakeholders in the MDI Ecosystem

National Statistical Institutes (NSIs) and other Partners
- NSI remote execution
- NSI remote access
- Partners with country-specific (remote) access

NSIs provide the underlying data and support either remote execution or access to confidential firm-level data. While legal access rules, data availability, and technical infrastructure vary across countries, NSIs form the backbone of the standardized MDI research environment.

Module writers (MDI users)
- Productivity Boards
- External Academic and Policy
- MDI ‘Theme’ research staff

MDI users include productivity boards, external researchers, and thematic research staff. They are responsible for designing research modules that harness MDI’s infrastructure for cross-country analysis.

MDI staff
- Country specialists
- Thematic research personnel
- Infrastructure support teams

MDI staff ensure the effective development and operation of the MDI environment. They support NSIs with data preparation and documentation, and assist module writers by providing expertise on data, tools, and research themes.

1.3 How

The MDI infrastructure is a continuously evolving codebase, known as Rocket, that is periodically deployed within the secure environments of NSIs. Its main function is to process and harmonize raw data, execute research code (modules) , and export results, all while strictly complying with NSIs’ disclosure rules. This process, is referred as a launch. The process occurs every 4 to 6 months depending on country readiness. Please see below the MDI launch pipeline:

flowchart LR
    R["<b>Rocket</b><br>Contains research codes<br>(<i>modules</i>)<br>+<br>All needed R scripts to<br> harmonize the raw data"]
    D[("Harmonized data<br>&uarr;<br><b>Raw data</b><br>&darr;<br>Metadata<br><small>constantly updated</small>")]
    O["<b>Output</b><br><small>CSVs outside the NSI<br> protected environment</small>"]

    R --> D
    D -->|export| O
    D -.->|Metadata feed rocket| R
    
    classDef rocket fill:#ffcccc,stroke:#333,stroke-width:2px;
    classDef raw fill:#ccffcc,stroke:#333,stroke-width:2px;
    classDef output fill:#ccccff,stroke:#333,stroke-width:2px;

    class R rocket;
    class D raw;
    class O output;

Note: The diagram shows the MDI launch pipeline. The Rocket represents the deployed codebase containing harmonization scripts and research modules. It processes raw data and constantly updated metadata to produce harmonized data within the secure NSI environment. The harmonized datasets are then exported as output files outside the protected environment, only after passing disclosure checks.

Access models
- Direct access: Researchers connect to the NSI secure environment with user credentials and run approved code on site.
- Indirect access: NSI staff or MDI staff execute the approved code and return only disclosure-safe outputs.

Glossary

Class: Describes classification variables in the datasets, such as industry or product codes.
Codebook: Maps categorical variable values to their corresponding descriptions.
Data centers: Technical environments managed by NSS components that host, process, and secure microdata.
Datafiles: Lists all available NSI firm-level data files, including their names and years covered.
Disclosure Criteria: Rules designed by the NSIs to protect the confidentiality of firm-level data, ensuring that no output allows the identification of individual firms or the disclosure of sensitive information, even in aggregated form.
Hierarchy: a table that maps a classification at different aggregation levels. E.g., Nace 4d: 6491 is equivalent to Nace 3d 649, Nace 2d, 64 and industry K.
MD metadata: standardized set of variable names (MD_varname, i.e., firmid, capital, etc.) and respective definitions that forms microdata (MD) panels, or the MD_dataset (i.e., BS, SBS, ENER, etc.) set by the MDI team.
MDI: Microdata Infrastructure.
MDI data catalogue: catalog containing all variables and their year range availability by country.
MDI launch: process of running the modules within the rocket every three months.
MDI tools: set of R functions created by the MDI team to generate the MD_datasets, manipulate them and execute modules.
Module: research code.
Modules names are defined with an acronym (“res_group”). For example, a module about firm dynamics is called FD (res_group=FD).
NSS: The coordinated institutional and technical framework encompassing the NSI and associated data centers.
NSIs: National Statistical Institutes. These are the public authorities responsible for official statistics in each country. They host the confidential microdata, set legal rules, run disclosure control, and provide the secure environments where MDI operates.
Nuvolos: cloud server platform where MDI users develop and test their codes. This space is designed for training, practicing, and familiarizing with the MDI infrastructure. MDI infrastructure: some terms
Rocket: codebase containing modules and scripts that are periodically deployed within the secure environments of NSIs to process and harmonize raw data, execute modules, and export results.
Varnames: Documents the variables and their descriptions for each raw data file listed in datafile.

2 Using MDI

This section focuses on using the MDI and is meant primarily for research groups and module writers. It outlines all steps involved in conducting research with the MDI - from formulating a research question to selecting variables and preparing data files. It also provides information about launches, including the research execution process and the overall timeline.

2.1 MDI Users

MDI users (or module writers) include productivity boards, external academic and policy researchers, and MDI ‘Theme’ research staff. They are responsible for developing research modules that leverage MDI’s infrastructure for data analysis.

2.2 Setup for Researcher

Module writers develop and test their research code using mock data on the Nuvolos platform (see Nuvolos section). This process relies on a standardized metadata structure initialized through an R setup program. If a researcher has direct access to the microdata, they may also develop and test their modules directly using real data. Once development is complete, MDI staff consolidate and stack country-level outputs to enable cross-country analysis without granting direct access to firm-level data.

2.3 Workflow for writing modules

Writing modules for the MDI launch is an iterative process that moves from conceptualization to execution. It is a staged process designed for reproducibility and cross-country comparability. Start from a clear research question, select MD variables that exist across countries, prototype on Nuvolos mock data, validate disclosure compliance in-code, and prepare exports with complete metadata.

Deadlines and launch schedules

MDI modules are executed every four months through pre-scheduled launches. Accordingly, the MDI team communicates specific deadlines to all researchers for submitting their research modules and alerts the NSI staff accordingly.

The following table contains an estimate of the duration of a whole launch (between brackets, in the first column, a reference to the items in the diagram above):

Task	Estimated duration
1) Research module preparation (1. - 4.)	one month
2) Module testing and submission (5. & 6.)	two/three weeks
3) Launch preparation (7.)	a few days to a week
4) Launch execution (7.)	two months
5) Extraction of the results and consolidation of the output (8. & 9.)	a few days to a couple of months

Hence, a researcher can expect to receive all the consolidated cross-country results between three to six months after the module submission.

2.3.1 Define your research question

Every module begins with a clear and concise research question, designed to leverage MDI’s cross-country data and produce meaningful analytical insights.

Important

Before writing the analytical code, you must define a research acronym for your module (specified as res_group <- '(some 2-letter string)') and communicate it to the MDI team.

2.3.2 Data selection

Use the MDI data catalog to identify and select the most relevant datasets and variables for your analysis.

Important

Ensure that all MD variables used in the module are available across all countries. Especially the employment variable.

However, keep in mind that the harmonized version of classification variable nace, called MDnace, is not present in the catalog.

In case you want to use them in your code, make sure you use MDnace instead of nace.

Conversely, the harmonized version of product and trade codes (prodcom and cn08, respectively) have the original classification variable name.

If you want to use the original non-harmonized codes, make sure you use NSI_(classname) in your code.

Check the dedicated section below for more details.

Open the MDI Metadata Viewer

Additionally, module writers can also look at information on data source, firm sample and other details (taken from the NSI_datafiles tables) on the raw datafiles underlying each MD panel by using the interactive tool Datafiles Info Viewer

Once the final selection of MD variable names has been made for the module, a file named (res_group)_MDnames_select.csv (see example below) must be submitted to the MDI team. This needs to have the column names as shown below.

MD_dataset	MD_varname
BR	firmid
BR	plantid
BR	entid
BR	entgrp
BR	year

2.3.3 Analysis

2.3.3.1 Libraries, packages and the MDI R tools

Make use of MDI R-packages (see ../rocket/Rtools/Rpackages/Rpackage_info_v2.3_.csv) and Rtools (see: ../docs/MDI_Rpackage_1.0.0.pdf): See the R-package libraries currently installed at NSIs and loaded at runtime of the launcher at ../rocket/Rtools/Rpackages/record_package_info.csv or below

PDF viewer of the mdi R library manual

If you need a package that is not part of the current list of R libraries, notify the MDI staff so it can be added to the NSI requirements. When preparing output, use standardized functions from the MDI Rtools (see directory ../rocket/Rtools/R) whenever possible.

List of currently used R libraries

Package	Version	Title	NL_version
abind	1.4-5	Combine Multidimensional Arrays
broom	1.0.6	Convert Statistical Objects into Tidy Tibbles	1.0.7
cluster	2.1.6	""Finding Groups in Data"": Cluster Analysis Extended Rousseeuw et al.
conflicted	1.2.0	An Alternative Conflict Resolution Strategy
data.table	1.17.2	Extension of `data.frame`	1.16.2
devtools	2.4.5	Tools to Make Developing R Packages Easier
DiagrammeR	1.0.11	Graph/Network Visualization
dplyr	1.1.4	A Grammar of Data Manipulation
factoextra	1.0.7	Extract and Visualize the Results of Multivariate Data Analyses
fixest	0.12.1	Fast Fixed-Effects Estimations
FNN	1.1.4.1	Fast Nearest Neighbor Search Algorithms and Applications
foreign	0.8-86	Read Data Stored by 'Minitab', 'S', 'SAS', 'SPSS', 'Stata', 'Systat', 'Weka', 'dBase', ...
frontier	1.1-8	Stochastic Frontier Analysis
fs	1.6.6	Cross-Platform File System Operations Based on 'libuv'	1.6.4
ggplot2	3.5.2	Create Elegant Data Visualisations Using the Grammar of Graphics	3.5.1
git2r	0.36.2	Provides Access to Git Repositories	0.33.0
gmm	1.8	Generalized Method of Moments and Generalized Empirical Likelihood
gridExtra	2.3	Miscellaneous Functions for ""Grid"" Graphics
gt	0.11.0	Easily Create Presentation-Ready Display Tables	0.11.1
haven	2.5.4	Import and Export 'SPSS', 'Stata' and 'SAS' Files
igraph	2.0.3	Network Analysis and Visualization
knitr	1.50	A General-Purpose Package for Dynamic Report Generation in R	1.49
lfe	3.0-0	Linear Group Fixed Effects
mFilter	0.1-5	Miscellaneous Time Series Filters
modelsummary	2.2.0	Summary Tables and Plots for Statistical Models and Data: Beautiful, Customizable, and Publication-Ready
momentfit	0.5	Methods of Moments
openxlsx	4.2.5.2	Read, Write and Edit xlsx Files	4.2.7.1
pander	0.6.5	An R 'Pandoc' Writer
plm	2.6-4	Linear Models for Panel Data
poLCA	1.6.0.1	Polytomous Variable Latent Class Analysis
readr	2.1.5	Read Rectangular Text Data
readxl	1.4.3	Read Excel Files
Rtsne	0.17	T-Distributed Stochastic Neighbor Embedding using a Barnes-Hut Implementation
shiny	1.10.0	Web Application Framework for R	1.9.1
stargazer	5.2.3	Well-Formatted Regression and Summary Statistics Tables
stringr	1.5.1	Simple, Consistent Wrappers for Common String Operations
tidyr	1.3.1	Tidy Messy Data
zip	2.3.1	Cross-Platform 'zip' Compression
zoo	1.8-12	S3 Infrastructure for Regular and Irregular Time Series (Z's Ordered Observations)

The MDI team also maintains an overview of:

Installed R versions at each NSI, along with details on package installation policies. This information is available in Rversions_countries.csv, located in the /MDI/docs/.

Country-specific R info

Code

library(readr)
library(knitr)
library(kableExtra)

# Read the rversions data
rversions_data <- read_csv("Rversions_countries.csv", show_col_types = FALSE)

# Create a scrollable table
kable(rversions_data, format = "html", escape = TRUE) %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(ncol(rversions_data), width = "25em") %>%  
  scroll_box(width = "100%", height = "auto")

Country	R-version	Date	cran_install (Yes/No)	install_method (CRAN, IT-managed, Other specify)	version_control (Yes/No/Partial specify)	r_update_freq	library_restore (Yes/No/Manual)	Additional info
SI	4.4.2	Jul.25	Yes	Yes	Yes	a few times per year	Yes	NA
FR	4.3.1	Aug.24	No	Local CRAN Mirror	No. Only the latest version of the packages are available by default	A few times per year	Manual	NA
AT	4.1.3	Aug.24	No	IT-managed	No	They provide all Users with a shared library so they always install the newest version.	Ad hoc	NA
FI	4.3.1	Apr.25	No	Local CRAN Mirror. Package updates may take a few days	No. Only the latest version of the packages are available by default	Infrequent (less than once per year). The packages will need to be reinstalled after each R update.	No	NA
PTx	4.5.1	Jan.26	Yes	CRAN	Yes	no restrictions	No	NA
NL	4.2.3	Sept.24	No	IT-managed	Partial	Every 1.5 months	Yes	During each scheduled maintenance weekend, all packages are updated to their latest versions. Older versions are archived, allowing rollback to a specific version if needed (CBS will assist with this on request). However, if a desired version was never the latest at the time of a maintenance update, it won’t be available on the server.
GB	4.4.0	Apr.25	No	The package files need to be imported to the environment and manually installed	Yes but it can be annoying to handle with dependencies	a few times per year	No	NA
MT	NA	Jan.26	Yes	CRAN	Yes	NA	NA	NA
EL	4.4.2	Jan.26	Yes	IT-Managed	Yes	Ad hoc	Manual	RStudio Server 2024.09.1+394 running on Ubuntu.

Code

# Display table with column definitions
rversions_def <- read_csv("Rversions_columns_definitions.csv", show_col_types = FALSE)

# Display it as a regular table
kable(rversions_def)

Column name	Definition
R-version	The current R-version installed in the NSI remote env.
Date	Date of the last info updation on the R-version
cran_install (Yes/No)	Whether packages can be installed directly from CRAN (Yes/No)
install_method (CRAN, IT-managed, Other specify)	Method used if not CRAN (e.g., Manual, IT-managed, Custom script)
version_control (Yes/No/Partial specify)	Can specific package versions be installed (Yes/No/Partial)
r_update_freq	Frequency of R updates (e.g., Quarterly, Annually, Ad hoc)
library_restore (Yes/No/Manual)	Whether initial libraries are restored after update (Yes/No/Manual)
Additional info	Any additional notes or info

Package conflict resolution preferences are documented in conflicts_prefer.csv, located under /MDI/rocket/Rtools/Rpackages/.

R Package conflicts info

Code

library(readr)
library(knitr)
library(kableExtra)

# Read the rversions data
setwd('..')
setwd('rocket/Rtools/Rpackages/')
rconflicts_data <- read_csv("conflicts_prefer.csv", show_col_types = FALSE)

# Create a scrollable table
kable(rconflicts_data, format = "html", escape = TRUE) %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(ncol(rconflicts_data), width = "25em") %>%  
  scroll_box(width = "100%", height = "auto")

Function	ConflictingPackages	Count	PreferredPackage
%>%	dplyr, tidyr, stringr, DiagrammeR, gt	5	dplyr
Position	ggplot2, base	2	ggplot2
all_of	dplyr, tidyr	2	dplyr
any_of	dplyr, tidyr	2	dplyr
as.Date	zoo, base	2	base
as.Date.numeric	zoo, base	2	base
as.data.frame	git2r, base	2	git2r
as_label	dplyr, ggplot2	2	dplyr
as_tibble	dplyr, tidyr	2	dplyr
between	data.table, dplyr, plm	3	data.table
body<-	methods, base	2	base
bread	fixest, momentfit, sandwich	3	fixest
coef	momentfit, stats	2	momentfit
combine	dplyr, gridExtra	2	dplyr
confint	momentfit, stats	2	momentfit
contains	dplyr, tidyr, gt	3	dplyr
diff	git2r, base	2	git2r
ends_with	dplyr, tidyr, gt	3	dplyr
enexpr	dplyr, ggplot2	2	dplyr
enexprs	dplyr, ggplot2	2	dplyr
enquo	dplyr, ggplot2	2	dplyr
enquos	dplyr, ggplot2	2	dplyr
ensym	dplyr, ggplot2	2	dplyr
ensyms	dplyr, ggplot2	2	dplyr
estfun	fixest, sandwich	2	fixest
everything	dplyr, tidyr, gt	3	dplyr
expr	dplyr, ggplot2	2	dplyr
filter	dplyr, stats	2	dplyr
first	data.table, dplyr	2	data.table
fixef	fixest, plm	2	fixest
head	git2r, utils	2	git2r
index	zoo, plm	2	plm
intersect	dplyr, base	2	dplyr
kernapply	momentfit, stats	2	momentfit
kronecker	methods, base	2	base
lag	dplyr, plm, stats	3	dplyr
last	data.table, dplyr	2	data.table
last_col	dplyr, tidyr	2	dplyr
lead	dplyr, plm	2	dplyr
matches	dplyr, tidyr, gt	3	dplyr
merge	momentfit, git2r, base	3	momentfit
model.matrix	momentfit, stats	2	momentfit
nobs	plm, stats	2	plm
npk	MASS, datasets	2	MASS
num_range	dplyr, tidyr, gt	3	dplyr
one_of	dplyr, tidyr, gt	3	dplyr
p	shiny, pander	2	shiny
plot	momentfit, graphics, base	3	momentfit
print	momentfit, base	2	momentfit
pull	dplyr, git2r	2	dplyr
quo	dplyr, ggplot2	2	dplyr
quo_name	dplyr, ggplot2	2	dplyr
quos	dplyr, ggplot2	2	dplyr
reset	lmtest, git2r	2	git2r
residuals	momentfit, stats	2	momentfit
select	dplyr, MASS	2	dplyr
setdiff	dplyr, base	2	dplyr
setequal	dplyr, base	2	dplyr
show	momentfit, methods	2	momentfit
starts_with	dplyr, tidyr, gt	3	dplyr
subset	momentfit, base	2	momentfit
summary	momentfit, base	2	momentfit
sym	dplyr, ggplot2	2	dplyr
syms	dplyr, ggplot2	2	dplyr
tag	shiny, git2r	2	shiny
tags	shiny, git2r	2	shiny
tibble	dplyr, tidyr	2	dplyr
tribble	dplyr, tidyr	2	dplyr
union	dplyr, base	2	dplyr
update	momentfit, stats	2	momentfit
vars	dplyr, ggplot2, gt	3	dplyr
vcov	momentfit, stats	2	momentfit
vcovHAC	momentfit, sandwich	2	momentfit
vcovHC	sandwich, plm	2	plm
yearmon	data.table, zoo	2	data.table
yearqtr	data.table, zoo	2	data.table

In the event of package conflicts, we follow the preferences outlined in this file. However, if a module requires a function from a non-preferred package, authors must explicitly use the package::function() syntax to avoid ambiguity. This syntax is generally encouraged to ensure clarity and compatibility across systems.

While we aim for a harmonized environment, some variation between countries may persist due to local constraints. Any such discrepancies are documented and communicated to module writers before deployment.

There are five general categories of R tools:

Metadata: for generating, verifying, and manipulating metadata
Infra: used by MDI staff for data importation, harmonization, and disclosure checks
MDI: mostly for module writers, e.g. merge_datatables, regressions, aggregations, export
Analysis: support analytical tasks and reporting
Programmer: assist with R coding tasks

All tools are documented using Roxygen2 and exported as an R package. You can access documentation via the standard ?function() syntax, or by clicking on the mdi package in the RStudio package tab to view the full list of available functions.

2.3.4 Importing data

The NSI metadata enables the creation of standardized microdata panels (MD), which are harmonized and managed by the Launcher based on the logic defined in countdown.R.

MD datasets can be imported from the dirTMPSAVE folder, which is predefined in the environment. For example, the MD dataset ITGS can be imported using the following code snippet:

Code

BR <- readRDS(paste0(dirTMPSAVE, 'BR.rds'))

2.3.5 Manipulating data

You can freely manipulate linked panels using R and its libraries, such as data.table, dplyr, and the broader tidyverse.

Managing firm units

When writing your analytical module, always check the unit of observation in each MD dataset for all countries where your code will run. This can be verified in the MDI Metadata Viewer.

For example, if using the ENER dataset, note that the unit of observation may vary between countries—such as between France and Portugal. Your R code must account for these differences to ensure analytical consistency. For details on the units used in each MD dataset, please refer to the metadata file MD_idInfo.csv.

To assist with this, the MDI toolkit includes a utility for aggregating or disaggregating between different key IDs. This tool is located in rocket/Rtools/R/mdi_key_id_switch.R and uses the metadata file *NSI*_firmid_entgrp_conc.csv.

2.3.5.1 Working with classifications

Classification variables, such as industry or product codes, are key in microdata work. We use tools that allow classification lists to be coherent over time.

In those harmonized MD datasets where a classification variable is present, be aware that the those inclide both the raw classification variable and the time-concorded one. Hence, make sure you keep this in mind when designing your code!

In particular, when you prepare your (res_group)MDnames_select.csv, please use the original MDnames for all classification variables, but feel free to use the concorded MDnames for the concorded classification variables.

For more details, check the dedicated section.

2.3.5.2 Merging data

Additionally, when merging data from two files, use the mdi_mergedatatables() function. This helps prevent memory issues and ensures that merges are performed correctly.

Tip

When working with MD_dataset CIS, given that the data comes from bi-yearly surveys, it is suggested to always merge CIS with BR, like

Code

DT <- merge(BR, CIS, by = c('firmid','year'), all.x = T)

Then the user can decide on how to interpolate the values in the missing years for the same firmid.

Last but not least, be a smart coder: clean up unnecessary datasets, and avoid writing code that calls the operating system, creates (sub)directories, or installs R packages.

2.3.6 Analytical tools

The MDI toolkit includes a set of functions designed to help researchers efficiently carry out common tasks in firm-level microdata research. In this section, we present some of the key tools available. For more details, please refer to the MDI R tools section above and consult the mdi library PDF.

mdi_aggregate()

This tool aggregates variables in a data.table by group, allowing customizable statistics (e.g., sum, mean, HHI – check the PDF manual of the mdi library for the full list of methods in the related section), optional merging with the original dataset, and built-in disclosure checks.

Working with quantiles

Note that any output containing data points (such as plots or tables with quantiles) cannot be exported due to disclosure restrictions. Hence, quantiles cannot be exported as such.

However, keep in mind that mdi_aggregate() allows to compute the mean of the minimum number of observations allowed for disclosure (the function uses MDIminNumObs, check the related section below) around the observation that is closest to the first (q25), second (median) or third (q75) quantile value.

The diagram below illustrates how this value is calculated for a series of values (3 to 11), in case

MDIminNumObs is an odd number

timeline
    title Odd: `MDIminNumObs` = 5 → pick 2 below, 1 at quantile, 2 above
    3  : |
    5  : 🔵 (2nd below)
    6  : 🔵 (1st below)
    7  : 🔴 (closest to q)
    8  : 🔵 (1st above)
    9  : 🔵 (2nd above)
    11 : |

MDIminNumObs is an even number

timeline
    title Even: `MDIminNumObs` = 6 → pick 3 below, 3 above (bias below)
    3  : 🔵 (3rd below)
    5  : 🔵 (2nd below)
    6  : 🔵 (1st below)
    7  : 🔴 (closest to q)
    8  : 🔵 (1st above)
    9  : 🔵 (2nd above)
    11 : |

Note that if the number of observations for the aggregate is small then the coefficient might be very different from the quantile value.

estimate_markup()

This function computes firm-level markups using the De Loecker (2012) method, which divides output elasticity multiplied by revenue by input costs, and returns the resulting markup as a new variable.
estimate_prod()

This tool estimates firm-level production function parameters (such as input elasticities and/or TFP) using OLS, ACF, or OP methods under Cobb-Douglas or translog specifications. It offers flexible options for fixed effects, instruments, and grouped estimation.
mdi_regress()

This function runs one or more regressions using feols or feglm from the fixest package, performs automatic disclosure checks to ensure the minimum observation threshold is met, and optionally exports LaTeX regression tables with accompanying metadata logs.
pim_capital()

This tool estimates firm-level capital stock using the Perpetual Inventory Method (PIM), based either on a user-specified depreciation rate or an inferred asset type. It returns the original data.table with an added capital stock variable.

Industry Aggregations

Researchers may wish to conduct their analysis at various levels of sectoral aggregation. The MDI infrastructure supports this by providing classification concordances such as MD_nace_hier.csv and MD_naceR2_CNind_classconc.csv, which allow NACE Rev.2 industry codes to be mapped to broader industry groupings—such as 3-digit, 2-digit, 1-digit levels, and the CompNet macroindustry classification.

2.3.7 Exporting results

Once a results table is generated, the researcher must extract the file at the end of the module. After the launch is fully executed in a given country, the country leader submits an export request to the NSI, which then verifies compliance with disclosure rules for each output file (see disclosure criteria for details).

The mdi_export() function facilitates this process by exporting a data.table to a specified file format and logging comprehensive metadata—including variables used, purpose, and dataset context—into a central description file (OutputDescription.csv), which is also extracted. The function includes optional disclosure checks for summary statistics.

Below is a description of all parameters required for mdi_export():

Focus on mdi_export()

It is fundamental, for disclosure reasons, that the module writer fills in exhaustive information related to each output file when using function mdi_export(). In particular, please provide

format
Character string specifying the format of the export (‘csv’, ‘RDS’, ‘txt’, ‘dta’, ‘xlsx’, ‘sas’).
output_name
The name of the file to be created, without the file extension and the country code.
datasets_used
The name of the MD_dataset(s) used for the analysis.
purpose
Describe the research purpose of the analysis.
share_0_1
Explain whether the output contains any shares equal to 0 or 1 (i.e. 0% or 100% of the group share the same characteristic). Such cases are not allowed according to the output guidelines and must therefore be suppressed or explicitly justified.
zeroes
If the output contains zero values, provide an explanation of why these zeroes are not revealing additional information. Otherwise, this information must be suppressed.
rel_other_output
Describe how this output file relates to other previously exported or requested files, for instance whether it performs the analysis in a different way and, if so, how.
selection
Describe if the results contained in the file were derived from a specific selection of the sample available (if so, explain which selection) or if the full sample is used.
export_type
Character string indicating the type of output (‘sum_stat’ for summary statistics, ‘reg_tab’ for regression tables, or ‘other’).
description
A string providing additional explanation of the output file.

2.3.8 Consolidation of MDI Module Output

Once the files pass the disclosure checks:

Country leaders/NSIs will upload each country’s output to their designated Teams folder.
MDI staff will consolidate (stack) the outputs of each module by country and place both the module-specific and general launcher outputs in the appropriate Teams folder for module writers.

2.3.8.1 Procedure to stack MDI Module Output

For cross-country analysis, the individual country exports need to be identified, and consolidated into combined stacked datasets per module.

This is accomplished in three steps, using a sequence of scripts that are stored in dirROCKET/MDIprogs.

Step 1: get_output_file_list.R Generate Country-Specific File Lists

The first script creates a file inventory for each country.

Inputs:
- Country code (CC, e.g., FR, FI, NL)
- Launch version number (2.3)
- Local path to the country’s upload directory

Process:
1. Iterates through all module output folders for the selected country.
2. Extracts the names of all .csv files, excluding descriptive files (e.g., OutputDescription.txt).
3. Adds metadata:
- Launch version number
- Country code
- A numeric flag indicating the format of the file name (1, 2, or 3).
4. Saves the resulting inventory as launch_<n>_file_list_<CC>.csv in the specified directory at the start of the script.

Output:
A CSV file listing all valid exported files for a single country, annotated with launch and country metadata.

Step 2: generate_stacked_files.R Combine File Lists Across Countries

The second script consolidates the individual country inventories into one master file list.

Inputs:
- File lists generated by Script 1 (launch_<n>_file_list_<CC>.csv for each country).

Process:
1. Reads each country’s file list.
2. Appends a Country column to identify the file’s origin.
3. Stacks the inventories into one combined dataset.

Output:
A single file: launch__file_list_combined.csv containing metadata on all exported files across participating countries.

Step 3: consolidate_output.R Consolidate Module-Level Outputs

The third script merges the exported data across countries for a chosen module.

Inputs:
- The combined file list from Step 2.
- Module name (e.g., EN for Energy).
- Country-specific Export directories (Most likely a Teams path).

Process:
1. Filters the combined file list for the specified module.
2. Iterates through each country’s export path and retrieves the corresponding .csv files.
3. Reads and cleans each filename and appends a Country identifier column.
4. Binds all country datasets into one consolidated file.

Output:
A module-specific cross-country combined file (e.g., EN_combined.csv) stored in the specified Research Agenda folder that you input at the start of the script.

To summarise module exports consolidation

get_output_file_list.R → Generate a country-level export file list.
generate_stacked_files.R → Combine these lists into a cross-country file inventory.
consolidate_output.R → Use the inventory to locate, clean, and stack module-level data exports across countries.

By running these 3 scripts, all outputs are systematically catalogued, reproducible, and readily available for post-launch comparative analysis.

2.4 Running Order & How-To (Quick Reference)

2.4.1 Prerequisites

R packages: data.table, dplyr, readr
Directory layout must follow:
- .../MDI Data Providers Forum - CC - CC/Upload/Launch_<n>/<CC>_output_Launch_<n>_<MODULE>/...
- Central outputs: .../CompNet MDI Research Agenda - General/Launch_<n>
Researchers can run ../launchpad/interactive_MDI.R to initialize their MDI environment in a standardized way.
Researchers then analyze the output, optionally using standardized tools for statistical analysis, graphing, and reporting.

2.5 Dealing with classifications

A key feature of firm-level research is the use of classifications, such as industry codes (NACE codes), product codes (PRODCOM codes) and trade codes (combined nomenclature codes). Given that the official set of codes in a classification can vary across the years, we developed some tools that allow us to have a consistent list of codes over time in each country. Specifically, we make use of two tools:

make_conc()

This tool is currently used to harmonize PRODCOM and ITGS codes over time.

Firstly, it takes the time concordance tables for each couple of years and reproduces the development of each code over time. This way, the yearly concordances traces all the possible changes of the codes from the first year to the last year of the relevant period. Secondly, it links all the paths of codes that have common codes. This way, it harmonizes these groups of codes with a common code from the last year of the relevant period.

Note: Column left in a time concordance table, the one we received from the NSI, might not contain all codes we observe in the dataset at time year-1. Hence, it is advised to use tool mdi_timeconc_update() from the mdi package. This tool makes sure that if there’s any missing mapping, those are present in the time concordance table for that dataset.

As inputs, it requires the yearly concordance tables of the classification (in data.table format), the numeric vector of the years of interest, and the character name of the classification. It returns the data.table that concords each code to the harmonized code, for each year.

It returns the original MD dataset with the old NSI class code (under column NSI_(classname)) and the harmonized code (under the column using the classification name).
concord_nace()

This tool harmonizes NSI NACE classification over time.

First, it detects the year with the most NACE code changes, the year with a possible break in the classification. Then, it uses the mode NACE code in the post-break year as harmonized NACE code and harmonizes previous year codes accordingly, by firm. For firms present only before the break year, their codes are harmonized depending on the changes of the surviving firms, which are used to create a concordance table between codes in the pre- and post-break year.

As inputs, it requires the character dataset name; a logical to decide whether or not to weight code matches of surviving firms by employment (instead of number of firms); the number indicating the cumulative residual share of firms deleted for the construction the pre- and post-break year concordance; and the number indicating the share of firms deleted for the construction of such concordance.

It returns the original MD dataset with the old NSI NACE code (under column nace) and the harmonized NACE (under column MDnace).

It will be possible, then, to add the MD NACE through the concordance table between NSI NACE codes and the MD NACE codes.

2.6 Disclosure Procedures

2.6.1 What are Disclosure Criteria

Disclosure criteria at National Statistical Institutes (NSIs) are rules designed to protect the confidentiality of firm-level data. They ensure that no output allows the identification of individual firms or the disclosure of sensitive information, even in aggregated form. These criteria are crucial for complying with national privacy and data protection regulations.

2.6.2 How Disclosure Criteria Are Applied:

MDI tools automate disclosure control by applying primary and secondary confidentiality rules (such as minimum observation thresholds and dominance criteria) before any output is released. These rules ensure that sensitive data is suppressed or flagged, in line with the parameters defined in file payload/Launch_v2.3/MDmetadata/MD_disclosure_info.csv.

Learn more how this is done

MDI tools such as mdi_aggregate(), disclose(), DisclosCrit(), mdi_regress(), and mdi_export() help automate disclosure control by enforcing rules based on parameters set in the Countdown, ensuring compliance before output is released.
Primary disclosure (Step 1) requires suppression of all cells that fail the dominance criterion or contain fewer than the minimum number of observations (minNrObs).
Secondary disclosure (Step 2) involves suppressing additional cells to protect those flagged in Step 1, following the minimum frequency rule. This typically means suppressing the smallest unsuppressed cell if only one cell was suppressed in Step 1 (applicable to totals/sums where the parent node is available).

For example, a cell not meeting NumObs or exceeding domPerc is suppressed. Outputs violating these criteria are flagged or excluded from export.

2.6.3 Components of Disclosure Criteria in the MDI:

Four main variables are created by MDI tools to assess diclosure criteria:

Dominance Share (MDIdomSh)

The maximum share of the total (e.g., employment, sales) contributed by the largest ‘X’ firms (number ‘X’ defined by domNr) in a cell. Example: If domPerc is 0.75, The top ‘X’ firms cannot contribute more than 75% to the cell’s total.

Minimum Number of Observations (MDIminNumObs):

The minimum number of firms required in a cell for it to be included in the output. Example: If NumObs is 3, at least 3 firms must contribute to a cell.

Top Firms Count (MDIdomNr)

Specifies how many top firms’ shares are considered when applying the dominance criterion (domPerc). Example: If domNr is 1, the dominance is based on the largest firm; if 2, the top two firms are considered.

Dominance Variable (MDIdomVar)

The variable on which the dominance criterion is applied, such as employment (emp) or sales (nq). Different NSIs may apply criteria to different variables, depending on their legal requirements. Note: The domVar can be ‘var’ in the countdown. If so, the domPerc is computed for all variables for which an aggregate is computed.

Show dominance percentiles (show_domVar)

This is a dummy variable indicating whether the dominance percentile columns needs to be included (1) or not (0) in the output file.

Hide or not hide values post-disclosure (show_values)

This is a dummy variable indicating whether the aggregates in the output file need to be hidden (0) or not (1) in case they don’t comply with the disclosure rules of the NSI.

2.6.4 Disclosure Criteria in MDI Countries

Below is the disclosure criteria in MDI countries:

disclosure_variable	AT	EL	FI	FR	DE	NL	PTx	PT	SI	GB	MT
MDIminNumObs	10	5	3	4	NA	10	1	1	5	10	3
MDIdomVar	var	var	persons_br	var	NA	var	var	var	var	var	var
MDIdomSh	0.8	0.85	0.75	0.85	NA	0.5	1	1	0.5	0.4375	0.9
MDIdomNr	2	2	1	1	NA	1	1	1	1	1	2
show_domPerc	1	1	1	1	NA	1	0	0	1	1	1
show_values	0	0	0	1	NA	0	1	0	1	0	0

Some countries apply additional disclosure criteria. For instance, the Netherlands (NL) and Slovenia (SI) require that all exported variables (not just employment (emp) and sales (nq)) comply with the dominance share criterion. In such cases, the parameter domVar is set to ‘var’ in the Countdown file. These disclosure parameters are configured during the execution of the infrastructure at an NSI, either by MDI or NSI staff.

2.6.5 Disclosure Routines in MDI

The MDI tools listed below operate using the disclosure parameters defined by the user in the Countdown file.

Tool	Purpose	Use.by.Researchers	Use.Other.MDI.Tools
mdi_aggregate.R	Aggregate data with optional disclosure checks: dominance threshold (`domPerc`) and minimum observations (`NumObs`).	Yes	Yes (`disclosCrit`, `disclose`)
mdi_regress.R	Performs regression analysis and automatically checks whether the number of observations meets the required minimum threshold for disclosure. Skips regressions that fail the check.	Yes	No
mdi_export.R	Exports datasets with optional disclosure compliance and metadata logging.	Yes	Yes (`disclose`)
disclose.R	Performs primary and secondary disclosure checks.	No	No
disclosCrit.R	Adds disclosure metrics (`domPerc`, `NumObs`) to datasets.	No	No

Module writers are strongly encouraged to use MDI tools to ensure compliance with the disclosure criteria of all countries where the module is intended to run. ### Primary and Secondary Disclosure with disclose

The disclose tool applies two levels of disclosure control to aggregated statistics to ensure compliance with confidentiality requirements.
Suppressed values are replaced with the sentinel value -999, and disclosure flags (discflag1, discflag1_*, discflag2) record which suppression criteria were triggered.

2.6.5.1 1. Primary Disclosure

Primary disclosure applies two main suppression rules to protect confidentiality:

Minimum Observations Rule
Any aggregate based on fewer than the required minimum number of observations (MDIminNumObs) is suppressed.
All affected variables in that row, including the NumObs column, are replaced with -999.
Dominance Rule
For sum-type variables, the function evaluates the dominance share of the largest contributors (domPerc_*), calculated by [disclosCrit()].
- For most countries: a cell is suppressed if the dominance share exceeds the threshold (domPerc >= MDIdomSh).
- Germany (DE): the rule is inverted — a cell is suppressed if the dominance share falls below the threshold (domPerc < MDIdomSh).
  This reflects German statistical disclosure practice, where low dominance values indicate high concentration risk.

All cells suppressed in this step are flagged with discflag1 (and variable-specific flags discflag1_<var>).

2.6.5.2 2. Secondary Disclosure (Hierarchical Totals Only)

When the aggregated data includes hierarchical levels — for example, industry or regional totals — an additional secondary disclosure step prevents back-calculation of suppressed values.

The hierarchy file (hhfile) must be a wide table, with one column per hierarchical level (e.g., h_0, h_1, h_2, …).
The node variable in the dataset identifies the child level.
The parent level is determined by the next column to the right of the child in the sorted hierarchy (e.g., if node = h_1, the parent is h_2).

Suppression rule:
If within a parent group exactly one child cell was suppressed in the primary step, the tool suppresses one additional child — the non-suppressed cell with the smallest number of observations (NumObs).
This prevents the originally suppressed value from being reconstructed by subtraction from the total.
All such cases are flagged with discflag2.

2.6.5.3 Germany-Specific Note on Dominance Percentiles

The dominance share (domPerc_*) used in the primary disclosure rule is computed differently for Germany in [disclosCrit()].
Instead of using the standard top-n share (sum of the top domNr values divided by the total), Germany applies the following ratio:

$\text{domPerc} = \frac{\text{Total} - x_1 - x_2}{x_1}$

where $(x_1)$ and $(x_2)$ are the two largest firm values in each aggregation group.
This yields a dominance percentile that decreases as concentration increases — hence, in Germany, smaller values of domPerc indicate greater dominance and trigger suppression (domPerc < MDIdomSh).

2.6.5.4 Output

The tool returns the dataset with all required suppressions applied and two disclosure flags:
- discflag1: primary disclosure
- discflag2: secondary disclosure

2.7 Auxiliary Files

Deflators are constructed using data extracted from Eurostat via the eurostat R package. These include deflators for

Value-added (pnv, at NACE level 2)
Capital depreciation (pnc, at NACE level 2)
Gross fixed-assets (pgrK, at NACE level 1)
Investment (pni, at NACE level 2)
GDP (pngdp, at NACE level 2)
Harmonized consumer price index (HCPI) (pnhcpi, at NACE level 2),

all normalized to a 2010 base year (set to 1). The underlying Eurostat datasets—nama_10_a64, nama_10_a64_p5, nama_10_gdp, prc_hicp_aind, and nama_10_nfa—cover national accounts and price indices. I

The processed deflator file is structured by country code (cc), industry code (DEFind, that can be linked to MD variable nace using table nace_DEFind), and year (year). The table also contains asset-specific deflators (e.g., construction, machinery, intellectual property) and includes growth and depreciation rates, offering a detailed dataset for robust analytical use.

2.8 Nuvolos: where MDI users develop and test their codes

To support testing of both modules and the full MDI infrastructure, a dedicated environment has been set up on the server Nuvolos. This environment includes several separate spaces for different purposes—for example, an internal development space for the MDI team, and a testing space for internal and external users to validate module code.

This server replicates the environment of a national statistical institute and includes the MDI infrastructure. The Nuvolos server provides all the necessary tools, scripts, and libraries required to develop research modules. Each space includes mock data, which consists of artificially generated datasets that mimic real NSI data in structure and naming conventions. These datasets allow for realistic and consistent testing of modules and infrastructure. (Details on how the mock data is created can be found here.)

2.8.1 How to access

Nuvolos is the server environment that we use for trainings, testing, developing and debugging. There are three separate environments:

1.MDI Training Environment: this space is meant for MDI users, module writers who want to test their scripts and people who want to learn about the MDI

2.Nuvolos Developer Space: this space is meant for the MDI infra team to develop and test the MDI infrastructure

3.Portugal Data Access Space: this is a space exclusively for people who have data access to the Portuguese data. This is where the MDI for PTx is executed by the MDI team.

If you want to get access to either of the environments, reach out to Johanna via email or Teams.

2.8.1.1 MDI Training Environment

This space is designed for training, practicing, and familiarizing with the MDI infrastructure. It’s intended mostly for external people, as it already contains all relevant files, ie. the whole MDI infrastructure, mock data. It’s an environment in which externals can test their codes using the MDI infrastructure and mock data. It is updated periodically, so doesn’t always reflect the latest version of the MDI. It is not intended for bug-fixing and working on the infrastructure, as it doesn’t link to Github.

As this is a practice space, module writers can import or write their scripts, develop their module, run it as part of the MDI and export their files it if needed. The final module needs to be sent to the MDI team before each launch.

In the training space each user has a separate copy of the MDI and data files. That means, if scripts are altered moved or added, this is only reflected in the user’s space, i.e. deleting a file will not affect any other user and no other user can see your modifications.

2.8.2 How to use

You will find a folder structure such as the one found in the NSI environments containing the actual microdata.

Files structure

“Files” section {#files-nuvoulos}

You’ll find the following folders:

MDI: containing the MDI infrastructure
output: if a script generates any output it can saved here

These additional folders can only be accessed through RStudio:

space_mount/mockdata/NSIdata/: this folder contains NSI mockdata of several countries
space_mount/mockdata/TMP/: this folder contains MD mockdata of several countries

To write their codes, users need to first follow some steps to load the environment. I.e., run some functions which import the mockdata and all the necessary auxiliary files mimicking the NSI environments.

Steps to write and execute your codes

go to the section “Applications” on the left navigation bar and open the RStudio application

Run the countdown:
- In RStudio, on the right in the files section, look for the MDI > launchpad and open countdown.
- R Click on ‘source’ to execute it.
- You will be asked choose a program to run, choose interactive MDI by entering 4
- Wait until the script is done. The metadata, Rtools and libraries are imported - you can now create/ run your script.
Create/ Execute a Module
- Add a new file
- Add your module code or any code that you want to run
- Execute your code either line-by-line by clicking ‘run’ or all at once by clicking ‘source’
When you run your script, any errors will appear in the Console section. The executed lines are highlighted in blue, while errors are displayed in red. To resolve an issue, identify the problematic line in your script, make the necessary corrections, and run it again.

2.8.2.1 Portugal Data Access Space

This space is set up for the direct access of the Portugese data, this is where the launch execution for Portugal takes place. This space is reserved for the PTx country leaders. It contains the raw PTx data the MDI infrastructure. The MDI infrastructure in this space will be updated regularly but is not necessarily the most recent version found of GitHub.

The data is stored in the large file storage (folder space_mounts/NSIdata). Everyone who has access to the space has read and write permission of the data. Any changes to the data should be done with utmost caution.

All files - infrastructure and data - are shared files across all users of the space. That means any modifications that are done will show for all users of the space.

2.9 The complete MDI pipeline

Below is the complete MDI pipeline for your visualization:

Code

%%{init: {
  "theme": "base", 
  "themeVariables": {
    "background": "#ffffff",
    "textColor": "#000000",
    "lineColor": "#000000",
    "fontSize": "26px"
  }
}}%%

flowchart TB

%% Setting MDI

A([🚀 **Microdata Infrastructure** 🚀]) --> |to initialize it...|A1([**launchpad/countdown.R**])

A1 --> A2(set country code)
A1 --> A3(set paths to directories)
A1 --> A4(set disclosure parameters)

A2 -.-> A5[AT, DE, FI, FR, NL, PT, SI]

  style A fill:#003366,stroke:#000000,color:#ffffff  
  style A1 fill:#228B22,stroke:#ffffff,color:#ffffff
  style A2 fill:#90EE90,stroke:#ffffff,color:#000000
  style A3 fill:#90EE90,stroke:#ffffff,color:#000000   
  style A4 fill:#90EE90,stroke:#ffffff,color:#000000  
  style A5 fill:#E6FFE6,stroke:#ffffff,color:#000000


%% Options MDI

A3 ---> B1([there are four options:])
A4 ---> B1([there are four options:])
A5 ---> B1([there are four options:])

B1 --> C[(**pre_launch_checker.R**)]
B1 --> D[(**litoff.R**)]
B1 --> E[(**interactive_mdi.R**)]
B1 --> F[(**prepare_NSI.R**)]

  style B1 fill:#F0F0F0,stroke:#FFFFFF,color:#000000

%% Pre-launch Checker

C --> C1[checking if metadata corresponds to what we have in the environment]

C1 --> C2(**Database**)
C1 --> C3(**Varnames**)
C1 --> C4(**Codebooks**)
C1 --> C5(**Classfiles**)

C2 -.-> C6[Do they all exist?]
C3 -.-> C7[Do all varnames exist?]
C3 -.-> C8[Are all varnames of the listed data type?]

  style C fill:#7A5DC7,stroke:#000000,color:#ffffff
  style C1 fill:#E6E6FA,stroke:#ffffff
  style C2 fill:#7A5DC7,stroke:#ffffff,color:#ffffff
  style C3 fill:#7A5DC7,stroke:#ffffff,color:#ffffff
  style C4 fill:#7A5DC7,stroke:#ffffff,color:#ffffff
  style C5 fill:#7A5DC7,stroke:#ffffff,color:#ffffff
  style C6 fill:#E6E6FA,stroke:#ffffff
  style C7 fill:#E6E6FA,stroke:#ffffff
  style C8 fill:#E6E6FA,stroke:#ffffff

%% Liftoff

D ---> D1(🚀 **main script that completes the launch sequence: loads essential R libraries, pulls in MDI resources and NSI metadata, and brings raw firm data plus concordance tables into the environment. This script launches the rocket** 🚀)

D1 --> D2(read R libraries, import mdi library, import NSI metadata, import concordance tables)
D1 --> D3(import raw firm data)

D1 ---> |if data is already harmonized...|D4[**execute research modules**]

D4 --> D5[M0]
D4 --> D6[CN]
D4 --> D7[EN]
D4 --> D8[FD]
D4 --> D9[MP]
D4 --> D10[TC]

D5 --> D11[🎊 **extract results** 🎊]
D6 --> D11
D7 --> D11
D8 --> D11
D9 --> D11
D10 --> D11

  style D fill:#8B0000,stroke:#000000,color:#ffffff  
  style D1 fill:#8B0000,stroke:#ffffff,color:#ffffff 
  style D2 fill:#FFD6D6,stroke:#ffffff,color:#000000  
  style D3 fill:#FFD6D6,stroke:#ffffff,color:#000000
  style D4 fill:#8B0000,stroke:#ffffff,color:#ffffff 
  style D5 fill:#FFD6D6,stroke:#ffffff,color:#000000
  style D6 fill:#FFD6D6,stroke:#ffffff,color:#000000
  style D7 fill:#FFD6D6,stroke:#ffffff,color:#000000
  style D8 fill:#FFD6D6,stroke:#ffffff,color:#000000
  style D9 fill:#FFD6D6,stroke:#ffffff,color:#000000
  style D10 fill:#FFD6D6,stroke:#ffffff,color:#000000
  style D11 fill:#8B0000,stroke:#ffffff,color:#ffffff 

%% Interactive MDI

E --> E1[set up directories, call R libraries, load Rtools, and interactively explore the MDI environment]

  style E fill:#FF8C00,stroke:#000000,color:#ffffff   
  style E1 fill:#FFF2DC,stroke:#ffffff   

%% Prepare NSI

F --> F1[first time preparation of data and metadata, updating of metadata when new file years and file types become available]

  style F fill:#654321,stroke:#000000,color:#ffffff   
  style F1 fill:#ECD9C6,stroke:#ffffff  
  
%% Harmonize raw data

D2 --> G(🔩 **harmonize raw data to MDI** 🔩)
D3 --> G

G --> G1(there are 4 methods for the harmonization process)
G1 -.-> G2(**revalue**: transform value of a unique variable)
G2 -.-> G3(**recode/reclass**: concord codebook/class variables as desired)
G3 -.-> G4(**redefine**: aggregate one or more variables)
G4 -.-> G5(**remap**: assign new name to a raw variable)

G ---> D4

style G fill:#4B4B4B,stroke:#ffffff,color:#ffffff   
style G1 fill:#E0E0E0,stroke:#ffffff   
style G2 fill:#F5F5F5,stroke:#ffffff   
style G3 fill:#F5F5F5,stroke:#ffffff  
style G4 fill:#F5F5F5,stroke:#ffffff    
style G5 fill:#F5F5F5,stroke:#ffffff

%%{init: {
  "theme": "base", 
  "themeVariables": {
    "background": "#ffffff",
    "textColor": "#000000",
    "lineColor": "#000000",
    "fontSize": "26px"
  }
}}%%

flowchart TB

%% Setting MDI

A([🚀 **Microdata Infrastructure** 🚀]) --> |to initialize it...|A1([**launchpad/countdown.R**])

A1 --> A2(set country code)
A1 --> A3(set paths to directories)
A1 --> A4(set disclosure parameters)

A2 -.-> A5[AT, DE, FI, FR, NL, PT, SI]

  style A fill:#003366,stroke:#000000,color:#ffffff  
  style A1 fill:#228B22,stroke:#ffffff,color:#ffffff
  style A2 fill:#90EE90,stroke:#ffffff,color:#000000
  style A3 fill:#90EE90,stroke:#ffffff,color:#000000   
  style A4 fill:#90EE90,stroke:#ffffff,color:#000000  
  style A5 fill:#E6FFE6,stroke:#ffffff,color:#000000


%% Options MDI

A3 ---> B1([there are four options:])
A4 ---> B1([there are four options:])
A5 ---> B1([there are four options:])

B1 --> C[(**pre_launch_checker.R**)]
B1 --> D[(**litoff.R**)]
B1 --> E[(**interactive_mdi.R**)]
B1 --> F[(**prepare_NSI.R**)]

  style B1 fill:#F0F0F0,stroke:#FFFFFF,color:#000000

%% Pre-launch Checker

C --> C1[checking if metadata corresponds to what we have in the environment]

C1 --> C2(**Database**)
C1 --> C3(**Varnames**)
C1 --> C4(**Codebooks**)
C1 --> C5(**Classfiles**)

C2 -.-> C6[Do they all exist?]
C3 -.-> C7[Do all varnames exist?]
C3 -.-> C8[Are all varnames of the listed data type?]

  style C fill:#7A5DC7,stroke:#000000,color:#ffffff
  style C1 fill:#E6E6FA,stroke:#ffffff
  style C2 fill:#7A5DC7,stroke:#ffffff,color:#ffffff
  style C3 fill:#7A5DC7,stroke:#ffffff,color:#ffffff
  style C4 fill:#7A5DC7,stroke:#ffffff,color:#ffffff
  style C5 fill:#7A5DC7,stroke:#ffffff,color:#ffffff
  style C6 fill:#E6E6FA,stroke:#ffffff
  style C7 fill:#E6E6FA,stroke:#ffffff
  style C8 fill:#E6E6FA,stroke:#ffffff

%% Liftoff

D ---> D1(🚀 **main script that completes the launch sequence: loads essential R libraries, pulls in MDI resources and NSI metadata, and brings raw firm data plus concordance tables into the environment. This script launches the rocket** 🚀)

D1 --> D2(read R libraries, import mdi library, import NSI metadata, import concordance tables)
D1 --> D3(import raw firm data)

D1 ---> |if data is already harmonized...|D4[**execute research modules**]

D4 --> D5[M0]
D4 --> D6[CN]
D4 --> D7[EN]
D4 --> D8[FD]
D4 --> D9[MP]
D4 --> D10[TC]

D5 --> D11[🎊 **extract results** 🎊]
D6 --> D11
D7 --> D11
D8 --> D11
D9 --> D11
D10 --> D11

  style D fill:#8B0000,stroke:#000000,color:#ffffff  
  style D1 fill:#8B0000,stroke:#ffffff,color:#ffffff 
  style D2 fill:#FFD6D6,stroke:#ffffff,color:#000000  
  style D3 fill:#FFD6D6,stroke:#ffffff,color:#000000
  style D4 fill:#8B0000,stroke:#ffffff,color:#ffffff 
  style D5 fill:#FFD6D6,stroke:#ffffff,color:#000000
  style D6 fill:#FFD6D6,stroke:#ffffff,color:#000000
  style D7 fill:#FFD6D6,stroke:#ffffff,color:#000000
  style D8 fill:#FFD6D6,stroke:#ffffff,color:#000000
  style D9 fill:#FFD6D6,stroke:#ffffff,color:#000000
  style D10 fill:#FFD6D6,stroke:#ffffff,color:#000000
  style D11 fill:#8B0000,stroke:#ffffff,color:#ffffff 

%% Interactive MDI

E --> E1[set up directories, call R libraries, load Rtools, and interactively explore the MDI environment]

  style E fill:#FF8C00,stroke:#000000,color:#ffffff   
  style E1 fill:#FFF2DC,stroke:#ffffff   

%% Prepare NSI

F --> F1[first time preparation of data and metadata, updating of metadata when new file years and file types become available]

  style F fill:#654321,stroke:#000000,color:#ffffff   
  style F1 fill:#ECD9C6,stroke:#ffffff  
  
%% Harmonize raw data

D2 --> G(🔩 **harmonize raw data to MDI** 🔩)
D3 --> G

G --> G1(there are 4 methods for the harmonization process)
G1 -.-> G2(**revalue**: transform value of a unique variable)
G2 -.-> G3(**recode/reclass**: concord codebook/class variables as desired)
G3 -.-> G4(**redefine**: aggregate one or more variables)
G4 -.-> G5(**remap**: assign new name to a raw variable)

G ---> D4

style G fill:#4B4B4B,stroke:#ffffff,color:#ffffff   
style G1 fill:#E0E0E0,stroke:#ffffff   
style G2 fill:#F5F5F5,stroke:#ffffff   
style G3 fill:#F5F5F5,stroke:#ffffff  
style G4 fill:#F5F5F5,stroke:#ffffff    
style G5 fill:#F5F5F5,stroke:#ffffff

Note. This scheme shows how to initialize and run MDI. Start launchpad/countdown.R, set country, paths, and disclosure parameters. Choose one of four programs: pre_launch_checker to validate metadata and varnames. liftoff to load libraries, import metadata and raw data, harmonize using four methods (revalue, recode/reclass, redefine, remap), then execute modules and extract results after disclosure checks. interactive_mdi to explore the environment. prepare_NSI for first-time setup and metadata updates. Boxes indicate steps. Arrows indicate control and data flow.

3 Setting up the MDI

This section covers the technical details of the MDI - including preparation for implementation in a new country, the modifications required for a launch, and how to develop the infrastructure further. It provides information for the MDI team and the country leads responsible for setting up the MDI in their respective countries.

3.1 Introduction & Setup to MDI Infrastructure

The MDI (Microdata Infrastructure) provides a unified research environment implemented identically at all national statistical institutes (NSIs), including the mock data site. This consistent setup allows researchers to analyze standardized microdata (MD) panels, constructed from diverse national sources.

These MD panels are harmonized through detailed metadata, which ensures legal compliance, transparency, and comparability of statistical outputs across countries. For each NSI, the metadata specifies the available source files, variables, and classification lists, and maps them to the shared MD format. As a result, the datasets are syntactically identical across countries, even when the underlying data differs.

NSIs vary in their legal frameworks, technical setups, and the types of data they maintain—from registers and surveys to administrative sources. The MDI infrastructure addresses this heterogeneity by applying a common structure and metadata standard across all participating institutes.

To execute research, individual researchers write analysis modules—typically in R—the payload. This payload is executed inside the secure MDI environment. During a launch, the rocket reads the metadata and data, harmonizes them, constructs the MD panels, and then runs the payload modules. The outputs comply with disclosure rules, enabling valid cross-country comparisons.

3.1.1 Launch Preparation by the MDI team

These are the steps the MDI team must follow in this order before each launch.

Lock the R package list.Freeze the package versions to ensure consistency across all NSI environments.
Create a dedicated GitHub branch, e.g. post_Launch_vX.X .This branch will track all changes made during the launch at the NSIs.
Create an error tracking file alldocs/Launch_vX.X_errors_overview.csv. This shared file is used by all country leaders to document errors and changes during the launch process.
Generate documentation with roxygenize. Run roxygenize(paste0(dirROCKET, "Rtools")) to generate documentation. No further changes should be made to the R tools after this step.
Capture the Git commit details. Run rocket/MDIprogs/get_commit_details.R and re-commit the branch to lock the exact version.
Deploy to NSI Teams folder. Use the appropriate scripts to copy the finalized GitHub branch to the NSI-specific Teams folders.
Notify NSIs to download and set up. Ask NSI system administrators to download the folder, install or update required R packages, and install the MDI package.

3.1.2 Launch Sequence Overview

This section outlines the main steps for executing a full MDI launch.

Import MDI Files
Copy the complete MDI folder from the country Teams directory into a local working directory of your choice. Ensure that both the user and R have read access to the raw data files in that location.
Configure Countdown Script
Open launchpad/countdown.R and update the required parameters to match your site-specific setup. Find details and explanations here
Run pre_launch_checker.R
Begin by running countdown.R from your working directory and selecting the pre_launch_checker.R option. This checks for inconsistencies between the metadata expected by MDI and the actual metadata at the NSI site. It generates concordance files and a report listing issues to fix. Find details and explanations here
Run the Post-Harmonization Checker
In the countdown.R set the option MDImoduleRun = FALSE and run the countdown again, selecting liftoff.R. This executes the full MDI rocket without running the analytical modules. During this phase, the Post-Harmonization Checker (PHC) is triggered to validate the harmonized data. The PHC script performs quality checks on the harmonized microdata, such as detecting duplicates, format mismatches, date inconsistencies, and structural breaks. The results of these checks are saved to two files: <CountryCode>_phc_results.txt and breaks_report.pdf (in dirTMPSAVE ) . These files must be reviewed and issues resolved before proceeding. Please have a look at the detailed section on post harmonization checks for instructions for country leaders.
Full module execution
After resolving all issues flagged by the PHC, set MDImoduleRun = TRUE and rerun liftoff.R to execute the full set of research modules. Iterations with the MDI staff may be needed for fixes and patches to the rocket and payload, until the final results are written to the dirOUTPUT directory.
Export
The files in dirOUTPUT need to be checked for disclosure. After disclosure checks are completed, the approved files from this directory can be uploaded to the MDI TEAMS cloud directory designated for the NSI staff.

3.1.3 Additional Programs (Not Part of Launch Sequence)

The following programs are not part of the formal launch sequence but support metadata setup and interactive work:

prepare_NSI.R
Used to generate and structure metadata at the NSI. Should be run before any launch steps if metadata is not yet available.
interactive_MDI.R
Enables interactive work inside the MDI environment (e.g. for metadata exploration or manual testing). Not used during automated launches.

3.1.4 The structure of the MDI

These are the main directories in the MDI folder:

docs: Documentation of the MDI system, including the MDI Manual.
rocket: Code supporting and controlling MDI rocket launches, including NSI metadata and auxiliary data.
payload: Research modules, including metadata and NSI-specific NSI_MD concordances.
launchpad NSI-specific information for controlling MDI code and rocket launches.

The directories at the NSI site MDI installation

Files in the MDI folder

.
├── docs
├── launchpad
├── payload
├── rocket
└── site

directory launchpad (with files to launch MDI)

launchpad
├── README.md
├── countdown.R
├── interactive_MDI.R
├── liftoff.R
├── pre_launch_checker.R
├── prepare_NSI.R
└── report_file_changes.R

subdirectories of rocket (with code, and (meta)data to support MDI)

rocket
├── CompNet
├── MDIprogs
├── NSImetadata
├── Rtools
├── auxdata
└── control

subdirectories of payload (with analytical code and MD (meta)data)

payload
├── Launch_v2.0
├── Launch_v2.1
├── Launch_v2.2
├── Launch_v2.3
└── Launch_v2.mini

3.1.5 Importing to, and exporting from, the NSI site

Whenever an import or export operation is required from or to an environment, it is important to consider both the time it takes and whether the operation incurs any monetary cost.

💵 Click to view costs per country

Country	Type	Time	Cost
AT	Export	45-60 days	343.92 Euro
AT	Import	2-3 days	124.00 Euro / hour
FI	Export	2-3 days	0.00 Euro
FI	Import	2-3 days	0.00 Euro
FR	Export	1-2 days	31.80 Euro / 30 minutes
FR	Import	1-2 days	0.00 Euro
NL	Export	1-2 days	125.00 Euro - 266.00 Euro
NL	Import	1-2 days	0.00 Euro
PT	Export	2-3 days	0.00 Euro
PT	Import	2-3 days	0.00 Euro
SI	Export	1 day	0.00 Euro
SI	Import	30 days	0.00 Euro

In general, files to be downloaded from MDI to an NSI are stored in a TEAMS directory accessible to the respective NSI team. These files are located in the download folder of a SharePoint directory, which can be synced to your local machine. For example: ../OneDrive-SharedLibraries-IWHEconomicStudiesLab/MDI Data Providers Forum - PT. Each NSI also has an MDI TEAMS ‘upload’ directory, used to upload output generated by the rocket.

3.2 Metadata

This section provides an overview of the structure of NSI and MD metadata, how to construct them and to establish the connection between them, ensuring that country-specific data sources are accurately mapped to the standardized MD panel structure.

3.2.1 Specifications for the NSI Metadata

This section summarizes the structure and content of the NSI metadata files. These files document, in both machine- and human-readable formats, the available data files, the unit of observation (i.e., the description of each row), the names and descriptions of the variables (i.e., columns) in each file, and the valid values for each variable, including their class and domain. The following paragraphs offer guidance on how to prepare country-specific metadata files accordingly.

Once created, the NSI metadata files must be uploaded to the designated TEAMS directory. After the NSI downloads the updated rocket, these metadata files will be located in the rocket/NSImetadata/*NSI*/ directory of the MDI infrastructure. The MDI program pre_launch_checker.R, which should be run whenever the MDI is updated—will identify inconsistencies and other issues in the metadata.

The main types of NSI metadata files prepared include:

datafiles: Lists all available NSI firm-level data files, including their names and years covered.
varnames: Documents the variables and their descriptions for each raw data file listed in datafile.
codebook: Maps categorical variable values to their corresponding descriptions.
class: Describes classification variables in the datasets, such as industry or product codes.
classvart0_classvart1_conc: Details how classification variables evolve over time, providing concordance between versions.
keyID1_keyID2_conc: Maps a firm identifier (keyID1) to higher level identifier (keyID2) by year.

It is advised to construct the files in the same order as in the above list.

Together with the MDI team, the NSI prepares metadata to support the harmonization of NSI data to the MD specification. The MDI team supplies metadata—potentially specific to each launch that describes the MD datasets and their variables. In addition, the MDI team and the NSI jointly provide concordances used to align NSI data files with the standardized MD format.

To facilitate this process, the MDI team also provides tools for creating the required metadata files. These tools can be found in the directory /rocket/MDIprogs/metadata_tools/.

In the filenames for the metadata, the acronym NSI is used. This should be substituted with the 2-letter country code for the country in question (using the ISO3166-2 standard, e.g. NSI = PT). For the MDI metadata, the two letters MD are used.

3.2.1.1 List of NSI datafiles – NSI_datafiles.csv

This file contains the list of all available raw data files on a country’s environment. The file has the following columns:

NSI_datafile,NSI_dataset,yearvar,year_start,year_end,format,path,details,firm_unit,data_source,firm_sample,preprocessing

where:

NSI_dataset is the ‘generic’ name of the NSI datafile.
NSI_datafile is the name of the file in the NSI environment.
yearvar gives name of year variable if NSI_datafile is a panel, empty otherwise.
year_start is the starting year of the data file.
year_end is the last year of the data file
format is the file extension (csv, sas, stata, etc) of the file (i.e. also the storage format of the data).
path indicates path of the datafile relative to the NSI data directory (given by the parameter dirINPUTDATA in launchpad/countdown.R).
details contains additional notes on the file.
firm_unit indicates the type of firm observation unit. There can be four types of units. Below we provide a definition for each, taken from the Eurostat glossary, hierarchically ordered, i.e.
- plant: A single-location enterprise or part of one, primarily engaged in one main productive activity. Also often known as ‘establishment’. This corresponds to plantid in the MD data
- legal_unit: Either legal persons recognized by law or natural persons conducting economic activity independently. This corresponds to firmid in the MD data
- enterprise: An organizational unit producing goods or services, with decision-making autonomy, possibly spanning multiple activities, locations, or legal units. Hence, one enterprise might be constituted by more than one legal unit. This corresponds to entid in the MD data
- enterprise_group: A set of legally or financially linked enterprises, controlled by a group head, forming an economic entity with shared or centralized decision-making. This corresponds to entgrp in the MD data
data_source: refers to the origin of the data. Three options are possible:
- survey: If the data was collected through surveys
- administrative_source: Information collected by public authorities from firms as part of legal or regulatory requirements, such as tax records, employment filings, or financial statements
- mixed: If the data comes both from surveys, administrative source of other collection methods.
firm_sample: Information about the population of firms present in the datafile (usually a piece of text, trying to be as concise as possible)
preprocessing: Instructions on how to perform a data preprocessing operation on the raw datafile. For more details, check the dedicated box.

An example (for NSI=FI, 2018) of the metadata for the raw data files (the columns yearvar, year_end, path and details are omitted for viewing):

NSI_datafile	NSI_dataset	year_start	format	firm_unit	data_source	preprocessing
bd2018	bd	2018	csv	legal_unit	adminiatrative_source	NA
br2018	br	2018	csv	legal_unit	adminiatrative_source	NA
bs2018	bs	2018	csv	legal_unit	adminiatrative_source	NA
cis2018	cis	2018	csv	legal_unit	survey	NA
ictec2018	ictec	2018	csv	legal_unit	survey	NA
ifats2018	ifats	2018	csv	legal_unit	adminiatrative_source	NA
*Note: Only the first 5 rows are displayed.

3.2.1.2 File-specific metadata – NSI_varnames.csv

This file contains the list of all variables in each raw datafile appearing in the column NSI_datafile of NSI_datafiles.csv. It has the following columns:

[1] NSI_datafile,NSI_varname,is_key,description,class,domain

where:

NSI_datafile is the name of the file in the NSI environment.
NSI_varname is the name (hopefully mnemonic) of the variable in the raw file.
is_key is a boolean stating whether variable belongs to the (possibly joint) unique keys of the datafile, e.g. firmid, or firmid,year are often the unique key(s).
description contains a description of the variable, if possible using Eurostat convention.
class is the type of value that the variable holds. The following data types can be encountered:
- numeric: Numbers with or without decimals (e.g., 3, 4.5).
- character: Text or string values (e.g., “apple”).
- date: Calendar dates stored as Date objects (e.g., 2023-05-09).
- logical / boolean: TRUE/FALSE values used in conditions and comparisons.
domain provides information on the values of the variable. See examples below:
- classification: e.g. list of industry, region, product or codes. (values is metadata filename: e.g. NSI_classname_class.csv, which provides a list of permissible values and descriptions)
- file-specific codebook of categorical answers. (value is metadata filename, e.g. *NSI_codebook.csv containing permissible values, such as ‘yes’, ‘no’,‘maybe’, or ‘small’, ‘large’, ‘medium’.
- For other values:
  - For monetary values, “1000” (for 1000 Euros)
  - For dates: “%m%d%Y” (R date-format for mmddyyyy). For ‘year’ variable, we use “%Y”
  - For real units, choose from: “ton” (weight, 1000kg), “m3” (volume), “GJ” (energy), “unit” (1 item).

3.2.1.2.1 Domain: Expenditures, Quantities, Dates

Measure	Domain Entry	Description
Expenditure	1000	… or 1 Euro; 10000000 Euro; etc.
Foreign currency	1*FXC	… or 1000 etc.; Where FXC is an ISO 4217 3-letter currency code
Employment	1	1 here refers to 1 FTE; or 1000; … or 1 Emp if in persons
Numerical	1	1 here refers to 1 unit; … or 10; 100; where ‘unit’ gives unit in lowercase for the variable in the NSI data file
Date	%Y-%m-%d	Use the R date format that matches the values for the NSI date or year variable

Click to see an abbreviated list of R date formats:

Format	Description	Example
%a	Abbreviated weekday	Sun, Thu
%A	Full weekday	Sunday, Thursday
%b or %h	Abbreviated month	May, Jul
%B	Full month	May, July
%d	Day of the month 01-31	27, 07
%j	Day of the year 001-366	148, 188
%m	Month 01-12	05, 07
%U	Week 01-53, (start Sunday)	22, 27
%w	Weekday 0-6 (Sunday= 0)	0, 4
%W	Week 00-53 (start Monday)	21, 27
%x	Date, locale-specific
%y	Year 2-digit 00-99	84, 05
%Y	Year 4-dig: (69 to 99 - 19xx)	1984, 2005
%C	Century	19, 20
%D	Date formatted %m/%d/%y	05/27/84, 07/07/05
%u	Weekday 1-7 (Monday=1	7, 4

3.2.1.2.2 Domain: Classification or Categorical (factor) variables

Variable	Domain_Entry	Description
Classification variable	NSI_classname_class	An (official) list, ie NL_nace
Categorical variable	NSI_codebook	Contains permissible values for categorical (factor) variables, e.g. ‘yes’, ‘no’,‘maybe’

3.2.1.2.3 Example: Netherlands (SBS, 2018): NL_varnames.

NSI_datafile	NSI_varname	is_key	description	class	domain
sbs2018	ent_id	1	Enterprise ID (identification	character
sbs2018	sbs_12110	0	Turnover	numeric	1000
sbs2018	sbs_12150	0	Value added at factor cost	numeric	1000
sbs2018	sbs_12170	0	Gross operating surplus	numeric	1000
sbs2018	sbs_13110	0	Total purchases of goods and s	numeric	1000
*Note: Only the first 5 rows are displayed.

3.2.1.3 Codebook for categorical variables – NSI_codebook.csv

This file contains the possible values of a categorical variable and the description that belongs to that value. There rows give the possible values occuring in firm data for a particular NSI_datafile and NSI_varname. The name of the codebook should be given in the ‘domain’ columnn of NSI_varnames for the relevant categorical variable.

[1] NSI_dataset,NSI_varname,year,code,description

where:

NSI_dataset is the name of the generic dataset in the NSI environment.
NSI_varname is the name of the variable of that specific raw dataset.
year is the year for which codebook values hold. If empty, holds for all years of the NSI dataset.
code gives all the values of the categorical variable that occur for that NSI_varname in that NSI_datafile.
description gives the description explaining each code value.

Dealing with the year column

As mentioned, if a specific mapping holds for all years available of a specific NSI_dataset, then the year column for that mapping needs to be empty. For instance, say that we have raw NSI_dataset data_ictec with NSI_varname var112 being a categorical variable taking values ‘0’, ‘2’, ‘999’, referring to ‘no’, ‘yes’, ‘not available’ for all years. In this case, the NSI_codebook will contain only three rows for these three mapping without any reference to the years. In practice:

NSI_dataset	NSI_varname	code	description
data_ictec	var112	0	no
data_ictec	var112	2	yes
data_ictec	var112	999	not available

That said, if a mapping is not constant across all years of an NSI_dataset, then the year column needs to have a value for all mappings reported. In this context, there can two cases:

The codes differ by year for the same variable: This means that var_112 takes values, say, ‘0’, ‘2’, ‘999’, referring to ‘no’, ‘yes’, ‘not available’ for 2006 and in 2007 the mapping changes to ‘1’, ‘2’, ‘9’, for ‘no’, ‘yes’, ‘not available’, respectively. Hence, we need to indicate year-specific mappings in the codebook table:

NSI_dataset	NSI_varname	year	code	description
data_ictec	var112	2006	0	no
data_ictec	var112	2006	2	yes
data_ictec	var112	2006	999	not available
data_ictec	var112	2007	0	no
data_ictec	var112	2007	1	yes
data_ictec	var112	2007	9	not available

The NSI_varname isn’t available for all years: As an example, let var_112 be only available in 2005 and 2006 but it’s dropped or has a different name in the other years. Then, we need to include it for both 2005 and 2006, regardless of whether the codes are identical or not:

NSI_dataset	NSI_varname	year	code	description
data_ictec	var112	2005	0	no
data_ictec	var112	2005	2	yes
data_ictec	var112	2005	999	not available
data_ictec	var112	2006	0	no
data_ictec	var112	2006	2	yes
data_ictec	var112	2006	999	not available

An example below of the values occurring for the unit of measurement in the SI PRODCOM data for 2012.

NSI_dataset	year	NSI_varname	code	description
MIKRO_INDL_razST	NA	ME	1000 SIT	thousands of slovenian tolars
MIKRO_INDL_razST	NA	ME	EUR	euros
MIKRO_INDL_razST	NA	ME	GJ	Gigajoule - a unit of energy
MIKRO_INDL_razST	NA	ME	MWh	Megawatt-hour - a unit of energy
MIKRO_INDL_razST	NA	ME	TJ	terajoules
*Note: Only the first 5 rows are displayed.

3.2.1.4 Classification lists – NSI_classvar_class.csv

This file contains the unique list of codes per year of a specific classification variable in a country. Note that there should be a list for every categorical variable in each dataset. The related table has the following columns:

[1] code,year,description

where:

code is the list values of the classification variable observed in the data.
year is the related year. If the mapping is not changing across all years available for that NSI_varname, then the year column will be filled with NA.
description gives the description for each code value.

An sample of rows from the table of PRODCOM codes (in this case some randomly selected rows from the list of codes of Finland):

code	year
27512630	2020
22214180	2019
19301352	2005
20142320	2021
26702490	2016
20165490	2013
24521090	2019
28133200	2010
15842150	2005
19303150	2006
13301380	2016
26518110	2021
20101032	2004
16292320	2012
26518550	2011

3.2.1.5 Time concordances for classifications – NSI_classvart0_classvart1_conc.csv

This file contains the concordance table for classification lists between a code at time t-1 and the corresponding code at t. This means that in column left all codes appearing in the reference year should be present, regardless of whether they change or disappear in the following year. The reference year is indicated in column year. In other words, t=year. This file is currently used in when concording a classification variable to the latest year observed in the data, as explained here.

[1] left,right,year

where:

left is the code at time t-1.
right is the value(s) the code in column left can take at time t.
year is the reference year.

Importance of the history of each code

The table needs to contain the full history of all codes. In other words, if a code is present in column right at year t, then a row with that code in column left must be present where year is equal to t+1.

For example, let code ‘5678’ turn into ‘1234’ in year 2003. In the table, we would then have the following row:

left	right	year
5678	1234	2003

This now means that ‘1234’ is present in the data for year 2003. As a result, ‘1234’ must appear in column left when year is ‘2004’.

In this case, four things could happen:

The code survives:

left	right	year
1234	1234	2004

The code is substituted by one or more new codes:

If ‘1234’ is substituted by ‘9012’:

left	right	year
1234	9012	2004

If ‘1234’ is substituted by ‘9012’ and ‘3456’ (‘one-to-many’ case):

left	right	year
1234	9012	2004
1234	3456	2004

The code is substituted by one or more new codes but it can still be found in 2004:

If ‘1234’ is partially substituted by ‘9012’ but appears in 2004 (special ‘one-to-many’ case):

left	right	year
1234	9012	2004
1234	1234	2004

The code is dropped:

If ‘1234’ doesn’t appear in the data in 2004:

left	right	year
1234		2004

For example, the following table displays how such table looks like for a subset of ITGS codes in PT:

left	right	year
39172991	39172990	2005
85243990	85243980	2001
48064090	48239090	2001
90309010	90309085	2005
8802 20 00	8806 91 00	2021
85319020	85423910	2006
3215 19 90	3215 19 00	2017
84717090	84717098	2005
3901 90 90	3901 40 00	2016
0302 90 00	0302 91 00	2016
86073099	86073000	2010
72254020	72254012	2003
37061010	37061011	1994
85066010	85066000	2010
22042183	22042142	2009

3.2.1.6 Key ID concordance – NSI_keyID1_keyID2_conc.csv

This file contains a concordance table between two firm unit codes by year.

[1] "keyID1" "keyID2" "year"

where:

keyID1 is the first firm unit code
keyID2 is the second firm unit code
year is the reference year

For example, a concordance between units firmid and entgrp could look like:

firmid	entgrp	year
EoYncPX1QK	ZDfQhvv	2019
Zn1yzAeYA4	oOUz7Ep	2021
B4iCIhzgPi	oOUz7Ep	2013
9sJnqQM0lo	ZDfQhvv	2002
FPSgiOjwA7	MEVq1XW	2007
0g0AFLyHCe	MEVq1XW	2015
hlmrG4AyLu	oOUz7Ep	2011

3.2.2 Specifications for the MD Metadata

Work in this section is a collaboration between NSI and MDI staff.
In an iterative process, using NSI metadata for each country, and taking into account research needs of MD users, a specification is made of the MD panels and variables.
1. MD_datafiles.csv describe the harmonized panel datasets generated in each launch
2. MD_varnames.csv describe the variables per dataset, with their description, class, and domain.
3. MD classifications: versions of official classifications, such as EU NaceR2 activities or NUTS2 regions
4. MD codebooks: valid values for categorical variables

3.2.2.1 List of micro-dataset (MD) panels – MD_datafiles.csv

This file contains the list of all firm-level MD panels generated by the MDI code and usable by researchers via code modules. The file has following columns:

This file contains the list of all harmonized firm-level micro data panels (MD datafiles) that can be used in research by code in an MDI launch, either individually, or linked at the firm-year level.

[1] MD_dataset,description,details

where:

MD_dataset is the name of the MD panel (R data.table) at runtime of the launch.
description A description of the panel and its underlying source data.
details contains additional notes on the file.

Below is the list of currently available MD panels:

MD_dataset	description	details
BR	Business Register	see: https://ec.europa.eu/eurostat/
BS	Balance Sheet	Balance Sheet on Enterprise groups
BD	Business Dynamics
SBS	Structural Business Statistics
CIS	Community Innovation Survey	(only available in even numbered ye
ICTEC	ICT Usage in Enterprises Survey	https://ec.europa.eu/eurostat/cache
ITGS	International Trade in Goods
ITS	International Trade in Services
OFATS	Outgoing Foreign Affiliates Statist
IFATS	Incoming Foreign Affiliates Statsti
ENER	Energy Use at Firms	in progress harmonization across co
PRODCOM	Production Communitaire by firm and	https://ec.europa.eu/eurostat/web/p

3.2.2.2 Micro-dataset (MD) variables – MD_varnames.csv

This file contains the list of all variables available in all the MD firm-level panel datsets that have been generated by the MDI code using the NSI datafiles, NSI metadata, and the NSI-MD concordances. The file has the following columns:

[1] MD_varname,MD_dataset,is_key,description,class,domain

where:

MD_dataset is the name of the MD firm-level panel dataset, ie BR, SBS, etc.
MD_varname is the name of the variable in the virtual firm-level dataset.
is_key is a boolean stating whether variable belongs to the (possibly joint) unique keys of the dataset, e.g. firmid, or firmid,year are often the unique key(s).

Note

Given that an MD dataset can have different unique identifier–be it plant, legal unit, enterprise or enterprise group, check the NSI_datafiles section under firm_unit–depending on the raw data which is based on, is_key takes value 1 for each of the four possible unit types.

That said, in the harmonzied MD dataset researchers will work on, only one of the four units will be available, allowing then module writers to aggregate or disaggregate across units (when possible) using tool mdi_key_id_switch() from package mdi. The tool makes use of different aggregation or disaggregation methods based on the MD varname, as indicated in file MD_aggr_disaggr_methods.csv.

description contains a description of the variable, if possible using Eurostat convention.
class is the type of value that the variable holds (e.g. integer, character, boolean etc.).
domain
- classification: e.g. list of industry, region, product or codes. (values is metadata filename: e.g. MD_filename_varname_list.csv, which provides a list of permissible values and descriptions)
- MD-specific codebook of categorical answers. (value is metadata filename, e.g. MD_codebookname_codes.csv containing permissible values, such as ‘yes’, ‘no’,‘maybe’
- For other values:
  - For monetary values, “1000” (for 1000 Euros)
  - For dates: “%m%d%Y” (R date-format for mmddyyyy). For ‘year’ variable, we use “%Y”
  - For real units, choose from: “ton” (weight, 1000kg), “m3” (volume), “GJ” (energy), “unit” (1 item).

3.2.2.2.1 Domain: Expenditures, Quantities, Dates

Measure	Domain_Entry	Description
Expenditure	1000 Euro
Employment	1 FTE	… or 1 Emp if in persons
Numerical	1 ‘unit’	‘Unit’ gives unit used in NSI data file, or is left blank if just a count.
Date	%Y	For now, we use a R format for 4-digit year as the date variable
Weight	1 kg
Volume	1 m3
Area	1 m2
Length	1 m
Energy	1 GJ	GigaJoule

3.2.2.2.2 Domain: Classification or Categorical (factor) variables

Variable	Domain_Entry	Description
Classification variable	NSI_classname_class	An (official) list, ie NL_nace
Categorical variable	NSI_codebook	Contains permissible values for categorical (factor) variables, e.g. ‘yes’, ‘no’,‘maybe’

Below is a sample of 5 rows of the file MD_varnames with harmonized MD variables

MD_dataset	MD_varname	description	domain
CIS	inpssu	Introduced onto the marke
ICTEC	RBTS	Use service robots
ICTEC	CRMSTR	share of information with
BD	merger	Enterprise merged with an
CIS	year	Year	%Y

3.2.2.3 Classification lists – MD_classvar_class.csv

This file contains the unique list of codes per year of a specific classification variable from the MD panels. Note that there should be a list for every categorical variable in each MD datasets. The related table has the following columns:

[1] code,description

where:

code is the list values of the classification variable observed in the data.
description gives the description for each code value.

An example of the table for NACE codes (in this case the official EU NaceR2 classification):

code	description
C17.2.2	____Manufacture of household and sanitary goods and of toilet requisit
C17.2.3	____Manufacture of paper stationery
C17.2.4	____Manufacture of wallpaper
C17.2.9	____Manufacture of other articles of paper and paperboard
C18	__Printing and reproduction of recorded media
C18.1	___Printing and service activities related to printing
*Note: Only 5 rows are displayed.

3.2.2.4 Hierarchy files for classifications – MD_classvar_hier.csv

This file contains a series of columns that refer to different nodes of the classification variable in question. With this file, the user can easily aggregate or disaggregate the data based on the different nodes of the classification variable.

The columns of the file are labelled as h_X, where X is a number from 0 to N denoting one of the N available nodes in the variable.

An example of a hierarchy table for NACE codes (in this case the official EU NaceR2 classification):

h_0	h_1	h_2	h_3	h_4
6491	649	64	K	TOT
2222	222	22	C	TOT
9810	981	98	T	TOT
2331	233	23	C	TOT
4743	474	47	G	TOT
*Note: Only 5 rows are displayed.

3.2.2.5 Codebook for categorical variables – MD_codebook.csv

This file contains the possible values of a categorical variable and the description that belongs to that value. Note that sometimes a particular codebook is ‘re-used’ for multiple variables. The name of the codebook should be given in the ‘domain’ columnn of the metadata for the file containing the categorical variable.

[1] MD_dataset,MD_varname,code,description

where:

MD_dataset is the name of the MD firm-level panel dataset, ie BR, SBS, etc.
MD_varname is the name of the variable of that specific MD dataset.
code gives the valid values of the ccategorical variable.
description gives the description for each code value.

An example below of the values given for some variables in the MD BR business register dataset

MD_dataset	MD_varname	code	description
BD	status	1	born in reference year
BD	status	2	active entire reference year
BD	status	3	dead in reference year
BD	status	4	born and dead in reference year
BR	demo	0	”No demographic relation in ref. year”
BR	demo	1	”Receiving employment from other enterprise in ref. year”
*Note: Only 5 rows are displayed.

3.2.2.6 Key ID overview – MD_idInfo.csv

As we need to coordinate our data work across multiple countries, there are differences in what the key identifiers of the different MD datasets are. The table below illustrates the situation for the countries to which we currently have access to.

MD_dataset	AT	DE	FI	FR	NL	PTx	PT	SI	GB
BR	firmid	firmid	firmid	firmid	entid	entid	firmid	plantid	NA
BD	firmid		firmid			entid	firmid		NA
BS			firmid	firmid	entgrp	entid	firmid	firmid	NA
CIS	firmid		firmid	firmid	entid			firmid	NA
ENER	plantid	firmid		plantid		entid	firmid	plantid	NA
ICTEC		firmid	firmid	firmid		entid	firmid	entid	NA
IFATS	firmid		firmid			entid	firmid		NA
ITGS	firmid	firmid	firmid	firmid	entid	entid	firmid	firmid	NA
ITS			firmid						NA
OFATS	firmid		firmid	firmid					NA
PRODCOM	plantid	firmid	firmid	firmid	entid	entid	firmid	plantid	NA
SBS	firmid	firmid	firmid	firmid	entid	entid	firmid	firmid	NA

3.2.2.7 MD_aggr_disaggr_methods.csv

This table contains instructions on how a specific MD_varname is aggregated or disaggregated to a higher/lower firm unit. It is only used by mdi_keyID_switch.R file if a module writer wants to perform such operation on a given harmonized MD dataset.

[1] MD_dataset,MD_varname,NSI_dataset,NSI_varname,method,detail,year

MD_dataset is the dataset name that the variable belongs to (e.g. SBS, BS, PRODCOM, BR). This determines the source of the variable within the integrated microdata framework.
MD_varname is the standardized variable identifier, harmonized across datasets (e.g. emp, rev, pay, assets). Used to link equivalent variables across datasets.
class is the variable type (numeric, categorical, boolean, date). It defines what operations are logically and statistically valid for the variable.
aggregation_method is the rule for aggregating data from a lower level to a higher level (e.g. plant $\rightarrow$ firm, firm $\rightarrow$ group). Specifies how observations are collapsed across identifiers during aggregation.
disaggregation_method is the rule for splitting or allocating data from a higher level to a lower level (e.g. firm $\rightarrow$ plant). Indicates which weighting logic or fallback hierarchy is used to distribute values.

Aggregation Methods

sum
Adds up all values in the group.
Used for additive variables such as employment, turnover, pay, or total assets.
mean
Calculates the simple arithmetic mean.
Used for ratio or intensity variables (e.g. productivity, profitability ratios).
weighted_avg:<var1>|<var2>|...
Computes a weighted average using one or more candidate weighting variables.
The first available candidate is used.
Example: weighted_avg:emp|rev $\rightarrow$ weights by emp if available, otherwise by rev.
mode
Returns the most frequent category (the statistical mode).
Used for qualitative variables like ownership type or legal form.
weighted_mode:<var1>|<var2>|...
Returns the category that maximizes the weighted frequency count.
Example: weighted_mode:rev|emp gives greater weight to categories from larger firms.
any
Logical aggregation returning TRUE if any record in the group is TRUE.
Used for indicators such as export participation.
all
Logical aggregation returning TRUE only if all records in the group are TRUE.
Useful for group-level flags (e.g. all plants meet environmental certification).
min
Returns the smallest value or earliest date in the group.
Useful for start dates or minimum rates.
max
Returns the largest value or latest date in the group.
Useful for end dates or maximum thresholds.

Disaggregation Methods

equal
Splits the higher-level total equally across all lower-level entities.
Example: 100 employees across 4 plants $\rightarrow$ each gets 25.
replicate
Copies the same value across all sub-entities.
Used for categorical variables like region, legal form, or activity code.
weighted_alloc:<dataset.var1>|<dataset.var2>|...|equal
Allocates a higher-level value proportionally using variables from other datasets that exist at the disaggregated level.
The listed candidates are checked in order, and the first available is used.
Example:
weighted_alloc:PRODCOM.rev|ITGS.ntrade|SBS.emp|equal
$\rightarrow$ uses product-level revenue, if unavailable uses trade value, then employment, then equal split.
proportional_alloc (optional)
Variant of weighted_alloc where weights are normalized within each group.
Usually equivalent to weighted_alloc in implementation.

This design ensures that numerical variables preserve total consistency, while categorical and boolean fields retain logical coherence during aggregation and disaggregation.

Below a sample of five rows contained in the table:

MD_varname	MD_dataset	aggregation_method	disaggregation_method
rdemp	SBS	sum	weighted_alloc:BS.total_assets\|PRODCOM.rev\|ITGS.ntrade\|BR.persons_br\|equal
start_nace	BR	mode	replicate
distr_heat_noenerg	ENER	sum	weighted_alloc:SBS.nv\|SBS.emp\|BS.total_assets\|PRODCOM.rev\|ITGS.ntrade\|BR.persons_br\|equal
inpdgd	CIS	any	replicate
fte	SBS	sum	weighted_alloc:BS.total_assets\|PRODCOM.rev\|ITGS.ntrade\|BR.persons_br\|equal
*Note: Only 5 rows are displayed.

3.2.3 Specifications for metadata needed for the NSI to MD harmonization

Harmonization of MD panels entails harmonization of units of observation, variable definitions, and variable values.
The key to harmonization is NSI metadata, MD metadata, and NSI to MDI concordances.
The MD standard metadata is found ‘iteratively’ and can evolve as countries join and as new MDI research users and MDI launches have different data requirements.
- The MD metadata and NSI to MDI concordances allow live updates of the MDI data documentation.
Mapping units of ‘firms’, enterprises, legal units requires knowledge of NSI source data: registers, (weighted) sampling, sample designs.
Harmonizing variable definitions and nomenclature is done through renaming, revaluing or combining NSI variables.
- In the *NSI*\_MD\_conc.csv file, information is available to show how an MD variable (from a particular MD dataset) is generated from NSI variables, through the harmonization operations remap, revalue, or redefine.
Harmonizing values of classification variables is done by reclassifying values over time to MD standard.
- A concordance for each NSI classification version to the MDI standard is needed. Each observed value of the classification code in rawdata needs to be mapped to the MD classification, otherwise the raw data observations are lost. This is done using the concordance file *NSI*_*classname*_MD_*classname*_classconc.csv.
Harmonizing categorical variables is done by recoding between conforming values from codebooks.
- To harmonize data values for categorical variables, a concordance is made between *NSI*__MD_codeconc.csv.
To concord other data values ((currency) units, date values), R functions are used to revalue.
- E.g. If the domains of the variable in NSI data is 1000 and in MD data 1, then the NSI value is multiplied by 1000. If the NSI value is in an R date-value, say %d%m%Y, an R date function is used convert to the required R date-value.

Storing an MDvarname in an MD panel

Only remap and redefine rows store columns in the final MD panel. Hence, for a MD variable to be present in the output data, one of the two methods needs to be used.

The reason for this is that revalue, recode and reclass only change the content of the NSI_varname, given that the NSI_varname on which the operation has been applied to could be use for multiple mappings, be it a remap or a redefine, in the same concordance year.

Therefore, if you would like to store an MD_varname after a revalue, recode or reclass operation, make sure you add a row for the same NSI_varname-MD_varname mapping with method=remap.

3.2.3.1 Concordance file – NSI_MD_conc.csv

This file contains the list of all variables in a particular MD panel, with information on how to map the NSI variables from one or more raw datafile (often one for each year) to the MD variable. The related table has the following column names:

[1] MD_dataset,MD_varname,NSI_dataset,NSI_varname,method,detail,year

where:

NSI_dataset is the generic name of the data, that together with year specify the NSI datafile that hosts the variable to be use in concording. If year is empty, the concordance does not change over the years.
year is the year for which the concordance holds. If the mapping involves an NSI_datafile which is a panel, the column needs to be filled just with the first year available. If the NSI_datafile is a cross-section, the column needs to be filled with the year it is referenced to (in other words, there has to be one set of mapping rows per NSI_datafile).
NSI_varname is the name of the variable in the NSI datafile.
MD_dataset is the name of the MD firm-level panel dataset, ie BR, SBS, etc.
MD_varname is the name of the variable in the MD data source to be generated. Make sure the year variable is not included in the concordance table, since it is harmonized separately by the infrastructure.
method is the method used to harmonize the data. The value of the categorical variable, provides the method for generating the harmonized variable MD_varname.
- revalue The values of the variable are changed using an R function and parameters in the column detail and possibly from the class and domains variable from the relevant _varnames files.
- recode The values of the variable are changed using a codebook concordance, whose name is given in de details column, e.g. ‘NSI_filename_MD_dataset_codeconc.csv’. Only values that need to be changed require a row in the _codeconc.
- reclass The values of the variable are changed using a classification concordance, whose name is given in the details column, ‘NSI_classname_MD_classname_classconc.csv’. This is used to reclassify e.g. industry, region classifications.
- remap The name of the variable is changed, in a one-to-one mapping from NSI_varname to MD_varname.
- redefine The MD variable is generated as a linear combination of the NSI variable. The column detail specifies the linear combination, i.e ‘+’ or ‘-’) in the many-to-one NSI_varname to MD_varname mapping.
detail contains the function for revalue, the concordance for codebook or classification for recode and remap, and the linear operations for redefine. For revalue, any valid operation operating on the NSI_varname (referred to as x) is good. If the domains of the variable in NSI data is 1000 and in MD data 1, then the NSI value is multiplied by 1000, so detail = x*1000. If the NSI value is in an R date-value, say %d%m%Y, an R date function is used convert to the required R date-value, format(as.Date(x,“%d%m%Y”),“%Y”)
NSI_datafile is the name of the raw dataset from where the NSI_variable is taken from
year is the reference year for that specific row, which will be used to construct the MD-cross section for that year

Storing the year MD_varname in the data

The year variable for each MD_dataset is automatically mapped to the harmonized data based on the metadata and the year value assigned for the corresponding concordance table rows. Hence, please do not add any row in the concordance table where MD_varname = 'year'.

3.2.3.1.1

Data preprocessing

Given that some raw datafiles require specific preprocessing, in very special cases some NSI_varnames might end up be different than those appearing in file NSI_varnames. Hence, in case you would like to concord some of the variables from a dataset for which preprocessing is needed – which you can verify by looking at column preprocessing of that datafile in NSI_datafiles – please keep that in mind. For more information, check out the box on datafile preprocessing or get in touch with the MDI team.

3.2.3.1.2 Examples by harmonization method

Example: revalue

As said, a revalue row in the concordance table simply transforms the content of the raw data’s column, without changing the column name to the desired MD_varname. For instance, say that you want to remove all dots from a string in raw variable var1 from NSI_dataset data_firm for year 2010. To do that, we add a row in the concordance table which looks as follows:

NSI_dataset	year	NSI_varname	MD_dataset	MD_varname	method	detail
data_firm	2010	var1	BR	nace	revalue	gsub(“\.”, ““, x)

In practical terms, this operation will transform the raw datafile from

…	var1	…
…	10.40	…
…	20.59	…
…	01.45	…
…	32.10	…
…	…	…

…	var1	…
…	1040	…
…	2059	…
…	0145	…
…	3210	…
…	…	…

Example: reclass/recode

A reclass/recode row in the concordance table maps the values of a categorical variable (be it a class or codebook variable) to some specified objective values, as indicated in the corresponding class/codeconc table. For instance, say that you want to change the mapping of categorical variable var2 from NSI_dataset survey_firm for year 2012. The raw variable can take values 1, 2 and 9, which link to ‘yes’, ‘no’, ‘not available’. To do that, we assign recode (given that this is a codebook variable; we would indicate reclass in case of a class variable) in the method column, as follows

NSI_dataset	year	NSI_varname	MD_dataset	MD_varname	method	detail
survey_firm	2012	var2	CIS	inpdsv	recode	NSI_MD_codeconc

The harmonization tool will open the codeconc file and transform the values as shown below

1 → 0

2 → 1

9 → 9

In practical terms, this operation will transform the raw datafile from

…	var2	…
…	1	…
…	9	…
…	2	…
…	1	…
…	…	…

…	var2	…
…	0	…
…	9	…
…	1	…
…	0	…
…	…	…

Example: redefine

A redefine row in the concordance table aggregates two or more NSI_varname’s to create an objective MD_varname. As mentioned, the aggregation function is not restricted to a specific form. It can be a sum or subtraction of all non-NA values of the raw variables of the aggregation (detail = + or -) or a custom function (detail = fn('content of the function in R syntax')).

For example, if we want to sum the values of var3, var4 and var5 from datafile 2005_bs to create MD_varname nv, we add the following rows to the table

NSI_dataset	year	NSI_varname	MD_dataset	MD_varname	method	detail
2005_bs	2005	var3	BS	nv	redefine	+
2005_bs	2005	var4	BS	nv	redefine	+
2005_bs	2005	var5	BS	nv	redefine	+

This operation will transform the raw datafile from

…	var3	var4	var5	…
…	12	4	15	…
…	`NA`	2	16	…
…	9	32	19	…
…	8	14	`NA`	…
…	…	…	…

to (also by removing the original raw variables)

…	nv	…
…	31	…
…	18	…
…	60	…
…	22	…
…	…	…

On the other hand, if we want to sum var3 to var4 and divide the result by var5, we need to build a custom function, as in the below concordance rows:

NSI_dataset	year	NSI_varname	MD_dataset	MD_varname	method	detail
2005_bs	2005	var3	BS	nv	redefine	fn((var3+var4)/var5)
2005_bs	2005	var4	BS	nv	redefine	fn((var3+var4)/var5)
2005_bs	2005	var5	BS	nv	redefine	fn((var3+var4)/var5)

This operation will transform the raw datafile from

…	var3	var4	var5
…	12	4	15
…	`NA`	2	16
…	9	32	19
…	8	14	`NA`
…	…	…	…

to (also by removing the original raw variables)

…	nv	…
…	1.067	…
…	`NA`	…
…	2.158	…
…	`NA`	…
…	…	…

Example: remap

A remap row in the concordance table assigns the name of a given MD_varname to an NSI_varname’ without changing the values of the variable itself. This method is usually used to store variables to the objective MD panel without changing anything or which were subject to a revalue or recode/reclass operation.

For example, if we want to store var6 from datafile ener_2001 to create MD_varname firmid, we add the following row to the table

NSI_dataset	year	NSI_varname	MD_dataset	MD_varname	method	detail
ener_2001	2001	var6	ENER	firmid	remap

This operation will transform the raw datafile from

…	var5	…
…	nwejn	…
…	aios2	…
…	cjnje	…
…	29hbd	…
…	…	…

…	firmid	…
…	nwejn	…
…	aios2	…
…	cjnje	…
…	29hbd	…
…	…	…

An example of the table for a few variables needed for Slovenian harmonized MD BR for year 2007 (column year is omitted):

MD_dataset	MD_varname	NSI_dataset	NSI_varname	method	detail
BR	firmid	MIKRO_PRS_razST	MS10_razST	remap
BR	entgrp	MIKRO_PRS_razST	MS10_IZP_MS7_razST	remap
BR	birthyr	MIKRO_PRS_razST	Datum_prv_vnosa	revalue	as.Date(as.character(x), ‘%d.%m.%Y’)
BR	exityr	MIKRO_PRS_razST	Datum_izbrisa	revalue	as.Date(as.character(x), ‘%d.%m.%Y’)
BR	nace	MIKRO_PRS_razST	Skd	remap
BR	soe	MIKRO_PRS_razST	Vrsta_lastnine	recode	SI_MD_codeconc
BR	birthyr	MIKRO_PRS_razST	Datum_prv_vnosa	remap
BR	exityr	MIKRO_PRS_razST	Datum_izbrisa	remap
BR	soe	MIKRO_PRS_razST	Vrsta_lastnine	remap
BR	nace	MIKRO_PRS_razST	Skd	revalue	sub(‘^(\d{2})\.(\d{2})\d$’,‘\1\2’,x)

3.2.3.2 NSI_MD_codeconc.csv

[1] NSI_dataset,year,NSI_varname,MD_dataset,MD_varname,left,right

where:

NSI_dataset is the generic name of the data, that together with year specify the NSI datafile that hosts the variable to be use in concording. If year is empty, the concordance does not change over the years.
year is the year for which the concordance holds. If empty, the same concordance rows are used for all NSI datafiles associated with the generic NSI_dataset.
NSI_varname is the name of the variable of the specific NSI datafile associated with dataset and year.
MD_varname is the name of the corresponding MD variable.
left gives the valid values of the categorical variable in the raw NSI dataset.
right gives the corresponding MDI dataset values to map.

Note

If the mapping of a categorical variable already corresponds to that of the objective MD_varname, then there’s no need to add the related row in the codeconc. For instance, say that we have raw NSI_dataset data_ictec with NSI_varname var112 being a categorical variable taking values ‘0’ referring to ‘no’. Then let the variable be mapped to MD_varname IACC from MD_dataset ICTEC. Given that, as indicated in the MD metadata, code ‘0’ is linked to ‘no’ for this MD_varname, we don’t need to add any row for this specific mapping.

However, if the other mappings don’t correspond, the rows in the codeconc file need to be present for those!

Dealing with the year column

As mentioned, if a specific mapping holds for all years available of a specific NSI_dataset, then the year column for that mapping needs to be empty.

For instance, say that we have raw NSI_dataset data_ictec with NSI_varname var112 being a categorical variable taking values ‘0’, ‘2’, ‘999’, referring to ‘no’, ‘yes’, ‘not available’ for all years. This variable will be harmonized to IACC of the MD_dataset ICTEC. In this case, the NSI_MD_codeconc will contain only three rows for these three mapping without any reference to the years. In practice:

NSI_dataset	NSI_varname	year	MD_dataset	MD_varname	left	right
data_ictec	var112		ICTEC	IACC	2	1
data_ictec	var112		ICTEC	IACC	999	NA

Note that the mapping for ‘0’ - ‘no’ is missing given that it already corresponds to the objective MD mapping.

That said, if a mapping is not constant across all years of an NSI_dataset, then the year column needs to have a value for all mappings reported. In this context, there can two cases:

The codes differ by year for the same variable: This means that var_112 takes values, say, ‘0’, ‘2’, ‘999’, referring to ‘no’, ‘yes’, ‘not available’ for 2006 and in 2007 the mapping changes to ‘1’, ‘2’, ‘9’, for ‘no’, ‘yes’, ‘not available’, respectively. Hence, we need to indicate year-specific mappings in the codebook table:

NSI_dataset	NSI_varname	year	MD_dataset	MD_varname	left	right
data_ictec	var112	2006	ICTEC	IACC	2	1
data_ictec	var112	2006	ICTEC	IACC	999	NA
data_ictec	var112	2007	ICTEC	IACC	1	0
data_ictec	var112	2007	ICTEC	IACC	2	1
data_ictec	var112	2007	ICTEC	IACC	999	NA

The NSI_varname isn’t available for all years: As an example, let var_112 be only available in 2005 and 2006 but it’s dropped or has a different name in the other years. Then, we need to include it for both 2005 and 2006, regardless of whether the codes are identical or not:

NSI_dataset	NSI_varname	year	MD_dataset	MD_varname	left	right
data_ictec	var112	2005	ICTEC	IACC	2	1
data_ictec	var112	2005	ICTEC	IACC	999	NA
data_ictec	var112	2006	ICTEC	IACC	2	1
data_ictec	var112	2006	ICTEC	IACC	999	NA

Missing codebook entry in MD_codebook

As the right column of an NSI_MD_codeconc file needs to have entries that are present in the MD_codebook, there can be cases in no corresponding value can be found between a categorical value in a country’s dataset and the MD_codebook. For example, say that a very specific value for variable unit of MD dataset PRODCOM is available in the data of a country and no corresponding value can be found in the MD codebook. In that case, please reach out to the MDI team, as we might consider adding that value to the MD_codebook.

Below an example of the codebook concordance table in Portugal:

NSI_dataset	NSI_varname	year	MD_dataset	MD_varname	left	right
ifats	imputeifats	NA	IFATS	imputed	1	0
ifats	imputeifats	NA	IFATS	imputed	2	1
itgs	exim	NA	ITGS	exim	1	0
itgs	exim	NA	ITGS	exim	2	1
itgs	imputeitgs	NA	ITGS	imputed	1	0
*Note: Only 5 rows are displayed.

3.2.3.2.1 NSI_classname_MD_classname_classconc.csv

[1] year,left,right

where:

left is the name of the current NSI classification code list for variable NSI\_classname.
right is the name of the MD classification code list the user wants to concord to.
year is the year for which the concordance holds.

Below a sample of the concordance from the raw data’s common nomenclature code list (left) to the harmonized one (right).

year	left	right
2005	61019090	61019090D
2004	72249014	72249014
2005	09104090	09104090D
2004	85407200	85407200D
2005	84145910	84145910D
*Note: Only 5 rows are displayed.

3.2.4 Data documentation and MDI implementation process in phases

To construct the necessary metadata of the raw data and concordance tables needed to produce the harmonized MD datasets is a particularly long and tedious process.

A necessary requirement to have the MDI installed on the remote environment of a country is to have sufficiently large RAM on the server. Naturally, the amount of RAM needed depends on the size of the data. An indicative measure for this could be the ratio RAM - number of observations in the BR. This ratio should be approximately larger or equal to 2.

To make the experience more manageable as well as to give it more structure, we developed the following timeline divided in seven phases:

Metadata/concordance table construction phases and deliverables
Phase	Completed files
I	raw files cleanup (paths, list of files needed, etc.) NSI_datafiles firm unit analysis unique keys detailed information on disclosure rules of the NSI
II	Phase I NSI_varnames
III	Phase II NSI_codebook (just for BR, BS (if available), SBS) NSI_class (just for BR, BS (if available), SBS)
IV	Phase III skeleton NSI_MD_conc (just for BR, BS (if available), SBS)
V	Phase IV import script that harmonizes variables harmonize BR, BS (if available), SBS import MDI CN module and execute it on BR, BS (if available) and SBS in R Stata packages questionnaire for the CompNet variables country file the CompNet Stata files and run them on Stata for a subset of indicators extract the tables
VI	Phase V NSI_codebook (for the remaining datasets) NSI_class (for the remaining datasets) NSI_MD_codeconc for non-surveys NSI_MD_classconc for non-surveys
VII	Phase VI NSI_MD_codeconc for surveys (CIS, ICTEC, …), sequentially (from easier to more complex) NSI_MD_classconc for surveys (CIS, ICTEC, …), sequentially (from easier to more complex) NSI_MD_conc (excluding ITGS and PRODCOM)
VIII	Phase VII NSI_MD_conc for ITGS and PRODCOM timeconc firm ID conc various cleaning/leftovers

Each phase refers to the construction of a specific file, as described in the above sections. There are a few elements that haven’t been explicitly mentioned yet. A brief explanation is provided below:

Raw files cleanup: making sure the raw files directory is tidy and usable
Firm unit analysis: most granular firm identifier (plant, legal unit, enterprise, enterprise group) of each raw file (see the NSI_datafiles section to check the list of possible units)
Unique keys: the uniquely identifying columns of each raw file (see the NSI_varnames section)
Disclosure rules: detailed description of the disclosure routines in place in the NSI
import script that harmonizes variables: after Phase IV, given that the NSI_MD_conc file mappings for BR, BS (if available) and SBS are ready, those files can be harmonized. To do so, we don’t import the whole infrastructure (yet) but we provide you with a code that does that by reading your metadata files and relevant raw data and produces the MD panels.
upload MDI CN module and execute it on BR, BS and SBS and importing the CompNet-related files (Phase V): After having harmonized BR, BS (if available) and SBS, we ask you to import a few more files.
- CN module: This script produces some files under a specific directory
- questionnaire: The Questionnaire is an Excel file that contains:
  1. paths
  2. variable names
  3. confidentiality routine settings, that needs to be filled in with these things. So in particular the variable names to reflect the country-specific variable mapping
- country file: A Stata .dta file at country-year-industry2d-sizeclass level and it contains
  1. population firm numbers from Eurostat
  2. industry-level deflators from Eurostat/EU KLEMS/AMECO
  3. some additional measures from public sources (e.g. 10year government bond yields from Eurostat)
  4. one predefined measures from us (i.e. not from public sources; if this is an issue we could leave it out)
- Stata files for CompNet: The Stata (.do) files take as input in the output of the CN module, the questionnaire and the country file and produce a limited first version of some CompNet indicators

Important

Executing the CompNet .do files requires that

Stata can be used in the same remote environment as the MDI
The NSI agrees to export and have the CompNet indicators be published to third parties

The timeline was built based on our past experience. It’s meant to be first and foremost a help when creating metadata and concordance table from scratch.

3.2.5 Constructing NSI metadata

This section consists in a guide on how to build NSI metadata files. The guide will make references to some scripts which can assist users to create the files from scratch. Naturally, only using these scripts is not enough, as many fields of the tables need to be manually specified.

1. Constructing NSI_datafiles.csv

This script scans the raw data directory in the protected NSI environment and builds the metadata table required by the CompNet rocket. It creates the *_datafiles.csv file according to the specification in mantioned above.

The script can be found in this dropdown menu:

R script

# NSI_datafiles builder (spec §3.2.1.1)
# author: AM-MM, date: 2025-09-29

library(data.table)
library(stringr)

# ---- Inputs you must define upstream ----
# dirINPUTDATA: the NSI data root (manual's reference base dir)
# dirROCKET:    rocket root to use for storing NSI metadata
# CountryCode:  2-letter ISO code (e.g., "IT")
file_path <- dirINPUTDATA  # use the actual base for relative paths

# ---- List files (absolute) ----
abs_files <- list.files(file_path, recursive = TRUE, full.names = TRUE)

# keep only files (exclude dirs)
abs_files <- abs_files[file.info(abs_files)$isdir == FALSE]

# ---- Build table ----
DT <- data.table(abs_path = abs_files)

# relative path to dirINPUTDATA (allow trailing slash in file_path; escape regex)
file_path_norm <- normalizePath(file_path, winslash = "/", mustWork = FALSE)
file_path_esc  <- gsub("([\\^$.|?*+()\\[\\]{}\\\\])", "\\\\\\1", file_path_norm)
DT[, rel := sub(paste0("^", file_path_esc, "/?"), "", normalizePath(abs_path, winslash="/"))]

# split dir / filename / extension
DT[, filename      := basename(rel)]
DT[, path          := dirname(rel)]
DT[, format        := tools::file_ext(filename)]
DT[, NSI_datafile  := tools::file_path_sans_ext(filename)]   # spec name

# ---- Derive NSI_dataset (generic) by stripping all 4-digit years and separators ----
DT[, NSI_dataset := NSI_datafile |>
     str_remove_all("\\d{4}") |>
     str_replace_all("[-_.]+", "_") |>
     str_replace("^_|_$", "") |>
     str_to_lower()
]

# ---- Years: extract all 4-digit tokens from the filename.
# Take min/max if present; if none found, leave NA.
# IMPORTANT: always double-check that year_start / year_end are correct,
# especially for files named like "bd2018" (single year) or with unusual patterns.
extract_years <- function(x) as.integer(str_extract_all(x, "\\d{4}")[[1]])
yrs <- lapply(DT$NSI_datafile, extract_years)
DT[, year_start := vapply(yrs, function(v) if (length(v)) min(v) else NA_integer_, integer(1))]
DT[, year_end   := vapply(yrs, function(v) if (length(v)) max(v) else NA_integer_, integer(1))]

# ---- Fields to be filled manually or via later tools ----
DT[, yearvar      := NA_character_]   # name of year column if panel; else ""/NA
DT[, details      := NA_character_]
DT[, firm_unit    := NA_character_]   # one of: plant, legal_unit, enterprise, enterprise_group
DT[, data_source  := NA_character_]   # one of: survey, administrative_source, mixed
DT[, firm_sample  := NA_character_]
DT[, preprocessing:= NA_character_]   # instruction string per §3.2.4–3.2.9

# ---- Final spec order EXACTLY as in the manual ----
out_cols <- c(
  "NSI_datafile","NSI_dataset","yearvar","year_start","year_end",
  "format","path","details","firm_unit","data_source","firm_sample","preprocessing"
)
NSI_datafiles <- DT[, ..out_cols]

# ---- Write CSV ----
outdir <- file.path(dirROCKET, "NSImetadata")
dir.create(outdir, showWarnings = FALSE, recursive = TRUE)
fwrite(NSI_datafiles, file.path(outdir, paste0(CountryCode, "_datafiles.csv")))

Steps:

Set inputs at the top of the script
- dirINPUTDATA: the base directory containing the raw data.
- dirROCKET: the base directory containing the NSImetadata/ folder.
- CountryCode: the two-letter ISO country code (e.g. "IT").
Run the script
The script will automatically collect:
- NSI_datafile (filename without extension)
- NSI_dataset (generic dataset name, stripped of years and underscores)
- year_start / year_end (from 4-digit tokens in the filename)
- format (file extension)
- path (relative path to dirINPUTDATA)
Fields requiring manual completion
The following columns are created but left empty (NA). They must be filled manually by the NSI team:
- yearvar: name of the year column if the file is a panel (leave empty otherwise).
- details: clarifications on dataset coverage or specific notes.
- firm_unit: one of {plant, legal_unit, enterprise, enterprise_group}.
- data_source: one of {survey, administrative_source, mixed}.
- firm_sample: information on the sampling scheme.
- preprocessing: description of preprocessing steps, if any (see section below).
Double-check the automatic fields
- Years: the script extracts all 4-digit numbers from the filename and assigns the minimum to year_start and the maximum to year_end.
  ⚠️ Always verify that these values match the actual time coverage of the dataset. For example, bd2018 will yield both year_start=2018 and year_end=2018, which may or may not be correct.
- NSI_dataset: confirm that the generic dataset name is harmonised and consistent across files.
Export
The script writes the CSV into: /NSImetadata/_datafiles.csv

Key reminders

The script is a first pass only: it automates extraction of filenames, formats, and candidate years.
Most metadata fields must be filled manually by the NSI staff who know the data.
Always double-check the final file before uploading to the rocket.

2. Constructing NSI_varnames.csv

This script produces the *_varlist.csv files, one for each dataset listed in NSI_datafiles.csv.
It should always be run after the NSI_datafiles script.

The script can be found in this dropdown menu:

R script

# Script to generate NSI_varnames metadata (per §3.2.1.2 of manual)
# author EB-MH-MM-AM, rev 2025-09-29

library(data.table)
library(dplyr)
library(readxl)
library(tools)

# ---- Inputs you must define upstream ----
# dirINPUTDATA: the NSI data root (manual's reference base dir)
# dirROCKET:    rocket root to use for storing NSI metadata
# CountryCode:  2-letter ISO code (e.g., "IT")

# ---- Inputs ----
md_folder   <- file.path(dirROCKET, "NSImetadata")
NSI_datafiles <- fread(file.path(md_folder, paste0(CountryCode, "_datafiles.csv")))
file_path  <- dirINPUTDATA  # base folder for raw data

# Helper: test if a set of columns uniquely identifies rows
is_key_id <- function(data, cols) {
  n_distinct <- data %>% select(all_of(cols)) %>% distinct() %>% nrow()
  return(n_distinct == nrow(data))
}

# Process per dataset
for (DS in unique(NSI_datafiles$NSI_dataset)) {
  
  NSI_datafiles_filtered <- unique(NSI_datafiles[NSI_dataset == DS,])
  
  # Build file paths
  abs_paths <- file.path(file_path,
                         NSI_datafiles_filtered$path,
                         paste0(NSI_datafiles_filtered$NSI_datafile, ".", NSI_datafiles_filtered$format))
  
  # Load files
  file_list <- lapply(abs_paths, function(f) {
    import_data(dir = dirname(f), file = basename(f), format = file_ext(f))
  })
  
  var_names_list <- lapply(file_list, function(df) data.table(NSI_varname = names(df)))
  
  for (i in seq_along(var_names_list)) {
    rawdata <- file_list[[i]]
    
    # Load variable descriptions
    desc_file <- file.path(file_path,
                           NSI_datafiles_filtered$path[i],
                           paste0(NSI_datafiles_filtered$NSI_dataset[i], "_descr.csv"))
    
    if (!file.exists(desc_file)) {
      stop(paste("Description file not found:", desc_file,
                 "Please create it as required by the manual."))
    }
    description <- fread(desc_file)
    
    if (!"NSI_varname" %in% colnames(description)) {
      stop("Description file must contain a column named 'NSI_varname'")
    }
    
    # Add class, domain, NSI_datafile
    var_names_list[[i]]$class <- sapply(rawdata, function(x) paste(class(x), collapse=","))
    var_names_list[[i]]$domain <- NA_character_  # manual input required
    var_names_list[[i]]$NSI_datafile <- file_path_sans_ext(basename(abs_paths[i]))
    
    # Merge with descriptions
    var_names_list[[i]] <- merge(var_names_list[[i]], description,
                                 by = "NSI_varname", all.x = TRUE)
    
    # Report missing variables in description
    missing_vars <- setdiff(names(rawdata), description$NSI_varname)
    if (length(missing_vars) > 0) {
      message("Missing variable descriptions for ", var_names_list[[i]]$NSI_datafile[1], ": ",
              paste(missing_vars, collapse=", "))
    }
    
    # Identify key variables
    colnms <- colnames(rawdata)
    max_cols <- min(4, length(colnms))
    found <- NULL
    for (k in 1:max_cols) {
      for (comb in combn(colnms, k, simplify = FALSE)) {
        if (is_key_id(rawdata, comb)) {
          found <- comb
          break
        }
      }
      if (!is.null(found)) break
    }
    var_names_list[[i]]$is_key <- var_names_list[[i]]$NSI_varname %in% found
    
    message("++ ", var_names_list[[i]]$NSI_datafile[1], " processed.")
  }
  
  stacked_df <- bind_rows(var_names_list)
  
  # Final column order per manual
  stacked_df <- stacked_df[, c("NSI_datafile","NSI_varname","description","is_key","class","domain")]
  
  # Export
  fwrite(stacked_df, file.path(md_folder, paste0(CountryCode, "_", DS, "_varlist.csv")))
  message("List for dataset ", DS, " exported.")
}

Steps:

Inputs required
- dirROCKET: base directory containing the NSImetadata/ folder.
- CountryCode: the two-letter ISO code (e.g. "IT").
- dirINPUTDATA: main folder containing the raw NSI data.
Run the script
For each dataset in NSI_datafiles.csv, the script will:
- Load the raw files listed for that dataset.
- Extract the variable names (NSI_varname).
- Read the corresponding description file <dataset>_descr.csv (must be provided by the NSI).
- Record the variable class (data type).
- Attempt to infer which variables form a key (is_key).
- Create an empty domain column to be filled manually.
- Export the compiled metadata to:
```
<dirROCKET>/NSImetadata/<CountryCode>_<dataset>_varlist.csv
```
Fields requiring manual completion
- description: ensure that the description file is complete and correctly labelled.
- domain: must always be filled manually (see manual §3.2.1.2 for details).
- is_key: the automatic detection may fail or give false positives. Double-check and adjust manually.
Double-check the automatic fields
- Verify that all variables in the raw data are listed in the description file.
  Missing variables are reported in the console when running the script.
- Confirm that the class column is meaningful and consistent.

Key reminders

A description file <dataset>_descr.csv is mandatory. If missing, the script stops with an error.
The is_key detection is heuristic. Always verify manually which variables uniquely identify records.
The domain classification cannot be automated. It must be completed by the NSI staff.
Always inspect the final *_varlist.csv files before uploading them to the rocket.

3. Constructing NSI_class.csv

This script produces the classification metadata files *_class.csv required by the rocket.
It should always be run after the NSI_datafiles script.

The script can be found in this dropdown menu:

R script

# Script to generate NSI_class metadata (per §3.2.1.3 of manual)
# Run only after NSI_datafiles.R

library(readr)
library(dplyr)
library(stringr)

# ---- Inputs ----
CountryCode <- "IT"        # set your 2-letter code
dirROCKET   <- "your_dir"  # base folder for rocket
dirINPUTDATA <- "your_data_folder"  # main raw data folder

# Load metadata from NSI_datafiles
data_files <- read_csv(file.path(dirROCKET, "NSImetadata", paste0(CountryCode, "_datafiles.csv")),
                       show_col_types = FALSE)

# ---- Specify dataset and classification variable ----
class_dataset <- "your_dataset"         # must match NSI_dataset in datafiles
class_name    <- "name_class_variable"  # e.g. "NACE2"

NSI_datafiles_filtered <- filter(data_files, NSI_dataset == class_dataset)

if (nrow(NSI_datafiles_filtered) == 0) {
  stop("Dataset not found in NSI_datafiles: ", class_dataset)
}

# ---- Function to read classification data ----
extract_columns <- function(file_path) {
  data <- read_csv(file_path, show_col_types = FALSE)
  
  required_columns <- c(class_name, "year", "description")
  if (!all(required_columns %in% names(data))) {
    stop("Missing one or more required columns in: ", file_path,
         ". Expected: ", paste(required_columns, collapse=", "))
  }
  
  out <- select(data, all_of(required_columns))
  # Rename classification variable to generic name 'classvar'
  colnames(out)[1] <- "classvar"
  return(out)
}

# ---- Process files ----
results <- list()

for (i in seq_len(nrow(NSI_datafiles_filtered))) {
  f <- file.path(dirINPUTDATA,
                 NSI_datafiles_filtered$path[i],
                 paste0(NSI_datafiles_filtered$NSI_datafile[i], ".",
                        NSI_datafiles_filtered$format[i]))
  
  if (!file.exists(f)) {
    message("File not found: ", f)
    next
  }
  
  extracted <- extract_columns(f)
  
  dataset_name <- tolower(NSI_datafiles_filtered$NSI_dataset[i])
  output_file <- file.path(dirROCKET, "NSImetadata",
                           paste0(CountryCode, "_", dataset_name, "_class.csv"))
  
  # enforce column order
  extracted <- extracted[, c("classvar", "year", "description")]
  
  write_csv(extracted, output_file)
  message("Exported classification metadata to ", output_file)
  
  results[[dataset_name]] <- extracted
}

Steps

Inputs required
- dirROCKET: base directory containing the NSImetadata/ folder.
- CountryCode: the two-letter ISO code (e.g. "IT").
- dirINPUTDATA: main folder containing the raw NSI data.
- class_dataset: the dataset where the classification variable is found (must match an NSI_dataset in NSI_datafiles.csv).
- class_name: the name of the classification variable (e.g. "nace").
Run the script
For the specified dataset, the script will:
- Load the raw files linked to the dataset.
- Extract three required fields:
  - classvar (the classification variable, renamed from the raw variable class_name),
  - year (validity year),
  - description (text label of the classification code).
- Export the results into:
```
<dirROCKET>/NSImetadata/<CountryCode>_<dataset>_class.csv
```
Fields requiring manual completion / verification
- Ensure that the classification variable chosen (class_name) matches the raw file.
- Check that year and description columns exist and are correctly populated in the raw data.
- Confirm that the classvar column has been renamed properly and contains only the classification codes.
Double-check the automatic fields
- The script will stop if any of the required columns (class_name, year, description) are missing.
- Even if the file is created, NSIs must review the exported *_class.csv carefully to verify that:
  - year corresponds to the reference period of the classification.
  - description correctly describes each classification code.
  - No codes are missing or duplicated.

Key reminders

Each dataset that includes a classification variable must have a corresponding *_class.csv file.
Column order in the final CSV must be exactly: classvar, year, description
Always inspect the final file manually before uploading it to the rocket.

4. Constructing NSI_codebook.csv

This script produces the *_codebook.csv files required by the rocket.
It should always be run after the NSI_datafiles script.

The script can be found in this dropdown menu:

R script

# Script to generate NSI_codebook metadata (per §3.2.1.4 of manual)
# Produces a single consolidated <CountryCode>_codebook.csv
# Run only after NSI_datafiles.R

library(data.table)
library(tools)

# ---- Inputs ----
CountryCode  <- "IT"              # two-letter code
dirROCKET    <- "your_dir"        # rocket base folder
dirINPUTDATA <- "your_data_folder"

# ---- Import function ----
import_data <- function(file_path) {
  fread(file_path, stringsAsFactors = FALSE)
}

# ---- Helper: detect large digit variation ----
has_large_digits_variation <- function(values, threshold = 1) {
  values <- na.omit(values)
  digits <- nchar(as.character(values))
  digit_diff <- abs(digits - min(digits, na.rm = TRUE)) > 3
  sum(digit_diff) > threshold
}

# ---- Create codebook for one dataset ----
create_codebook <- function(df, dataset_name,
                            max_unique_values = 50,
                            digit_variation_threshold = 1) {
  codebook <- data.table(NSI_dataset = character(),
                         NSI_varname = character(),
                         code = character(),
                         year = character(),
                         description = character())
  
  for (var_name in names(df)) {
    unique_values <- unique(df[[var_name]])
    
    # Skip high-cardinality vars or numerics with wide digit variation
    if (length(unique_values) > max_unique_values ||
        (is.numeric(df[[var_name]]) &&
         has_large_digits_variation(unique_values, digit_variation_threshold))) {
      next
    }
    
    temp_dt <- data.table(
      NSI_dataset = dataset_name,
      NSI_varname = var_name,
      code = as.character(unique_values),
      year = "",                         # ++++ to be reviewed manually ++++
      description = NA_character_        # ++++ to be filled manually ++++
    )
    codebook <- rbind(codebook, temp_dt, fill = TRUE)
  }
  
  return(codebook)
}

# ---- Create single consolidated codebook for all datasets ----
create_codebook_all <- function(rd_folder, md_folder, CountryCode,
                                max_unique_values = 50, digit_variation_threshold = 1) {
  csv_files <- list.files(path = rd_folder, pattern = "\\.csv$", full.names = TRUE)
  all_codebooks <- list()
  
  for (file_path in csv_files) {
    dataset_name <- tools::file_path_sans_ext(basename(file_path))
    dataset <- import_data(file_path)
    cb <- create_codebook(dataset, dataset_name,
                          max_unique_values, digit_variation_threshold)
    all_codebooks[[dataset_name]] <- cb
    message("Processed dataset: ", dataset_name)
  }
  
  # Stack all datasets together
  codebook_all <- rbindlist(all_codebooks, fill = TRUE)
  
  # Enforce manual’s column order
  codebook_all <- codebook_all[, c("NSI_dataset","NSI_varname","year","code","description")]
  
  # Export single consolidated file
  output_file <- file.path(md_folder, paste0(CountryCode, "_codebook.csv"))
  fwrite(codebook_all, output_file, quote = FALSE)
  message("Exported consolidated codebook: ", output_file)
}

# ---- Execute ----
md_folder <- file.path(dirROCKET, "NSImetadata")
dir.create(md_folder, showWarnings = FALSE, recursive = TRUE)
create_codebook_all(dirINPUTDATA, md_folder, CountryCode)

Steps

Inputs required
- dirROCKET: base directory containing the NSImetadata/ folder.
- CountryCode: the two-letter ISO code (e.g. "IT").
- dirINPUTDATA: main folder containing the raw NSI data.
Run the script
For each raw dataset (CSV) in the folder, the script will:
- Extract variable names (NSI_varname).
- Collect their observed values (code).
- Create empty year and description fields.
- Stack all rows and define what NSI_dataset they relate to.
- Export the result to:
```
<dirROCKET>/NSImetadata/<CountryCode>_codebook.csv
```
Fields requiring manual completion
- year: must be reviewed and filled manually where relevant.
- description: must always be filled manually (label for each code).
Double-check the automatic fields
- The script excludes variables with too many unique values or with large numeric variation.
- Ensure that important categorical variables were not skipped.
- Verify that codes are consistent across years.

Key reminders

Column order in the final CSV must be exactly: NSI_dataset, NSI_varname, code, year, description
This script only provides a first draft. Most of the meaningful content (year, description) must be added manually by NSI staff.
Always inspect the final file carefully before uploading to the rocket.

5. Constructing the `timeconc` table

The timeconc table is part of the metadata required by the rocket.
Unlike the other metadata files, it cannot be generated from the raw microdata.

Key points

The timeconc table provides official information on the time coverage of the data.
It must be obtained directly from an authoritative NSI or official source.
Once collected, the table should be stored and maintained in the NSImetadata folder with the naming convention: classvar*t0\_*classvar*t1_conc

Responsibilities

The NSI staff must provide the timeconc table using official sources (e.g. methodological notes, published documentation, internal validation).
The role of the CompNet rocket is only to read and integrate this file; it does not generate it.

Key reminder

⚠️ Always ensure that the timeconc table comes from an officially validated source and is kept up to date. This file underpins the correct interpretation of the temporal dimension of the datasets and cannot be replaced by automated extraction.

6. Constructing NSI_keyID1_keyID2_conc.csv

The firm ID concordance table establishes the link between two firm identifiers (among plantid, firmid, entid, entgrp) used in different datasets.
It is essential for ensuring consistent longitudinal tracking of firms and dataset merging.

The script can be found in this dropdown menu:

R script

# Pseudo-code: Building the firm ID concordance table (NSI_firmid.csv)

library(data.table)

# ---- Inputs ----
CountryCode  <- "IT"                  # two-letter code
dirROCKET    <- "your_dir"            # rocket base
dirINPUTDATA <- "your_data_folder"    # raw data

# Step 1: Identify dataset(s) that contain both ID variables
# Example: suppose "id_old" and "id_new" are two firm identifiers
candidate_datasets <- c("dataset_with_ids")

# Step 2: For each dataset, load and stack across years if not a panel
firmid_list <- list()

# Read NSI_datafiles
datafiles <- fread(paste0(dirNSIMETA, CountryCode, '_datafiles.csv'))

for (ds in candidate_datasets) {
  # Build file path(s) from NSI_datafiles.csv
  files <- datafiles[ds == NSI_dataset,]$path
  files <- paste0(dirINPUTDATA, files)
  
  # If multiple cross-sections: bind them into a long panel (add year column!)
  if (length(files) > 1) {
    raw <- rbindlist(lapply(files, fread), fill = TRUE) # Works only if datafiles have the same column names!
  } else {
    raw <- fread(files)
  }
  
  year_var <- '...' # Define the variable name for the year variable
  
  # Step 3: Extract only the two ID columns + year column ---> manually fix the id column names
  firmid_sub <- unique(raw[, .(id_old, id_new, ..year_var)])
  
  # Step 4: Standardise column names ---> manually fix the column names
  setnames(firmid_sub, old = c("id_old","id_new","..."),
           new = c("firmid_old","firmid_new","year"))
  
  firmid_list[[ds]] <- firmid_sub
}

# Step 5: Combine all datasets (if more than one provides concordance)
firmid_all <- rbindlist(firmid_list, fill = TRUE)

# Step 6: Export to NSImetadata
fwrite(firmid_all, file.path(dirNSIMETA,
                               paste0(CountryCode,"keyID1_keyID2_conc.csv"))) # Substitute the keyIDS with their proper name!

Key principles

The table can only be created if at least one dataset contains both identifiers in the same file.
If the dataset is not a panel but a set of yearly cross-sections, it must be stacked into a long format before extracting IDs.
The table must always contain unique triples: keyID1, keyID2, year, where the ID names need to be picked from plantid, firmid, entid, entgrp.

Steps

Identify dataset(s)

Review NSI_datafiles.csv and raw data.
Find which dataset(s) include the two firm ID variables (e.g. id_old and id_new).

Stack data if needed

If the dataset is stored as separate cross-sections by year, stack them and add a year column.
If the dataset is a panel, the year is already present.

Extract unique concordance

Keep only the two ID columns and the year column.
Deduplicate (unique) to avoid duplicates across files.

Rename columns

Use the standard names:
- keyID1
- keyID2
- year

Export

Save as:

<dirROCKET>/NSImetadata/<CountryCode>_firmid.csv

Key reminders

This file is not always available — it depends on the data structure in the NSI.
The NSI staff must verify that the mapping is correct and covers the relevant years.
Always check that:
Both ID variables are properly harmonised.
No spurious duplicates or mismatches exist.
Cross-sections have been stacked correctly.

3.2.6 Data Pre-Processing

In the Netherlands and France, NSIs have already harmonized their raw data files to resemble MD datasets, resulting in minimal harmonization work being required from the Launcher.

However, there is a strategic intention to shift the boundary between the responsibilities of NSIs and the MDI infrastructure. Two approaches are under consideration:

NSIs document their raw files, and the Launcher—guided by this metadata—performs the harmonization and constructs the MD panels.
NSIs carry out the full harmonization to MD standards, and the Launcher simply reads the pre-harmonized files into R.

NSI_datafile preprocessing routines guide

Some raw datasets might require specific preprocessing. This is taken care of by the infrastructure right after the raw datafile is imported by the launcher and before it is harmonized using the preprocessing tool (rocket/MDIprogs/datafile_preprocessing_tool.R).

The tool is a general-purpose function designed to apply one or more preprocessing transformations to raw datasets (stored as data.table objects). It enables modular, rule-based data cleaning and transformation by interpreting a structured string called preprocessing_string.

How it works

Instruction string (preprocessing_string) encodes all preprocessing steps.
The string is split into separate operations using ||.
Each operation is parsed and executed in sequence.
The data is modified in-place step by step, and the final data.table is returned.

Syntax rules

Operations are separated by ||
Parameters within each operation are separated by |
Multiple elements in a parameter (e.g., multiple column names) are separated by a hash symbol #

Supported Operations

Operation	Format	Description
`dedup`	`dedup\|id_col1#id_col2\|method\|[optional:dedup_col]`	Removes duplicates by ID(s). Methods: `na`, `min`, `max`, `meanmode`, `random`.
`filter`	`filter\|column\|operator\|value`	Filters rows based on logical conditions. Operators: `eq`, `neq`, `gt`, `gte`, `lt`, `lte`, `in`, `nin`.
`agg`	`agg\|group_col1#group_col2\|var1:func#var2:func`	Aggregates rows over groups using functions: `sum`, `mean`, `min`, `max`, `median`, `sd`, `mode`, `pickmaxby-refcol`.
`restruct`	`restruct\|column_to_remove\|col1#col2#col3`	Drops a column and deduplicates rows based on remaining selected columns.
`reshape`	`reshape\|id_col1#id_col2\|names_from\|values_from1#values_from2`	Reshapes data from long to wide format using `dcast()`.
`derive`	`derive\|new_col\|condmap\|cond1:val1#cond2:val2#...\|default:<default_val>`	Creates a new column from conditional logic. Conditions use standard R syntax; values can be column names or literals. A default must be specified.
`scaleif`	`scaleif\|condition_col\|val1:factor1#val2:factor2#...\|col1#col2#...`	Conditionally multiplies one or more columns by a factor based on a categorical column’s value.
`trimchars`	`trimchars\|col1#col2#...\|n`	Trims the last `n` characters from each specified character column. Useful to normalize identifiers or string variables.
`mergefrom`	`mergefrom\|datafile_name\|join_key1#join_key2#...\|col1#col2#...\|[join_type]`	Imports one or more raw files belonging to `datafile_name` (as defined in column `NSI_datafile` of file `NSI_datafiles`), stacks them if multiple, and merges the specified columns into the current datafile using the provided join keys. By default a left join is performed. If `join_type` is set to `outer`, a full outer join is applied.

Special Feature in `agg`: `pickmaxby-refcol`

You can specify that a categorical column should take the value from the row with the highest value in another column.

Syntax:
my_categorical_col:pickmaxby-SCORE_col

This selects the value of my_categorical_col from the row that has the highest SCORE_col within each group.

Example

Code

preprocessing_string <- 
  "dedup|firm_id#year|random||
   filter|year|gt|2010||
   agg|firm_id|sales:sum#country:pickmaxby-sales||
   trimchars|vat_id|2||
   mergefrom|employment_survey_08|FIRM_ID#year|industry_code#size_class|left"

In case you need clarifications regarding the tool, please reach out to the MDI team.

The advantage of having the Launcher perform the harmonization is a reduction in maintenance costs for NSIs, particularly for recurring annual updates. It also improves the codification and reproducibility of the conceptual work done by NSI staff. However, this approach entails higher initial costs, as it requires NSIs to adopt a more rigid system of metadata documentation and to coordinate more closely with the MDI team.

3.3 Run MDI: Harmonization & Modules

This section describes the main steps to configure and run the MDI system. It covers how to prepare the countdown.R script, perform metadata checks, and execute harmonization and analysis modules. Each step is explained in detail below. An overview of all the steps an be found in the previous section

3.3.1 Countdown

The countdown (countdown.R) is the starting point for all use of the MDI. It requires the user to make a number of adjustments to ensure that the MDI can be executed successfully for the selected purpose. The following parameters must be reviewed and set by the user.

MDI Installation Directory (dirMDI)
Set the full path to the directory where all MDI files are installed. Make sure the path ends with a “/”, e.g. dirMDI = "my/dir/".
Country code (CountryCode)
Specify a 2-letter country code following the ISO 3166-1 alpha-2 standard.
NSI data directory (dirINPUTDATA)
Set the full path to the directory containing the NSI firm-level data files (or mock data files). These are the raw input files provided by the National Statistical Institute (NSI).
Output directory (dirOUTPUT)
Define the directory to which all generated output files will be exported. This directory must have read/write permissions and will contain results, module outputs, and other exported files.
Temporary storage directory (dirTMPSAVE)
Set the directory used for temporary storage of MDI virtual longitudinal datasets.

Note

This directory is used to store intermediate datasets and allows reuse of processed data without re-importing raw NSI files.

Optional – flags for temporary files
These flags control how the MDI process handles raw data imports and temporary files.
- MDIimportFlag: Set to TRUE to import raw NSI data files. If FALSE, existing virtual datasets stored in dirTMPSAVE are used.
- MDIcleanTMP: If TRUE, the temporary directory dirTMPSAVE is cleaned before execution.
Optional – mock data flag
This flag controls whether the MDI is executed using mock data.
- IsMOCK: Set to TRUE to run the MDI in a test scenario using mock data. When TRUE, temporary files are stored in a country-specific subdirectory to avoid overwriting files when switching countries.
Optional – execution control flags
These flags influence how the MDI scripts run and how much output is produced.
- MDImoduleRUN: Set to TRUE only after post-harmonization checks have been completed and research modules are ready to run. It should be FALSE during the first execution.
- MDIdebug: Set to TRUE to display logs, warnings, and errors. Use FALSE for a quieter run.
- MDIimputeFlag: Reserved for potential data imputation routines (currently not in active use).
- filteredHarmonization: Set to TRUE if harmonization should be restricted to variables listed in the current MDnames_select file.

Click here to see the entire countdown.R script

# This file is used to work start MDI
# fill in all the parameters and save this file: countdown.R
# run the program in R and then choose to execute:
# 1. run pre_launch_checker.R  to run after an update of MDI at NSI, to check and fix metadata
# 2. run liftoff.R to run rocket: execute MD harmonizer and run payload modules
# 3. run prepare_NSI.R to run things to aid in getting metadata in good shape
# 4. run interactive_MDI.R to initialize environment to test/debug/explore/write module code.


rm(list = ls())

MDI_launch_version <- "v2.3"


########################################
# Compulsory steps
########################################


########################################
# 10. Set the full path to the directory where you install the MDI files
########################################

dirMDI <- "/files/MDI/"


########################################
# 9. Give 2 letter country code for your site ("ISO 3166-1 alpha-2" standard)
########################################

CountryCode <- "PTx"


########################################
# 8. Set the full path to the  directory with NSI firm-level data files (or mockdata files)
########################################

dirINPUTDATA <- "/files/NSIdatafiles/"


########################################
# 7. Set the full path to the directory to which generated files are exported (dirOUTPUT)
########################################

dirOUTPUT <- "/files/output/"


########################################
# 6. Set the full path to the directory for temporary storage of MDI virtual longitudinal datasets (dirTMPSAVE)
########################################

dirTMPSAVE <- "/files/TMP/"


########################################
# Optional steps (steps 5, 4, 3 & 2)
########################################

#####################################
# 5. Flag for temporary MDI files   #
#####################################

# set ImportFlag=TRUE if you want to import raw NSI data files (if FALSE: reads MDI virtual data from dirTMPSAVE)

MDIimportFlag <- TRUE

################################################
# 4. Flag for cleaning the temoporary folder   #
################################################

# set cleanTMP=TRUE if you would like to clean the dirTMPSAVE before running

MDIcleanTMP <- FALSE

#####################################
# 3. Flag for mock data use         #
#####################################

IsMOCK <- TRUE
# If isMOCK, temporary files are stored in CountryCode folder,
# so that files arent overwritten when switching country
if (IsMOCK) {
  dirTMPSAVE <- paste0(dirTMPSAVE, CountryCode, "/")
  if (!dir.exists(dirTMPSAVE)) {
    dir.create(dirTMPSAVE)
  }
}

#####################################
# 2. Flags to control execution     #
#####################################

# Set MDImoduleRUN = TRUE if the post_harmonization script has been run and checked and modules are ready to be run.
# Should be set to FALSE when running the launch for the first time.

MDImoduleRUN <- FALSE

# set debug = TRUE if you don't want to suppress logs, warnings and errors

MDIdebug <- TRUE

## NOTE EB: Nothing done at the moment with the imputeflag (was called inputeflag in early versions)

MDIimputeFlag <- FALSE

# Flag to indicate whether this is a test run on the server with mock data

# If you want the harmonization to be done only for the variables included in the current
# launch's MDnames_select file
filteredHarmonization <- FALSE

##############################
# 1. Liftoff                 #
##############################

# save the program countdown.R to your work directory.
# Run the file to choose which program/feature to execute:

# Now, pick the program to be executed
# 1. run pre_launch_checker.R  to run after an update of MDI at NSI, to check and fix metadata
# 2. run liftoff.R to run rocket: execute MD harmonizer and run payload modules
# 3. run prepare_NSI.R to run things to aid in getting metadata in good shape
# 4. run interactive_MDI.R to initialize environment to test/debug/explore/write module code.
# ---> Choose below with number of the selected program

# Check if the session is interactive (works both in RStudio and console)

if (interactive()) {
  # Use select.list() for interactive selection

  user_input <- select.list(c("pre_launch_checker.R", "liftoff.R", "prepare_NSI.R", "interactive_MDI.R"), title = "Choose a program to run:")


  if (user_input != "") {
    # Source the corresponding script

    source(paste0(dirMDI, "launchpad/", user_input))
  } else {
    cat("No selection made. Exiting.\n")
  }
} else {
  # If not in an interactive session, use readline()

  user_input <- as.integer(readline(prompt = "Please enter an integer (1 for pre_launch_checker.R, 2 for liftoff.R, 3 prepare_NSI.R, 4 interactive_MDI.R): "))
  if (!is.na(user_input) && user_input %in% 1:4) {
    # Map the user input to the corresponding script name
    # Source the corresponding script

    source(paste0(dirMDI, "launchpad/", user_input))
  } else {
    cat("Invalid input. Please enter a valid integer (1, 2, 3 or 4).\n")
  }
}

3.3.2 Pre-Launch-Checker

The program pre_launch_checker.R (run countdown.R and choose this program) needs to be run before anything else. It performs various checks on the NSI metadata to avoid errors later on. The results of the checks can be found in the file pre_launch_checker_results.txt in the output directory. It shows possible errors that should be adjusted in the NSI metadata. Additionally, two concordance files (NSI_pcc8t0_pcc8t1_conc.csv and NSI_MD_nace_conc.csv) are created using existing concordance files and updating them with the data at the NSI. These concordance table might contain empty values, if no value was previously defined. Missing values need to be filled in manually. When the concordance files are ready to be used, they need to be moved to the directory indicated in pre_launch_checker_results.txt

3.3.3 Post-Harmonization Quality Checks

After harmonizing your country’s microdata to the MD format, the Post-Harmonization Checker (PHC) script is automatically executed in the rocket to ensure that the harmonized datasets meet essential quality and consistency standards. This diagnostic process validates whether the resulting data is clean, correctly structured, and ready for module execution.

The script performs the following checks on the :

Check Type	Description
1. Duplicate Check	Identifies rows where the key ID variable (e.g., `firmid`) is duplicated.
2. Variable Class Check	Verifies that each variable matches its expected R data type (e.g., numeric, character, date).
3. Date Format Check	Ensures date variables are correctly formatted and parseable (e.g., `%Y`, `%d%m%Y`).
4. Date Range Check	Extracts the minimum and maximum detected dates per variable to check date range matches.
5. Break Detection	Identifies structural breaks in aggregate-level distributions over time (over 10% jumps).

Each of these checks outputs either a summary table (.txt) or a visual diagnostic (.pdf) to help identify problems.

3.3.3.1 PHC Output Files Generated

After the script runs successfully, you will find the following two files:

<CountryCode>_phc_results.txt
Location: dirTMPSAVE
Contents: Duplicate summary, class and format mismatches, detected date ranges.

Duplicate Check Table

Column Name	Description
`dataset`	MD dataset (e.g., `BR`, `SBS`, `ICTEC`)
`id_var`	The country-specific ID variable used to identify unique records taken from MD_idInfo
`has_duplicates`	`TRUE` if duplicated rows are found based on `id_var`, `FALSE` otherwise
`num_duplicated_rows`	Total number of rows that are duplicates (may include multiple per key)
`num_unique_duplicated_keys`	Number of unique key values (`id_var`) that are duplicated

Variable Class Check Table

Column Name	Description
`dataset`	MD Dataset
`variable`	Variable name being checked
`expected_class`	Class assigned to this variable in the metadata (`MD_varnames`)
`actual_class`	Actual class detected in the harmonized `.RDS` file
`class_match`	`TRUE` if expected and actual class match, `FALSE` otherwise

Date Format & Class Check Table

Column Name	Description
`dataset`	MD Dataset
`variable`	Variable name being checked
`expected_class`	Expected class (usually `"date"`)
`actual_class`	Class detected in the file
`expected_format`	Date format expected (e.g., `%Y`, `%d%m%Y`)
`actual_format`	Detected format based on sample values
`format_valid`	`TRUE` if values can be parsed using `expected_format`, `FALSE` otherwise
`class_match`	Whether the variable is stored as a `Date` object

Date Range Check Table

Column Name	Description
`dataset`	MD Dataset
`variable`	Date variable being checked
`actual_format`	Detected format used to parse the variable
`actual_min_date`	Earliest parsed date in the variable
`actual_max_date`	Latest parsed date in the variable
`expected_range`	Expected range of years (as specified in `MD_catalogue`)

breaks_report.pdf
Location: dirTMPSAVE
Contents: Plots showing time-series breaks for each numeric variable by dataset. The red dots show a structural break in the time series defined by a minimum 10% jump.

Break Summary Table (PDF)

Column Name	Description
`dataset`	Dataset name
`variable`	Numeric variable being assessed for breaks
`stat`	Statistic showing the break (e.g., `mean`, `p50`, `sd`)
`year`	Year in which a structural break was detected
`growth`	Relative change from previous year (e.g., +0.25 = 25% increase, -1.0 = 100% drop)

3.3.3.2 Instructions for Country Leaders for Reviewing and Fixing PHC Errors

Duplicate Check

Check: Whether any rows share the same key (e.g., firmid) more than once.
Look for: has_duplicates == TRUE and high values in num_duplicated_rows or num_unique_duplicated_keys.
Fix: Review your harmonization step and ensure that each firm-year observation is uniquely identified. If intentional (e.g., due to panel structure), document it clearly.

Variable Class Check

Check: Compares expected vs. actual data types.
Look for: class_match == FALSE
Fix: In your country metadata, ensure each variable is explicitly cast to the correct type using functions like as.numeric(), as.character(), or as.Date() using the revalue method.

Date Format Check

Check: Whether date variables match expected formats (e.g., %d%m%Y).
Look for: format_valid == FALSE or actual_format == "unknown"
Fix: Recheck how date strings are parsed in your harmonization script and in the metadata. Use as.Date() with the proper format string.

Date Range Check

Check: Compares detected date range with expected year coverage.
Look for: Min or max dates far outside expected range (e.g., year 1001 or 9122).
Fix: Likely due to incorrect parsing. Verify input formats and metadata.

Break Detection

Check: Identifies abrupt jumps/drops in:
- p25, p50, p75
- Mean
- Standard deviation
Look for: Large positive/negative growth values in the break summary and red dots in the plots.
Fix: Review input consistency across years (e.g., variable definitions, missing categories). Cross-check with national data providers to see if breaks are expected due to methodology changes.

Note

You are ready to proceed with running the MDI modules (e.g., setting MDImoduleRun = TRUE) only after:

All critical issues (e.g., duplicate rows, format mismatches, corrupted dates) are resolved.
You’ve documented any justified exceptions (e.g., expected breaks).
You’ve shared updates or escalated open issues to the MDI team.
Please keep backup copies of your harmonized .RDS files before making changes.

3.4 Developing & Testing

3.4.1 Nuvolos Developer Space

This space is intended for code development, module creation, and script testing on mock data. It is designed for MDI team members and module writers who are familiar with the MDI infrastructure and have access to the MDI GitHub repository. (More information on Nuvolos: where MDI users develop and test their codes)

Each user must connect their Github account to enable pushing and pulling changes. Note the following:

Each user works in their own isolated space—your changes remain private until you explicitly push them to GitHub.
It is your responsibility to ensure you are working on the latest version of the MDI codebase by pulling updates from GitHub when you start your RStudio session (how to work with Git).
The workspace includes both NSI mock data and harmonized MD mock data, which you can use for testing and developing your modules. You can find the NSI mockdata in space_mounts/mockdata/NSIdata/ and the harmonized MD mockdata in space_mounts/mockdata/TMP/. (Note: These folders are only accessible via RStudio and won’t show up in the Files section)

First time users: How to connect Github

If you’re using the Nuvolos Developer Space for the first time you need to connect your GitHub account. Follow the steps below:

Open Nuvolos, navigate to “applications” (left menu) and open RStudio
Go to the terminal
- Generate a public/private key pair by executing this command: ssh-keygen -t ed25519 (no need to change suggested location or create a password > press enter 3 times)
- Navigate to the folder where both are saved. You can do that on the file section on the right side, you might have to click on the “/” to see all directories. The folder .ssh is a hidden folder, so click on the gear symbol and select the option “Show Hidden Files”
Open the id_ed25519.pub and copy it’s content
Open Github in the browser
- In Github: go to settings>SSH and GPG keys
- Click “add new SSH key”
- Add the copied pub key into the key section and add a title eg. “Nuvolos MDI test environment”, then save
Back to Nuvolos, in RStudio, Terminal, clone the branch using this command (from within /files folder, which is default):
- git clone --branch pre_Launch_v2.2_backup --single-branch git@github.com:Secretariat-CompNet/MDI.git
- The MDI with all files will show up in the files section on the right
- Then go to home/datahub/ in the file section and open the .gitconfig file. The file will look like this:
```
[user] email = 12345678+Name@users.noreply.github.com        
      name = YourName 
[credential]           
      helper = cache --timeout 64800      
```
Make sure that the email address is the one from your Github, not eg. the iwh email address. To check which is the right one go to your Github account > Settings > Emails. Copy the email address ending on @users.noreply.github.com as shown below into the .gitconfig file. Save file.
The MDI is correctly set up now. You can run your code with mockdata, edit and pull/push changes to Github
To test codes: execute countdown and select interactive_MDI (option 4) before running your own script.

3.4.2 General Workflow with Git

If you’re in the right branch and your repository is up-to-date, this is the normal workflow:

You change a file/ add a module/ add metadata.
You save your edited file.
(Best practice: Check status (git status) and make sure there are no recent updates on the branch)
You add your file(s) to a commit (git add file_name)
You create a commit with a commit message (git commit -m "this is the commit message")
You push your commit to GitHub (git push)
You can verify your commit with git log. This shows a list of all recent commits (on top the one you just did)

If you haven’t worked with the MDI in a while, the repository might be outdated or you might be in an old branch. Below are the steps to make sure you’re working in the right branch and have the latest updates.

Navigate into your MDI repository

cd /files/MDI
Verify the status of your MDI version

git status

This tells you what branch you’re on, eg. : On branch branch_name

and if you’re up-to-date with the latest changes. There are three possible options:
1. Your branch is up to date with 'origin/branch_name'. You have all the latest changes of that branch. No need to do anything else
2. Your branch is ahead of 'origin/branch_name' by x commits. You have changes that you didn’t push to GitHub yet.
3. Your branch is behind 'origin/branch_name' by x commits. There are updates that you haven’t pulled yet.
If you want to change the branch: git switch new_branch_name
If you want to pull changes: git pull
If you want to push your changes:
1. To add updated files to a commit use: git add name_of_your_changed_file (Use that command for each file individually, ot use git add . to add all changed files)
2. To create a commit use git commit -m "Add your commit message here" (Make sure your commit message described your updates well)
3. Push your commit to GitHub: git push
Check the status again to verify that you’re in the latest version of your desired branch: git status

This should now give:

On branch branch_name Your branch is up to date with 'origin/branch_name'. nothing to commit, working tree clean

3.4.3 How to add your research module to the MDI infrastructure

If you want to add your module to the MDI infrastructure via the Nuvolos Developer Space you need to have access to the MDI GitHub repository and set up your GitHub account in Nuvolos (see:first time users) Then open RStudio in the Developer Space and follow these steps to add your module:

Add a module folder

In the files section on the right, navigate to the folder MDI/payload/Launch_vX.X/Rmodules/ (Replace X.X with the current launch version). In that folder create a new folder with a two-character name, abbreviating your research module, eg. “XY”.
Add your MDnames_select file

Inside your module folder add the MDnames_select file. This file contains a list of the variables that your module uses (more information here) and it needs to be named like this: (res_group)_MDnames_select.csv where res_group is your module abbreviation, eg. XY_MDnames_select.csv
Add your main script

Inside your module folder add your main script. This script needs to be named the following way: Launch_X.X_(res_group).R where X.X is the current launch version and res_group is your module abbreviation. Eg. Launch_2.3_XY.R. This scripts will be executed by the code when your module is run. That does not mean all your code needs to be in that one script. You can add as many scripts as you like and call them using source(path/to/your/script/).
Add any other scripts/files/ folders

If you need any additional scripts or files place them in your module folder.

This is an example of a module folder: The folder has the module abbreviation CN and contains the main script Launch_2.3_CN.R, the CN_MDnames_select.csv file and two additional files EU_countries.csv and Questionnaire.xlsx.

3.4.4 How to develop and test your module using (mock) data

To develop your module you first need to adjust and run the countdown. This will import all libraries and variables you might want to use in your module. To do so, navigate to launchpad/countdown.R.
The countdown script functions as a configuration file that, for example, sets up paths and flags for the MDI execution. You need to adjust the parameters in the script to fit your needs. For example, set the flag isMOCK to TRUE if you’re working with mock data, or set dirTMP to the directory where the harmonized mock data is stored (space_mounts/mockdata/TMP/). You can find all parameters and flags in the countdown section of this manual.

After adjusting the parameters, run the countdown script and select option 4 “Interactive MDI”. This will set up the environment for you to develop and test your module, but it will not run any MDI module. You can then run your module script (e.g. Launch_2.3_XY.R) to test your module using the mock data.

If you want to mimic a launch as it would happen at an NSI, set the flag MDIimportFlag to FALSE and MDImoduleRUN to TRUE. Then run the countdown and select option 2 “liftoff”. This will set up the environment and run all modules with the selected mock data.

3.4.5 Mockdata

To ensure robustness, consistency, and functionality across the MDI infrastructure, the development and use of mock data is essential.

Specifications for mock data.
- All files from NSI_datafiles.csv are covered, with all years as given by NSI_varnames.
- All variables from all NSI files must be represented with the correct format and domain, including classifications, codebooks, and value labels.
Underlying ‘firm’ datasets.
- SBS/BS Cobb-Douglas model: deterministic framework with stochastic draws; firm size is used to infer capital (k), materials (m), and output (y) based on productivity shocks and capital-labor moments.
- SBS/BS forward-looking Hopenhayn model: includes stochastic productivity draws and shocks to productivity and demand, allowing for endogenous firm exit.
- SBS/BS Aglio–Bartelsman-type firms: based on parameter draws for A/g, η, and ρ.
- Firm dynamics with innovation: models firms’ extensive choices in innovative activities (e.g., ICTEC, R&D).
- Firms with trade behavior: captures extensive and intensive trade choices across modules such as ITGS, ITS, OFATS, and IFATS.

Workflow for mock data simulation

BLOCK0: Prepare Auxiliary Files
- Define the country, the sample periods, the datasets and read country-specific NSI metadata (datafile, varname, codebook).
- Create a table specifying the hierarchical structure among variables.
- Develop a table defining the concordance between fundamental model variables and NSI variables.
- Compile a file detailing auxiliary regressions for predicting numerical, logical, and categorical variables.
BLOCK0: Obtain Data Moments from the Data or by Simulation
- Calculate the sample mean and variance of employment for each NACE 2-digit sector.
- Determine the average exit rate for each NACE 2-digit sector.
- Extract regression coefficients for auxiliary regressions.
- Gather information on sample sizes of surveys.
- Compute key economic ratios and rates: capital-labor ratio, capital rental rate (interest rate), wage rate, and capital depreciation rate.
BLOCK1: Simulate an Unbalanced Panel Dataset
- Generate an unbalanced panel dataset for firms over time, incorporating firm entry and exit dynamics based on Hopenhayn (1992).
- Estimate model parameters: $\alpha$ (output elasticity of labor), $\sigma$ (standard deviation of TFP process), and z_exit (exit threshold for firms’ productivity) by targeting the sample mean and variance of the firm size (employment) distribution and the exit probability.
- The simulated panel data includes firm ID, year, productivity, labor, capital, depreciation, and EBITDA.
BLOCK2: Predict BR and BS Variables
- Use concordance tables between model variables and NSI variables, as well as auxiliary regressions and regression coefficients to predict BR and BS variables.
BLOCK2: Sample from the ‘Universe’ of Firms
- For each survey table, sample from the firm universe and predict NSI variables using the auxiliary regressions and regression coefficients.

4 Acknowledgement

We gratefully acknowledge the support of the European Union, whose funding made this project possible. We also thank all National Statistical Institutes (NSIs), National Statistical Systems, National Productivity Boards (NPBs), and other collaborators for their valuable contributions to the development of the MDI project and this manual.