Step 2: Prepare Dataset
Sections
1. Data Types
1.1. Data type definition
Mirage auto-detects data types, but allows the following changes:
Categorical → ID
Numerical → ID, Categorical
DateTime → ID, Categorical
ID
Unique identifier column with 100% distinct values. Used as row labels, not for analysis.
NRIC
UEN
Categorical
Contains distinct categories (only values seen in real data are valid). Categories may be ordered or unordered.
Unordered text: "Single", "Married"
Ordered text: "Very unsatisfied" > "Very satisfied"
Unordered numbers: Group 1-10
Ordered numbers: Satisfaction scale 1-5
Numerical
Contains numbers with an inherent order. Supports mathematical operations (sum, avg, etc.).
Temperature
Weight or height
Household income
Datetime
Represents a point in time, stored as string. Granularity can range from day to nanosecond.
Date (
2025-08-18)Time (
14:35:00)Datetime(
2025-08-18 14:35:00)
Resources:
SDV data types: Categorical, Numerical and Datetime data type definitions
2. Synthesise columns
Selecting the right columns to synthesise improves both accuracy and usefulness of your synthetic dataset. The more columns you add, the harder it becomes for the model to learn meaningful patterns and reduces the quality of generated data.
2.1 Focus on what matters
Ask yourself:
Do I need this column to retain its statistical attributes?
Do I need this column’s relationships with other columns?
If the answer is no, exclude the column.
E.g., If your task is to predict a patient’s insurance charges, you likely need columns such as age, BMI, smoking status, and number of dependents, since these directly affect costs and their relationships matter. But you don’t need the patient’s home address.
2.2 Use Mock Data where possible
If a column doesn't need to preserve its statistical patterns, only its general structure, you can , generate it separately using mock data generation and merge it back later.
E.g., For the task to predict a patient's insurance charges, you can generate home addresses using our Mock Data Generation module.
2.3 Turn off synthesis
If you decide not to include a column, simply turn off the “Synthesise” button for that column in the interface. This ensures the model ignores it during training.

3. Treatment of ID Columns
An ID column is the first column in your dataset with 100% unique values, often serving as a primary or foreign key. Its purpose is to uniquely identify each row and do does not carry statistical meaning. Because ID columns don’t contribute to patterns or distributions, they are not synthesised like regular data columns.
The user determines how they are handled through 3 approaches:

3.1 Drop
Definition: The column is completely removed from the synthetic dataset.
When to use:
The ID column is not required in the synthetic data.
The ID contains sensitive information that should not appear, even in transformed form (e.g., employee IDs, NRIC numbers).
3.2 Regular Expression
Definition: IDs are generated based on a user-specified regular expression pattern.
When to use: Your application expects IDs in a specific format (e.g., national IDs, license plates, customer codes).
Examples:
Car license plate:
S[A-HJ-NP-Z]{2} \d{1,4}[A-Z]$Starts with “S”
Two random letters (excluding I/O)
Up to four numbers (1–9999)
Ends with a random letter
Tip: Use regex generators such as regex101.com or pair.gov.sg.
Mirage already provides patterns for NRIC, UEN and Vehicle Number.

3.3 Incremental Transformation
Definition: Creates a new sequence of IDs by incrementing numbers (e.g., 1, 2, 3…), with optional formatting.
When to use:
IDs are purely numeric and don’t contain embedded meaning.
You need to preserve ordering or sequence across rows (e.g., row numbers, invoice IDs).
3.3.1 Methods for Incremental Transformation
Basic
Simple sequential numbers, no formatting.
Example:
1, 2, 3
Leading Zeros
Sequential numbers with zeros padded to the front.
Example:
01, 02, 03(number of leading zeros configurable).
Prefix/Suffix
Sequential numbers combined with text before (prefix) and/or after (suffix), joined by a character.
Rules:
Prefix and suffix cannot both be empty.
Joining character (e.g.,
.,-,/) connects number with prefix/suffix.
Examples:
With prefix & suffix:
Prefix.1.Suffix,Prefix.2.SuffixPrefix only:
Prefix-1,Prefix-2Suffix only:
1/Suffix,2/Suffix

4. Treatment of Missing Values
4.1 Auto-drop for high % of missing values
Columns with a high percentage of missing values (>60%) are automatically removed before synthesis. With too much missing data, the model cannot reliably learn patterns, so including them greatly reduces quality. This threshold aligns with industry guidance (e.g., Gretel AI’s recommendation to drop columns with >60% missing values).

4.2 Fill missing values
Missing values are common in real-world datasets, and how you handle them directly affects the quality and usefulness of your synthetic data. Mirage offers three strategies — Retain, Remove Rows, and Fill — each with trade-offs. The right choice depends on your use case and how much data is missing.

Missing values in a column can be handled in one of the following ways:
Retain:
Leave missing values as blank; no cleaning or imputation applied.
Creating test datasets (e.g., dashboards, software validation) where missing values need to be present.
Column has only a small % of missing values, and affected rows don’t have many other gaps.
You want a dataset with no blanks.
The column is a target variable for ML tasks.
Keeps realism and structure, but limits usability for ML or analytics that require complete data.
Remove Rows: Drops all rows where the selected column has missing values (across the dataset).
Rows missing in this column also have many other missing values → dropping improves overall dataset quality.
Very few rows are missing, so dataset size won’t shrink much.
Multiple columns are set to remove rows, especially with high missingness
Most rows could be dropped (Mirage will show an error if all rows are lost).
The missing column is isolated but other columns in the same rows contain valuable data.
Improves cleanliness and reduces noise, but risks shrinking dataset or discarding useful information.
Fill: Replaces missing values with a specified value (e.g., mean, median, mode, or placeholder).
You need a complete dataset for ML tasks.
Column has many missing values but is still important.
A suitable fill value can reasonably represent the data.
Missingness itself is meaningful (e.g., absence of a medical test result).
Filling could distort skewed or categorical data.
Ensures completeness, but may introduce bias or artificial patterns.
4.2.1 Fill method options
Most frequent value
Fills the column with the value that appears the most frequently in the column.
When the data is a categorical column, and the most frequent value is a good representation of the data.
Categorical, Datetime, Numerical
Constant value
Fills the missing values with a user-specified constant value.
When you want to specify your own value to impute. This should NOT include values indicating that the values are left blank (e.g. NA, nan). If you want the values to be blank, select the "Retain" method instead.
Categorical, Datetime, Numerical
Smart Imputation (Machine Learning-based)
Fills the missing values with the MissForest imputation strategy.
When you want to fill in the missing values based on not only the selected column, but based on the entire dataset.
Categorical, Datetime, Numerical
Mean
Fills the missing values with the mean of the column data.
When the data closely follows a normal distribution. This is not recommended when the dataset is skewed, as outliers can have a large impact on the mean value of a dataset.
Numerical
Median
Fills the missing values with the median of the column data.
When the data in the column is skewed (does not closely follow a normal distribution).
Numerical
Optimal
Automatically chooses mean or median based on how skewed the column data is.
When you are unsure whether to select mean or median.
Numerical
Resources:
Analytics Yogi: Explains the definitions of mean and median, and their suitability for replacing missing values in detail.
MissForest imputation strategy
MissForest—non-parametric missing value imputation for mixed-type data: Research paper that introduces MissForest.
Medium: An illustrated article that explains MissForest.
Last updated