Step 2: Prepare Dataset

Sections

1. Data Types

1.1. Data type definition

Mirage auto-detects data types, but allows the following changes:

  • Categorical → ID

  • Numerical → ID, Categorical

  • DateTime → ID, Categorical

Name
Description
Example

ID

Unique identifier column with 100% distinct values. Used as row labels, not for analysis.

  • NRIC

  • UEN

Categorical

Contains distinct categories (only values seen in real data are valid). Categories may be ordered or unordered.

  • Unordered text: "Single", "Married"

  • Ordered text: "Very unsatisfied" > "Very satisfied"

  • Unordered numbers: Group 1-10

  • Ordered numbers: Satisfaction scale 1-5

Numerical

Contains numbers with an inherent order. Supports mathematical operations (sum, avg, etc.).

  • Temperature

  • Weight or height

  • Household income

Datetime

Represents a point in time, stored as string. Granularity can range from day to nanosecond.

  • Date (2025-08-18)

  • Time (14:35:00)

  • Datetime(2025-08-18 14:35:00)

Resources:

  • SDV data types: Categorical, Numerical and Datetime data type definitions

2. Synthesise columns

Selecting the right columns to synthesise improves both accuracy and usefulness of your synthetic dataset. The more columns you add, the harder it becomes for the model to learn meaningful patterns and reduces the quality of generated data.

2.1 Focus on what matters

Ask yourself:

  • Do I need this column to retain its statistical attributes?

  • Do I need this column’s relationships with other columns?

If the answer is no, exclude the column.

E.g., If your task is to predict a patient’s insurance charges, you likely need columns such as age, BMI, smoking status, and number of dependents, since these directly affect costs and their relationships matter. But you don’t need the patient’s home address.

2.2 Use Mock Data where possible

If a column doesn't need to preserve its statistical patterns, only its general structure, you can , generate it separately using mock data generation and merge it back later.

E.g., For the task to predict a patient's insurance charges, you can generate home addresses using our Mock Data Generation module.

2.3 Turn off synthesis

If you decide not to include a column, simply turn off the “Synthesise” button for that column in the interface. This ensures the model ignores it during training.

Columns can be excluded by turning off the "Synthesise" button

3. Treatment of ID Columns

An ID column is the first column in your dataset with 100% unique values, often serving as a primary or foreign key. Its purpose is to uniquely identify each row and do does not carry statistical meaning. Because ID columns don’t contribute to patterns or distributions, they are not synthesised like regular data columns.

The user determines how they are handled through 3 approaches:

A user can apply "Drop", "Regular Expression" or "Incremental" on an ID column

3.1 Drop

  • Definition: The column is completely removed from the synthetic dataset.

  • When to use:

    • The ID column is not required in the synthetic data.

    • The ID contains sensitive information that should not appear, even in transformed form (e.g., employee IDs, NRIC numbers).

3.2 Regular Expression

  • Definition: IDs are generated based on a user-specified regular expression pattern.

  • When to use: Your application expects IDs in a specific format (e.g., national IDs, license plates, customer codes).

  • Examples:

    • Car license plate: S[A-HJ-NP-Z]{2} \d{1,4}[A-Z]$

      • Starts with “S”

      • Two random letters (excluding I/O)

      • Up to four numbers (1–9999)

      • Ends with a random letter

  • Tip: Use regex generators such as regex101.com or pair.gov.sg.

  • Mirage already provides patterns for NRIC, UEN and Vehicle Number.

The user can specify a regex pattern or choose NRIC/ UEN/ Vehicle Registration

3.3 Incremental Transformation

  • Definition: Creates a new sequence of IDs by incrementing numbers (e.g., 1, 2, 3…), with optional formatting.

  • When to use:

    • IDs are purely numeric and don’t contain embedded meaning.

    • You need to preserve ordering or sequence across rows (e.g., row numbers, invoice IDs).

3.3.1 Methods for Incremental Transformation

  • Basic

    • Simple sequential numbers, no formatting.

    • Example: 1, 2, 3

  • Leading Zeros

    • Sequential numbers with zeros padded to the front.

    • Example: 01, 02, 03 (number of leading zeros configurable).

  • Prefix/Suffix

    • Sequential numbers combined with text before (prefix) and/or after (suffix), joined by a character.

    • Rules:

      • Prefix and suffix cannot both be empty.

      • Joining character (e.g., ., -, /) connects number with prefix/suffix.

    • Examples:

      • With prefix & suffix: Prefix.1.Suffix, Prefix.2.Suffix

      • Prefix only: Prefix-1, Prefix-2

      • Suffix only: 1/Suffix, 2/Suffix

User can append prefix and suffixes to sequential numbers

4. Treatment of Missing Values

4.1 Auto-drop for high % of missing values

Columns with a high percentage of missing values (>60%) are automatically removed before synthesis. With too much missing data, the model cannot reliably learn patterns, so including them greatly reduces quality. This threshold aligns with industry guidance (e.g., Gretel AI’s recommendation to drop columns with >60% missing values).

"dept" column has >60% missing values and is automatically dropped

4.2 Fill missing values

Missing values are common in real-world datasets, and how you handle them directly affects the quality and usefulness of your synthetic data. Mirage offers three strategies — Retain, Remove Rows, and Fill — each with trade-offs. The right choice depends on your use case and how much data is missing.

Columns with missing values can be handled through multiple methods

Missing values in a column can be handled in one of the following ways:

Strategy
When to use
When to avoid
Trade-offs

Retain:

Leave missing values as blank; no cleaning or imputation applied.

  • Creating test datasets (e.g., dashboards, software validation) where missing values need to be present.

  • Column has only a small % of missing values, and affected rows don’t have many other gaps.

  • You want a dataset with no blanks.

  • The column is a target variable for ML tasks.

Keeps realism and structure, but limits usability for ML or analytics that require complete data.

Remove Rows: Drops all rows where the selected column has missing values (across the dataset).

  • Rows missing in this column also have many other missing values → dropping improves overall dataset quality.

  • Very few rows are missing, so dataset size won’t shrink much.

  • Multiple columns are set to remove rows, especially with high missingness

  • Most rows could be dropped (Mirage will show an error if all rows are lost).

  • The missing column is isolated but other columns in the same rows contain valuable data.

Improves cleanliness and reduces noise, but risks shrinking dataset or discarding useful information.

Fill: Replaces missing values with a specified value (e.g., mean, median, mode, or placeholder).

  • You need a complete dataset for ML tasks.

  • Column has many missing values but is still important.

  • A suitable fill value can reasonably represent the data.

  • Missingness itself is meaningful (e.g., absence of a medical test result).

  • Filling could distort skewed or categorical data.

Ensures completeness, but may introduce bias or artificial patterns.

4.2.1 Fill method options

Method
Description
Use when
Eligible data types

Most frequent value

Fills the column with the value that appears the most frequently in the column.

When the data is a categorical column, and the most frequent value is a good representation of the data.

Categorical, Datetime, Numerical

Constant value

Fills the missing values with a user-specified constant value.

When you want to specify your own value to impute. This should NOT include values indicating that the values are left blank (e.g. NA, nan). If you want the values to be blank, select the "Retain" method instead.

Categorical, Datetime, Numerical

Smart Imputation (Machine Learning-based)

Fills the missing values with the MissForest imputation strategy.

When you want to fill in the missing values based on not only the selected column, but based on the entire dataset.

Categorical, Datetime, Numerical

Mean

Fills the missing values with the mean of the column data.

When the data closely follows a normal distribution. This is not recommended when the dataset is skewed, as outliers can have a large impact on the mean value of a dataset.

Numerical

Median

Fills the missing values with the median of the column data.

When the data in the column is skewed (does not closely follow a normal distribution).

Numerical

Optimal

Automatically chooses mean or median based on how skewed the column data is.

When you are unsure whether to select mean or median.

Numerical

Resources:

Last updated