Step 1: Upload Data

Sections

  1. Data size: How to determine if your data is suitable for Synthetic Data Generation and how to handle data size issues.

  2. Data security and classification: Eligible Government data and data security measures.

  3. Data cleaning: How data should be cleaned before uploading to Mirage.

1. Data size

Synthetic data generation requires sufficient input data to train the model. An adequate number of rows relative to columns is needed for the model to reliably capture statistical attributes. Insufficient data may render the dataset unsuitable for synthetic generation. For low-data cases, contact us at [email protected] and [email protected].

1.1. Summary

Note: These guidelines are for general reference. The amount of data needed for good results may vary depending on your dataset’s characteristics (e.g., high cardinality, missing values, data imbalance).

Category
Criteria
Recommendation

Insufficient

<20 rows

Add more data - not usable yet

Small

  • 20-1000 rows OR

  • Ratio of rows: column < 200: 1

Generation is possible, but quality can be greatly improved - add more rows if possible

Usable (More = Better)

  • Ratio of rows: columns > 200:1

Generation is possible, but quality can be improved - add more rows if possible

Excess

Mirage's thresholds are:

  • Maximum dataset file size: 500 MB

  • Maximum number of rows: 1 million rows

  • Maximum number of columns: 50 columns

Beyond 1M rows, you have more than sufficient data for training. Reduce the size of your dataset.

1.1.1 Ensuring adequate data

Increasing dataset size

If your dataset size is small, consider increasing the number of rows (e..g, if your dataset is a subset of a larger dataset.

Reducing number of columns

Choosing the right columns to include is a key step in achieving high-quality synthetic data. A more intentional selection of columns allows the model patterns more effectively, leading to better results and faster training times.

  1. Drop columns not needed for pattern learning

If a column isn’t important for pattern preservation (e.g., not used in analytics or ML tasks), consider excluding it from SDG and generating it using Mock Data Generation instead.

E.g., For a task like predicting patient insurance charges:

  • 👍 Include: Diagnosis, Age, Number of hospital visits

  • 👎 Consider dropping: Date of admission, Residential address

These dropped columns can be added later using Mirage's Mock Data Generation feature.

  1. Remove columns that can be derived from others.

E.g., If you have these columns: Date, Year, Month, Day

  • 👍 Include: Date

  • 👎 Consider dropping: Year, Month, Day (can be extracted from Date)

  1. Consider dropping High Cardinality Columns

Columns with too many unique values (e.g., long free-text notes, UUIDs) may add noise, slow model training and reduce data utility.

Recommendation:

  • Replace with simpler, semantically meaningful features (e.g., replacing unstructured "ClinicalNotes" with most meaningful features (e.g., "DiagnosisCode")

  • Use Mirage's Mock Data Generation feature.

1.1.2 Dimension limits

Thresholds

Mirage is unable to process your dataset if your dataset exceeds any of the following conditions:

  1. Maximum dataset file size: 500 MB

  2. Maximum number of rows: 1 million rows

  3. Maximum number of columns: 50 columns

Recommendations

If your dataset is too large and Mirage is unable to process your dataset:

  1. Reducing the number of rows

  2. Reducing the number of columns (see 1.1.1)

2. Data security and classification

For Government data, Mirage has met IM8 requirements around Application Development Security and Risk Management for use of Government data classified up to Confidential Cloud-Eligible (CCE), Sensitive-High (SH).

A non-comprehensive summary of Mirage’s key risk mitigation measures is listed below, and users are welcome to contact the team for more details or issues not covered here.

  1. Mirage does not retain data, raw and processed data are purged within 72 hours or upon user request.

  2. All data is encrypted in transit and at rest.

  3. The following tests are conducted on Mirage:

    1. Quarterly Vulnerability Assessment scan

    2. Yearly Penetration Testing

3. Data cleaning

3.1. Data standardisation

3.1.1. Missing values

  • Leave missing values blank when uploading. Do not use placeholders (e.g., "NA", "missing"). This allows you to handle missing values with Mirage during Step 2.

  • Numerical columns: Do not use placeholders in a coumn with numerical values, as this forces the column into categorical type. The model can then only reproduce existing values, not interpolate new ones (e.g. it won’t generate 3 between 2 and 4 if "NA" is present) or fill in missing values suitable for numerical columns.

  • Categorical columns: Placeholders are treated as an extra category (e.g. Mirage may predict "NA" as a value).

3.1.2. Data format consistency

Data formatting consistency guidelines are segmented by data type.

  • Categorical data: Use consistent spelling and capitalization ("Active", "active", "ACTIVE" are treated as different). For iterables (multiple values in one cell), make sure they are stored as text and not data structures. When you export from Excel or pandas, this usually happens automatically.

    • ✅ Correct: "['Red', 'Blue']" (saved as text in one cell)

    • ❌ Incorrect: ["Red", "Blue"] (actual Python list object)

  • Numerical data:

    • Use . for decimals (1.50, not 1,50).

    • Avoid separators in large numbers (10000, not 10,000).

    • Export scientific notation as numeric values (1000, not 10^3).

  • Datetime data: Keep a single format across the column (e.g. YYYY-MM-DD HH:MM:SS). Do not mix formats like 03/15/2024 2:30 PM and 15-Mar-2024 14:30.

Last updated