Step 1: Upload Data
Sections
Data size: How to determine if your data is suitable for Synthetic Data Generation and how to handle data size issues.
Data security and classification: Eligible Government data and data security measures.
Data cleaning: How data should be cleaned before uploading to Mirage.
1. Data size
1.1. Summary
Note: These guidelines are for general reference. The amount of data needed for good results may vary depending on your dataset’s characteristics (e.g., high cardinality, missing values, data imbalance).
Insufficient
<20 rows
Add more data - not usable yet
Small
20-1000 rows OR
Ratio of rows: column < 200: 1
Generation is possible, but quality can be greatly improved - add more rows if possible
Usable (More = Better)
Ratio of rows: columns > 200:1
Generation is possible, but quality can be improved - add more rows if possible
Excess
Mirage's thresholds are:
Maximum dataset file size: 500 MB
Maximum number of rows: 1 million rows
Maximum number of columns: 50 columns
Beyond 1M rows, you have more than sufficient data for training. Reduce the size of your dataset.
1.1.1 Ensuring adequate data
Increasing dataset size
If your dataset size is small, consider increasing the number of rows (e..g, if your dataset is a subset of a larger dataset.
Reducing number of columns
Drop columns not needed for pattern learning
If a column isn’t important for pattern preservation (e.g., not used in analytics or ML tasks), consider excluding it from SDG and generating it using Mock Data Generation instead.
E.g., For a task like predicting patient insurance charges:
👍 Include: Diagnosis, Age, Number of hospital visits
👎 Consider dropping: Date of admission, Residential address
These dropped columns can be added later using Mirage's Mock Data Generation feature.
Remove columns that can be derived from others.
E.g., If you have these columns: Date, Year, Month, Day
👍 Include: Date
👎 Consider dropping: Year, Month, Day (can be extracted from Date)
Consider dropping High Cardinality Columns
Columns with too many unique values (e.g., long free-text notes, UUIDs) may add noise, slow model training and reduce data utility.
Recommendation:
Replace with simpler, semantically meaningful features (e.g., replacing unstructured "ClinicalNotes" with most meaningful features (e.g., "DiagnosisCode")
Use Mirage's Mock Data Generation feature.
1.1.2 Dimension limits
Thresholds
Mirage is unable to process your dataset if your dataset exceeds any of the following conditions:
Maximum dataset file size: 500 MB
Maximum number of rows: 1 million rows
Maximum number of columns: 50 columns
Recommendations
If your dataset is too large and Mirage is unable to process your dataset:
Reducing the number of rows
Reducing the number of columns (see 1.1.1)
2. Data security and classification
For Government data, Mirage has met IM8 requirements around Application Development Security and Risk Management for use of Government data classified up to Confidential Cloud-Eligible (CCE), Sensitive-High (SH).
A non-comprehensive summary of Mirage’s key risk mitigation measures is listed below, and users are welcome to contact the team for more details or issues not covered here.
Mirage does not retain data, raw and processed data are purged within 72 hours or upon user request.
All data is encrypted in transit and at rest.
The following tests are conducted on Mirage:
Quarterly Vulnerability Assessment scan
Yearly Penetration Testing
3. Data cleaning
3.1. Data standardisation
3.1.1. Missing values
Leave missing values blank when uploading. Do not use placeholders (e.g., "NA", "missing"). This allows you to handle missing values with Mirage during Step 2.
Numerical columns: Do not use placeholders in a coumn with numerical values, as this forces the column into categorical type. The model can then only reproduce existing values, not interpolate new ones (e.g. it won’t generate
3between2and4if"NA"is present) or fill in missing values suitable for numerical columns.Categorical columns: Placeholders are treated as an extra category (e.g. Mirage may predict
"NA"as a value).
3.1.2. Data format consistency
Data formatting consistency guidelines are segmented by data type.
Categorical data: Use consistent spelling and capitalization (
"Active","active","ACTIVE"are treated as different). For iterables (multiple values in one cell), make sure they are stored as text and not data structures. When you export from Excel or pandas, this usually happens automatically.✅ Correct:
"['Red', 'Blue']"(saved as text in one cell)❌ Incorrect:
["Red", "Blue"](actual Python list object)
Numerical data:
Use
.for decimals (1.50, not1,50).Avoid separators in large numbers (
10000, not10,000).Export scientific notation as numeric values (
1000, not10^3).
Datetime data: Keep a single format across the column (e.g.
YYYY-MM-DD HH:MM:SS). Do not mix formats like03/15/2024 2:30 PMand15-Mar-2024 14:30.
Last updated