Quickstart
Sections
Create your project: Create a synthetic data generation project.
Create your project

On the Mirage homepage, click "Enter" in the "Synthetic Data Generation" section to view your synthetic data generation projects.

On the Projects page, click "Create Project" to create a new synthetic data generation project.
Step 1: Upload Dataset
Select the dataset that you want to use for your new synthetic data project.

Select your use case by filling in the following fields:
Use case entity: Which entity do you plan to share this data with?
Use case scenario: How will you be using this synthetic data?
Select your dataset format. Currently, Mirage only supports standard tabular datasets. We are exploring generation for other data types (e.g., sequential tabular data, free-text data) - let us know at [email protected] if this is something you need.


Upload or select a dataset for your project.
If you have a dataset that you would like to use, click "Drop your file here or click to upload" to upload your dataset (CSV file).
Select the delimiter used in your CSV file if the delimiter if it is not a comma (,).
Uncheck the checkbox if the first row of your dataset does NOT contains headers.
If you do not currently have a dataset that you would like to use, click "I don't have a dataset right now" and select from one of our sample datasets.

Check that your dataset is correct in the "Dataset preview" section. If you are unable to view all of your columns, scroll the table to the right to view the rest of your table.
[OPTIONAL] If you do not want to use your entire uploaded dataset and only select some columns from your dataset, select the columns that you would like to keep in the "Select columns" section. The screenshot shows an example of selecting all columns except the "Medisave Usage" column.
Click "Next: Prepare Dataset" at the bottom of the page to upload your dataset and move on to the next step.
Step 2: Prepare Dataset
Pro-process your data before generating new synthetic data.


Data previously not selected in the "Select columns" section (Step 1, Part 5) will not be shown in the list of columns on the right.
By default, columns will be dropped, and cannot be selected when:
Columns contain information identified as Personally Identifying Information (PII). These include columns like NRIC and Name.
Columns have a high percentage of missing data (>60%).

Columns with missing data can be handled in the following ways:
Retain (Default): Leave the missing values blank.
Remove rows: Remove the corresponding row in the entire dataset if the value in the column is missing.
Fill: Fill in missing values with placeholder values.


Columns with missing data can be filled in the following ways:
Categorical, datetime type columns: Most frequent value, Constant value, Smart Imputation
Numerical datetime type columns: Mean, Median, Most frequent value, Constant value, Optimal, Smart Imputation


[OPTIONAL] Click "View real data" on the top left corner of the page to view your data, if you need to look at your data again to decide on data processing methods.
Click "Configure Data" to confirm your selected data processing methods and move on to the next step.
Step 3: Configure Training

Select a training goal by clicking on the respective box:
Accuracy: Mirage automatically selects a model better suited for your data, resulting in higher quality generated synthetic data that more closely resembles your original data.
Speed: Mirage automatically selects a model better suited for your data, resulting in faster training times.
Custom: You can customise the model used and its hyperparameters.

If "Custom" is selected, select the following in the "Custom Selections" section:
Select the "Model for Training": Select a model from our four available models.
[OPTIONAL] Adjust model parameters It is suggested to start a project with the default model parameters first, before starting a project with adjusted model parameters to compare model performance.
Specify "Number of Rows to Generate For Synthetic Dataset" in the "Data Generation Settings" section.


[OPTIONAL] Expand the "Advanced Settings" section if you want to adjust the "Subsample Data" or "Machine Learning Task" values.
Change the value of "Subsample Data" to only use a subset of your data as training data, when your dataset size is large.
If you are using the synthetic dataset for a machine learning task, and would like evaluate the performance of the synthetic dataset on a machine learning task, select a task from the following tasks: Classification, Regression. Change the value of "Test Set Size" if you would like to change the percentage of the dataset used as a test set to evaluate the performance of the synthetic dataset on the specified machine learning task.

[OPTIONAL] If you do not want to receive an email notification when model training is complete and the project is ready to generate additional synthetic datasets, uncheck the checkbox "Email me when the project is ready for data generation".
Click the "Train" button to start model training.

After training is completed successfully, you can move on to Step 4 by clicking "Next: Generate Data".
Step 4: Generate Data & Report
After training is completed, you can do the following:
Download the generated synthetic dataset
Generate additional synthetic datasets
Review the performance summary
4.1. Download the generated synthetic dataset

Generated synthetic datasets can be found in the "Generation History" section.
Click the "Download dataset" button in the "Download" column to download the synthetic dataset.

A preview of the synthetic dataset is available in the "Preview Synthetic Data" section below, to have an overview of the synthetic dataset before downloading it.
4.2. Generate additional synthetic datasets

Click "Generate" above the "Generation History" to generate an additional synthetic dataset.
If you want to create a dataset with a different number of rows, you can change the number of rows in your new synthetic dataset in "Number of rows", before you click "Generate".
4.3. Review the performance summary
The performance metrics shown compare your synthetic dataset against the processed and subsampled version of your original dataset ("real dataset"), to determine the following:
Column shapes Column shapes help you evaluate how similar the synthetic column is to the real column for quality evaluation. You can assess the columns using summary statistics, distribution similarity scores, and by verifying that the data respects the original column's range and categories.
Column pair trends Column pair trends metrics help you evaluate the extent to which relationships among columns are retained in the synthetic dataset compared to the real dataset. This evaluation uses correlation and contingency table scores.
Machine learning efficiency This section is only available if a machine learning task is selected at Step 3. Machine learning efficacy helps you evaluate how well a synthetic dataset performs in prediction tasks. Based on your selection in Step 3 for the task and the prediction column, specific machine learning algorithms are trained on the training dataset, and their performance is tested on the test dataset.
Privacy Privacy metrics help you evaluate the risks that adversaries might potentially exploit to gain insights from synthetic dataset by assessing exact real records leakage, the similarity between real and synthetic datasets, and the presence of outliers. While synthetic data inherently offers some privacy due to its randomness and the absence of a one-to-one mapping with the real dataset, these metrics can help you assess the subjective privacy risk based on your context of data sharing and mitigate potential privacy threats.
Contact
If you would like to organise a Mirage demo session for your team, please contact [email protected] and [email protected]
Last updated