The Data Blog |
There are two fundamental approaches to generating synthetic data. The first involves accessing and modifying/masking a production database, either manually or through use of Extract Transform Load technologies or analyzing using AI and generating a facsimile that mirrors some attributes. The second approach does not involve the use of production databases and generates fully synthetic data. The fidelity of this data can vary wildly from random alphanumeric characters to high fidelity synthetic data correlated to the field level systems of systems with business logic and workflow rules, correct statistical distributions, correlation over the time axis, high use case coverage, engineered errors and systems response files. A method based on the use of a production data source is typically best if policy decisions are being made with the data. A fully synthetic data approach is generally better suited for most other use cases. You do not need access to a confidential database that might not exist for your future state system, there are no privacy restrictions on how you can use the data, it is less expensive and faster and with a known ground truth and with expected system response files you can measure and improve system error rates.
2 Comments
|
Archives
August 2023
Categories
All
Data Blog |