Privacy by Design
Privacy by Design Using Synthetic Test Data

Kenneth E. Washington, PhD, CIPP
Vice President and Chief Privacy Leader
Lockheed Martin Corporation
John Dawson
President
ExactData, LLC
September 15th, 2009
Many IT applications exist in an environment with legally or contractually mandated levels of privacy and confidentiality. For such applications, high quality private test data are required to perform software validation and to train operations to avoid errors during operations. The cost of error can be especially critical in systems that process personally identifiable information in that it has ramifications for personnel health, safety, and security.
Why Testing Applications That Process Private Data is Hard
Small subsets of real (or production) data are sometimes used for testing. This is undesirable because doing so could reveal personally identifiable information. Also, extracts from production data sets might not be representative of the larger population, may be of uncertain provenance, and the “truth” required to test or train against may be unknown. De-identified data records are also often utilized, but by definition lack fundamental demographic information for analysis of such attributes as related to family histories or geographic location.
The alternative is to generate synthetic data sets that are both realistic and consistent. Historically generating large private data sets that exhibit the required attributes of consistency and realism has proven extremely difficult, largely driven by the complexities of creating a data record that is representative, statistically valid and contextually correct throughout the multiple layers of the record. Manually creating data records, which typically involves the high level skill sets of a computer scientist and subject matter expert, is currently the preferred method for creating synthetic test data. Such manual methods are constrained by costs, require long production times, and are limited in the data complexity they can represent.
Solution
We present a novel approach to automating the generation of large sets of fully synthetic test data that meet the consistency and realism requirements that application testing demands. Multi-layered complex data sets are set up using a graphical user interface that applies contextual rules, statistics, and progressively applied field and group dependencies. Every element generated can establish a new context for subsequent, and previous, processing in a data set. Structured and unstructured data sets are supported, and essentially any defined schema can be supported as the output format. The methodology also supports semantic interoperability requirements so that different words that mean the same thing are appropriately reflected. This is especially important in areas with multiple and/or emerging standards such as healthcare applications. Finally, output formats support test data needed to perform testing for code validation purposes and training files for application users.
The Value of Privacy by Design
The methodology will be demonstrated on a hypothetical healthcare application. In our test case realistic synthetic private test data is generated that enables privacy to be built into the testing procedures by design. A test is designed to allow an operator-provided hypothesis to be tested, e.g. “exposure to drug XYZ increases risk of condition A”. Our solution allows a large dataset (haystack) to be created in which this relationship is buried (needle) and marked (event tags).
This approach to test data provisioning lowers costs, provides enhanced capabilities, and eliminates any privacy issues.
In summary our methodology provides:
- Assured privacy protections by avoiding the use of real production data
- Large realistic data record sets that are better to simulate system response
- Large realistic data record sets to re-create real life variability
- Large realistic data record sets with introduced anomalies that are flagged, creating the ability to test or train with a known response
- Flexibility to easily change and create new test or training case scenarios
- Data that enables deep levels of analysis to reveal subtle problems without creating privacy or confidentiality issues
- A cost-effective approach to that scales to meet growing customer needs and application requirements
