The Data Blog
There is No Longer Any Excuse for Not Using Fully Synthetic Data for Developing and Training Your AI
By 2024, Gartner predicts 60% of data for AI will be synthetic to simulate reality, future scenarios and de-risk AI, up from 1% in 2021. Investment in AI will continue to accelerate by organizations implementing solutions, as well as by industries looking to grow through AI technologies and AI-based businesses. The beauty of synthesizing data on a computer is that it can be procured on-demand, customized to your exact specifications, and produced in nearly limitless quantities. Training a billion-parameter foundation model takes time and money. Replacing even a fraction of real-world training data with synthetic data can make it faster and cheaper to train and deploy AI models of all sizes. Collecting samples of all potential scenarios, including rare, so-called edge cases, would be impractical to impossible. Synthetic data makes it possible to create customized data to fill the gaps. Large models almost always contain hidden biases, too, picked up from the articles and images they have ingested. Use of synthetic data will enable you to test to ensure that these are found and corrected.
Due to privacy laws and restrictions, the synthetic data generation market is evolving from a large base of companies generating the data based on legacy methods involving modifying an existing database using Extract Transform Load (ETL) technologies to fully synthetic generation which does not. Fully synthetic technologies involve the use of algorithms to generate the data or the use of AI/ML to analyze a production database and reproduce a facsimile. Complexity of the generated fully synthetic data and fit for use for the system under test varies widely from non-sensical randomly generated data using free tools to premium solutions of highly complex systems of systems databases and the ability to generate statistically significant data for the creation of confusion matrixes and measurement of systems error rates. The fully synthetic data generation market is migrating to higher complexity driven by the ability to make high revenue/profit Enterprise level sales and clear benefits as better test objects that reduce systems error rates while dramatically decreasing software development cycles at less costs than traditional methods.
A recent internet search revealed 43 companies participating in the test data generation market. The majority of companies relied on traditional ETL methods to generate the data with an impressive growth in new companies generating fully synthetic data. Many of these new companies were using a combination of AI/ML techniques and traditional ETL or lower complexity algorithmic solutions. An example is Tonic which has appeared in the market within the last few years with an impressive $35M in Round B venture funding. ExactData appears to remain the only company participating in the premium fully synthetic data generation market.
There are two fundamental approaches to generating synthetic data. The first involves accessing and modifying/masking a production database, either manually or through use of Extract Transform Load technologies or analyzing using AI and generating a facsimile that mirrors some attributes. The second approach does not involve the use of production databases and generates fully synthetic data. The fidelity of this data can vary wildly from random alphanumeric characters to high fidelity synthetic data correlated to the field level systems of systems with business logic and workflow rules, correct statistical distributions, correlation over the time axis, high use case coverage, engineered errors and systems response files. A method based on the use of a production data source is typically best if policy decisions are being made with the data. A fully synthetic data approach is generally better suited for most other use cases. You do not need access to a confidential database that might not exist for your future state system, there are no privacy restrictions on how you can use the data, it is less expensive and faster and with a known ground truth and with expected system response files you can measure and improve system error rates.
According to VentureBeat, AI is facing several critical challenges. Not only does it need huge amounts of data to deliver accurate results, but it also needs to be able to ensure that data isn’t biased, and it needs to comply with increasingly restrictive data privacy regulations.
We have seen several solutions proposed over the last couple of years to address these challenges, including various tools designed to identify and reduce bias, tools that anonymize user data, and programs to ensure that data is only collected with user consent. But each of these solutions is facing challenges of its own.
Now we’re seeing a new industry emerge that promises to be a saving grace: synthetic data. Synthetic data is artificial computer-generated data that can stand-in for data obtained from the real world. A synthetic dataset must have the same mathematical and statistical properties as the real-world dataset it is replacing but does not explicitly represent real individuals. Think of this as a digital mirror of real-world data that is statistically reflective of that world. This enables training AI systems in a completely virtual realm. And it can be readily customized for a variety of use cases ranging from healthcare to retail, finance, transportation, and agriculture.
Over the last few years, there has been increasing concern about how inherent biases in datasets can unwittingly lead to AI algorithms that perpetuate systemic discrimination. In fact, Gartner predicts that through 2022, 85% of AI projects will deliver erroneous outcomes due to bias in data, algorithms, or the teams responsible for managing them.
One alternative often used to offset privacy concerns is anonymization. Personal data, for example, can be anonymized by masking or eliminating identifying characteristics such as removing names and credit card numbers from ecommerce transactions or removing identifying content from healthcare records. But there is growing evidence that even if data has been anonymized from one source, it can be correlated with consumer datasets exposed from security breaches. In fact, by combining data from multiple sources, it is possible to form a surprisingly clear picture of our identities even if there has been a degree of anonymization. In some instances, this can even be done by correlating data from public sources, without a nefarious security hack.
Synthetic data promises to deliver the advantages of AI without the downsides. Not only does it take our real personal data out of the equation, but a general goal for synthetic data is to perform better than real-world data by correcting bias that is often ingrained in the real world.
It's no secret that technology is always evolving and with Facebook Smart Glasses being recently announced, as well as the already existing Google Glasses technology, the world becomes closer and closer to seeing what we thought was only possible in movies. Glasses that can take pictures could turn into glasses which project holograms in a few years. Virtual reality goggles and helmets, which already exist, can transport people into their own universe; it's only a matter of time until these technologies are combined and a whole new realm of opportunities will exist.
Markets will flood for what will no doubt be lucrative eyewear that can project videos in real time to you and your friends without the need of a computer screen. Perhaps iPhones will become 'Eye Phones' and you'll be able to project and shuffle through phone apps with voice command. Facebook Glasses already utilizes voice command to take pictures and videos, how long until it takes to scroll through your Facebook feed and post statuses, play games, or message friends?
Of course, one concern with any emerging technology these days is privacy concerns; not only will private companies have access to what you're watching or projecting in real time, but so would everyone in a 10-foot radius, right? What happens if sensitive data is displayed for people to see? What happens if projections overlap and the technology starts to malfunction? There are a lot of questions for programmers and business executives to consider if they go this route.
The biggest question to think about is no doubt if this kind of technology will disrupt existing markets? Will smart phones be obsolete in 10 years if you can just wear glasses that do the same thing and more? Will we need television if we can stream whatever show we want for an audience just using our very eyes?
As always, there's a lot to think about with these types of emerging technology. While it's likely years away from hitting the market, those years will pass by sooner than we think, and consumers and businesses alike will have to prepare.
Synthetic data is consistently able to fill the gaps where real-world data can't quite manage to hit the mark. Whether it's for the advancement of artificial intelligence or enhancement of robust simulations, synthetic data has one thing that real-world data never will have; controlled variation.
Synthetic data being created artificially gives a major advantage which allows us to control test conditions and variations within the data. Instead of needing to rely on real-world data to satisfy every single test condition you can think of, synthetic data fills each of those gaps with ease, and allows for not only progression, but automation as well. Soon, artificial intelligence will be able to improve itself by synthesizing its own simulated data and automate its own evolution.
Think of it; if artificial intelligence is able to automate its own testing and training and improve itself until completion, there won't be a need for real-world data anymore. AI would just need to create its own data to adjust itself to, which let's face it, would cover more ground a lot more quickly than any non-synthetic data counterparts.
For example, self-driving cars being able to calculate the quickest route to any given destination on the fly and adjusting accordingly based on upcoming traffic, accidents that may have occurred, or any other predicted trouble on the road would innovate the automobile industry to no end.
This also begs the question, if everyone is using synthetic data for automation, who will do it best? Will AI compete with each other to automate itself best? Only time will tell.
Jefferson Health recently reported their cloud-based database was hacked and data belonging to 1,769 patients treated at the Sidney Kimmel Cancer Center was compromised as a result. This attack occurred back in April of 2021, but was reported both publicly and to the federal government for the first time Thursday, July 22nd at the end of the 60-day legal window for reporting cyber attacks.
Cyber attacks in general have been on the rise ever since the beginning of the COVID-19 pandemic, however ransomware attacks and hackings against health facilities in the United States have soared to 153% from the year prior, and these are just those which have been reported.
Additionally, Jefferson Health was not the only healthcare facility breached by the data attack; reports suggest Yale New Haven Health System and many other affiliated Elekta healthcare organizations were breached, with intent seemingly related to stealing data related to Cancer patients.
With cyber attacks on the rise across all industries, especially healthcare, it's easy to tell that nobody is safe from malicious ransomware attacks. Companies worldwide are in constant demand for cybersecurity maintenance but it seems like the supply isn't getting any larger.
Synthetic data generation, however, offers an alternative solution, ensuring the safety of data belonging to clients while keeping the benefits of using real-world data. Now more than ever, synthetic data is imperative and serves as a great defense against hackers and cyberterrorists out to steal customer data.
Learn more at www.exactdata.net/
Earlier this week, a hacker gang behind an international crime spree claimed to have locked over a million individual mobile devices and made a demand of $70 million in bitcoin to unlock them. REvil, a Russia-connected cyberterrorist group has previously hacked JBS cyber operations and have since compromised Kaseya and Coop, two international giants, as well as have claimed to have attacked 1,000 individual small businesses as well.
Global ransomware attacks have been increasing steadily over the last few years, and while cyber defenses are continuing to improve, there's no telling who will be targeted next, and what it will cost your company if you are hacked. Money, assets, customers, and all kinds of personal data are at risk every day, and as the July 4th weekend proved, the threat is imminent.
Learn more at https://www.exactdata.net/
According to Erica Davis, Guy Carpenter, Managing Director and Cyber Center of Excellence Leader for North America, there will be $6T in 2021 global cybercrimes costs with only $6B in 2021 cyber insurance gross written premiums. The Ponemon Institute indicates 60% of cybercrime costs are due to 3rd party breaches. Fully synthetic data generation technologies eliminate the cost and risks of 3rd party breaches. The potential global financial impact is enormous with a potential reduction in cybercrime cost of $3.6T annually. The insurance industry would also be closing a huge risk exposure gap of trillions of dollars through broad adoption of synthetic data generation technologies.
Just recently, McDonalds has suffered from a data breach where the personal data of customers in Taiwon and South Korea was exposed. This comes right after JBS admitted to paying $11 million in ransom to Hackers who broke into their computer system last month.
With more and more companies are being targeted, it's hard to say who will be safe from looming threats.