The Data Blog
A thought exercise on the System perspective of dev and test, as enabled by ExactData Synthetic Data.
Let’s consider the development of an application that scours incoming data for fraudulent activity… How would that test and analysis look with production data, de-identified production data, hand crafted data, and ExD synthetic data?
Let’s also consider that the application will classify all transactions/events as either bad or good. The perfect application would classify every transaction correctly resulting in 100% Precision (everything classified as bad was actually bad), 100% capture rate (classified every actual bad as bad), 0% escape rate (no bads classified as good), and 0% False Positive rate (no goods classified as bad). The application needs to be developed, tested, and analyzed from a System perspective. For example, the application could classify every transaction as bad and achieve 100% capture rate, and 0% escape rate, but would also result in poor Precision and a huge False Positive rate – thus requiring significant labor support to adjudicate the classifications. On the other extreme, the application could classify everything good, be mostly right, and not catch any bads. Both of these boundary conditions are absurd but illustrate the point of the importance of System.
One method of System analysis is the Confusion Matrix, noted below.
With production data, you don’t know where the bads are, so you can’t complete the confusion matrix.
With de-identified production data, you don’t know where the bads are, so you can’t complete the confusion matrix.
With hand-crafted data, you might have the “truth” to enable completion of the confusion matrix, you would not have the complexity or volume to be truly testing to find the “needle” in the haystack of fraudulent behavior within mass of good behavior.
With ExD synthetic data, you know where every bad is (you have the ground truth), so you CAN complete all 4 quadrants of the confusion matrix, and can then only, conduct a system analysis, driving the application to the real goal of tuning and optimizing Precision (maximizing TP) and Capture rate (maximizing TP/TP+FN) , while at the same time minimizing Escapes (FN) and False Positive rate (FP/FP+TP). Within a particular setup of an application version, these are typically threshold trade-offs, but with next iteration development, there is the opportunity to improve on all scores.
With every new year comes exciting new updates and trends to the technological world around us! We at ExactData are excited about many trends and future advancements to come, but here are five that we're excited about in particular!
1) Advancement of AI and Mobile Intelligence
It's no secret that AI and mobile intelligence are evolving everyday. We see growth in both of these departments to no end, where things like facial recognition, fingerprint, voice, and eyes scans are all becoming more of a reliable reality! This is seen through many of the innovations of Apple, Samsung, and Google have brought to the table, but also through other fields of data science as well!
2) Automation and Innovation
When one thinks of automation and innovation, jobs and mundane tasks are often the first things thought of. How is data being innovated or automated you may ask? Well being able to derive data in faster response rates, being able to generate, switch, and use data for test purposes on the fly for exact results seems innovative to us! This innovation can be traced to artificial intelligence as well through pattern recognition, GPS sensors, self-driving cars, and more!
3) Cloud Computing and Cyber Security
Cloud computing is becoming more distributed, meaning the origin of the cloud can distribute services to other locations while operating fully in effect from one area. Server updates, latency checks, and bandwidth fixes are becoming quicker every year which not only affects the cloud and its functions but can also be used to stop breaches, glitches, and hackers right in their tracks as soon as they get into the system.
4) Financial Patterns and Recognition
Recognizing financial data patterns through data has been historically tricky due to the immense analytical prowess and and observational skills that could be needed. AI and statistical learning developments however can be trained to pick up these patterns more quickly than ever before, and with less error too. Financial analytics and trend recognition will certainly see upgrades in the upcoming year, especially with more variables such as cryptocurrency coming into play.
5) Accessibility and Privacy
Accessibility and privacy for data files come hand in hand; by making something more accessible you also have the means to make it more restricted. Added levels of security for data can come in many different forms; test data, artificial data, cloud computing, advanced machine learning, more advanced security protocols and more. The rule of thumb is to keep everything private that you may need for later so that nobody else can take or modify it.
While there are so many trends we believe to be up and coming in the world of data, these were just some of the few we believe to be relevant to both the industry and general public as a whole.
The terms "database" and "database management system" are typically used interchangeably despite the fact the two mean completely separate things. Additionally, both are important terms that those in the technology industry should clearly know how to distinct between, but it seems many people either don't or can't. Very quickly, below are definitions for the two vocabulary terms.
A database is a logically modeled cluster of information [data] that is typically stored on a computer or other type of hardware that is easily accessible in various ways.
A database management system is a computer program or other piece of software that allows one to access, interact with, and manipulate a database.
Additionally, there are many types of database management systems that exist in the world today. Historically, relational database management systems (RDBMS) are the most popular approach for managing data due to their accessibility and performance result capabilities. Examples of RDBMS's include the Amazon RDS, Oracle, and MySQL which all utilize Structured Query Language (SQL) to manipulate the different databases they interact with. All RDBMS's are ACID compliant and typically implement an OLTP system.
To combat the limitations of relational database management systems, NoSQL databases became more popular over the years. The term "NoSQL" was coined by Carlo Strozzi in 1998 as the term for his first database which didn't utilize SQL for managing data, hence the label "NoSQL." Examples of popular NoSQL databases include key-value pair databases, document databases, graph databases, and columnar databases, all of which while are similar in concept are different in theory, as there are advantages and disadvantages to using each in different scenarios.
As we continue to move forward in the technology world, we constantly search for the most optimal solution for all of our data needs. These optimal solutions begin with which database management system or systems we choose to utilize to solve our data-related problems. Some database management systems are more equipped for certain scenarios than others, and figuring out which type works best for you is essential when working with big data.
Most scientists agree that no one really knows how the most advanced algorithms do what they do, nor how well they are doing it. That could be a problem. Advances in synthetic data generation technologies can help. These algorithms generate data with a known ground truth, sufficient volumes and with statistically relevant true and false positives (TP, FP) and true and false negatives (TN, FN) for the nature of the test. AI algorithms can now be measured for precision, c, as the fraction of the predicted matches that are true positive matches, or c = TP/(TP + FP).
With the recent Equifax breach coming back into the limelight due to the cancellation of the $125 check the FTC promised to those impacted by the breach, we want to take a look at possible prevention for the breach in the first place, or at least ways that the damage could have been minimized.
A very interesting application of high-fidelity synthetic data generation techniques is to reduce credit card fraud. By 2025, the global losses to credit card fraud are expected to reach almost $50 billion. Detecting fraudulent transactions in a large data-set poses a problem because they are such a small percentage of the overall transactions. Banks and financial institutions are in need of a solution that can correctly identify both fraudulent and non-fraudulent transactions, and detect false/true negatives and false/true positives, enabling the creation of receiver operating curves and tuning the system to optimize for the cost to correct the fraud payment versus the cost of the payment. High fidelity synthetic data solves this dilemma by generating volumes of non-fraudulent transactions while interweaving complex fraud patterns into a very small subset of the overall transactions. The fraud patterns are known, enabling the credit card fraud detection system to be optimized.
Most applications testing, both performance and in development environments, is being done today utilizing production data that has been extracted utilizing an ETL (Extract Transform Load) process and then manually modified to create specific use cases. For example for cyber applications, most testing is being done by replaying network traffic. Due to the labor intensity of this process, use case coverage is generally very low and most of the business logic and workflow rules go untested. This is where the concept of sufficiently complex data comes in. Test data should be of large enough volumes to cover peak processing volumes and have sufficient complexity to cover almost all of the business logic and workflow rules. Utilizing large amounts of sufficiently complex test data will exercise algorithms at peak processing volumes to expose failures before moving to the production environment and enable precision error measurement for ambiguous, true and false errors. Systems can then be optimized for the cost of errors versus the cost to correct.
What is ExactData? What do we do? Why is it important, and how can we help you? These are some of the many questions we would like to answer to give a little more insight about how we operate.
ExactData is based in Rochester, New York and we specialize in automating the generation of large, fully artificial, engineered test data for enhanced performance yet quicker results. Our data eliminates security and privacy risks and uses no personal information whatsoever when generating artificial test purpose data making it completely safe to use on top of being unique and optimized per each situation. The creation and advancement of simulated data is unique yet up and coming, and we strive to improve our product everyday.