The Data Blog
In data science, model training and fitting via machine learning is one of those subjects that never really has the same answer each time. Every model is different; each has their own data, response, and predictors uniquely akin to them. Yes there is a "right way" and a "wrong way" to train a fit a model, but what is right and wrong is very subjective. If the model works does that mean you trained it right every step of the way? If the model predictions are inconclusive or opposite of your hypothesis did you train it wrong?
The first step in proper model training is always asking yourself what the goal of the model is. What is its purpose? What are you trying to predict? If you can't specifically summarize your aim in a sentence or two you need to reevaluate your goals and the conceptual idealism behind the model. Every model should have a clear purpose that can be easily explained to anyone willing to listen.
"I'm testing to see if the amount of sleep one gets is a reliable predictor to whether or not they will drink coffee the next morning."
Great, right? Wrong. While the above description certainly seems valid, so many questions already arise from that one sentence. Does the person have to usually drink coffee to be counted as valid in the analysis? What do you mean by the 'amount of sleep', do you mean a little, a lot, just enough? When does the morning end? What defines "reliable" in this context?
To be honest, we can nitpick even the greatest models, but at the very least, a great model's problem, objective, solution, and features should be clearly identifiable when summarizing it.
"My model predicts whether or not the average caffeine consumer will drink coffee within four hours of waking up if they got less than 8 hours of sleep the previous night."
After being able to summarize your model and clearly laying out your intent, you'll need the right data to back it up. A few questions to keep in mind; how will you get the data? How do you know the data is accurate? Will you filter out outliers or take a random sample of the amount of observations? What if some of your data is incomplete or missing? Which fields are more important than others; which are predictors, which are responses, and which are unnecessary? Is there multicollinearity or any correlation at all for that matter within your data?
There are many questions that need to be addressed with the data, so many that they can't all be listed here. However, the most important part is that you're able to validate your data and the model that depends on it, so you can move forward with training your model.
We may train their model in a number of ways, one of the most popular being with a linear regression fit aimed to answer a specific question; what is the relationship between our predictors and outcome? While training our data we hope to isolate which of our predictors have a significant impact on our outcome and continue to use them, while taking out those which do not have an impact on our data in the long-run. By analyzing which variables have the most impact on our our response as a whole, we're able to enact the process known as regularization. Regularization is important, as it will determine your model's capabilities on data other than your training set. If your model doesn't work as well on other types of data meant for it, your model may be under or overfit, and you'll have to back up a bit in terms of figuring out your best predictors.
Thinking back to our coffee example, our response variable is obviously whether or not a consumer drank coffee. The most obvious predictor would include amount of sleep the previous night, but should include other predictors as well such as age, sex, weight, and accessibility. We'd then aim to trim any variable deemed not a valid predictor for drinking coffee the next morning. Once we believe we have the best predictors for our model, we'd test it on other datasets and continue training it until we're satisfied.
Learn more at www.exactdata.net/
Data collection has to happen at some point, and what better way to know all about someone's interests, plans, aspirations, etc. than by updating databases and collecting new data when people are the most giving? Businesses typically see a staggering 73% increase in their data collection through the holiday season via new recipients for email lists, personal information from buyers clubs or one time purchasers and account holders, and more! Data collection in the holiday season can also be influenced over social media through different advertisement campaigns, cross-promotions, forms, and cookies tracking meant to see who likes what best before that insight goes over to companies so they know what demographics to target. These companies could have any type of product like gifts you might be interested in, or could just be collecting data on what you got or where you were for the holiday season so it knows where to reach you.
The truth is, there is all types of data collection, both on and offline, that occurs during the holiday season that it's almost impossible to track all of it. Through different social media advertisements, online email lists, in-store and e-commerce store purchases, and more, your personal data is collected by businesses and most of the time you won't remember signing yourself up for it.
With more and more people taking notice of their personal data being offered up on a silver platter most of the year let alone during the holiday season, it doesn't take a private investigator to know that something's got to switch. Why keep collecting data like this when there are new and improved techniques to generate your own data, synthetically and ethically? The more synthetic data generation there is on the market, the less we'll have to rely on gathering public data, having our own data sold, and budget or time constraints when trying to obtain a certain amount of data.
Learn more at ExactData.net
Last weekend, a team of private citizens composed of expert codebreaker (computer programmers and mathematicians) were able to solve what's known as the "340 Cipher", a jumbled series of numbers, letters, and symbols arranged by the infamous Zodiac Killer that was sent in letters to taunt police about the crimes he had committed.
By running a codebreaking software to run 650,000 different simulations. the team was able to produce output which identified the correct sequence of characters. This begs the question, how much farther must AI be advanced until it can begin analyzing and producing potential solutions for ciphers and other types of cryptograms in an efficient capacity?. While it would have to be trained to look for the different kinds of cryptograms and pick the best potential solution based on parameters and context of the cipher, it isn't farfetched that it won't be long before AI can reasonably spit out these types of outputs.
Online today there are already computer programs which can solve types of cryptograms and ciphers; these online tools are however limited and need a certain amount of help to actually solve the puzzles they're fed. Additionally, like the software used to solve the 340 Cipher, codebreaking technology already exists; it's just a matter of refining and training it to become more efficient in its performances.
Only time will tell how advanced we can become with our codebreaking and sleuthing technologies, but the more advanced our AI becomes, the better our odds of solving mysteries which were previously thought to be unsolvable.
The COVID-19 is one of the most dangerous problems we as a society struggle with today, and to make matters worse the disease is highly contagious and spreading rapidly around the world. As there are many people who are unaware of their health situation and don't find it necessary to get tested, and furthermore aren't enough test kits readily available for every single person, it's essential we use our resources and historical data to track the virus so we can begin to stop it in its tracks.
By preparing travel, social, and contact networks, we may effectively be able to track to a certain degree where the virus is, isn't, and may potentially be. A travel network specifies a single, series, or pattern of travel activities by a node [individual] or group of nodes [group of individuals] by any mode to any location. A social network is defined as a network of known social interactions between family, friends, co-workers, and those you are relatively familiar with. Meanwhile, a contact network tracks the time and proximity one node may have to another at any given time, but isn't specifically limited to others known by the individual; contact networks include interactions with a cashier when buying a coffee or perhaps passing someone nearby on local transportation. By combining the three types of networks, we effectively can understand each node's travel, social, and contact patterns and compare them to COVID-19's own pattern of travel, something we can denote as contact tracing.
Using the data collected from the COVID-19 outbreak as well as by those who have been tested for exposure, we have the opportunity to track the precise whereabouts of the pandemic and fight it before the next wave of it or a future pandemic begins. The first of our two key assumptions for this methodology is we have enough readily available data to use for tracking where COVID-19 has been and currently is so we can also predict where it is likely to go. The second key assumption is that we find a way to track those we don't have data on, as the contact network isn't limited to interactions with known nodes, but unknown ones as well. Nevertheless, this is a rare opportunity we have to begin our fight back against COVID-19 and other future pandemics, and we should take any advantage we can to prepare for it.
Happy New Year from us at ExactData! With each new year the aspiration to evolve technology even further grows exponentially, and the once thought to be improbable becomes possible right before our very eyes through both sustainable and disruptive innovations.
2020 promises to bring massive changes to the tech world, which includes but isn't limited to:
The terms "database" and "database management system" are typically used interchangeably despite the fact the two mean completely separate things. Additionally, both are important terms that those in the technology industry should clearly know how to distinct between, but it seems many people either don't or can't. Very quickly, below are definitions for the two vocabulary terms.
A database is a logically modeled cluster of information [data] that is typically stored on a computer or other type of hardware that is easily accessible in various ways.
A database management system is a computer program or other piece of software that allows one to access, interact with, and manipulate a database.
Additionally, there are many types of database management systems that exist in the world today. Historically, relational database management systems (RDBMS) are the most popular approach for managing data due to their accessibility and performance result capabilities. Examples of RDBMS's include the Amazon RDS, Oracle, and MySQL which all utilize Structured Query Language (SQL) to manipulate the different databases they interact with. All RDBMS's are ACID compliant and typically implement an OLTP system.
To combat the limitations of relational database management systems, NoSQL databases became more popular over the years. The term "NoSQL" was coined by Carlo Strozzi in 1998 as the term for his first database which didn't utilize SQL for managing data, hence the label "NoSQL." Examples of popular NoSQL databases include key-value pair databases, document databases, graph databases, and columnar databases, all of which while are similar in concept are different in theory, as there are advantages and disadvantages to using each in different scenarios.
As we continue to move forward in the technology world, we constantly search for the most optimal solution for all of our data needs. These optimal solutions begin with which database management system or systems we choose to utilize to solve our data-related problems. Some database management systems are more equipped for certain scenarios than others, and figuring out which type works best for you is essential when working with big data.
On Thursday, November 7th the Institute for Robotic Process Information & Artificial Intelligence (IRPA AI) New York Chapter Launch Party will be taking place in Manhattan, New York on 25 West 39th Street on Floor 14. The launch party will include a pre-launch networking event for the new chapter beginning at 5:15pm which will serve as an opportunity to plan and discuss future programs for the chapter. The launch party will also inform guests on how they can be involved with the chapter. Drinks will be served at the launch party during the pre-launch networking happy hour!
For more information or to RSVP to the event, please follow the link here.
For questions or general inquiries about IRPA AI or the NY Chapter Launch Party, please contact Molly Alexander at Molly.Alexander@irpanetwork.com.
You can also learn more about IRPA AI by going to their website and or their LinkedIn!