The Data Blog
In data science, model training and fitting via machine learning is one of those subjects that never really has the same answer each time. Every model is different; each has their own data, response, and predictors uniquely akin to them. Yes there is a "right way" and a "wrong way" to train a fit a model, but what is right and wrong is very subjective. If the model works does that mean you trained it right every step of the way? If the model predictions are inconclusive or opposite of your hypothesis did you train it wrong?
The first step in proper model training is always asking yourself what the goal of the model is. What is its purpose? What are you trying to predict? If you can't specifically summarize your aim in a sentence or two you need to reevaluate your goals and the conceptual idealism behind the model. Every model should have a clear purpose that can be easily explained to anyone willing to listen.
"I'm testing to see if the amount of sleep one gets is a reliable predictor to whether or not they will drink coffee the next morning."
Great, right? Wrong. While the above description certainly seems valid, so many questions already arise from that one sentence. Does the person have to usually drink coffee to be counted as valid in the analysis? What do you mean by the 'amount of sleep', do you mean a little, a lot, just enough? When does the morning end? What defines "reliable" in this context?
To be honest, we can nitpick even the greatest models, but at the very least, a great model's problem, objective, solution, and features should be clearly identifiable when summarizing it.
"My model predicts whether or not the average caffeine consumer will drink coffee within four hours of waking up if they got less than 8 hours of sleep the previous night."
After being able to summarize your model and clearly laying out your intent, you'll need the right data to back it up. A few questions to keep in mind; how will you get the data? How do you know the data is accurate? Will you filter out outliers or take a random sample of the amount of observations? What if some of your data is incomplete or missing? Which fields are more important than others; which are predictors, which are responses, and which are unnecessary? Is there multicollinearity or any correlation at all for that matter within your data?
There are many questions that need to be addressed with the data, so many that they can't all be listed here. However, the most important part is that you're able to validate your data and the model that depends on it, so you can move forward with training your model.
We may train their model in a number of ways, one of the most popular being with a linear regression fit aimed to answer a specific question; what is the relationship between our predictors and outcome? While training our data we hope to isolate which of our predictors have a significant impact on our outcome and continue to use them, while taking out those which do not have an impact on our data in the long-run. By analyzing which variables have the most impact on our our response as a whole, we're able to enact the process known as regularization. Regularization is important, as it will determine your model's capabilities on data other than your training set. If your model doesn't work as well on other types of data meant for it, your model may be under or overfit, and you'll have to back up a bit in terms of figuring out your best predictors.
Thinking back to our coffee example, our response variable is obviously whether or not a consumer drank coffee. The most obvious predictor would include amount of sleep the previous night, but should include other predictors as well such as age, sex, weight, and accessibility. We'd then aim to trim any variable deemed not a valid predictor for drinking coffee the next morning. Once we believe we have the best predictors for our model, we'd test it on other datasets and continue training it until we're satisfied.
Learn more at www.exactdata.net/