Have you ever had this happen to you? You created a new social media company. All you wanted to do was use advertising to pay the bills. But you optimized a few AI models for giving users what they want so they stay use your product longer. They get things they like, and you get more ad revenue from them being on the site longer. Win-win right?
Then the AI model finds out people stay online longer if they are emotionally engaged and nothing engages more than controversy. Everyone hates each other now because all the AI model is serving is the most controversial material, this isn’t how things were supposed to work… right?
Let’s make sure that doesn’t happen. In todays episode we will be covering how to use the right metrics for your product to create a good experience for the user.
This podcast is called design for AI
It is here to help define the space where Machine learning intersects with UX. Where we talk to experts and discuss topics around designing a better AI.
music is by Rolemusic
Im your host Mark Bailey
Lets get started
I know this is probably going to be the most technical episode I’ve done so far. So I am going to apologize in advance but I do think it is important. One of the most important areas that it is overlooked for good design is UX input into the creation of the model. But UX can’t give useful feedback unless they know what is going on, and that means some technical learning. I’ll make sure to cover all the meaning of the terms, and the way that I’ve learned a lot of machine learning is to go over the same info twice so the second time around I know the terms and it sinks in better. So if this episode doesn’t make sense maybe give it a listen again. If it still doesn’t make sense then let me know what your questions are.
I’ve split it up into 3 groups of metrics. Metrics developers, PMs, and UX should care about. Let’s start with developer metrics.
For developer metrics I split it up by the main model types that use differing metrics: classification, regression, and ranking models.
Performance Metrics for Regression Problems
Accuracy = number of correct predictions / total number of predictions. It is the basic metric that is used for almost all models. As the UX person, hopefully the developer you are working with isn’t using this as the only metric since it can give you a false sense of security. For example if you are looking for an anomaly that only happens 2% of the time, a model that never predicts the anomaly will have an accuracy rate of 98%.
A better way to understand accuracy is with a Confusion Matrix. Think of this as four boxes:
Top left is True Positives : The cases in which we predicted YES and the actual output was also YES. This is correctly detecting some thing that was there.
Top right is True Negatives : The cases in which we predicted NO and the actual output was NO. This is correctly detecting that something was not there.
Bottom left is False Positives : The cases in which we predicted YES and the actual output was NO. This is also called a type I error. It is like a doctor telling a man he is pregnant.
Bottom right is False Negatives : The cases in which we predicted NO and the actual output was YES. This is also called a type II error. It is like telling a woman in the middle of having a baby that she isn’t pregnant.
Sensitivity matters more when classifying the positive detection correctly is more important than classifying the negative cases. A example of this is detecting cancer. You don’t want to miss out any malignant to be classified as ‘benign’. So it is better to tell a few people they have cancer that don’t, then to let people who do have cancer to slip through the cracks. The consequences of the mistakes are that people have a few bad days and need to get retested. That is seen as a better result than people dying because they were told they were fine when they had cancer.
Sensitivity measures the proportion of actual positives that are correctly identified as such. If you are want a measure a model for how sensitive the model is then you will use the True Positive Rate. This is defined as TP/ (FN+TP). True Positive Rate corresponds to the proportion of positive data points that are correctly considered as positive, with respect to all positive data points.
Specificity matters more when classifying the negative cases correctly is more important than classifying the positives.Maximizing specificity is more relevant in cases like spam detection, where you strictly don’t want genuine messages (the negative case) to end up in spam box (the positive cases). It is better for someone to read a few spam messages than to miss important messages.
Specificity is the proportion of actual negatives that are correctly identified as such. For example, the percentage of healthy people who are correctly identified as not having the condition. If you are looking for measuring Specificity then you will use a False Positive Rate. This is defined as FP / (FP+TN). False Positive Rate corresponds to the proportion of negative data points that are mistakenly considered as positive, with respect to all negative data points.
An improvement to the simplified measurement of accuracy is decreasing the logarithmic loss. It works by penalizing the false classifications. It starts by finding the probability for each group in your data. So if you have a say, minimizing Log Loss gives greater accuracy for the model used for classification.
Receiver operating characteristic (ROC)
A Receiver operating characteristic (ROC) curve is a curve that develops from the true positive rate vs. false positive rate at different classification thresholds. Basically it is mapping out the the line between the model getting it right and wrong. ROC curves are probably the most commonly used measure for evaluating the predictive performance of scoring classifiers. This metric is best used for evaluating accuracy for Classification Models, Regression Models, or Clustering Models.
The thing you need to think about when this is used as a metric is the way you tweak the ROC curve is by lowering the classification threshold. This classifies more items as positive, thus increasing both False Positives and True Positives. So, it is a way to move back and forth as needed depending on which is more important.
Area under Curve (AUC)
The AUC is a hard thing to wrap your mind around. The technical definition for AUC is the probability that a classifier will be more confident that a randomly chosen positive example is actually positive than that a randomly chosen negative example is positive.
An easier way to look at it is, if you map out an ROC curve on a graph, the area under the curve is the an accurate prediction. If you can map it out perfectly then the AUC number is 1. As mapping the curve gets worse the AUC number becomes a fraction. AUC provides an aggregate measure of performance across all possible classification thresholds so it is good for getting the big picture type information on the model. Because it is big picture info things like the scale of granularity, and classification threshold don’t matter.
One problem using the AUC as a metric is that the scale invariance is not always desirable. For example, sometimes we really do need well calibrated probability outputs, and AUC won’t tell us about that.
Classification-threshold invariance is not always desirable. In cases where there are wide disparities in the cost of false negatives vs. false positives, it may be critical to minimize one type of classification error. For example, when doing email spam detection, you likely want to prioritize minimizing false positives (even if that results in a significant increase of false negatives). AUC isn’t a useful metric for this type of optimization.
Improve F1 Score
OK to describe an F1 score we need to cover some new vocabulary. We have already covered Sensitivity and Specificity. Now we will add precision and recall. For refresher, Specificity is given a negative example, what is is the probability of a negative test result.The easiest one is that Sensitivity = recall. I don’t know why they needed two words to explain the same thing but either word means given a positive result, what is the probability that it is a positive test result. Precision is the opposite of recall: given a positive test result, what is the probability that it is a positive result.
F1 Score is used to measure a test’s accuracy. F1 Score is the average (Harmonic Mean to be specific) between precision and recall. The range for F1 Score is between 0 and 1. It tells you how precise your classifier is (how many instances it classifies correctly), as well as how robust it is (it does not miss a significant number of instances).
High precision but lower recall, gives you an extremely accurate, but it then misses a large number of instances that are difficult to classify. The greater the F1 Score, the better is the performance of our model.
Performance Metrics for Regression Problems
Mean Absolute Error
Mean Absolute Error is the average of the difference between the Original Values and the Predicted Values. It gives us the measure of how far the predictions were from the actual output. However, they don’t gives us any idea of the direction of the error i.e. whether we are under predicting the data or over predicting the data.
Mean Squared Error
Mean Squared Error(MSE) is quite similar to Mean Absolute Error, the only difference being that MSE takes the average of the square of the difference between the original values and the predicted values. The advantage of MSE being that it is easier to compute the gradient, whereas Mean Absolute Error requires complicated linear programming tools to compute the gradient. As, we take square of the error, the effect of larger errors become more pronounced then smaller error, hence the model can now focus more on the larger errors.
Metrics for Ranking models
Best Predicted vs Human, BPH:
The most relevant item is taken from an algorithm-generated ranking and then compared to a human-generated ranking. This metric results in the comparison that shows the difference in estimations of an algorithm and a human.
The problem with using this method is when people use human ranking, a lot of the time people use mechanical turk or internal employees. I can pretty much guarantee that the people working for mechanical turk are not your target audience and neither is co-workers. So if you are the UX person, make sure it is the people from the target audience that are ranking their choice and not a developer that is familiar with what the answers are possible.
KENDALL’S TAU COEFFICIENT
If you are comparing the whole ranked list instead of just the top item then the Kendall’s tau coefficient shows the correlation between the two lists of ranked items based on the number of similar and dissimilar pairs in a pairwise: in each case we have two ranks (machine and human prediction). Firstly, the ranked items are turned into a pairwise comparison matrix with the correlation between the current rank and others. A concordant pair means an algorithm rank correlates with a human rank. Otherwise, this will be a discordant or dissimilar pair.
Let’s shift gears to business metrics. These are the things the PM needs to worry about. Since most models are built in a research setting the developers have the most say into the metrics being used on judging if a model is “good”. As machine learning matures more and more processes mature so will the business metrics. Until that time, here is what I have seen being used by PMs that I have worked with at different companies. Feel free to let me know if there are others you have used.
The metrics I will be covering are:
Smaller, more efficient, models
Ability to productize the model
Once a model is created there is a good chance it can be used for more than just its original purpose. I’ve previously talked about the data ladder. Basically a better model leads to new info being able to be collected, which leads to a better model. That is the ladder. It should be part of your plan for which models are needed to get new data sources.
This metric is to rank the model structure based on how adaptable it is to not just the current step in the ladder but how useful this model will also be for future rungs of the ladder. Basically will it improve with added data streams or is this model a one-off that will only work for the current rung in time. Favoring models that are adaptable will help you climb the ladder faster and get ahead of competition.
How fast can you put out new minor versions of a model? How long does a major version take? Models that can be turned around faster can be improved faster. Faster improvement means more accuracy, more frequent user testing, and better alignment to what your customers want.
Smaller, more efficient, models
It is amazing how accurate models can get given enough data and hardware power. The problem being that cruft can start to build up just like any other software. Smaller models require less hardware for training and serving. Since machine learning hardware is pretty much all cutting edge it is expensive to use as compared to other cloud services. This is cost saving metric, but reducing complexity also helps faster iteration, and create a better user experience.
Since machine learning is mainly research still, the majority of models created never see the light of day. This is another problem with building up cruft in the model. Optimizing models for production helps keep the development cycle lean.
The warning for creating a good user experience is to think long term. Be extra careful with the metrics you implement. The law of unintended consequences can be harsh with AI models. Most developers will optimize their models for some form of accuracy or precision. To lower the chances of unintended consequences, add UX metrics. The ones I use are:
Model regression tests
Faster iteration times
Customer happiness will help for long term customers. Choosing between long term or short term customer satisfaction matters on the cost of two items for your company. Compare the customer acquisition cost (CAC) and Customer Lifetime Value (CLV). The higher both of these numbers are the more important it is to focus on long term happiness. This also means smaller perceived changes for the user.
“Disparate Impact” is the ratio of the fraction of positive predictions for both groups. Members of each group should be selected at the same rate.
The “performance difference/ratio” is the calculation of all the standard performance metrics like false-positive rates, accuracy, precision or recall (equal opportunity) for privileged and unprivileged group separately and see the difference or a ratio of those values.
“Entropy based metric” is the generalized entropy for each group calculated separately and then compared. This method can be used to measure fairness not only at a group level but also at the individual level. The most commonly used flavor of generalized entropy is the Theil index, originally used to measure income inequality.
So which one should you choose? Be empathetic of your users and think how they would measure fairness and find a metric that reflects that.
Model regression tests
With so much about building models being researched based, code base usually doesn’t last long. The problem is that once it is time to move onto the next version it is just as likely to start over with a new code base to try out new ideas. Problems that have been previously solved can easily creep back. All in the search for new better accuracy.
This metric helps to keep the team to constantly keep track of previous problems, how they were solved, and to merge that code base into new ideas so that old problems don’t need to be solved again. Otherwise it is trying to provide a good user experience on a constantly shifting ground because you are never sure what was optimized for in this version of the model.
Faster iteration times
I covered this one in for business metrics, but this is also a big deal for a better UX. The reality is that the faster a model can iterate the more it can be tested with users. More tests with users helps with feedback on what the users want.
Models can can be iterated fast can also test new ideas faster. Rapid prototyping helps in testing more ideas. Anyone doing user tests know how easy it is to get completely unexpected answers from user tests. The ability to quickly pivot helps to align to what users want.
and on that note, what metrics have you caused unintended consequences for the models you have helped to build? Have you found any metrics that have helped create better models?
That’s all we have for this episode, but I would love to hear back from you on which metrics you use for your models.
, use your phone to record a voice memo,
That is also an awesome way to let me know what you like and would like to hear more of, or If you have questions or comments record a message for those too.
If you would like to see what I am up to, you can find me on Twitter at @DesignForAI
Thank you again
and remember, with how powerful AI is,
lets design it to be usable for everyone