Select Page
The extra UX headaches for AI in production

The extra UX headaches for AI in production

It used to be that UX would work with the developers to create a good experience. Dev would build it and then QA would verify it. After that the product would be released and the team moved on to new features. This still works for non-AI products.The current process of developing machine learning throws a wrench into that when things go into production (known as inference for ML models). ML is still enough on the cutting edge that most develops are on the research side of creating ML models, meaning they care more about getting the model quality up instead of speed so the models that do get created work well but are slow. There are a lot of ways to get the speed up for production but there are trade-offs.


The most obvious solution is to just optimize the code for the current model. A lot of times companies have a different engineer that specializes in productionizing models. This is a different skill than building models. They can strip out inefficient parts of the model to try to simplify it while affecting quality as little as possible. This can be verified before production.


This is a where the slow large accurate model tries to train a faster simpler model. The idea is a smaller model is created without all the specialized code. A much larger training set can be used since it does not need to be hand labeled. Instead the large slow model is used to tag unlabeled training data. The larger training set hopefully allows for a similar quality measure. This is easy to catch if QA tracks model versions the same as code revisions.


This one is tricky. Up to this point one of the general ideas of code is that once software is compiled it will run the same no matter what hardware you put it on. This is not the case for ML models. The quality of hardware can affect the quality of the accuracy of the model. When the hardware is built, more expensive hardware means more tensor cores in the case of GPUs, which means more math can be done, which means more accurate answers. If it is a large enough deployment the number of GPUs (or TPUs if you are using Google for serving) can affect the quality of answers so the servers all need to be live with the software on the serving hardware before you can be sure the user experience tested will be the experience the users actually experience.


One of the shortcuts that can be done to cut down on the math needed is called quantization. The idea is to take big numbers that take a lot of space (for example floating points like 3.12345) and shorten them down (3.1 or even just 3). It speeds things up but again the experience needs to be verified after the changes are made. Also just as a heads up quanitizing models makes quality levels even more finicky depending on the hardware it is run on and the hardware it is run on will determine what kind and how much quantization can be done.

Related posts

3 steps to ensure AI doesn’t make your product worse

3 steps to ensure AI doesn’t make your product worse

As AI becomes better it has started to become a differentiator between products being offered by different companies there is a push to put it into everything. The idea is so pervasive it lead to a whole genre of memes on using AI as the solution to anything. So where can it be used to help the software that you are working on?

Does the user change behavior?

A good place to start is to make sure we are not making the experience worse. While it is hard to generalize to every product in this article, there is a good change you are collecting some analytics to build up profiles of the users, and their expected behavior. So it is easy to check if the users are changing their behavior when AI is added to do things for the users. This is a good opportunity for an A/B test.

A good example of this would be when voice to text first came out. The user needed to over-pronunciate to get the computer to recognize the words. While there were a lot of early adopters that were happy to change their speech patterns it is not something that caught on to the general population. If you can detect those kind of changes in behavior and the level to which it is happening during user testing, then it can help predict the level of friction to adoption of a new feature on the product.

Does the user know where they are?

The next thing to check to make sure AI isn’t making the situation worse is situation awareness. The person using your product should know where they are in the navigation at all times. A big fallacy for integrating AI into a product is that it will start trying to do things for the user. An example of this was an experiment done by Microsoft to change the UI based on the mood detected by the user (A simpler mobile UI when they are on the move and distracted). It didn’t work until they communicated a mode change.

The AI customizing things for the user can be helpful but only if the user knows what is being changed. Automatically changing settings or moving the location of the user within the navigation without explicitly telling them what is going on will create distrust of the product and confusion for the user.

Remind the user of the good job you did

The third thing may seem counter-intuitive. When everything is running smoothly there need to be reminders of good things happening. This stems from human nature. Since it is habit for the primitive area of the brain to focus on bad things that need fixing, there need to be reminders of mundane background stuff going right; especially at the end of the usage experience.

Forrester research found when a person went on a flight reminding them that everything went smoothly at the end of the flight would end up with a higher satisfaction rating. AI integration can make a whole lot of things just work so that people don’t need to think about them, and the won’t. The two big things to accomplish this is to make sure to set expectations at the beginning to make sure fledgling AI systems in use today don’t oversell their abilities. Then, at the end of use for your product, a little message to say thank you and tell them a quick synopsis of what steps were done for them is all that is needed.

Related posts

AI specific issues while user testing

AI specific issues while user testing

Machine learning is one of the strategies many companies are using as they try to improve and differentiate their product by augmenting the user’s experience. Building artificial intelligence into apps can improve what can be done for customers and users, but can also affect how customers and users interact with the machine. In this article I am covering how, as the designer, specific issues for you to look out for when testing user scenarios, preparing for doing user testing, and doing user testing. I’ve previously covered how choosing an algorithm affects the user’s experience where I cover the research that is needed before design begins. The information from the previous article will help when doing user testing too.


Even before doing user testing you want to make sure all of the algorithms are working. If you can hand off UX requirements to the QA testing group it’s helpful. But, I’ve found a lot of the time the QA group is also adjusting to the new reality of working with AI; and a lot of the time, as long as they can verify that the app gives a reasonable response, it passes. However, AI can give many responses that make sense but do not help the user to achieve their goals.

So is the AI helping? I’ve found the easiest way to find out is an A/B test of AI vs no AI. The problem is that once software has been written it is hard to disable the AI and have the software still work. A good way around this is to have the developers set up an alternative algorithm to just take all the data and find the mean average. If the AI can not preform better to achieve the stated goals than averaged data then back to the drawing board before user testing.


As soon as there is anything usable for the AI it is best to test with the users to verify what it is giving what is important to them. Even after choosing a specific algorithm developers can modify the algorithm using human imposed variables called Hyper-parameters. These settings can tweak how the machine learning algorithm preforms. It can be things like how many layers of neurons, or the number of neurons in a layer. The important thing to remember is that the more accuracy achieved is usually associated with higher processor requirements.

So what problems can you discover? Let’s start with an example that during user testing you all the users getting too similar of recommendations from the AI or the AI is always making guesses that are too similar to each other. This could be caused by Underfitting. The best way to explain underfitting is with an X-Y graph. In this example we are only using two lists of measurements/data called Features. Of course what you are working on will have more features, but the idea is the same. Figure 1 shows the hypothetical AI giving a perfect recommendation line ignoring all noise.

Figure 1. Best case scenario

Underfitting is ignoring or generalizing too much data during training. It is treating everything as noise.

Figure 2. Under fitting

The solution is to train on more data (or data closer to what the users are using), add more features (measure more things), create a more more complex model (add more neurons or layers to the model). Developers will have to know which hyperparameters make the most sense to change in your situation; and there is a good chance they will change all of them to see which ones create the best results. Your part is knowing this can be a problem, so you can tell them where 1) the data needs to be sampled, 2) and where the answers needs to be right.

The opposite problem is Overfitting. This is trying to fit every point on the graph. The algorithm will work great when all the points are part of the training data but the AI can’t generalize when it sees something new. This can happen when no Cross-validation is done (Splitting out some data to test the AI on instead of using it all for training.) Sometimes cross-validation can’t be done, but sometimes it is forgotten on the list of things to do.

Figure 3. Over fitting

Overfitting shows up during user testing as the edge cases being wildly off, so there is actually a better chance of catching it when validating the prototype. Of course there are the basic remedies of removing layers or neurons but another solution is something called regularization.

Regularization is a fix for the problem machine learning has if some of the features are small numbers and some are large. The algorithm will tend to forget the small numbers. For example, in a home buying AI with the features as number of rooms and price, since the price is much higher, the number of rooms won’t matter. The fix is to make all the data sets look similar. For example make the price of the homes in 100k (eg. $2.1 instead of $210,000). For natural language processing regularization will strip off all of the verb endings so working, worked, and work all count as the same word.

Usually regularization works without your intervention unless the thing the user is interested in relates to something that was regularized away. While this can pop up in user testing it is easier to catch by talking with the developers to see what data they have regularized and make sure it does not conflict with the users goals.


Bias is a big enough problem it needs its own section. As smart as machine learning gets, when a bias is detected it can remind you just how dumb AI still is. First, I will cover Statistical bias. This is when the “bias” is part of the error term. Basically the model is not being the true model. So if the model is off this is something that should be designed out, hopefully with your help the developers never see this problem. Using the user stories and goals take into account the variance and bias from reality. Variance is how far from the average numbers can get. Think of a shotgun pattern on a wall. The further back you stand when shooting the gun the more the dots will be spread out, or the higher the variance. The tolerance for a high or low variance depends on what the user goals are.

Also know that variance interacts with bias. If variance is how spread out the answers are, bias is how far off the center of the target is. So based on what the user goals are (like consistency is more important than accuracy), you might want a high bias and a low variance. This would make all the answers close together with a predictable distance from reality.

The second type of bias is biased training data. Biased training data is when there is a problem with how the data is gathered. Discovering the AI being effected by training data was first detected in 1964 ( with what might be apocryphal) when they were trying to use an AI to detect tanks in images. Testing off the training data went well but it could not work with new photos. It turns out all of the tank photos were from sunny days and non-tank photos on cloudy days. So the AI was good at detecting the weather in the photo instead of finding tanks.

Another example is using an AI to decide who to keep in the hospital with extra care and who to send home. The AI kept recommending people with asthma to be sent home which went against obvious medical knowledge. It turns out people with asthma were getting extra attention from the doctors because of the extra risk. this extra attention was not factored into the training data and therefore leading the AI to the bias not matching reality.

Detecting biased training data is something you will need to look for on both ends. During the design phase sit down with the developers to make sure the training data parallels the user goals, actions, and stories. Also make sure the training and testing data are split randomly. For example you don’t want stock market data to be trained with all the data from last year and tested in the data from the last two weeks.

Once the algorithm is ready make sure you are covering all of your persona types when doing your user testing. Also, this is a good place to find the subject matter experts (SMEs) and go through heuristic evaluations with them. Like the doctors in the previous example they can say when recommendations do not match up with reality.

A subcategory of biased training data is data normalization. As you know, users will put in fake data if there is forced collection to reach their goal. A good example is known as the Schenectady problem. There is a zip code 12345. It is for the GE factory in Schenectady, New York. If you try to use unnormalized data the amount of people showing up as living inside of a factory will be unusually high. Other anomalies include the number of people sharing the birthday January 1, 1900 and phone numbers that start with 555. It is easier to design in catch questions for user surveys. But, if you are working with already collected data the algorithms used to normalize data differs on what you are working on so the main things to verify is 1) the data is normalized before AI training and 2) that if your user groups and personas do have “out of the norm” peculiarities that they are not normalized away out of the data.

The last type of bias to cover is social bias. This is when you know the data goes against the company values. All data collected is from the past and from people who are acting on what they learned in the past. So things like racism and bigotry will show up in real data sets and need to be adjusted for. The biggest problem is remembering there is a problem since most development is done within one bubble or other. As the designer it is something to check for: 1) during design, 2) verifying in the training and testing data, 3) making sure the personas cover race, culture, personal and group identity, and gender and 4) to test against those persona groups.

Social bias gets split up into two groups (and I’m quoting directly from Kate Crawford, NIPS 2018) Harms of allocation covers discrimination in the product or service, for example approving a mortgage, granting a parole, or deciding insurance rates. The second is Harms of representation. This covers the social inequalities and stereotypes we don’t want to perpetuate.

There are plenty of examples of companies getting embarrassed by this. With Google, being a leader in machine learning, run into this problem a lot. There was the time their image recognition app was recognizing black people and tagging them as gorillas. The problem was there was not enough racial diversity in the developers working on the project so when they were testing with their own pictures they never saw the problem (causing biased training data).

Google also had a problem with their search recommendations. Since they build up their recommendation lists based on what other people typed in; when searching, racist recommendations used to pop-up when minorities were words used. They fixed this by being aware of the problem, detecting the racist searches, moving them to a different list than the training data so when racist data shows up as part of the test data it can still be acknowledged but will not count against the accuracy of the algorithm and will not show up as a suggestion.

Other examples include the profiling algorithm being less likely to recommend high paying jobs to women (story by Prachi Patel) and searching for historically black names are more likely to show ads for prison background checks (research by L. Sweeney). Or, when doing language translation, translating from non gendered sentences for doctor and nurse will add male to doctor and female to nurse. Even the tools used by developers like word2vec (a tool to categorize words to other words they are most likely to be used with)is more likely to associate male associated words with brilliant, genius. commit, and firepower, while female associated words are near babe, sassy, sewing, and homemaker.


This is just scratching the surface of problems you can run into. There are so many different areas of AI I can only cover some of them. Not to mention everyone is working on a customized version of an algorithm to get it to work for their own specific needs. If you come across a design problem or solution with AI, you are free to contact me and let me know so I can help or get to word out about good solutions.

Related posts

How algorithm choice affects UX

How algorithm choice affects UX

Machine learning is one of the strategies many companies are using as they try to improve and differentiate their product by augmenting the user’s experience. Building artificial intelligence into apps can improve what can be done for customers and users, but can also affect how customers and users interact with the machine. In this article I am covering how, as the designer, you can help improve the decision process; even when it is the engineers are deciding on which machine learning algorithm to use.


Like every other product you must first know who your competition is. A competitive analysis will tell you what other companies are also working in this space. Since machine learning is on the bleeding edge of what is coming you will also want to check research papers being published as competitions. What is a research paper or conference topic right now turns into a product very quickly. The people presenting these papers are being hired by companies as a cost saving measure of not needing to buy up a startup company that the person creates upon graduation.

Knowing the competitive offerings allows you to push back if the development team is pushing for an uncompetitive path. I have known a billion dollar retailer to go with a older less accurate rule based language processor instead of the deep learning language processor that has become the standard over the last few years. The developers recommended what they were familiar with and there was no push back from the product manager or UX because they were not familiar with the space. Not surprisingly the product is stumbling and having a hard time competing.

The next area to look at is the users. The first question to ask is AI necessary? If you are still reading I am guessing it is. Make sure you have some base set of users that were tested without an AI though so in the off chance that the AI tests as taking longer or lower user satisfaction you can show the difference so the AI can be improved or scrapped.

Also, because this area is changing so fast a lot of the time users being interviewed and tested will be part of the early adopter group. They are happy to put up with a lot of annoyances the general population users you are trying to expound for do not exist yet. For new interfaces, like AI, there are not a lot of best practices and heuristics to fall back on; so one of the most basic heuristics applies.

“The less the user needs to change their behavior and understanding, the better”.

Even early adopters are happier when they don’t need to learn something new, so to minimize the changes required for the user to adapt to the software, you can’t overstate the importance of building up accurate personas and journey maps. Once you know what the user’s goal is and what is influencing their decisions to reach that goal then you can help the engineers with the algorithm choice.


Now that you know your competitors and users how does that affect the algorithm choice? Algorithms for machine learning are split up into two main groups ‘Deep Learning’ and ‘Shallow algorithms’. Deep learning is working from tons of data, so even after it is optimized, it can be much slower than any number of specialized algorithms. If it must be smart, then slow might be OK. However, if it is just an improvement of one step in a process being done by millions of customers a day then speed might be more important and a specialized shallow learning algorithm might work better. That being said if the algorithm is spitting out useless info then you might need some videos of users cursing at their screen to convince product managers it is worth spending the extra money to allow for deep learning.

Within the shallow learning algorithms there are three groups: Supervised learningUnsupervised learning and reinforcement learning. Supervised learning is what is used most of the time. The training dataset to use is one that you can split into different groups. There needs to be some set of data that you already know the answers to that the different algorithms can be judged how accurate they are at guessing the right answer. Unsupervised learning is if you don’t know what is in the dataset. Most of the time this is to just group the dataset into sets with similar items. Reinforcement learning is trying to maximize or minimize something.

Deep learning is similar to reinforcement learning in that a goal is trying to be reached. The difference is that with the extra data deep learning can make predictions on the outcome. Within Deep learning there are two groups. Rule based learning is how it sounds. Rules are input into the machine before it is trained on the data. This type is favored by OpenAI and how they beat the Go champion. Reactionary learning is similar to instinct. This is relying more on the machine to come up with its own reaction more. Google depends on this style more.

Speaking of how accurate the algorithm is: does it need to be mostly accurate all of the time? Or, can it be very accurate most of the time, but when it is wrong, it is really wrong. The algorithms can be adjusted differently for those two different circumstances.

One thing to make sure of is there is not a rush to use a specific learning algorithm. Even for specialized algorithms, like making recommendations, multiple algorithms that work better for different situations exist. So make sure to document the situation and influences as much as possible for the user journey.

When choosing an algorithm, there are two things that need to be measured. The first is precision: when the algorithm makes a guess how accurate is it? The second is recall or prevalence: out of the total items you want found, how many did it find? Depending on your user’s goals either one can be more important than the other. For example if you are detecting spam it is better to have a high precision threshold and lower recall since it is better that a few spam messages end on in the mailbox than a real email ends up in the spam. But if you are detecting a disease from a medical scan then it is better to have a low precision and high recall, because it is better to have a few people need to be retested than to have someone slip through the with the disease. The F2 measurement is a test which gives a number that is basically a slider is between the precision and recall. This is what you will want to define for the algorithms to achieve.


There is a good chance the developers will come up with a few algorithms that kind of work. A lot of AI is still at the point of throwing a lot of stuff at the wall and see whats sticks. They might come to you with multiple algorithms to compare.

A normal machine learning process is three steps. The algorithm takes some data (with known answers) and predict an outcome (without access to those answers). It then uses that prediction to compare to the answers and measure the error/loss. From that measurement/grade it ‘learns’ by adapting itself. Then it goes through the whole process over and over. It would seem the easiest way to choose an algorithm is to compare them based on their measurement/grade they are giving themselves. The problem is the grading process inside each algorithm is customized, so not comparable.

The best way to test algorithms against each other is using a ‘Confusion matrix’. Think of it as kind of the final test at the end of the term for the algorithms. (It does not need to be big, usually the data is split 60% training, 20% validate, and 20% testing.) The normal data sets used to train an algorithm are split between the training and then validating. But just like students can learn to study to a test instead of learning the material; algorithms can learn to pass only the validation data set. Using a final set of data the algorithms never get to see during training makes sure this does not happen. It also allows for you to compare which of different algorithms best achieve your goals since they are all being tested on the same data and you can specify the requirements to measure.

Remember this is the testing developers will be (or should be) doing. Their goal is to get it as close to a theoretical point. This is the point you will be pushing them to reach. This point is the goals you discovered the users are trying to reach.


This is just scratching the surface of problems you can run into. There are so many different areas of AI I can only cover some of them. Not to mention everyone is working on a customized version of an algorithm to get it to work for their own specific needs. If you come across a design problem or solution with AI, you are free to contact me and let me know so I can help or get to word out about good solutions.

Related posts