Here is the scenario for this episode:
The boss gives you access to the companies data and asks you to come up with a model that uses it. With all this data it’s got to be good for making something the users will use right?
You buckle down, work with data scientists and make a lot of tweaks to the data come up with something, but no matter how much you advertise it no one wants to use it. Back to the drawing board.
This time you find out what the users do want, more tweaks to the data and get a model that is accurate. People love it, tons of users flood in and flood the server. The servers crash from too large of a model. The IT guys say they can fix it and bring in a bunch of new hardware. It all seems to be going fine until you notice every review of your app laughs at how inaccurate it is. This can’t be, it’s the same model, just running on different hardware, right?
Lets make sure this doesn’t ever happen.
Today we are covering the development cycle for AI
This podcast is called design for AI
It is here to help define the space where Machine learning intersects with UX. Where we talk to experts and discuss topics around designing a better AI.
music is by Rolemusic
Im your host Mark Bailey
Lets get started
Machine learning up to this point has been more on the research side.
So much so that it really doesn’t fit in to the normal software development cycle.
There are all these gotchas that won’t let you fit into the normal cyclic agile sprints that most people are used to.
This affects getting in good design. A big part of UX design not slowing down the software development cycle is to have a regular process so UX can run in parallel to development. It is possible with machine learning development, the cycle just looks a little different.
The normal software development process is building a machine. It’s a really complicated machine, but in development terms it is still stateful, so development is done to by writing to the test case.
For the updated process, instead of a machine, think of it like you are hiring an employee. There are 5 stages to hiring an employee.
Step 1: “Plan”
Let’s start with the plan. Before even thinking of Machine learning – collect data. Not just analytic data, user data. This is normal UX research. Is machine learning necessary? Remember AI is not a fortune teller. Aim for problems that are possible now but would take many hours for many people to solve. If a person can’t perform the task, then neither can an AI.
For the people side of UX research, visit in location, in car or lunch to watch real tasks. Bring artifacts if they can’t be visited. Do not talk down to user, ask them to explain things. Write quotes instead of opinion, Take pictures, Ask open ended questions. Do not ask them to design. Do not ask for them to predict the future. People are bad at that. Do not write solutions or bug fixes and do not teach no matter how much you think you can help. Instead, Can you tell be more? Can you explain x to me? Do they have questions for you?
All of these are important to learn the user journeys and to find the user’s true goal. You’ll use these as part of the data design.
As part of the UX, this is also the data to use to build the personas, and map out the user journeys
Step 2: “Job Posting”
Purpose & Design of the model
Set your goal
Take your user’s user journeys and goals and work with the data scientist to line them up to the data points you have available. What data do you have available? Don’t look at the data you have; then design a product around it. this leads to a product that management wants instead of what the users want. Design for what is needed; then find the data sources. There is a pretty good chance you will need to merge different data points to get to the data point you really want to know.
Information quality matters. Determine what the algorithm needs to know. Use representative & complete data. Design in enough measurement points across the entire user journey. Make sure data has enough touch points through the process to help the model have better visibility that it is doing what the user wants.
So what are the things you need to pay attention to while designing?
This is going to sound weird, but it’s OK to remind the user of the good job you did. A lot of the time you are doing things automatically for the user and it is normal human perception to take things for granted. If your server is busy doing something for the user, let them know what you are doing. It helps with transparency too. Perception is key, and last memory is important. So make it a good one.
A good example of this is when you fly on an Airline- Everything can go right, you can get there early, but if it takes too long to get your luggage, even if you leave the airport before you were expecting to, the trip is ruined.
AI specific problems include the user can get lost when you do things automatically for them. Too often AI tries to change the state for the user. If you are creating a world for the user, you need to state the boundaries since this isn’t AGI. If you want to dive deeper into this listen to Episode 6 on AI personality .
When designing ask: Does the user know where they are? Does the user know everything they can do? If you are updating a process with AI. The process got automated originally using the technology at the time. Don’t streamline a process that needs to be replaced. Look at the info the users are getting, what they use them for, what info they really want
Don’t forget accessibility. Machine learning averages toward the general case, not the exceptions. AI generalizes to the bulk of the data, don’t forget the edge cases. It’s easier to not get sued and don’t throw away 15% of your market.
Like I said previously, transparency is important. Machine learning is already viewed suspiciously. Transparency usually isn’t possible in the algorithms. Instead have transparency around:
When you are designing, be wary of group think. Being in the machine learning makes you feel like you can solve any problem. You are solving new problem no one has solved before. Just remember, every solution is always a hypothesis that needs to be tested. Everyone can come up with competing ideas. Use user testing, even in the design stage to test the ideas on the users.
Something more controversial I might say is that the designer should have a seat at the table when deciding which algorithm you want to hire. I’ll cover in a future episode what different algorithms will get you, and what they are good for. For now just know to focus on which one you choose will affect the UX for the customer. An example of this: I have known a billion dollar retailer to go with a older less accurate rule based language processor instead of the deep learning language processor that has become the standard over the last few years. The developers recommended what they were familiar with and there was no push back from the product manager or UX because they were not familiar with the space. Not surprisingly the product is stumbling and having a hard time competing.
Step 3: “Hire”
Build On Expertise
As the ‘Boss’ of an AI app, our worst nightmare isn’t that they are too smart; It is that they’ll be like us: dumb. It is even worse if they will inherit our biases.
I’ll be doing an episode just on all the different kinds of bias because there are a lot of ways to run foul of different kinds. For this episode I’ll just leave it at that a lot of people think that if it is a computer program it isn’t biased, because computers aren’t biased. This of course isn’t true since every application is built by people and uses data from people’s actions. So any bias people have can find its way into a AI model.
As the model is getting built it is important to dev and UX to work together. A big part of building a model is trying to get the accuracy up. What to decide in the model for the accuracy should align with what you found out in the research. Part of this is the UX testing. Now normal systems are deterministic. To create normal software dev teams write test driven processes where the outcome is expected, and it has to pass for the software to ship. AI models will always give different answers every time.
Since you don’t know how the answers come about, instead you need to know what the acceptance criteria is. What metric do you need to get your number above? How will you be able to measure it. These can be small numbers. For safety critical systems you want to make sure to move in small measurements. For noisy systems like recommendation systems, a 2% more buy, above the noise might be what you are looking for. It will depend on your industry, and what you found in research.
Another problem to watch for is the ability to do an AB test. Once software has been written it is hard to disable the AI and have the software still work. A good way around this is to have the developers set up an alternative algorithm to just take all the data and find the mean average. This will if nothing else it will tell you if the AI is better than no AI.
Step 4: “Train”
After the model is created with all the training data it is time to open it up a little to a beta test. How you conduct the beta test will be specific to your industry. In the previous episode I spoke about chat bot models being tested using facebook or Kik chatbots or even as a bot on Reddit. The point is to start getting real data directly to the model for it to respond to. The model won’t be live yet but it can be compared to what the SME says the answer should be.
A lot of the time, if the company is small enough, as the UX researcher you will be speaking as the SME since you are the one that talked to the users. Testing the accuracy of models as a non-data scientist might sound difficult but you can’t shy away from math. It will take measuring analytics since there could be a lot of noise. But, don’t worry, no one is good with math. It is just a matter of practice.
The SME need to be directly active during user tests. Every question that comes up the answer that the SME gives can be used to check what the model would have given in the same situation. Right or wrong the new data can be used to better train model edge cases because the SME know the problem domain. They know what a right or wrong answer is for the model to give. Exploratory testing and boundary testing can be done because they need to know where the limits are.
A heads up that at a lot of companies, the QA group is also adjusting to the new reality of working with AI. As long as they can verify that the app gives a reasonable response, it passes. AI can give many responses that make sense but do not help the user to achieve their goals. Make sure that the metric being tested, whether it is quantitative or qualitative, has been reduced down enough so anyone can tell if it a pass or fail.
Answer quality isn’t the only thing you want to make sure is part of a good user experience. Before release verify:
Availability of serving hardware. A lot of delays can creep in when one server is depending on another’s answer.
Response time for the model to give an answer or interaction. Make sure it can be scaled up.
How fast the critical mass of users can be built up. If you don’t get enough users in the beta test, it won’t train the model to give better answers. If the pickup of customers isn’t there, why? Did you not advertise the beta enough or is part of the interaction not what is expected or wanted?
The answers the users are giving. I’m referring to the Schenectady Problem. The company Meetup was showing a lot of users in New York state. Way more than what was representative. when they looked up zip code, it was for a single GE factory, but was showing 10 of thousands of users living there. That zip code was 12345. Just be on the lookout that you may need to clean your data coming in. This is why you are doing the beta.
Step 5: “Shadow”
It is now the AI’s job to build trust
It is important to build up users to increase accuracy. As the first users started to use your app the accuracy was low because of the lack of data. This difference between expectation and reality has been labeled the “gulf of disappointment”. Time spent in the gulf of disappointment is because of bad design or bad accuracy.
If users spend too much time disappointed they will stop using the app.
Bad design is covered by UX design that hopefully was done so it won’t be the cause. Bad accuracy is a reality when starting out because of the lack of data. As the number of users increase and more data is collected it becomes time to walk the tightrope between using more hardware to increase accuracy and simplifying the model to increase the number of users being served.
It used to be that UX would work with the developers to create a good experience. Dev would build it and then QA would verify it. After that the product would be released and the team moved on to new features. This still works for non-AI products. The current process of developing machine learning throws a wrench into that when things go into production (known as inference for ML models). ML is still enough on the cutting edge that most develops are on the research side of creating ML models, meaning they care more about getting the model quality up instead of speed so the models that do get created work well but are slow.
There are a lot of ways to get the speed up for production but there are trade-offs.
Unfortunately some tradeoffs can negatively affect the model accuracy, so make sure to test before and after for each model to make sure it still makes for an acceptable user experience.
The most obvious solution is to just optimize the code for the current model. A lot of times companies have a different engineer that specializes in productionizing models. This is a different skill than building models. They can strip out inefficient parts of the model to try to simplify it while affecting quality as little as possible. This can be verified before production.
The next solution is distillation. This is a where the slow large accurate model tries to train a faster simpler model. The idea is a smaller model is created without all the specialized code. A much larger training set can be used since it does not need to be hand labeled. Instead the large slow model is used to tag unlabeled training data. The larger training set hopefully allows for a similar quality measure. This is easy to catch if QA tracks model versions the same as code revisions.
The third solution is changing the serving hardware.This one is tricky. Up to this point one of the general ideas of code is that once software is compiled it will run the same no matter what hardware you put it on. This is not the case for ML models. The quality of hardware can affect the quality of the accuracy of the model. When the hardware is built, more expensive hardware means more tensor cores in the case of GPUs, which means more math can be done, which means more accurate answers. If it is a large enough deployment the number of GPUs (or TPUs if you are using Google for serving) can affect the quality of answers so the servers all need to be live with the software on the serving hardware before you can be sure the user experience tested will be the experience the users actually experience.
The last solution I will cover is quantization. It is a shortcut that can be done to cut down on the math needed. The idea is to take big numbers that take a lot of space (for example floating points like 3.12345) and shorten them down (3.1 or even just 3 depending on if you are using floating point or integer quantization). It speeds things up but again the experience needs to be verified after the changes are made. Also just as a heads up quanitizing models makes quality levels even more finicky depending on the hardware it is run on and the hardware it is run on will determine what kind and how much quantization can be done.
Step 6: “Lead”
This is a successfully working model, but It’s too early to pat ourselves on the back. Successful models are built in small steps. Building a successful model brings in more data that was not available before now. Does this new data give you a way to improve the model? Does it give you the data needed to fulfill another feature requested by the users?
Probably you are not able to achieve the ultimate goal of what the model should do based on what data was available to you at the beginning. It is a good to have a model expansion plan. Building model A will allow you to gather data x. Data x will allow you to build model B which will allow gathering data y and so on until you reach what you envisioned for the users.
This will however allow for you to start the cycle again. What was learned? what can be improved for next time?
and on that note, what can I improve? For this episode and for the podcast in general?
That’s all we have for this episode, but I would love to hear back from you what you like and would like to hear more of.
If you have questions or comments, use your phone to record a voice memo,
If you would like to see what I am up to, you can find me on Twitter at @DesignForAI
Thank you again
and remember, with how powerful AI is,
lets design it to be usable for everyone