Select Page

It used to be that UX would work with the developers to create a good experience. Dev would build it and then QA would verify it. After that the product would be released and the team moved on to new features. This still works for non-AI products.The current process of developing machine learning throws a wrench into that when things go into production (known as inference for ML models). ML is still enough on the cutting edge that most develops are on the research side of creating ML models, meaning they care more about getting the model quality up instead of speed so the models that do get created work well but are slow. There are a lot of ways to get the speed up for production but there are trade-offs.


The most obvious solution is to just optimize the code for the current model. A lot of times companies have a different engineer that specializes in productionizing models. This is a different skill than building models. They can strip out inefficient parts of the model to try to simplify it while affecting quality as little as possible. This can be verified before production.


This is a where the slow large accurate model tries to train a faster simpler model. The idea is a smaller model is created without all the specialized code. A much larger training set can be used since it does not need to be hand labeled. Instead the large slow model is used to tag unlabeled training data. The larger training set hopefully allows for a similar quality measure. This is easy to catch if QA tracks model versions the same as code revisions.


This one is tricky. Up to this point one of the general ideas of code is that once software is compiled it will run the same no matter what hardware you put it on. This is not the case for ML models. The quality of hardware can affect the quality of the accuracy of the model. When the hardware is built, more expensive hardware means more tensor cores in the case of GPUs, which means more math can be done, which means more accurate answers. If it is a large enough deployment the number of GPUs (or TPUs if you are using Google for serving) can affect the quality of answers so the servers all need to be live with the software on the serving hardware before you can be sure the user experience tested will be the experience the users actually experience.


One of the shortcuts that can be done to cut down on the math needed is called quantization. The idea is to take big numbers that take a lot of space (for example floating points like 3.12345) and shorten them down (3.1 or even just 3). It speeds things up but again the experience needs to be verified after the changes are made. Also just as a heads up quanitizing models makes quality levels even more finicky depending on the hardware it is run on and the hardware it is run on will determine what kind and how much quantization can be done.

Related posts