Productionising machine learning through the lens of semiconductor fabrication
(Photo from Unsplash)
Current machine learning processes leave much to be desired
A persistent criticism of machine learning is the lack of consistent and reliable processes compared to that of conventional software engineering - that the lack of software engineering knowledge in the Machine Learning community is the reason for slow deployment processes and a lack of testing.
Much has been written on these issues and methods to tackle them;  collects a list of technical debts that could commonly accumulate in machine learning codebases, and suggests solutions for ways to mitigate these issues. Meanwhile,  details a framework for evaluating the production-readiness of ML systems, consisting of 28 benchmarks spread across data, model, infrastructure and monitors. More importantly, both  and  describe in detail the degree to which machine learning differs from traditional software. This prompts us to seek solutions beyond software engineering.
Much like machine learning itself, where many techniques and phenomena have multiple interpretations and can be seen from different (i.e. Bayesian) perspectives, in this blog we compare ML productionisation to semiconductor manufacturing. Better analogies can certainly be found, but we want to use this as a starting point to provide a different perspective from which new ideas could precipitate. Admittedly, we don’t claim to be experts or proficient in microprocessor fabrication and its processes.
There are some dissimilarities between ML and software engineering that can’t be easily reconciled
There is often a strong urge to look at software engineering standards as the bar on which all machine learning development cycles are judged. In this section, we focus on a few discrepancies between the development cycles of software engineering and those of machine learning.
Features and components within traditional software can often be broken down into individual pieces, meaning that how a system or feature works can be broken down into specific parts of the code. Further layers of complexity is added to machine learning; there is a great amount of entanglement between input features and model parameters where intermediate outputs, which could sometimes be interpretable, do not have consistent meaningful uses across different initialisations.
Oftentimes, pieces of software, as components can be tested individually and quickly, where the bulk of the work and time cost is in writing up the code itself outside of compilation. With machine learning, new ideas and models often require comparatively smaller code changes, the equivalent of “code compilation”, i.e. training, takes orders of magnitudes longer compared to most traditional software. Unless it’s an error that breaks on initialisation, this means that many issues would take the end of training to identify.
Testing code in traditional software is more transparent, where common errors or failures can oftentimes be traced back to specific lines of code that can then be debugged. For ML, and especially neural net models, more thorough and complex evaluations and tests are needed to detect errors and failures. Should a ML system work incorrectly, the identification of the source of the error could require expertise spanning from data processing, the model code, model design and ML theory. And every change that requires retraining risks new errors being introduced to the system. As such, despite the work done on the deployment of ML models, machine learning production and iteration cycles can take weeks or months to complete.
These problems in ML have analogies in semiconductor fabrication
Microprocessors are some of the most complex products in the world. Improvements in CPU performance made through new manufacturing technologies (i.e. smaller transistors along the lines of Moore’s law) versus those made through microarchitecture optimisations. Changes in manufacturing technologies lead to significant improvements in processor performance, where research is done by investing in completely new technologies (and is becoming more and more difficult today as transistors start becoming affected by quantum effects). Meanwhile, the optimisation and development of microarchitectures lead to more efficient processors, making improvements in performance as well as new features.
The huge complexity of microprocessors means that they’re incredibly difficult and expensive to test and evaluate for. Not only so, issues and vulnerabilities can be incredibly difficult to fix; the Meltdown  and Spectre  vulnerabilities in Intel CPUs took years to discover, and the Spectre vulnerability could not be fully addressed until a next generation of Intel chips.
The speed from research to production for CPU technologies is also a slow one; the process going from the start of production to shipment for a single chip takes several months to complete, and it could take years for a chip design from research to be fully realised in a packaged product.
Some ideas from semiconductor fabrication can be borrowed to tackle the same problems in ML
Much work has been done from the perspective of trying to mitigate and erase the differences between machine learning and traditional software engineering. However if these characteristics in ML were embraced instead, new processes and standards could be defined; as such, ideas from semiconductor and microprocessor fabrication can be drawn to make sense of these characteristics in machine learning.
The Intel tick-tock model
Between the years of 2006 and 2015, Intel successfully used a production model called “tick-tock”. As such, each “tick” corresponded to a change in the manufacturing technologies, shrinking the size of transistors by roughly a factor of root(2), which are riskier improvements that lead to large step changes, while each “tock” corresponded to a change in the processor architecture, implementing many optimisations and features such as hardware-accelerated video transcoding. The adoption and successful implementation of the tick-tock model corresponds to a decade where Intel products dominated the CPU market, with relentless improvements in processor performance.
One balancing act that many ML startups face is the choice between committing themselves to iterating and refining an existing model/framework towards productionisation, or conducting more experimental research that will require longer time spans to reap rewards. This in practice has meant that the more adventurous research has been seen only in large established organisations and a small fraction of startups committing themselves to the deep-tech, moonshot path. A tick-tock development suggests that something similar could be implemented on a high level for ML research and development, where development is defined in cycles, alternating between experimental research and incremental ones.
In semiconductor production, the method to facilitate a long production process and testing procedures is to make use of production lines; each stage in the production involves specialists with automated tools, with a constant stream of chips flowing through the pipeline.
A production line approach for the productionisation of ML models could mean that processes such as data processing, model retraining or model evaluation would be seen in their own right with dedicated engineers and expertise. In such a framework, the goal could be to maximise the throughput of the ML production pipeline instead of the research output or the time to production.
Design for Testability
Design for Testability is a concept in microprocessor architecture design that builds testing and evaluation features into the processor itself. One such example is the technology of scan chains, which allows for the external setting and monitoring of every flip-flop within an integrated circuit, allowing the individual components of processors to be easily tested.
Similarly, for machine learning models, especially those of neural nets, one would imagine the practice of designing and building the neural net frameworks where inputs could be easily set from any layer, or any intermediate output or gradients could be measured after running the model in a forward pass.
The fabrication of electric circuits onto wafers is not always a reliable process, where dust and other impurities can break circuits and cause chips to break. Because of this, the concept of yield is used to describe the proportion of chips or microprocessors that are operable without error from a batch in production.
While there are no easy parallels to draw here, the stochastic nature of ML models means that “yield” could be a helpful concept for various methods of evaluating model robustness: whether be it the yield of valid model outputs under different initialisations, over the distribution of model inputs, or over the distribution of training datasets.
These proposed solutions are nice but ML still has a long way to go
These analogies are in no way perfect, and there are certainly other fields from which analogies can be drawn; perhaps running a successful ML pipeline, with the alchemic nature of neural net models, can be compared to running a successful Michelin starred restaurant, or maybe productionising SoTA ML models year after year can be compared to the ruthless competition of creating championship winning Formula One race cars. Nonetheless, there is a lot more to be done to establish standards and practices for machine learning, which is both a daunting and exciting prospect; chances are the best frameworks and processes for machine learning are yet to be created.
Many thanks to Lorenzo, Alexandra and Taras Iakymchuk for helpful comments!
 Sculley, Holt, Golovin, Davydov, Phillips, Ebner, Chaudhary, Young, Crespo, Dennison (2015) Hidden Technical Debt in Machine Learning Systems
 Breck, Cai, Nielsen, Salib, Sculley (2017) The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction
 Lipp, Schwarz, Gruss, Prescher, Haas, Fogh, Horn, Mangard, Kocher, Genkin, Yarom, Hamburg (2018) Meltdown: Reading Kernel Memory from User Space
 Kocher, Horn, Fogh, Genkin, Gruss, Haas, Hamburg, Lipp, Mangard, Prescher, Schwarz, Yarom (2018) Spectre Attacks: Exploiting Speculative Execution
Subscribe to the blog
Receive all the latest posts right into your inbox