What I’ve learned about MLOps from speaking with 100+ ML practitioners

Vesi Staneva
9 min readMay 17, 2021

Over the past months, I’ve been banging my head to understand what stands behind the overly hyped statement “90/87/85% of the machine learning models never get to production” and what it has to do with MLOps. Heading the product and customer development effort of a brand new Machine Learning deployment platform called TeachableHub, left me with no other choice but to dive deep into those “dark matters”.

So I have made it my mission to really uncover the truth beneath all the marketing hype and pass on all my learning with gratefulness to all those who so generously shared their amazing experience with me and my team. To achieve this goal we’ve spoken with 100+ Machine Learning practitioners ranging from Full-stack Data Scientists in fast-growing startups to Software Engineers from well-established enterprises to experienced MLOps guys laying the foundations in operationalizing Machine Learning. A huge “Thank you” to all those professionals for their time and willingness to share!

Without further ado, here is part of the key takeaways around the everyday MLOps challenges my team discovered on this fantastic journey.

“I was a happy data scientist until we decided it was time for deploying our models.”

This was how Ale Solano, a Data Scientist solving problems with NLP and Computer Vision, introduced himself back in November 2020 to the MLOps community. It was the time his company was just starting to deploy their models to production with only a couple of Data Scientists. His team had a passion for solving problems, but so far they have done it only the scientific way. No engineering background. No MLOps know-how.

Photo by Science in HD on Unsplash

And throughout the conversations with other Data Scientists, it became apparent that many companies are attempting to introduce Machine Learning the same way — only with a handful of Data Scientists. It makes a lot of sense to first do a proof of concept(PoC) before investing too much effort in resources into building a full-fledged Machine Learning Product. Still, training a model and reaching a certain level of accuracy doesn’t really prove that the model could and should be implemented in a live environment. And if the solution is not tested in production, then it doesn’t really solve the business problem nor return the investment. Naturally, Data Scientists are given another task — finalize the PoC and deploy their model to production.

How hard can this be? To successfully deploy models and be able to serve them at scale, skills and knowledge more commonly found in software engineering and DevOps teams are required. Think of infrastructure, hardware orchestration, container management, scaling, load balancing, redundancy, availability, performance, security, authentication, authorization, integration, versioning, reproducibility, monitoring, model re-training… all this and more should be considered in advance and set up in order to establish a CI/CD pipeline that would get a model to production in a safe, cost-efficient and maintainable manner. The short answer we’ve heard from one too many Data Scientists is just “Very very hard” as what soon happens is they realize deployment is not a task, it’s a process.

Photo by Ariel Biller

In conclusion, this approach about introducing Machine Learning into an organization is somewhat flawed since conception as Data Scientists are not usually the people with extensive knowledge about the above mentioned. On the other hand, acquiring such diverse know-how over multiple disciplines takes a lot of time and practice. In this setting the chances of building a reliable deployment solution are slim and the risk of failure is high.

The build vs. buy dilemma

Obviously, since TeachableHub is entering the market of ML deployment and serving solutions it’s vital to understand what exactly it is out there, who uses it and does it solve their problems well enough.

It’s no surprise that we found out most companies who are buying managed solutions are using AWS, GCP, or Azure. Most organizations are looking for a “safe bet” when it comes to such important decisions and point the main reason for such a choice is the adoption level of the tools. The most adopted ones are usually well documented and easier to get approved at C-level. That is common reasoning amongst bigger companies and enterprises.

The top solution was actually Sagemaker and what was a surprise is that even though it was the first choice there were quite a few negative statements:

“I use Sagemaker, but I don’t really like it. The only reason I chose to use that now is because I have experience with AWS and currently I’m the only one dealing with deployments. Still, in the future we’ll have to think of an alternative as Sagemaker’s workflow is not simplified, support is bad, and it’s not intuitive for Data Scientists. Vendor lock-in is also a concern.”

“If I let my DS team use Sagemaker, things will get too messy. We need a more simplified solution.”

“I started with AWS as I was the most familiar with the technology and there is a lot of documentation, still there are quite a few things that are not very intuitive and the solutions is not very cost-efficient for me.”

“I use Sagemaker, but it is too complicated. It’s too expensive!”

“Sagemaker has a huge learning curve and it is frustrating and annoying to work with it.”

“Deployment time with Sagemaker is better than a DIY solution, but If I could do it differently I would have pre-built an internal pipeline I can use for all my models using Kubernetes.”

When it comes to company size, the tendency we’ve noticed is that the bigger and the more inert the company(having too many legacy technologies), the bigger the probability to turn to a ready-to-use or a cloud solution. The banking and finance industry is a great example of such use-cases.

On the other hand, smaller companies and startups usually choose the “build in on your own” approach for more flexibility. Such DIY solutions are usually built on top of Kubernetes and Docker containers. In the early ML days of those companies, it makes a lot of sense not to commit to an expensive off-the-shelf platform. The key question here is how scalable, reliable, and maintainable is such a solution. Usually, when such companies reach a certain point in their business growth, this solution just becomes unmanageable or too expensive to take care of so they start looking for alternatives.

Bonus: Here is a very authentic post under the topic “You build it, you run it” from the MLOps Community that discusses some of the issues around ownership and responsibilities within ML teams that build their own ML solutions.

And when it comes to all the new MLOps tools coming up to the market the proliferation is huge and most practitioners are quite conservative. Most of them are waiting to see “What will stick around and survive for the long run.” before they commit to using any of those tools. Another thing is that most of those deployments, serving, and monitoring tools are targeting mainly enterprises, so they come off as too complex and pricey. Getting started with such a tool also takes quite a while.

Another common concern with such tooling is vendor lock-in, thus most companies regardless of the size, usually opt for a combination of best-of-breed solutions, rather than go with an end-to-end platform. The reasoning behind that is that smaller tools usually solve the problems they are designed for really well and are more fitting to the organization’s needs. On the opposite, one tool that claims to do it all, usually is not doing a good job with specific tasks, is way too complex, and oftentimes much more expensive.

And lastly, a couple of curious statement that popped up during the meetings about the Marketing messages the new players on the MLOps field use:

“It seems to me ML tool providers are not good with their Marketing and don’t explain well what they offer. When I visit most of their websites I’m thinking “What the hell is this?” “

“The ready-to-use MLOps platforms I explored were either too complicated or I didn’t really get what to do with the tools. Their websites and documentations are quite messy.”

Photo by Ashley Batz on Unsplash

To sum up here, when it comes to choosing deployment and serving solutions things are still very messy. There is quite a lot of room for advancements in the market and lucky for my team a lot of value TeachableHub can add, by simplifying and making the process more affordable.

“We don’t have a technology problem, we have a people problem.”

This statement was one of the toughest truths we’ve learned…Even the best tooling and the most automated solution can not help teams that have not adopted the proper mindset when it comes to introducing MLOps to an organization. Tools aim to be bridging the gap, but sometimes what is needed is a common space, common mindset, and language.

“#1 management challenge is the communication in the ML teams + clear roles and responsibilities + separation of the concerns.” — Laszlo Sragner, Founder at Hypergolic

Ultimately, any bigger change within the organizations’ established workflow causes frictions between the different levels. The most common scenario is when a “top-down” approach is used and someone from C-level decides it’s time for their ML operations to move to the next level as the business is growing.

Naturally, someone needs to take the lead on this bold move and the management usually assigns the task to the most experienced DS/MLE within the team, but it is also not uncommon when an external hire with special expertise is brought on board to execute. It’s great that such an important change has support from above, but like any change, it’s not always welcomed from the lower levels.

The “lucky” person to introduce MLOps principles not only realizes that there are no best practices that can easily be applied to every use case, but that not every Data Scientist on the team will be wanting to go full-stack. Shifting focus from passionately solving science problems to struggling with learning Docker and Kubernetes, often causes a lot of frustration in the team as well as significantly reduces the quality of the Machine Learning models the team is creating.

Such leaders need to be very flexible, empathetic and most of all patient as their task is not just to find the best MLOps tool and adopt it. They need to grow a mindset, keep the team’s spirits up, and lead the way to MLOps maturity within their organization.

Photo by Jehyun Sung on Unsplash

I would like to finish this passage with a quote from the awesome Demetrios Brinkmann’s post “MLOps is maturing, and here’s the evidence”:

“Those of us at the vanguard of ML will have to map the potential of MLOps, share the success stories, codify best practices, and show organisations how ML can help them realise practical business goals.”

I share the belief that such leaders are obliged to transfer knowledge and experience to the communities outside of their organizations too, as this is the only true path to attaining MLOps maturity. Let’s embrace the great honor of being the shapers of this bright future!

Fin

It’s been a lot of fun meeting so many amazing people and all the above is just a small fraction of the precious insights we’ve learned along the way. Most of those conversations confirmed there are quite a few challenges companies face with introducing MLOps principles and operationalizing models. Still, most companies are slowly getting to production one way or another. It’s also true though, that much more companies haven’t even entered the ML market. I would conclude that the many obstacles in front of operationalization plus the fact that the global Machine Learning market is projected to grow from $7.3B in 2020 to $30.6B in 2024, are amongst the main reasons behind the hype around MLOps. What do you think?

If what I shared above was intriguing to you, it’s 100% guaranteed hearing about the rest of my team experience will be worth your time. So don’t shy away and drop me a note on LinkedIn, let’s have a cup of coffee with MLOps flavor! We’ll be releasing a series of posts with more precious insights over the next months so your feedback is more than wanted. ;)

Update: You can find the second round of MLOps ‘signposts’ our team gathered while searching for the secret formula for MLOps success. Feel free to share your thoughts and comment under the article!

--

--

Vesi Staneva

Co-founder & Head of Product , SashiDo.io, GPTboost. io, and ChatShare