What to Consider When Selecting MLOps Tooling / Infrastructure | by David Hundley | Aug, 2022

Guidance on things to consider as your company sets up its MLOps ecosystem

Title card created by the author

One of the hottest recent topics in the AI/ML community has been the adoption of MLOps as a best practice. The challenge has been that MLOps can be pretty difficult to enable. As a machine learning engineer at a large company, I’ve been privy to many different options for enabling MLOps as it feels like at least one new vendor reaches out to me each week asking if I’d be interested in test driving their new MLOps enablement tool. To date, I haven’t seen a general consensus from any of the AI/ML community that there is one particular MLOps tool that is being considered “the tool” for MLOps. Part of the reason there is no general consensus on an MLOps tool is that different companies have different needs, and some tooling I’ve seen works better for some than others.

The goal of this post is to help you think about your own company’s needs and how you might consider a high level approach to MLOps tooling and infrastructure. Because I mentioned that different companies have different needs, you won’t see a hardline answer in this post. Additionally, you will not see me recommend any one particular tool or platform. These considerations are broad enough that it doesn’t particularly matter which tool or platform we are talking about.

To keep things simple and structured, we’re going to analyze a set of 4 combinations based on two variables: MLOps tooling and MLOps infrastructure. When I mean MLOps tooling, I’m referring to the actual tool you will be using to enable MLOps, and when I mean MLOps infrastructure, I’m referring essentially to the computing platform on which that MLOps tooling will run. To get to the 4 combinations of these things, I’m splitting the tooling and infrastructure into two considerations each:

  • Tooling #1: Vendor Product Solution: This will refer if you choose to go with an external vendor product to help enable your company’s MLOps needs. As vendor products typically do, they seek to abstract away many of the challenges that would hopefully streamline the process for MLOps enablement.
  • Tooling #2: Open Source or Custom Solution: Opposite of a vendor production solution is opting to use open source or a custom, homegrown solution. Contrasted with a vendor product solution, it can potentially save your company a lot of money, but it also requires that your MLOps engineers are technically capable enough of enabling something like this.
  • Infrastructure #1: Cloud Provider: These days, it is very common for companies big and small to use a cloud provider to host the company’s IT infrastructure. These cloud providers have also had the tendency to offer specific services to a particular domain, and machine learning is definitely no exception. All three big cloud providers (AWS, Google Cloud Platform, and Microsoft Azure) have robust options for machine learning enablement. (But not particularly MLOps enablement, at least not yet.)
  • Infrastructure #2: Self-Managed Environment: Given that cloud computing is still relatively new, it is common for many large organizations to still maintain their own server farm for its compute power needs. That said, it is not uncommon for companies to leverage those environments for hosting MLOps solutions.

Now that we’ve outlined our options, let’s jump into the benefits of the combinations and who might be best served with each of these different combinations.

Benefits: Presuming that everything works well, this combination should offer the most efficient way to enable MLOps for your company. Because there essentially is no homegrown anything here, you should presumably be able to cut back on the number of people your company has to keep on the payroll. Support for the product and cloud infrastructure would come from those respective companies. In a perfect world, this should be the best option for anybody…

Challenges: Okay, so as you might be able to tell, I actually don’t have much faith in the benefits mentioned above. Setting aside the fact that this can be the most costly option if you’re not careful, you’re essentially trying to take puzzle pieces out of two different boxes and hoping they fit together. Sometimes they can, but oftentimes I have seen that vendor product solutions don’t really jive well with cloud providers. And if you consider that cloud providers also seek to offer their own machine learning services, that feels a bit intentional. (That is pure speculation, by the way.) Moreover, if your company is already established in a niche way, you’re essentially trying to fit a round peg through both a square and triangle hole at the same time!

Who is this for: Of all four combinations you’ll read in this post, I transparently struggled the most with who is best served by this option. The closest thing I could think is a start up, especially if the product the start up is offering is not MLOps itself. For example, if I’m a data scientist with a great idea for using AI to predict the weather in a new, cutting edge way, I probably don’t want to have to deal with all the stuff to enable MLOps. Given that a start up doesn’t have the “baggage” a more established company has, start ups have much more flexibility in ensuring that a vendor product / cloud provider combo could work just fine for their needs. But this can also become a costly solution, so while I think a start up could still find value from this combination, they’d have to be pretty careful about setting things up correctly.

Benefits: As mentioned before, many companies still manage their own server farms for compute needs, but many of these same companies are also new to AI/ML. Moreover, AI/ML might be more of a “side benefit” than the value the core product provides itself. (This will be clearer with an example in the “Who is this for” section.) A company like this might be interested in hiring a handful of data scientists, but they might want to minimize the talent actually required to enable MLOps. Given these factors, a vendor product tool might be the best solution to bridge that gap between using your own compute infrastructure and not having to maintain your own MLOps enablement team.

Challenges: With any vendor product comes the challenge of making sure your company’s stuff can integrate properly with the tool. The challenge here isn’t as great as it is with the “vendor product on a cloud provider” option, but it still exists. The other challenge that comes with any vendor product is it might not do 100% of what you’re hoping to do, and when it comes to MLOps in particular, I personally haven’t seen a vendor product that would do 100% of what I would expect it to do. That said, you might have to purchase multiple vendor products to really get everything you need, or you might have to build some smaller custom solution around the vendor product in order to get it to work.

Who is this for: The group that would likely benefit most from an option like this is a large company whose core product isn’t AI/ML but already maintains their own compute platform. I’m imagining a large fast food corporation like McDonalds here. (I have no insider knowledge about McDonalds’ inner workings, so the following is pure speculation.) I would imagine that McDonalds probably maintains their own server farm to support things like cash register operations across all its restaurant. At the end of the day, McDonalds’ core product is food, and while AI/ML could help in some ways, I can’t imagine McDonalds would want to hire a robust group of expert AI/ML specialists to maintain MLOps for them. (Again, I know nothing about McDonalds’ in particular, so I could be very wrong about that.)

Benefits: Let’s say you already have a great solution for MLOps on your own internal server platform, but your company has decided that they would prefer to reduce its physical footprint going forward. Cloud providers offer great solutions to help you migrate your custom MLOps solution with relative ease. Moreover, the cloud providers themselves offer a number of machine learning services that a seasoned MLOps engineer can readily take advantage of.

Challenges: Like it is difficult to integrate a vendor product onto self-managed infrastructure, it can be equally difficult to port a custom MLOps solution onto a cloud provider. An MLOps engineer may have had the flexibility in getting very specific things enabled by infrastructure engineers maintaining their own platform, while cloud providers simply aren’t as flexible to work with. This doesn’t mean the cloud providers are totally rigid; it’s just not as easy as working with something that is totally malleable.

Who is this for: This option is definitely best for companies who have strong MLOps engineers but want to reduce their physical compute infrastructure. While maintaining your own infrastructure offers many benefits, there are also a lot of challenges associated with that. Setting aside the more obvious challenges of maintaining the servers themselves, there is also a more subtle challenge when it comes to pure accounting. Just as companies amortize things like buildings or vehicles, many companies also amortize their IT infrastructure components. This accounting structure alone can be a huge challenge, so if you discover that your MLOps tooling requires more GPUs than your company has on hand, the accounting structure might inhibit you from obtaining those in a timely manner.

Benefits: In the previous combinations, we touched on the challenges of working with rigid structures across vendor products and cloud providers. Because this option basically gives you full control over your MLOps systems, you have the freedom and flexibility to do whatever you want. In a world where your company has a strong MLOps engineering team and your own compute platform, this flexibility can be a dream.

Challenges: Of course, this is sort of the “pie in the sky” combination. While this combination offers a lot of flexibility, it also puts all the responsibility on your own folks. This means you have to have a strong group of MLOps engineers and an equally strong compute infrastructure. From a human resources standpoint alone, this naturally means you will need to hire the strongest talent, which can obviously get very expensive.

Who is this for: Naturally, most companies aren’t going to be able to take advantage of this option, but there is one niche group that can (and is) definitely benefit from this: AI powerhouses. I’m specifically referring to companies like Apple, Google, or Amazon. These AI powerhouses are already hiring the best and brightest folks, and because they maintain their own server ecosystems, it naturally makes the most sense to have their smart engineers build a custom solution on that self-managed infrastructure.

Leave a Reply

Your email address will not be published.