Production ML: 6 Key Challenges & Insights—an MLOps Roundtable Discussion
Navigating the journey from a promising ML concept to a robust, production-ready application is filled with challenges. Teams need to establish efficient data pipelines, understand and attribute their costs, and design organizational processes that support their ability to execute quickly. Whether streamlining existing processes or embarking on new ML projects, every team will run into these challenges as they move to deploy ML models into production.
To dive into these issues, Demetrios Brinkmann (Founder of MLOps Community) recently hosted a roundtable discussion with ML experts from Tecton, including Kevin Stumpf (CTO), Isaac Cameron (Consulting Architect), Eddie Esquivel (Solutions Architect), and Derek Salama (Product Manager).
In this post, we’ll summarize 6 key takeaways from the roundtable discussion. Want to watch the video instead? You can watch the full discussion here.
1. Productionizing ML models still poses a major challenge for organizations
Bringing ML models into production requires coordinating data, tools, and teams. First, you need to collect the right data in the right format, make it available to the right people, and give them the right tools to model it.
However, even after this data is in place, collaborating across teams can be challenging. There’s often a disconnect between data scientists and ML engineers—and the wider this gap, the harder it is for engineers to productionize models or features developed by the data science team. Complex scenarios, like those involving real-time processing requirements or multiple upstream dependencies, can introduce even more obstacles.
2. Without good cost attribution, calculating ROI is almost impossible
Demonstrating the return on investment of ML is crucial to ensure that leadership continues to invest in ML initiatives. Although the ROI of some results (like user delight) may be inherently hard to quantify, many ML projects are more easily broken down into costs and benefits.
Demystifying the cost side of the ROI equation is a common stumbling block for teams. Hidden infrastructure costs surrounding databases, ETL processes, and other compute resources make it hard to get a comprehensive picture of the total cost for a given feature. The architecture of your system can also affect your ability to break down costs; for instance, if one pipeline is responsible for running multiple features, it can be harder to get a granular view of costs for each individual feature.
To begin improving your cost attribution, consider a solution like AWS Cost Manager or Tecton. Some organizations have even built lightweight pipelines to unify their cost attribution data, making it easier to understand and report on ROI.
3. As orgs grow larger, their challenges deploying ML into production grow, too
Smaller teams are generally more nimble and are able to launch ML use cases into production faster. Larger teams have many more requirements and considerations, like uptime, privacy, and security. As more requirements are added, system complexity increases and can extend the amount of time it takes to deploy a use case into production.
Additionally, larger organizations often have a broad and diverse set of use cases crammed into one system. Although building ML models for recommendations, fraud, and pricing are distinct use cases built by different teams, many large organizations want to unify these use cases in one place, which can become unwieldy to build and maintain.
4. Set your data science team up for success
Before you can begin launching ML into production, generating reliable and accessible training data is key. The first step is ensuring a clean, time-stamped record of data in a data warehouse or data lake. This historical data needs to be accessible for data scientists as they’re training their models, and it’s generally best practice to keep a separate copy of historical data rather than instructing data scientists to directly interface with the online system/database. If your team doesn’t currently have this separation, consider using something like Fivetran to snapshot your online database into your data lake or data warehouse.
Once this data is available and accessible to your data science team, ensure they have an environment (using tools like Jupyter, Hex, or Deepnote) where they can get started experimenting and training their models.
5. Juggling different data processing strategies is tricky
Batch processing—generally the most straightforward data processing method to implement—considers large chunks of data at a time. While batch processing is great for some needs (such as lead scoring for sales or churn prediction), it falls short for use cases where real-time ML output is needed.
For instance, if your team is building real-time fraud detection or recommendation systems—situations where the model inputs are only known at inference time—batch processing won’t cut it, and you may need to use streaming infrastructure like Confluent/Kafka or AWS Kinesis. Pairing these tools with Tecton can help abstract away some of the complexity with managing this infrastructure, especially if your system uses both batch and streaming data processes together.
6. Future-proof your systems & processes to mitigate the pain of scaling ML
Successfully launching your first ML pipeline is just the beginning. The next step is creating effective processes to scale your team’s ability to deliver new pipelines.
The most common pitfall is not spending sufficient time in the design and requirements-gathering phase. To properly design data pipelines and models, teams should look beyond just the first use case and make educated guesses about future needs. By understanding how the system might evolve, teams are able to make better, more future-proof design decisions.
Creating overly complex systems with sprawling architecture diagrams is another common trap. Every new component exponentially grows complexity, so teams should strive for simplicity and consider the tradeoffs with introducing additional moving parts.
When designing your technical systems and organizational processes, another key consideration is feature and pipeline sharing. Ideally, features are reused for similar use cases to save development time and maintenance cost, but often, different teams end up recreating features or pipelines for their own use. Implementing strategies like feature governance or appointing a “data steward” for ML can promote sharing. Remember that building trust across teams is critical for encouraging sharing; if teams know that a feature owned by someone else won’t be changed without notice, they’re more likely to feel secure using shared features rather than building their own redundant version.
Key takeaways
Deploying ML to production can be difficult, but being aware of the best practices and potential pitfalls helps ensure teams are well-equipped to tackle the challenge. By understanding the nuances of cost attribution, data processing strategies, and the concerns that come with scale, organizations can navigate these complexities more effectively. Tecton can help your organization with these challenges by allowing you to easily deploy pipelines, manage features, and wrangle costs. To get started, schedule a demo or get started for free.