Why Feature Stores Should Extend, Not Replace, Existing Data Infrastructure
In February, Tecton hosted apply(meetup), a virtual meetup focused on all things production ML engineering. We heard from many smart folks who are working on cool ML engineering projects across the industry, and there’s no doubt we’ll be talking about the great ideas from these talks for months to come. If you’re interested in listening firsthand, all of the talks are recorded and freely available; you can check them out here. And if you want to join us for free in our main event of the year, apply(conf) is happening on May 18-19 and has a lineup of incredible speakers. You can register for that here.
During apply(meetup), Ben Wilson, from Databricks, gave a lightning talk that was particularly interesting on a few levels. On the surface, this was a talk about the basic principles and value of feature stores, but Ben also made a much deeper point about specialization vs centralization that I believe gets at the bigger system design tradeoffs that ML teams should be thinking about today.
Overall narrative of the talk
It went like this:
- ML projects today tend to be built by data scientists in isolated stacks using custom infrastructure.
- In reality, there are lots of dependencies that cross the boundaries of this isolation:
- ML projects consume data from the rest of the business.
- The rest of the business always ends up needing to use data that was created in the ML project for some reason (e.g. to investigate the cost/impact of an ML pipeline).
- ML teams often want to reuse parts of pipelines across projects.
- It’s really hard to do 2a, 2b and 2c when ML applications live in completely isolated software stacks (”islands”).
- Feature stores solve this because they:
- Provide APIs to conveniently author, register, and reuse feature logic, and make it easy to generate ML datasets for training or production serving
- Provide a central point into which other systems (e.g. explainability, BI tools, etc.) can interact with ML data
That’s kind of the standard “why you need a feature store” conversation. However, I want to emphasize the point Ben made about centralization: feature stores can most effectively centralize and make available ML data (4b above) when they integrate deeply with — i.e. are “built on top of” — underlying data platforms. I 100% agree.
Augmenting existing infrastructure
A ton of tools are built for data platforms with scopes beyond ML. We shouldn’t be rebuilding ML-specific versions of every single tool in the data ecosystem. When ML applications are built on completely isolated “island” stacks, the projects unintentionally are doomed to forever being incompatible with a broader flourishing data ecosystem. This is a terrible place to be.
The counterargument is typically: “But ML has special requirements! It’s special!” Yes, it has unique requirements, but also lots of non-unique requirements. The optimal ML data infrastructure provides:
- Excellent ML-specific workflows/APIs to make data scientists super efficient and meet ML-specific needs
- Deep integrations and connections with the rest of the data ecosystem to make data for ML just as useful and accessible as any other data in the business
These goals inform lots of system design trade-offs: where to store the data, how to store it, what types of transformations to support, what editing workflows to support, what type of consumption APIs to support, etc.
This idea of centralization is a core design principle for Tecton and Feast. We want to help you maximize the use of your existing platform and infrastructure while supporting the best workflows for AI/ML developers. We’ve done really well here so far, but have some huge plans for this year, including our recently launched integrations with Snowflake and Redis.
We’re hosting Apply(conf) on May 18-19, which is our main event for the year. If you’re building applied machine learning applications, I guarantee you’ll learn from the fascinating work that the speakers are doing in the ML data engineering space. You can see the full speaker lineup and the agenda here and sign up to attend for free here.