Why You Don’t Want to Use Your Data Warehouse as a Feature Store
At Tecton, we help ML teams deal with the data challenges of production machine learning. Specifically, we’ve seen a lot of teams struggle with deploying the infrastructure and managing the data pipelines to produce and serve up-to-date and reliable model inputs (or features) with the right throughput and latency for real-time decisioning use cases (such as fraud, risk, personalization, etc.)
To help deal with these challenges, we’ve seen teams try to build feature stores in-house or buy a solution. However, we’ve also come across many teams with a common misconception: that feature stores and feature platforms are just a database of features for ML—and we’ve seen many teams start their feature store journey by attempting to centralize all their ML features in a set of very wide relational tables in a data warehouse.
While this solution might seem appealing for the first few ML models, teams often run into limitations as they look to drive broader adoption across the enterprise. If you are embarking on your feature store journey, and are wondering where you should start and what additional limitations you might run into if using a data warehouse as a feature store, this post is for you.
Limitations of a data warehouse
1. Breadth of supported ML use cases
When building ML tooling, you want to design a future-proof solution that is flexible enough to accommodate different types of use cases. This ensures you’ll avoid building bespoke one-off systems per use case (and avoid increasing tech debt). One of the common limitations of using a data warehouse as a feature store is the inability to support an entire category of use cases: real-time ML. In contrast, feature platforms are meant to be reusable across all use cases.
ML use cases typically fall into two categories: batch and real-time. In batch models, batches of predictions are delivered on a schedule, while in real-time models, singular predictions are delivered to an application within dozens or hundreds of milliseconds of an action being made. Real-time ML is seeing rapid adoption and if you are not already doing real-time ML today, chances are you will have to sooner rather than later.
2. Real-time feature serving for ML
Most data warehouses are designed to run low concurrency and high throughput analytical workloads that are not latency sensitive. In real-time machine learning, inputs (features) to the models need to be looked up from a storage and fed to a model within a very constrained latency budget (usually in the low millisecond range). Adding to the complexity, because most real-time ML models power customer-facing applications like recommender systems or fraud detection, the volume of concurrent queries to the model is usually high, well above the thousands of queries per second (QPS).
This means that your ML feature storage and serving layer should serve features at low latency and high concurrency levels. As mentioned above, data warehouses don’t have good support for these workloads. Right out of the gate, your data warehouse-centric feature store is limited, and you’ll either need to compromise on latency or add an online store on top of your warehouse. Congratulations, you’ve unlocked a new level of complexity! You now need to maintain a low-latency key/value store and manage dozens of ingestion pipelines that copy the most up-to-date feature values to the key/value store.
3. Real-time feature engineering for ML
In addition to making predictions at low latency, many applications of real-time ML models rely on input features that are updated in near real time (within a few hundred milliseconds) or calculated in real time.
For example, a great feature for a real-time fraud detection ML model might be the number of transactions by the user in the past 10 minutes. In order to produce this feature in production, you’ll need to continuously compute and refresh feature values based on incoming events. This usually involves reading from a streaming data source like Kafka or AWS Kinesis.
Most data warehouses support streaming ingestion but not streaming data pipelines, which would force you to build a streaming feature engineering infrastructure (e.g., Spark or Flink) and push computed values to the warehouse. This added complexity often means that you will force your data scientists to trade off ease of implementation for model performance. Your business is now missing out on a whole lot of model performance as these features typically hold a lot of predictive power.
A good feature platform, on the other hand, should have built-in online stores with automatic synchronization of feature values from the offline store to the online store, as well as managed streaming feature engineering engines like Tecton’s Stream Ingest API. They guarantee that your features are fresh and can be served within the right latency for real-time ML.
Optimizing the end user experience
The value of a feature platform usually scales with its adoption and increased usage by its primary consumers: data science and ML teams. The more curated features you have in your feature platform, the more features can be reused instead of re-developed and the faster ML models can make it to production.
With that in mind, you should pay particular attention to making the feature authoring, registration, and consumption experience as seamless as possible for end users of the solution.
1. Flexible feature authoring
Feature platforms offer flexible declarative feature engineering frameworks to define features in a variety of coding languages leveraging various compute engines (Python, Pandas, in-warehouse SQL, PySpark, SparkSQL).
In contrast, most of the in-house solutions built on your data warehouse only support warehouse native feature engineering code in SQL (or Python APIs in some cases), with tools like DBT and Airflow for data management and orchestration.
While a good number of feature transformations like window aggregations can commonly be defined in SQL, more advanced logic like embedding creation is hardly supported by data warehouses. This means that feature authoring will most likely be reserved for data engineers who are familiar with your ETL stack.
2. Time travel & backfills
One of the key requirements of any feature platform is its ability to do time travel for feature values. This is especially important when training ML models on historical training observations where you’ll need to compute feature values based on the state of the world at the time of each observation. In order for this to be possible, any new feature that is pushed to the feature store should be backfilled with historical data.
In a feature platform like Tecton, this is something that comes out-of-the-box and is automated. The feature author defines a feature start time, and the platform automatically orchestrates jobs to backfill the feature store.
In a custom solution, the end user would have to set up their own backfill job, which often means building a separate Airflow DAG and running a very large job in the data warehouse.
3. Read APIs & point-in-time joins
Once features are defined and pushed to your feature store, reading feature data into a notebook to train a model should be made dead simple for your data scientists.
Imagine a situation where you have dozens of feature tables in your data warehouse and a data scientist wants to use them to generate a training dataset based on historical labeled data they curated. Each table might have different sets of keys for joining and different update cadences. Your data scientist will most likely have to write a very complex SQL query to join all the feature tables to their labeled data in a point-in-time accurate way. This is not only extremely time consuming, but also very error-prone, which means there is a chance you’ll introduce a bias in your ML model.
These are all table stakes for feature platforms. A good feature platform should come with very easy-to-use and robust Python SDKs that allow data scientists to generate point-in-time accurate training datasets with a single line of code and in a compute efficient way. (See requesting training data in Tecton.)
As you can see, when compared to using your data warehouse as a feature store, feature platforms abstract the complexity away from the end users and reduce the friction for ML teams to author, deploy, and read features. This ensures wider adoption and quicker iterations on ML models.
Maintainability & engineering overhead
The above is not an exhaustive list of the differences between data warehouses and feature platforms, but it is pretty clear that while data warehouses can be used as the foundation for your feature platform (e.g., for data source, compute, offline feature storage), there is significant engineering work and additional tooling that needs to be stitched together with data warehouses to really drive the outcomes an enterprise feature platform provides out-of-the-box.
Assembling all the building blocks (feature engineering, serving, cataloging, drift monitoring, governance, etc.) yourself means you’ll need engineers writing custom code to have disparate systems communicate, creating technical debt and a behemoth that takes significant resources to maintain (something we’ve seen happen with many teams before they decide to evaluate and buy a third-party feature store or feature platform).
Effectively, this engineering overhead takes away the focus and resources from the core mission of your ML team, which is building good ML features and accurate ML models. For example, Atlassian, which moved from an in-house feature platform to an enterprise feature platform, was able to free up 3 FTES that were previously dedicated to maintaining the in-house platform.
Key takeaways
Feature platforms are a great way to speed up the development and deployment of quality ML features for production applications. While they are commonly compared to data warehouses, the nature and requirements of the workloads and use cases they support make them fundamentally different and much more effective.
Teams that want to use their data warehouse as a feature store often spend months (if not years) building additional components to work around the limitations of data warehouses for production ML. They usually end up with a lot of technical debt and very little value generated. When they move to end-to-end managed feature platforms like Tecton, they experience:
- Lowered complexity of real-time ML infrastructure
- Shorter time to value on new features
- Fewer FTEs required to maintain the platform
- Less reliance on data engineers to productionize features
- Better governance