Real-Time Aggregation Features for Machine Learning (Part 1)
Introduction
Machine Learning features are derived from an organization’s raw data and provide a signal to an ML model. A very common type of feature transformation is a rolling time window aggregation. For example, you may use the rolling 30-minute transaction count of a credit card to predict the likelihood that a given transaction is fraudulent.
It’s easy enough to calculate rolling time window aggregations offline using window functions in a SQL query against your favorite data warehouse. However, serving this type of feature for real-time predictions in production poses a difficult problem: How can you efficiently serve such a feature that aggregates a lot of raw events (> 1000s), at a very high scale (> 1000s QPS), at low serving latency (<< 100ms), at high freshness (<< 1s), and with high feature accuracy (e.g. a guaranteed and not approximate time window length)?
These types of features are incredibly important for countless real-time ML applications whose predictive power is very sensitive to a user’s behavior up until the very moment of the prediction. Common examples where every event counts are real-time recommendation, fraud detection, spam detection, personalization, and pricing predictions.
In this two-part blog, we’ll discuss the most common technical challenges of this feature type, and a battle-tested architecture that’s employed both by Airbnb and by Tecton to solve them. (Footnote: If any one of the feature characteristics and required guarantees mentioned can be loosened, a different solution may be more optimal)
Technical Challenges
A naive implementation to the problem posed above may be to simply query a transactional database (like MySQL) in production every time a real-time prediction is made:
This works fine at a small scale and is a good way to start if you have the raw events in your database. However, at a high scale, and/or once your aggregation depends on a large number of raw events, your serving latencies will start to spike and your MySQL database will eventually keel over.
A common next step to scale the architecture above is to precompute aggregations in real-time as new raw data becomes available, and to make the features available in a scalable KV-store that’s optimized for low latency serving (like Dynamo or Redis):
This comes with a whole set of technical challenges, some of which we’ll explore in the next few sections.
Memory Constraints of Long-Running Time Window Aggregations
Frequently, teams employ Apache Spark or Flink to run streaming time window aggregations. The memory requirement of your Spark or Flink job is a function of the time window size as well as the event density of your stream.
Imagine a credit card processor’s fraud detection application that uses a user’s 4-week transaction count as an important feature. The stream processing job now needs to fit all transactions that happen within a 4-week period into memory.
For large data volumes, this will quickly lead to OutOfMemoryExceptions if not configured properly. As a fix, teams often resort to reducing the time window, reducing the event body size, partitioning the streaming job, or using a state store like RocksDB that can flush state to disk or to cold storage.
Backfilling challenges
Once you fine-tuned your streaming job to run reliably, you may realize that your streaming service of choice doesn’t contain enough historical data to backfill a new feature. For instance, AWS Kinesis holds only a maximum of 2 weeks worth of data. If you now want a 6 week time window aggregation, your streaming aggregation will serve inaccurate feature values until it has been running for 4 weeks.
In these instances, companies typically resort to limiting their streaming time windows to the duration of data that are available in their streaming infrastructure, often leaving important signals on the table. Others may hand-engineer a batch pipeline that precomputes data from a historical batch source and try to combine it with data that’s forward-filled by a streaming pipeline. Yet others may resort to using streaming infrastructure like Kafka whose retention period is more configurable and accept the fact that processing large volumes of historical data from Kafka will take much longer than running a batch-job against a data store that’s optimized for large-scale batch processing (e.g. a data warehouse or data lake).
Maintain High Feature Freshness
Let’s assume you’ve solved the backfill and memory challenges of your streaming time window aggregation. Another common challenge for several ML applications is that they depend on very fresh features (< 1 s). (Footnote: This isn’t always true – for some ML applications, feature freshness quickly approaches diminishing/no returns). Ideally, at ML prediction time, as you fetch the feature values, you want to take into account every single event that has happened right until the moment of the prediction. However, accomplishing real-time freshness is challenging and can get very expensive.
If you implement these time window aggregations using a stream processor like Apache Spark, a common way to control the feature freshness is to use a Sliding Time Window Aggregation. With a sliding time window aggregation, you choose a specific slide interval, which determines how frequently your streaming job emits updated feature values (and therefore sets the upper bound of your feature freshness). Further, we found that the minimum slide interval that stream processors can support at high data volumes is typically in the minutes’ range, limiting your maximum feature freshness.
Other streaming processors like Apache Flink, allow you to run aggregations without a sliding interval. They can emit feature updates whenever a new event is found on the stream. Now your features are as fresh as the time at which your most recent event was processed. Unfortunately, this means that your feature staleness is unbounded: if no new events get processed, the feature value you serve in production will not get updated. As a result, the rolling time window aggregation doesn’t aggregate over all of the events that happened in a specific time window before the prediction is taking place, but instead over all the events that happened before the last event on the stream was found.
Generate Training Datasets
Let’s assume you’ve mastered all of the challenges laid out above. You’re able to serve those features to your model running in production!
However, you still need to train a model. This means that you need to maintain a historical log of your feature values so you know what the world looked like at different points in time in the past. You will likely not store and serve those historical feature values from your production key-value store, which is optimized for low-latency and high-scale requests. Instead, you will likely choose an offline store like a data warehouse or a data lake, which is optimized for the occasional processing of large amounts of data. Your architecture now needs to evolve to something like this:
You have to ensure that the offline feature values that you use to train your models are calculated consistently with your production path and that your offline store provides historically accurate feature values without introducing data leakage – otherwise you introduce the dreaded and hard to debug train-serve-skew. (Footnote: The two common approaches are logging production-served feature values as well as backfilling from an offline data source using a separate compute path)
Conclusion (Part 1)
In this post, we’ve covered a lot of technical challenges that make it hard to serve fresh rolling time-window aggregations in production at a high scale. Part 2 will discuss a solution to these challenges that are employed by Tecton as well as Airbnb.