Featured Features: Ratio Features
Welcome to the first installment of Tecton’s Featured Features, an ongoing series in which we will be taking a deeper dive into some of the more popular features we see developers building on our feature platform for production machine learning. At Tecton, we’ve seen a number of organizations have success in using features that fall into similar categories that we’d like to present in more detail. These features improve the performance of their ML models by giving them more or better context. In this post, we will take a look at ratio features: what they are, how and when they are used by ML models, and how to build ratio features in Tecton.
Ratio features in machine learning
Given two features, Feature A and Feature B, a ratio feature could be created by combining the two features via a ratio: Feature A/Feature B. Ratio features help give context to features that have large variances. A few examples of ratio features could be:
- The ratio of late payments to total payments, which could be used in credit assessment models
- The proportion of product orders to product views, which could be used for an item ranking model
- The percentage of viewers that liked a video, which could be used for a feed recommendation model
Let’s dive deeper into the first example. An organization running a payments platform may track the number of late payments each user makes and send that information to models as a feature. Let’s say you want to determine whether 5 late payments in one year is a lot for one user. If that user has only made 5 total payments this year, it would seem so, but if they have only had 5 late payments over a total of 5,000 transactions, an ML model (and the organization using the model) may consider that less of an issue.
This isn’t the only ratio that could bring more context about a user’s late payments to ML models. For instance, each user’s late payments this month over their total late payments, or late payments this month over total payments this month, could impart important seasonality information to a model. With these examples, you can see how an organization can turn one of its features (number of late payments) into a collection of many derived features using ratios.
We discussed three additional ratio features (late payments of total payments, this month’s late payments of total late payments, and this month’s late payments of this month’s total payments), but many more could be created, each with the possibility of bringing more context to a model and improving its ability to make predictions. A feature platform could simplify the process of scaling out these features and automate the creation and management of turning a single feature into many ratio features.
Building a ratio feature in Tecton
Fortunately for Tecton users, implementing ratio features is very straightforward. Tecton’s On-Demand Feature Views (ODFVs) provide a framework for comparing features at request time instead of precomputing them and storing them in offline and online stores. ODFVs can be used to build a wide variety of feature types, but if ratio features give context to features that have large variances, precomputing ratios for every entity without an ODFV may include many that are not used frequently, which leads to inefficient and expensive online stores.
Many ratio features, like the ones noted in the previous section, are built by running additional transformations on lifetime aggregations. We can take a quick look at an example that transforms columns in a batch data set into lifetime aggregations, then compares those aggregations on-demand as a ratio (you can follow along with all these steps with this example Databricks notebook).
Creating an example dataset
The sample notebook creates a Spark DataFrame and registers it as a BatchSource in Tecton. This dataset has three timestamps: when each payment was made, when each payment was due, and when the record was last updated. In the example dataset, there are 2 users who have each made 3 payments. One user made 1 late payment out of 3, and the other made all three payments late. With these examples, we can build features tracking the total amount of payments and late payments each user has made, and later calculate a ratio from those totals.
Defining lifetime aggregation Feature Views
In Tecton, a Batch Feature View will provide the definitions that transform data from previously defined data sources into features. The Feature View has 2 aggregations, and for each user, Tecton will keep a running count of their total number of payments and total number of late payments.
@batch_feature_view(
sources=[late_payments_source],
entities=[user],
mode='spark_sql',
batch_schedule=timedelta(days=1),
aggregation_interval=timedelta(days=1),
aggregations=[
Aggregation(column='late_payment', function='sum',
time_window=timedelta(days=365*10)),
Aggregation(column='payment_id', function='count',
time_window=timedelta(days=365*10)),
],
feature_start_time=datetime(2023,7,1),
timestamp_field="update_timestamp"
)
def payment_aggregates(late_payments_source):
return f'''
SELECT
user_id,
payment_id,
CASE WHEN datediff(payment_date, payment_due_date) > 0
THEN 1 ELSE 0 END as late_payment,
update_timestamp
FROM
{late_payments_source}
'''
On-demand ratio features
If a new payment is made between batches, we can assess the ratio of each user’s late payments to total payments with an On-Demand Feature View. The definition provided below takes the pre-calculated lifetime aggregates and adds one to the total number of payments and one to the total number of late payments depending on whether or not the payment occurred after its due date.
request_schema = [
Field("payment_timestamp", String),
Field("payment_due_timestamp", String)
]
transaction_request = RequestSource(schema=request_schema)
output_schema = [Field("late_payment_ratio", Float64)]
@on_demand_feature_view(
sources=[transaction_request, payment_aggregates],
mode='python',
schema=output_schema
)
def late_ratio(transaction_request, payment_aggregates):
late_payments = payment_aggregates.get('late_payment_sum_3650d_1d', 0)
is_current_payment_late = (
datetime.strptime(transaction_request['payment_timestamp'],
'%Y-%m-%dT%H:%M:%S.000+0000') >
datetime.strptime(transaction_request['payment_due_timestamp'],
'%Y-%m-%dT%H:%M:%S.000+0000')
)
late_ratio_value = (late_payments + is_current_payment_late) / (1 + payments)
return {'late_payment_ratio': late_ratio_value}
The notebook ends by sending mock examples of payments to the On-Demand Feature View with the run() command. For the user that has made 3 out of 3 late payments, their late payment to total payment ratio with an additional late payment would remain at 1.0, the other user in this scenario would now have made 2 out of 4 late payments, for a ratio of 0.5.
Sending this example and other ratios to an ML model can give models a more complete understanding of the context they are making predictions in. Developing these ratios as On-Demand Features Views provides additional efficiencies to ML pipelines that we’ve seen organizations use to pre-approve credit lines and determine risk, among many other use cases. You can learn more about some of these organizations’ journeys in applied ML with feature platforms at Tecton’s apply(ops) conference on November 14. In the meantime, stay tuned for the next installment of Featured Features!