Introducing Tecton SDK 0.5
We are very excited to be launching Tecton 0.5! With improvements in flexibility, speed, costs, and control, 0.5 is shaping up to be our biggest release yet. The list includes some of our most highly requested features such as the Feature Materialization API, Feature View Output Streams, and Spark-less Feature Retrieval.
Check out the full list of functionality below and tell us what you think. We look forward to seeing what you build!
See the release notes for more information.
Feature Materialization API
Tecton’s new materialization API makes it easy to trigger feature materialization jobs programmatically, allowing your upstream data pipelines that run outside of Tecton to kick off feature processing as soon as new raw data is ready. The API can also be used to monitor feature materialization job completion statuses in order to kick off training or inference when new feature data is ready. The Tecton Airflow provider makes leveraging this API in Airflow DAGs quick and easy!
Now you can easily manage your entire ML pipeline, from feature materialization, over ML model training, all the way to making ML predictions, in any pipeline orchestration tool of your choice (Airflow, Kubeflow, Dagster, Prefect, etc.).
tecton.get_workspace('prod') \
.get_feature_view('my_feature_view') \
.trigger_materialization_job(
start_time=datetime(2022,10,2),
end_time=datetime(2022,10,3),
offline=True,
online=True,
overwrite=True # Rerun a past job
)
Feature View Output Streams
Feature View Output Streams enable event-driven applications that react to new feature updates in Tecton. For example, you may want to refresh “watch next” recommendations in the background after a user clicks on a new title.
Once you configure the output stream for a feature view, Tecton will write records to that stream for every new value processed. Both Kafka and Kinesis are supported. Here’s an example for configuring a Feature View with Kinesis.
@stream_feature_view(
sources=[transactions_stream],
entities=[user],
...
output_stream=KinesisOutputStream(
stream_name='feature-stream-name',
region='us-west-2',
include_features=True,
)
)
...
Check out the documentation for more information on using Feature View Output Streams.
AWS Athena-Based Feature Retrieval. No Spark required!
Tecton’s SDK can now leverage AWS Athena compute to generate training data sets from materialized features. This enables fast offline feature retrieval without the need for Spark. This is particularly useful if you want to generate training data sets using Tecton as part of an Airflow, Kubeflow, Dagster, etc. DAG.
Check out the documentation to try it out!
Spark Data Source Functions for Unlimited Flexibility in Connecting to Data Sources
Data sources for both batch and streaming Spark features can now be defined using functions, allowing for unlimited flexibility in data source types, authentication mechanisms, schema registry integrations, partition filtering logic, and more. Simply write any PySpark function that returns a DataFrame!
Always wanted to read from an Iceberg table?
Care to join streams?
Want to skip a set of directories on S3?
Whatever you can do in an interactive Spark notebook, you can now do in Tecton.
from tecton import BatchSource, spark_batch_config
@spark_batch_config()
def csv_data_source_function(spark):
df = spark.read.csv(csv_uri, header=True)
...
return df
csv_batch_source = BatchSource(
name="csv_batch_source",
batch_config=csv_data_source_function
)
More details and examples can be found here.
Batch Feature View Skew Reduction for Better Models
Tecton’s time-travel queries now consider more information such as scheduling details in order to select historically accurate feature values and reduce online / offline skew. Ensuring that offline feature data reflects the values that would have been available in the online store at a given time is critical for achieving good model quality. For more information, see the documentation.
get_historical_features() Performance Improvements on Spark
Tecton feature retrieval has been optimized, including a more stable and performant implementation of our point in time join. This implementation saw a significant performance improvement with spine-retrieval for non-aggregate and custom aggregate feature views (and feature services that contain these feature views).
Suppress Object Recreation to Optimize Costs
Tecton’s CLI now offers greater control over evolving feature pipelines. By default, Tecton automatically rematerializes feature data when changes are made to a feature’s transformation logic. This keeps historical feature data accurate. However, you may want to avoid rematerialization costs if the changes do not affect feature semantics (e.g. commenting code, extending a data source schema, changing to a mirror data source). In 0.5, Tecton admins can use the --suppress-recreates
flag with tecton apply
in order to suppress the recreation of objects and avoid unnecessary materialization costs.
$ tecton apply --suppress-recreates
Learn more about this functionality here.
Struct-Type On-Demand Features
On-Demand Feature Views now support structs as feature types. Now you can pass in deeply nested structures into your ODFV, without losing type safety (e.g. by returning a JSON string)!
from tecton import on_demand_feature_view, RequestSource
from tecton.types import Array, Field, Float64, String, Struct
request_source = RequestSource([Field("input_float", Float64)])
output_schema = [
Field("output_struct", Struct([
Field("string_field", String),
Field("float64_field", Float64)
]))
]
@on_demand_feature_view(
mode="python",
sources=[request_source],
schema=output_schema,
description="Output a struct with two fields."
)
def simple_struct_example_odfv(request):
input_float = request["input_float"]
return {
"output_struct": {
"string_field": str(input_float * 2),
"float64_field": input_float * 2
}
}
Check out the documentation for using struct types here.
Programmatic Metadata Access via Python SDK
All SDK methods returning a table now return a Displayable
object with a to_dict
method, making it easy to programmatically access metadata via the Python SDK.
my_feature_service = tecton.get_workspace('prod').get_feature_service('fraud_detection_feature_service')
print(my_feature_service.summary().to_dict()['Features'])