5. Transform¶
5.1. Introduction¶
Once data has been extracted from a DatabaseSource
or CallableSource
, it can be cleaned and expanded upon by ‘expansions’.
Conceptually, pylytics is a pipeline from extract to load.
Each row which comes from the source is passed through this pipeline.
The pipeline consists of a number of expansions, which transform the data. At the end of the pipeline, the data is ready to the loaded.
The expansions are simple, testable functions. For example:
# transform.py
def convert_datetime_to_date(data):
""" We're only interested in the created date, and not the time.
"""
data['created_date'] = data['created_datetime'].date()
del data['created_datetime']
def convert_str_to_int(data):
""" The source returns integers as strings - convert them.
"""
data['sales'] = int(data['sales'])
The data
argument is a dictionary representing a single row.
5.2. Adding expansions to facts and dimensions¶
For example:
# load.py
from transform import convert_datetime_to_date, convert_str_to_int
class Sales(Fact):
__source__ = DatabaseSource.define(
database="sales",
query="SELECT * FROM sales_table",
expansions=[convert_datetime_to_date, convert_str_to_int]
)
...
Both DatabaseSource and CallableSource accept the expansions
argument.
The expansions are processed in the order they appear in the list.