The SCD stage reads source data on the input link, performs a dimension table lookup on the reference link, and writes data on the output link. The output link can pass data to another SCD stage, to a different type of processing stage, or to a fact table. The dimension update link is a separate output link that carries changes to the dimension. You can perform these steps in a single job or a series of jobs, depending on the number of dimensions in your database and your performance requirements.
SCD stages support both SCD Type 1 and SCD Type 2 processing:
1.
SCD Type 1 Overwrites an
attribute in a dimension table.
- SCD
Type 2 Adds a new row to a dimension
table.
Each SCD stage processes a single
dimension and performs lookups by using an equality matching technique. If the
dimension is a database table, the stage reads the database to build a lookup
table in memory. If a match is found, the SCD stage updates rows in the
dimension table to reflect the changed data. If a match is not found, the stage
creates a new row in the dimension table. All of the columns that are needed to
create a new dimension row must be present in the source data.
Purpose codes in a Slowly Changing Dimension stage
Purpose codes are an attribute of dimension columns in
SCD stages. Purpose codes are used to build the lookup table, to detect
dimension changes, and to update the dimension table.
Building the lookup table The SCD stage uses purpose codes to determine
how to build the lookup table for the dimension lookup. If a dimension has only
Type 1 columns, the stage builds the lookup table by using all dimension rows.
If any Type 2 columns exist, the stage builds the lookup table by using only
the current rows. If a dimension has a Current Indicator column, the stage uses
the derivation value of this column on the Dim Update tab to identify the
current rows of the dimension table. If a dimension does not have a Current
Indicator column, then the stage uses the Expiration Date column and its
derivation value to identify the current rows. Any dimension columns that are
not needed are not used. This technique minimizes the amount of memory that is
required by the lookup table.
Detecting dimension changes Purpose codes are also used to detect
dimension changes. The SCD stage compares Type 1 and Type 2 column values to
source column values to determine whether to update an existing row, insert a
new row, or expire a row in the dimension table.
Updating the dimension table Purpose codes are part of the column
metadata that the SCD stage propagates to the dimension update link. You can
send this column metadata to a database stage in the same job, or you can save
the metadata on the Columns tab and load it into a database stage in a
different job. When the database stage uses the auto-generated SQL option to
perform inserts and updates, it uses the purpose codes to generate the correct
SQL statements.
Selecting purpose codes
Purpose codes specify how the SCD stage
should process dimension data. Purpose codes apply to columns on the dimension
reference link and on the dimension update link. Select purpose codes according
to the type of columns in a dimension:
1.
If a dimension contains a Type 2 column,
you must select a Current Indicator column, an Expiration Date column, or both.
An Effective Date column is optional. You cannot assign Type 2 and Current
Indicator to the same column.
2.
If a dimension contains
only Type 1 columns, no Current Indicator, Effective Date, Expiration Date, or
SK Chain columns are allowed.
Purpose code
definitions
The
SCD stage provides nine purpose codes to support dimension processing.
1.
(blank) The column has no
SCD purpose. This purpose code is the default.
2.
Surrogate Key The column is a
surrogate key that is used to identify dimension records.
3.
Business Key The column is a
business key that is typically used in the lookup condition.
4.
Type 1 The column is an
SCD Type 1 field. SCD Type 1 column values are always current. When changes
occur, the SCD stage overwrites existing values in the dimension table.
5. Type 2 The
column is an SCD Type 2 field. SCD Type 2 column values represent a point in
time. When changes occur, the SCD stage creates a new dimension row.
6. Current®
Indicator (Type 2) The column is the current record indicator
for SCD Type 2 processing. Only one Current Indicator column is allowed.
7. Effective
Date (Type 2) The column is the effective date for SCD
Type 2 processing. Only one Effective Date column is allowed.
8. Expiration
Date (Type 2) The column is the expiration date for SCD
Type 2 processing. An Expiration Date column is required if there is no Current
Indicator column, otherwise it is optional.
- SK
Chain The column is used to link a
record to the previous record or the next record by using the value of the
Surrogate Key column. Only one Surrogate Key column can exist if you have
an SK Chain column.
Surrogate
keys in a Slowly Changing Dimension stage
Surrogate keys are used to join a
dimension table to a fact table in a star schema database.
When the SCD stage performs a dimension lookup, it retrieves the value of the
existing surrogate key if a matching record is found. If a match is not found,
the stage obtains a new surrogate key value by using the derivation of the
Surrogate Key column on the Dim Update tab. If you want the SCD stage to
generate new surrogate keys by using a key source that you created with a
Surrogate Key Generator stage, you must use the NextSurrogateKey function to
derive the Surrogate Key column. If you want to use your own method to handle
surrogate keys, you should derive the Surrogate Key column from a source
column.
You can replace the dimension information in the source data stream with the
surrogate key value by mapping the Surrogate Key column to the output link.
Specifying
information about a key source
If you created a key source with a
Surrogate Key Generator stage, you must specify how the SCD stage should use
the source to generate surrogate keys.
The key source can be a flat file or a database sequence. The key source
must exist before the job runs. If the key source is a flat file, the file
must be accessible from all nodes that run the SCD stage.
To use the key source:
1. On the Input page, select the reference link in the
Input name field.
2. Click the Surrogate Key tab.
3. In the Source type field, select the source type.
4. In the Source name field, type the name of the key source, or click
the arrow button to browse for a file or to insert a job parameter. If the
source is a flat file, type the name and fully qualified path of the state
file, such as C:/SKG/ProdDim.
If the source is a database sequence, type the name of the sequence, such as
PRODUCT_KEY_SEQ. 5. Provide additional information about the key source
according to the type:
If the source is a flat file, specify information in the Flat File area.
If the source is a database sequence, specify information in the DB sequence
area.
Calls to the key source are made by the
NextSurrogateKey function. On the Dim Update tab, create a derivation that uses
the NextSurrogateKey function for the column that has a purpose code of
Surrogate Key. The NextSurrogateKey function returns the value of the next
surrogate key when the SCD stage creates a new dimension row.