Datastage - Slowly Changing Dimension (SCD) stage

 The SCD stage reads source data on the input link, performs a dimension table lookup on the reference link, and writes data on the output link. The output link can pass data to another SCD stage, to a different type of processing stage, or to a fact table. The dimension update link is a separate output link that carries changes to the dimension. You can perform these steps in a single job or a series of jobs, depending on the number of dimensions in your database and your performance requirements.

SCD stages support both SCD Type 1 and SCD Type 2 processing:

1.       SCD Type 1 Overwrites an attribute in a dimension table.

  1. SCD Type 2 Adds a new row to a dimension table.

Each SCD stage processes a single dimension and performs lookups by using an equality matching technique. If the dimension is a database table, the stage reads the database to build a lookup table in memory. If a match is found, the SCD stage updates rows in the dimension table to reflect the changed data. If a match is not found, the stage creates a new row in the dimension table. All of the columns that are needed to create a new dimension row must be present in the source data.

Purpose codes in a Slowly Changing Dimension stage
Purpose codes are an attribute of dimension columns in SCD stages. Purpose codes are used to build the lookup table, to detect dimension changes, and to update the dimension table.

Building the lookup table The SCD stage uses purpose codes to determine how to build the lookup table for the dimension lookup. If a dimension has only Type 1 columns, the stage builds the lookup table by using all dimension rows. If any Type 2 columns exist, the stage builds the lookup table by using only the current rows. If a dimension has a Current Indicator column, the stage uses the derivation value of this column on the Dim Update tab to identify the current rows of the dimension table. If a dimension does not have a Current Indicator column, then the stage uses the Expiration Date column and its derivation value to identify the current rows. Any dimension columns that are not needed are not used. This technique minimizes the amount of memory that is required by the lookup table.

Detecting dimension changes Purpose codes are also used to detect dimension changes. The SCD stage compares Type 1 and Type 2 column values to source column values to determine whether to update an existing row, insert a new row, or expire a row in the dimension table.

Updating the dimension table Purpose codes are part of the column metadata that the SCD stage propagates to the dimension update link. You can send this column metadata to a database stage in the same job, or you can save the metadata on the Columns tab and load it into a database stage in a different job. When the database stage uses the auto-generated SQL option to perform inserts and updates, it uses the purpose codes to generate the correct SQL statements.

Selecting purpose codes
Purpose codes specify how the SCD stage should process dimension data. Purpose codes apply to columns on the dimension reference link and on the dimension update link. Select purpose codes according to the type of columns in a dimension:

1.       If a dimension contains a Type 2 column, you must select a Current Indicator column, an Expiration Date column, or both. An Effective Date column is optional. You cannot assign Type 2 and Current Indicator to the same column.

2.       If a dimension contains only Type 1 columns, no Current Indicator, Effective Date, Expiration Date, or SK Chain columns are allowed.

Purpose code definitions

The SCD stage provides nine purpose codes to support dimension processing.

 

1.       (blank) The column has no SCD purpose. This purpose code is the default.

2.       Surrogate Key The column is a surrogate key that is used to identify dimension records.

3.       Business Key The column is a business key that is typically used in the lookup condition.

4.       Type 1 The column is an SCD Type 1 field. SCD Type 1 column values are always current. When changes occur, the SCD stage overwrites existing values in the dimension table.

5.       Type 2 The column is an SCD Type 2 field. SCD Type 2 column values represent a point in time. When changes occur, the SCD stage creates a new dimension row.

6.       Current® Indicator (Type 2) The column is the current record indicator for SCD Type 2 processing. Only one Current Indicator column is allowed.

7.       Effective Date (Type 2) The column is the effective date for SCD Type 2 processing. Only one Effective Date column is allowed.

8.       Expiration Date (Type 2) The column is the expiration date for SCD Type 2 processing. An Expiration Date column is required if there is no Current Indicator column, otherwise it is optional.

  1. SK Chain The column is used to link a record to the previous record or the next record by using the value of the Surrogate Key column. Only one Surrogate Key column can exist if you have an SK Chain column.

Surrogate keys in a Slowly Changing Dimension stage
Surrogate keys are used to join a dimension table to a fact table in a star schema database.

When the SCD stage performs a dimension lookup, it retrieves the value of the existing surrogate key if a matching record is found. If a match is not found, the stage obtains a new surrogate key value by using the derivation of the Surrogate Key column on the Dim Update tab. If you want the SCD stage to generate new surrogate keys by using a key source that you created with a Surrogate Key Generator stage, you must use the NextSurrogateKey function to derive the Surrogate Key column. If you want to use your own method to handle surrogate keys, you should derive the Surrogate Key column from a source column.

You can replace the dimension information in the source data stream with the surrogate key value by mapping the Surrogate Key column to the output link.

Specifying information about a key source
If you created a key source with a Surrogate Key Generator stage, you must specify how the SCD stage should use the source to generate surrogate keys.

The key source can be a flat file or a database sequence. The key source must exist before the job runs. If the key source is a flat file, the file must be accessible from all nodes that run the SCD stage.


To use the key source:

1. On the Input page, select the reference link in the Input name field.

2. Click the Surrogate Key tab.
3. In the Source type field, select the source type.
4. In the Source name field, type the name of the key source, or click the arrow button to browse for a file or to insert a job parameter. If the source is a flat file, type the name and fully qualified path of the state file, such as
C:/SKG/ProdDim. If the source is a database sequence, type the name of the sequence, such as PRODUCT_KEY_SEQ. 5. Provide additional information about the key source according to the type:
If the source is a flat file, specify information in the Flat File area.
If the source is a database sequence, specify information in the DB sequence area.

Calls to the key source are made by the NextSurrogateKey function. On the Dim Update tab, create a derivation that uses the NextSurrogateKey function for the column that has a purpose code of Surrogate Key. The NextSurrogateKey function returns the value of the next surrogate key when the SCD stage creates a new dimension row.

Post a Comment

Previous Post Next Post

Contact Form