Presenter: Brian Knight
bknight@pragmaticworks.com
Owner, Pragmatic Works
In this session, you’ll learn how to load a typical data warehouse in SSIS efficiently. You’ll start by seeing some of the strengths and weaknesses of the Slowly Changing Dimension (SCD) Wizard in SSIS and how you can get around some of the weaknesses including your own home-brewed solution. You’ll then see how to load a fact table using SSIS and how to make the common components scale.
Dimensional Modeling (See Dimensional Modeling 101 Notes)
- Data Separated into fact and dimension tables
- Dimension tables answer the pivot or where clause
- Make as wide and descriptive as possible
- Surrogate keys operate as unique ID for each row
- Keep surrogate keys as small as possible
- Fact tables answer the what questions or select statement
- Intersect all dimension tables
- Surrogate keys from each dimension in this table
- Measures are the “what” like Price, Quantity, Duration
Discusses SCD (Slowly Changing Data) Dimension types (Already here)
Don’t make everything Type two or your DB will bloat significantly. Also your reports would be more difficult to write.
Probably want to fix NULL values to be something. Use a Derived Column transform. Makes Nulls to be something like 0, Unknown, etc…
SCD Wizard Strengths
- SSIS transform that creates many other transforms conditionally
- Reduces design time of SCD load by 80%-90% to minutes per dimension
- Can be customized easily
- Compares differences between source and destination to find changes and new records
- Outputs:
- Type 0,1,2 update
- Inferred members
- New rows
- Duplicate rows
Historical Attribute Options – How do you want to set the expiration of historical records.
Problem with SCD Wizard is that any time you go back and change the configuration, all the output logic below it gets re-written. You lose what you created.
SCD Wizard Weaknesses
- Scalability -Generally up to about 50,000 records into the transform but varies based on number of updates
- Maintainability -After you customize, rerunning the wizard recreates all the transforms
- Uses OLE DB Command transforms for updates is row-level. Creates scalability issue here if lots of updates.
Making Your Own SCD Wizard
- Can use a Merge Join or Lookup Transform
- If no match found, it is an insert (Ignore Errors)
- Lookup Transform will scale better than Merge Join but lacks parameterization
- Add a Conditional Split transform after Lookup to direct to insert, duplicate or update path
Additional Scalability
- Watch your Lookup Transformation for scalability issues (don’t cache too much!)
- Potentially cache only the last 1 years worth of data with Partial Caching
- Only cache columns needed
- Additional scalability can be reached by landing updates into a staging table
- Then set-based update with an Execute SQL task.
- Checksum Transform can be used to detect changes across many columns
- Or HASHBYTES T-SQL statement
Inferred Members
- Created during the fact load
- A new Dim record is created using the value of “unknown” or NULL as a placeholder
- The record is flagged as an inferred member
Slowly Changing Dimension Wizard
- SSIS transform that creates many other transforms conditionally
- Handles:
- Type 0 (fixed attribute)
- Type 1 (changing attribute)
- Type 2 (historical attribute)
- Inferred members
- Typically can address 80% of the business scenarios
SCD Wizard Strengths
- SSIS transform that creates many other transforms conditionally
- Reduces design time of SCD load by 80%-90% to minutes per dimension
- Can be customized easily
- Compares differences between source and destination to find changes and new records
- Outputs:
- Type 0,1,2 update
- Inferred members
- New rows
- Duplicate rows
SCD Wizard Weaknesses
- Scalability -Generally up to about 50,000 records into the transform but varies based on number of updates
- Maintainability -After you customize, rerunning the wizard recreates all the transforms
- Uses OLE DB Command transforms for updates is row-level. Creates scalability issue here if lots of updates.
Lookup Transform – Lookup source against target dimension table. Select ALL available lookup fields and alias at TARGET_ so you can match against them. Link by primary key. Ignore the failure of matches.
Then use a conditional split – If target PK is NULL it is a new record, else Update.
“Pretty much a pain in the butt to write.” – Brian Knight in reference to the else UPDATE piece from the conditional split.
Ugh…no transcript and this info is not in the slides. ARGH! Can’t remember it all.
Brian uses the Checksum source vs. Checksum destination to deal with the UPDATE referred to above, but not guaranteed to be unique.
Hashbytes on the other hand is more unique if you wish. But it does not work on numeric fields. Need to cast them.
Fact Table Loads
- Series of Lookup Transforms
- In Type 2 Dimensions add WHERE EndDate IS NOT NULL
- Measures created Derived Column Transforms
- Aggregate transform to roll up the grain
- Lookup failure would create an inferred member or set to unknown
Popularity: 8% [?]