b¬äd†ødd

Dedicated to design and performance of databases and audio systems.

Dimensional:
  Star versus Snowflake

A recent theme I see among application groups is the need to create a Date Dimension table for Business Intelligence (BI) styles of reporting. Cognos works best with a dimensional model semantic layer underneath it.

For those who don't know what a dimensional model is, it is a data perspective where a main fact table--containing metrics like counts and dollars--is surrounded by dimensions that describe specific records within the fact. Some examples of dimensions include Time, Organization, Location, Product, etc. Therefore, we could have a fact table representing sales and their associated dollars. Each sale has a corresponding set of dimensions: sales time, store location, product, etc. Therefore, all concepts beyond the sales counts and sales dollars, in this sense, are dimensions.

There are differing styles of dimensions: star and snowflake, primarily. A star takes all of the related data and flattens it into a single object (i.e. table or view), whereas a snowflake allows the dimensional data to remain in its true hierarchy, in a more normalized state. If we think of the Time dimension, the need may be to have Date, Year and Month, Month, Quarter, and Year. One usage requirement may be that the second quarter of last year is compared with the second quarter of this year. Now obviously, a date belongs to a month within a year (e.g. 1/21/2014 is part of January 2014); a month is the same month across years (e.g. 1/21/2013 and 1/21/2014 are part of January); the month belongs to a quarter (e.g. January is part of Quarter 1); and a date is part of a year (e.g. 1/21/2014 is part of 2014). Therefore, flattening Date, Year/Month, Month, Quarter, and Year onto one dimension is a star; however, allowing each entity (i.e. Date, Year/Month, Month, Quarter, Year) to be a separate, but related hierarchical entity, is a snowflake. This means that the Month relates to the Date which relates to the fact table.

We use Teradata which does not require a dimensional model to perform like other RDBMS's do. But, BI tools typically want to see a star schema. The challenge is how to optimize our physical data implementation in Teradata while still providing a star-type of perspective for the accessing BI tools. The simple answer is via views. The transformation from a Teradata-optimized model to a BI friendly one occurs in the view layers. However, there are some considerations to be made.

For this we will use the Date dimension as our example. I have seen multiple application groups build customized Date dimension tables flattening out many levels of hierarchy into a single table. Another option I have seen is the use of the System Calendar. It also is a flattened out Date perspective albeit via three levels of views on top of a table that contains solely a date field--all other fields are derived. Optimally, neither one works best. Why? Because Teradata's optimizer and physical machine architecture are extremely good at understanding cardinality across tables with proper statistics and foreign key constraints, and replicating small data across its AMPs in memory, respectively. So what does that have to do with our Date dimension?

Simple: we need to store our data in a Teradata optimized fashion, but deliver it to our BI tools in a star-schema style of dimensional perspective. We know we will accomplish this via views. However, it is always better to illustrate it and provide a cookbook example. Here's mine.

First, we create our lowest level grain table: Date

-- table definition: Date
create multiset table dte_t, no fallback (
    cal_dte       date    not null,
    cal_yr_mo_num integer not null)
unique primary index (cal_dte)
;

Two fields: Date (e.g. 1/1/2014, 1/2/2014, etc.) and the Year/Month (e.g. 201401).

Define our statistics:

-- statistics definition: Date
collect statistics on dte_t index  (cal_dte);
collect statistics on dte_t column (partition);
collect statistics on dte_t column (cal_yr_mo_num);

Remember to always collect statistics on the keyword PARTITION regardless of whether the table is partitioned or not. NPPI tables are considered a logical partition of zero in their entirety. :-)

Next, load the data from the System Calendar:

-- data load: Date
insert into dte_t
select
    calendar_date as cal_dte,
    cal_dte / 100 + 190000 as cal_yr_mo_num -- e.g. 201401

from sys_calendar.calendar

-- adjust to the appropriate requirement
where calendar_date between date '1980-01-01'and current_date - 1 
;

And, now we collect our statistics:

collect statistics on dte_t;

Then create our base view (alias fields as appropriate):

replace view dte_v as
lock table dte_t for access
select    
    cal_dte,
    cal_yr_mo_num
from dte_t
;

One level done. Now we need to manufacture the Year/Month level. Here we always roll up from our lowest level to ensure complete referential integrity within our hierarchy.

The same five steps as above: table definition, statistics definition, data load, statistics collection, and base view creation.

-- table definition: Year/Month
create multiset table yr_mo_t, no fallback (
    cal_yr_mo_num integer  not null,
    cal_yr_num    smallint not null,
    cal_mo_num    byteint  not null)
unique primary index (cal_yr_mo_num)
;

-- statistics definition: Year/Month
collect statistics on yr_mo_t index  (cal_yr_mo_num);
collect statistics on yr_mo_t column (partition);
collect statistics on yr_mo_t column (cal_yr_num);
collect statistics on yr_mo_t column (cal_mo_num);

-- data load: Year/Month
lock table dte_t for access
insert into yr_mo_t
select
    cal_dte / 100   +   190000 as cal_yr_mo_num,
    cal_dte / 10000 +   1900   as cal_yr_num,
    cal_dte / 100   mod 100    as cal_mo_num
from dte_t
group by 1,2,3
;

-- statistics collection: Year/Month
collect statistics on yr_mo_t;

-- view definition: Year/Month
replace view yr_mo_v as
lock table yr_mo_t for access
select
    cal_yr_mo_num,
    cal_yr_num,
    cal_mo_num
from yr_mo_t
;

Next up, Month and Quarter:

-- table definition: Month & Quarter
create multiset table mo_qtr_t, no fallback (
    cal_mo_num  byteint not null,
    cal_qtr_num byteint not null)
unique primary index (cal_mo_num)
;

-- statistics definition: Month & Quarter
collect statistics on mo_qtr_t index  (cal_mo_num);
collect statistics on mo_qtr_t column (partition);
collect statistics on mo_qtr_t column (cal_qtr_num);

-- data load: Month & Quarter
lock table yr_mo_t for access
insert into mo_qtr_t
select
    cal_mo_num,
    (cal_mo_num + 2) / 3 as cal_qtr_num
from yr_mo_t
group by 1
;

-- statistics collection: Month & Quarter
collect statistics on mo_qtr_t;

-- view definition: Month & Quarter
replace view mo_qtr_v as
lock table mo_qtr_t for access
select
    cal_mo_num,
    cal_qtr_num
from mo_qtr_t
;

Then for Year:

-- table definition: Year
create multiset table yr_t, no fallback (
    cal_yr_num smallint not null)
unique primary index (cal_yr_num)
;

-- statistics definition: Year
collect statistics on yr_t index  (cal_yr_num);
collect statistics on yr_t column (partition);

-- data load: Year
lock table yr_mo_t for access
insert into yr_t
select cal_yr_num
from yr_mo_t
group by 1;

-- statistics collection: Year
collect statistics on yr_t;

-- view definition: Year
replace view yr_v as
lock table yr_t for access
select cal_yr_num
from yr_t
;

Okay. Now our tables are built. One question you may have is, why did we build a Year table when the Year field already exists as a foreign key on the Year/Month table?
Two reasons:

  1. This is a starter set. You may want additional fields associated to a Year;
  2. We want to give the optimizer as much information about the cardinality of the table relationships in the hierarchy as possible.
Note the use of a UNIQUE PRIMARY INDEX on each table.

We have captured our statistics. Remember, statistics provide information regarding intra-table data demographics. The same is true for the uniqueness constraint we applied via the UPI. However, the latter also sets us up for Foreign Key constraints. FK's will let the optimizer understand the relationship between our tables (inter-table) and extrapolate the statistics across tables; therefore, the cardinality. This is the secret sauce part. It glues it all together.

Let's apply our Foreign Keys:

-- Foreign Key definitions: Date Dimension
alter table dte_t add foreign key (cal_yr_mo_num)
    references with no check option yr_mo_t (cal_yr_mo_num);

alter table yr_mo_t add foreign key (cal_yr_num)
    references with no check option yr_t (cal_yr_num);

alter table yr_mo_t add foreign key (cal_mo_num)
    references with no check option mo_qtr_t (cal_mo_num);

Now that we have glued our tables together via FK's we can create that dimensional star perspective with a view that brings together all of our components into one seamless flattened presentation for our BI tool.

-- view definition: Date Dimension
replace view dte_dim_v as
select
    dte.cal_dte,
    dte.cal_yr_mo_num,
    mo.cal_yr_num,
    mo.cal_mo_num,
    qtr.cal_qtr_num,
    yr.cls_yr_grp_nme,
    yr.rpt_yr_grp_nme

from dte_v dte -- Date

  -- Year / Month
  inner join yr_mo_v mo onmo.cal_yr_mo_num = dte.cal_yr_mo_num

  -- Month / Quarter
  inner join mo_qtr_v qtr on qtr.cal_mo_num = mo.cal_mo_num

  -- Year
  inner join yr_v yr on yr.cal_yr_num = mo.cal_yr_num
;

One thing to also remember is that Foreign Key constraints, in support of SoftRI, enable Join Elimination. Which means, if the optimizer can omit a join from the view, due to field selection, and the results are not compromised, then the join will be skipped.

Overall, there are other factors to consider:

  • Should everyone create their own version of a Date dimension or should it be conformed once and centralized. Obviously, the optimal would be the latter, but there may be exception cases to the former.
  • Shouldn't base views contain their aliased English equivalents? Yes, but in the interest of keeping the example as small and simple as possible, they were omitted.
  • It should always be a practice to FK the fact table's, or any related table's, date fields to the date table here. Those tables should always have statistics captured on their date fields.
  • Never create INTEGER versions of look-alike date fields on a table (e.g. 20140121). Always use a DATE field when the field truly is a date (e.g. 2014-01-21). A DATE data typed field is truly an INTEGER internally, but the optimizer allows us to perform date functions on a DATE field.

I hope this provides some insight to the Teradata dimensional discussion and acts as a baseline cookbook recipe that helps you with your future processing.

Cheers!
-Brad