Dedicated to design and performance of databases and audio systems.
A recent theme I see among application groups is the need to create a Date Dimension table for Business Intelligence (BI) styles of reporting. Cognos works best with a dimensional model semantic layer underneath it.
For those who don't know what a dimensional model is, it is a data perspective where a main fact table--containing metrics like counts and dollars--is surrounded by dimensions that describe specific records within the fact. Some examples of dimensions include Time, Organization, Location, Product, etc. Therefore, we could have a fact table representing sales and their associated dollars. Each sale has a corresponding set of dimensions: sales time, store location, product, etc. Therefore, all concepts beyond the sales counts and sales dollars, in this sense, are dimensions.
There are differing styles of dimensions: star and snowflake, primarily. A star takes all of the related data and flattens it into a single object (i.e. table or view), whereas a snowflake allows the dimensional data to remain in its true hierarchy, in a more normalized state. If we think of the Time dimension, the need may be to have Date, Year and Month, Month, Quarter, and Year. One usage requirement may be that the second quarter of last year is compared with the second quarter of this year. Now obviously, a date belongs to a month within a year (e.g. 1/21/2014 is part of January 2014); a month is the same month across years (e.g. 1/21/2013 and 1/21/2014 are part of January); the month belongs to a quarter (e.g. January is part of Quarter 1); and a date is part of a year (e.g. 1/21/2014 is part of 2014). Therefore, flattening Date, Year/Month, Month, Quarter, and Year onto one dimension is a star; however, allowing each entity (i.e. Date, Year/Month, Month, Quarter, Year) to be a separate, but related hierarchical entity, is a snowflake. This means that the Month relates to the Date which relates to the fact table.
We use Teradata which does not require a dimensional model to perform like other RDBMS's do. But, BI tools typically want to see a star schema. The challenge is how to optimize our physical data implementation in Teradata while still providing a star-type of perspective for the accessing BI tools. The simple answer is via views. The transformation from a Teradata-optimized model to a BI friendly one occurs in the view layers. However, there are some considerations to be made.
For this we will use the Date dimension as our example. I have seen multiple application groups build customized Date dimension tables flattening out many levels of hierarchy into a single table. Another option I have seen is the use of the System Calendar. It also is a flattened out Date perspective albeit via three levels of views on top of a table that contains solely a date field--all other fields are derived. Optimally, neither one works best. Why? Because Teradata's optimizer and physical machine architecture are extremely good at understanding cardinality across tables with proper statistics and foreign key constraints, and replicating small data across its AMPs in memory, respectively. So what does that have to do with our Date dimension?
Simple: we need to store our data in a Teradata optimized fashion, but deliver it to our BI tools in a star-schema style of dimensional perspective. We know we will accomplish this via views. However, it is always better to illustrate it and provide a cookbook example. Here's mine.
First, we create our lowest level grain table: Date
-- table definition: Date create multiset table dte_t, no fallback ( cal_dte date not null, cal_yr_mo_num integer not null) unique primary index (cal_dte) ;
Two fields: Date (e.g. 1/1/2014, 1/2/2014, etc.) and the Year/Month (e.g. 201401).
Define our statistics:
-- statistics definition: Date collect statistics on dte_t index (cal_dte); collect statistics on dte_t column (partition); collect statistics on dte_t column (cal_yr_mo_num);
Remember to always collect statistics on the keyword
PARTITION regardless of whether the table is partitioned or not. NPPI tables are considered a logical partition of zero in their entirety. :-)
Next, load the data from the System Calendar:
-- data load: Date insert into dte_t select calendar_date as cal_dte, cal_dte / 100 + 190000 as cal_yr_mo_num -- e.g. 201401 from sys_calendar.calendar -- adjust to the appropriate requirement where calendar_date between date '1980-01-01'and current_date - 1 ;
And, now we collect our statistics:
collect statistics on dte_t;
Then create our base view (alias fields as appropriate):
replace view dte_v as lock table dte_t for access select cal_dte, cal_yr_mo_num from dte_t ;
One level done. Now we need to manufacture the Year/Month level. Here we always roll up from our lowest level to ensure complete referential integrity within our hierarchy.
The same five steps as above: table definition, statistics definition, data load, statistics collection, and base view creation.
-- table definition: Year/Month create multiset table yr_mo_t, no fallback ( cal_yr_mo_num integer not null, cal_yr_num smallint not null, cal_mo_num byteint not null) unique primary index (cal_yr_mo_num) ;
-- statistics definition: Year/Month collect statistics on yr_mo_t index (cal_yr_mo_num); collect statistics on yr_mo_t column (partition); collect statistics on yr_mo_t column (cal_yr_num); collect statistics on yr_mo_t column (cal_mo_num);
-- data load: Year/Month lock table dte_t for access insert into yr_mo_t select cal_dte / 100 + 190000 as cal_yr_mo_num, cal_dte / 10000 + 1900 as cal_yr_num, cal_dte / 100 mod 100 as cal_mo_num from dte_t group by 1,2,3 ;
-- statistics collection: Year/Month collect statistics on yr_mo_t;
-- view definition: Year/Month replace view yr_mo_v as lock table yr_mo_t for access select cal_yr_mo_num, cal_yr_num, cal_mo_num from yr_mo_t ;
Next up, Month and Quarter:
-- table definition: Month & Quarter create multiset table mo_qtr_t, no fallback ( cal_mo_num byteint not null, cal_qtr_num byteint not null) unique primary index (cal_mo_num) ;
-- statistics definition: Month & Quarter collect statistics on mo_qtr_t index (cal_mo_num); collect statistics on mo_qtr_t column (partition); collect statistics on mo_qtr_t column (cal_qtr_num);
-- data load: Month & Quarter lock table yr_mo_t for access insert into mo_qtr_t select cal_mo_num, (cal_mo_num + 2) / 3 as cal_qtr_num from yr_mo_t group by 1 ;
-- statistics collection: Month & Quarter collect statistics on mo_qtr_t;
-- view definition: Month & Quarter replace view mo_qtr_v as lock table mo_qtr_t for access select cal_mo_num, cal_qtr_num from mo_qtr_t ;
Then for Year:
-- table definition: Year create multiset table yr_t, no fallback ( cal_yr_num smallint not null) unique primary index (cal_yr_num) ;
-- statistics definition: Year collect statistics on yr_t index (cal_yr_num); collect statistics on yr_t column (partition);
-- data load: Year lock table yr_mo_t for access insert into yr_t select cal_yr_num from yr_mo_t group by 1;
-- statistics collection: Year collect statistics on yr_t;
-- view definition: Year replace view yr_v as lock table yr_t for access select cal_yr_num from yr_t ;
Okay. Now our tables are built. One question you may have is, why did we build a Year table when the Year field already exists as a foreign key on the Year/Month table?
UNIQUE PRIMARY INDEXon each table.
We have captured our statistics. Remember, statistics provide information regarding intra-table data demographics. The same is true for the uniqueness constraint we applied via the UPI. However, the latter also sets us up for Foreign Key constraints. FK's will let the optimizer understand the relationship between our tables (inter-table) and extrapolate the statistics across tables; therefore, the cardinality. This is the secret sauce part. It glues it all together.
Let's apply our Foreign Keys:
-- Foreign Key definitions: Date Dimension alter table dte_t add foreign key (cal_yr_mo_num) references with no check option yr_mo_t (cal_yr_mo_num); alter table yr_mo_t add foreign key (cal_yr_num) references with no check option yr_t (cal_yr_num); alter table yr_mo_t add foreign key (cal_mo_num) references with no check option mo_qtr_t (cal_mo_num);
Now that we have glued our tables together via FK's we can create that dimensional star perspective with a view that brings together all of our components into one seamless flattened presentation for our BI tool.
-- view definition: Date Dimension replace view dte_dim_v as select dte.cal_dte, dte.cal_yr_mo_num, mo.cal_yr_num, mo.cal_mo_num, qtr.cal_qtr_num, yr.cls_yr_grp_nme, yr.rpt_yr_grp_nme from dte_v dte -- Date -- Year / Month inner join yr_mo_v mo onmo.cal_yr_mo_num = dte.cal_yr_mo_num -- Month / Quarter inner join mo_qtr_v qtr on qtr.cal_mo_num = mo.cal_mo_num -- Year inner join yr_v yr on yr.cal_yr_num = mo.cal_yr_num ;
One thing to also remember is that Foreign Key constraints, in support of SoftRI, enable Join Elimination. Which means, if the optimizer can omit a join from the view, due to field selection, and the results are not compromised, then the join will be skipped.
Overall, there are other factors to consider:
INTEGERversions of look-alike date fields on a table (e.g. 20140121). Always use a
DATEfield when the field truly is a date (e.g. 2014-01-21). A
DATEdata typed field is truly an
INTEGERinternally, but the optimizer allows us to perform date functions on a
I hope this provides some insight to the Teradata dimensional discussion and acts as a baseline cookbook recipe that helps you with your future processing.