b¬äd†ødd

Dedicated to design and performance of databases and audio systems.

Null Surrogate Considerations

Whenever I see dimensional models physically implemented in Teradata, there are specific patterns that are embedded. One of these is the definition of key/foreign key fields as NOT NULL, whether explicitly or implicitly. The reasoning for this is that every join between a dimension and fact should be an INNER JOIN. BI tools expect it, users understand the concept better, and it is more efficient to execute.

The issue is that when values are non-existent, a null indication/value would exist in reality. However, nulls do not join. A null value is neither equal to a null value nor is a null value not not-equal to a null value. To resolve this obstacle, thereby enabling INNER JOINs, surrogate values are used to replace the null. Depending on the field's data type, the surrogate values are typically zero or -1 for integers, space(s) for characters, and some date value that is typically way out of bounds relative to the data for date fields. I have seen 1900-01-01 and 9999-12-31 as some of the more popular offerings.

Teradata is a MPP architecture. Each record resides on one, and only one, AMP. In order for Teradata to join records they must reside on the same AMP. To achieve this, records from one table, or both, involved in a join, unless AMP-local, are rehashed and redistributed in spool, to set up the join. The issue with this is that the values null, zero, and space, are all distributed to the null AMP--the last AMP. For example, a 100-AMP system contains AMPs numbered 0 through 99. In this case, AMP 99 is the null AMP.

Any join that relates integer and character null-surrogate values (i.e. zeroes and spaces), does so on the null AMP. For example, our fact table contains an ID field which is a foreign key to a dimension table. 20% of the fact records do not have a true, real-world value in this field, so they are defaulted to zero. The dimension table has a zero record allowing the join to occur between the dimension and the fact on an INNER JOIN basis. It may be populated with some description like N/A or the equivalent.

If the dimension table is small, Teradata will copy it into memory and replicate it across all of the AMPs. This will allow AMP-local joins of the fact and dimension to occur without having to rehash and redistribute the fact table. The most expensive processing is the redistribution of data to another AMP; therefore, the replication of small tables, across the AMPs, in memory, is very beneficial.

The problem is when the dimension table is too large to fit into memory. The join between the dimension and fact table will most likely require both tables to be rehashed and redistributed into spool. The dimension table will be flat across the AMPs, assuming a unique key value being joined on. However, remember that the fact table had 20% of its records' foreign key populated with the null surrogate value of zero. Where do these null-surrogate records redistribute to? The null AMP. This is skew--I/O skew, CPU skew, Impact I/O, Impact CPU.

Null values do not join to null values. In the MPP world, skew is the Achilles heel. Null is present for a reason. It is removed from the tables before the join occurs.

To be continued...

I hope this helps with your future Teradata designs.

Cheers
-Brad