Dedicated to design and performance of databases and audio systems.
In Teradata the
PRIMARY INDEX determines on which of the parallel AMPs a particular record will be placed. This occurs by the optimizer hashing the Primary Index field(s) through its proprietary algorithm. Therefore, a query constrained by a Primary Index value will retrieve the appropriate record directly since the optimizer can address the
HASHROW value, map that to the
HASHBUCKET value, and then in turn map that to the
HASHAMP. Additionally, joining data within the same AMP is far faster than having to rehash and redistribute data across the AMPs during a query.
The two considerations for a
PRIMARY INDEX are
In that order. We may override the desire of a ruler-flat data distribution for a significant performance gain. But, it is a conscious decision. The selection of a
PRIMARY INDEX should never be random or without purpose.
Teradata recommends a data skew threshold of no more than 5% across the AMPs. It is a rule of thumb, but not absolute. Additionally, small tables will skew by their nature; understand the appropriate size threshold for your system and do not be overly concerned with the small.
The method I use to measure skew of a prospective
PRIMARY INDEX candidate is as follows. Note, the data need to be present, but the field being tested does not need to currently be the table's
Here the System Calendar is used as an example that anyone can perform, even in a development environment.
/* Skew approximator based on record counts (instead of Perm space) which allows for the testing of Primary Index candidate field(s) without having to load/reload a table to determine. note: all records will have the same "Teradata variance %" and "Teradata threshold (5%)" values as they apply to the entire table result, not at each AMP-level. */ select hashamp(hashbucket(hashrow(calendar_date))) as "AMP", count(*) as "row count", "row count" * 100.00 / sum("row count") over() as "% of total rows", cast((max("row count") over() - min("row count") over()) * 100.00 / nullif(min("row count") over(),0) as decimal(9,2)) as "Teradata variance %", case when "Teradata variance %" < 5 then 'pass' else 'fail' end as "Teradata threshold (5%)" from sys_calendar.calendar group by 1 order by 1 ;
I hope this helps with your engineering.