Dedicated to design and performance of databases and audio systems.

Primary Index Skew

In Teradata the PRIMARY INDEX determines on which of the parallel AMPs a particular record will be placed. This occurs by the optimizer hashing the Primary Index field(s) through its proprietary algorithm. Therefore, a query constrained by a Primary Index value will retrieve the appropriate record directly since the optimizer can address the HASHROW value, map that to the HASHBUCKET value, and then in turn map that to the HASHAMP. Additionally, joining data within the same AMP is far faster than having to rehash and redistribute data across the AMPs during a query.

The two considerations for a PRIMARY INDEX are

  • performance (i.e. one-AMP operations or AMP-local joins)
  • data distribution

In that order. We may override the desire of a ruler-flat data distribution for a significant performance gain. But, it is a conscious decision. The selection of a PRIMARY INDEX should never be random or without purpose.

Teradata recommends a data skew threshold of no more than 5% across the AMPs. It is a rule of thumb, but not absolute. Additionally, small tables will skew by their nature; understand the appropriate size threshold for your system and do not be overly concerned with the small.

The method I use to measure skew of a prospective PRIMARY INDEX candidate is as follows. Note, the data need to be present, but the field being tested does not need to currently be the table's PRIMARY INDEX.

Here the System Calendar is used as an example that anyone can perform, even in a development environment.

    Skew approximator based on record counts (instead of Perm space)
    which allows for the testing of Primary Index candidate field(s)
    without having to load/reload a table to determine.

    note: all records will have the same "Teradata variance %" and
    "Teradata threshold (5%)" values as they apply to the entire
    table result, not at each AMP-level.

    hashamp(hashbucket(hashrow(calendar_date)))    as "AMP",
    count(*)                                       as "row count",
    "row count" * 100.00 / sum("row count") over() as "% of total rows",

    cast((max("row count") over() -
          min("row count") over()) * 100.00 /
   nullif(min("row count") over(),0)
         as decimal(9,2)) as "Teradata variance %",

        when "Teradata variance %" < 5
        then 'pass'
        else 'fail'
    end as "Teradata threshold (5%)"

from sys_calendar.calendar

group by 1
order by 1

I hope this helps with your engineering.