b¬äd†ødd

Dedicated to design and performance of databases and audio systems.

Top 10 Core Best Practices

During Teradata code review sessions I often find a few of the following fundamentals overlooked. In that spirit, these are my top 10 core best practices.

  1. SET versus MULTISET tables:

  2. Tables should be defined as MULTISET with NO FALLBACK unless there is a very specific reason to have a SET and/or FALLBACK protected definition. MULTISET omits a unique row check, saving overhead, while NO FALLBACK ensures there are not four copies of the table stored on disk. Teradata is RAID1/mirrored internally by default so a redundant copy already exists in case of drive failure.

  3. PRIMARY INDEX selection:

  4. The choice of a table's PRIMARY INDEX is the single most important physical design decision. The PRIMARY INDEX is not necessarily the PRIMARY KEY. If it is unique then it should be defined as a UNIQUE PRIMARY INDEX. If the table does not have any good candidate field(s) for the PRIMARY INDEX then defining the table as NO PRIMARY INDEX is acceptable (e.g. some staging tables). Two important considerations for the PRIMARY INDEX are
    • data distribution across the MPP architecture (i.e. the AMPs); and,
    • join access, which is the more important.
    Therefore, reasonable skew of data at rest is acceptable for processing efficiency gains.

  5. Efficient data typing:

  6. Fields should have an efficient data typing. If they always contain values then they should be defined as NOT NULL. Again, further efficiencies are realized. Examples of efficient data typing include defining code fields (e.g. USPS State Code: MA, NH), and other fixed-length character fields as CHARACTER and not VARCHAR. Today, we MVC compress low-cardinality fixed-length fields (prior to the introduction of VLC where varible-length fields can also be compressed). Additionally, date fields are defined as DATE and not TIMESTAMP nor INTEGER. TIMESTAMP is internally a concatenation of Year, Month, Day, Hour, Minute, and Second fields, and is not nearly as efficient as DATE. Likewise, turning date values into a look-alike INTEGER saves nothing in storage, but removes date math capabilities inherent with the DATE data type. And, storing numerical values in fields much larger than is required (e.g. if DECIMAL(9,2) suffices then don't use DECIMAL(18,2)). Also, don't exceed DECIMAL(18,x) without a specific reason to do so.

  7. Multi-Value Compression:

  8. Now that our tables and fields are defined we need to specify our Multi-Value Compression. MVC offers reduced disk space consumption and increased access performance. Typically, MVC saves 30% of required disk space on average. Thinner records equate to more records read per I/O. Less I/O leads to faster executing queries. The laws of physics.

  9. Statistics:

  10. Database statistics are core and fundamental to Teradata. The cost-based optimizer relies on them heavily. Approximately 90% of all performance problems are due to missing or stale/bad statistics. Teradata's rule of thumb for refreshing statistics is if the data changes by 10% or more--solely a guideline. I would be far more risk averse and place it at 5% or less. Ultimately, I consider fixed-period collection to work well (e.g. daily, weekly). Additionally, my belief is single-column statistics tend to be less confusing to the optimizer given the overlap of multi-column statistics with single-column versions; therefore, the core is single-column. At a minimum, the following fields should contain statistics:
    • the PRIMARY INDEX
    • the pseudo field PARTITION
    • foreign key fields
    • predicates: fields used as criteria in an ON or WHERE clause
    • secondary indexes (multi-column statistics if the secondary index is multi-column, and the collection is defined as INDEX)

  11. Foreign Keys:

  12. While statistics give the optimizer information about a table's demographics, they are intra-table. However, we rarely use solely one table in a query, rather joins to other tables are more common. Therefore, defining FOREIGN KEYs between tables allows the optimizer to understand intended relationships and offers inter-table demographics and extrapolate cardinality. FOREIGN KEY constraints done on a SoftRI (i.e. without database referential integrity checks) basis also allow for Join Elimination to be leveraged. This means that the optimizer may omit joins from a query or view if it deems that the join with not affect the integrity of the query's outcome and the join process would otherwise be superfluous. Efficiency gained.

  13. Locking modifiers:

  14. Table accessing queries should occur through a view. The base view to the table for accessing queries should contain an access lock: LOCKING TABLE table_name FOR ACCESS, or the row version at a minimum. Access locks remove another level of overhead and mitigate contention with write locks. Another efficiency gain.

  15. ANSI SQL:

  16. Queries and views should be written with ANSI SQL if there is an equivalent function to the need. ANSI SQL tends to perform eqaully well or better and is typically more efficient for accessing queries and views.

  17. OAFs:

  18. Ordered Analytical Functions can, but not always, offer increased efficiency by reducing the number of database objects brought into a query. This is especially true when doing some form of chronological sequencing of the data. Prior to OAFs these types of processes required self-joins of tables to occur. Otherwise, this would require a copy of the table to be spun up into spool for that join to occur. With an OAF it reduces the number of objects/tables involved and in many cases requires only a single pass of specific tables. Less I/O is required to scan through a table once versus many times. Examples of this involve the need to look back or forward in time, across records within a key, for the prior/next chronological value of a field. OAFs can be very powerful for specific tasks.

  19. EXPLAIN plan:

  20. After the tables are created, the data loaded, statistics collected, FKs defined, and the views and queries built, we need to validate it, not only from a results perspective, but from a processing perspective as well. The EXPLAIN plan defines the exact steps, and their sequence. It can provide valuable information about where a process may deviate from what was originally anticipated and help with debugging. It is a powerful tool that should be used to ensure efficiency and quality.

I hope this is helpful and serves as a guideline for you present and future Teradata development.

Cheers
-Brad