Dedicated to design and performance of databases and audio systems.

Reducing I/O

During the Teradata Partners Conference in late October, Stephen Brobst, Teradata's CTO, presented a session where he discussed the evolution of data volume, CPU power, and HDD density. To summarize:

  • data is being created at a rate where the last three years outsizes the prior 40,000 years;
  • CPUs are five million times more powerful than thirty years ago;
  • while HDD density has doubled every eighteen to twenty-four months recently, mechanical speed has only increased five times during the last thirty years.

With disk storage drives increasing in capacity due to density gains, access performance can decline. Requests stack up and the queue length rises. While the cost per TB of storage may be declining, the same volume of disk is now on fewer spindles. It's similar to taking the two-story parking garage and adding on two more stories. The capacity has doubled, but unless there are more entrances and exits added, throughput slows.

With Teradata being an MPP architecture, are there I/O concerns with increased volume HDD? Yes. However, if we consider data to have a temperature then our less-used cold data can be moved to slower highly-dense lower-cost-per-TB HDD. For the performance-oriented, we want our frequently used/hot data moved to faster storage. This may be memory, SSD, or smaller HDD, respectively. Therefore, this would mean lower cost per I/O. However, we need to also work smarter. Just because we can leverage faster storage media to query our data, doesn't mean that that is the end of the story. We want to also figure out better methods to reduce I/O from the inception. Combine that with faster media and we have realized an exponential gain.

Teradata will tell you that they provide four methods to reduce I/O:

  • horizontal partitioning via PPI -- omit the partitions of the table you do not require;
  • synchronized scans -- if another query is already reading the table your query requires then piggy-back the first query's scan and use that subset to help satisfy your request;
  • vertical partitioning -- the Columnar database capability within Teradata 14;
  • Advanced Indexing -- typically some form of a Join Index: Sparse, Aggregate, Single-table, etc.
However, I feel that they have omitted three other candidates:
  • Primary Index alignment: co-location of data means joins are more efficient by eliminating the I/O related to rehashing/redistribution in spool required to set up the join;
  • Join Elimination via Soft RI and Foreign Key constraints: any superfluous process that is eliminated removes I/O with it;
  • Ordered Analytical Functions: single passes through tables that eliminate the need for a self join can be orders of magnitude faster due to overall I/O reduction.

MPP systems are already fast with their I/O. It is not a question of system capacity, rather it's how do we continually strive to raise efficiency and not just throw hardware at a problem. MPP is brute force capable by nature, but adding the finesse of I/O reducing techniques ups the speed and efficiency dramatically.

I hope this provides some ideas for consideration in your environment.