b¬äd†ødd

Dedicated to design and performance of databases and audio systems.

Transformation in the View Layer

The other day I was reviewing some SQL code for a view. One field in the view was based on a CASE statement that was transforming a numeric value to a character. The field was an indicator, so zero and one became N and Y, respectively. Now, of course this is quite simple and an option. The question it poses, however, is whether or not the derived field will be used as criteria, as part of a join, or solely output within a SELECT statement. If it was to be used as a criterion or join condition then there is an issue with efficiency.

Statistics are collected on a table field basis. You cannot collect statistics on a view. Therefore, if we collect statistics on our numeric indicator field, the optimizer knows how many zeroes and ones reside in that field. It does not know how many perceived Y's and N's there are.

When the optimizer does not know how the demographics of a table's field it can perform Random AMP Sampling to attempt to determine. With a derived field there is no RAS. Therefore, the optimizer will typically use brute-force and just full-table scan the tables in question. For a small query, or one with few objects, this is not an issue. However, as queries get more sophisticated and larger, more joins are typically involved. More joins means more query steps. This is where the number of combinations of query execution paths really gets large. Without optimizer confidence in particular steps, that cascade from previous steps, confidence can be lost altogether meaning inefficient queries occur. Inefficient queries can run multiple times longer than normal.

The majority of steps in a query plan is generally one of two types: RETRIEVE or JOIN. JOINs will have two, and only two objects involved. As queries progress the result of these steps access derived work tables known as SPOOL which were created by previous steps. If there is no confidence in the step which created the SPOOL table then the step consuming the SPOOL table will likewise have no confidence.

So, how do we solve this problem? Move the 0/1 to N/Y transformation into the ETL process and populate the table wih N/Y instead of 0/1 in the first place? Yes. That would be the best option. However, another option is to transform the 0/1 to N/Y via a JOIN to a small decode table--two records: 0/N and 1/Y. Collecting statistics on this table is critical, and adding Foreign Key constraints are huge. This way the optimizer can extrapolate cardinality between the tables.

Here is the example:

-- table definition
create multiset table bt_fact_t, no fallback (
  id  integer not null,
  i   byteint not null,
  txt varchar(255)) -- to add width and weight to the record
primary index (id)
;
 
-- statistics definition
collect statistics on bt_fact_t index  (id);
collect statistics on bt_fact_t column (partition);
collect statistics on bt_fact_t column;
 
-- data load: 7,341,400 records
insert into bt_fact_t
select
  cal.day_of_calendar * 100 + prd.doc,
 
  case cal.day_of_week
      when 1
      then 1
      else 0
  end as num_i,
 
  substring('Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.' from 1f or random(100,255))
 
from sys_calendar.calendar cal -- System Calendar
 
  -- product join multiplier * 100
  cross join (
    select day_of_calendar - 1 as doc
    from sys_calendar.calendar -- System Calendar
    where day_of_calendar <= 100) prd
;

-- statistics collection
collect statistics on bt_fact_t;
 
-- test 1: character indicator = 'Y' view view
explain
lock table bt_fact_t for access
select
  id,
  txt
 
from ( -- view mimic
  select
  	id,
  	i as num_i,
 
        case i -- transformation of numeric indicator value to character indicator value
          when 0
          then 'N'
          else 'Y'
        end as chr_i,
 
  	txt
 
  from bt_fact_t) foo
 
where chr_i = 'Y'
;

Produces

  1) First, we lock bt_fact_t for access.
  2) Next, we do an all-AMPs RETRIEVE step from bt_fact_t
 	by way of an all-rows scan with a condition of ("(( CASE WHEN
 	(bt_fact_t.i = 0) THEN ('N') ELSE ('Y') END ))= 'Y'")
 	into Spool 1 (group_amps), which is built locally on the AMPs.
 	The size of Spool 1 is estimated with no confidence to be 734,140
 	rows (82,223,680 bytes).  The estimated time for this step is 0.29
 	seconds.
  3) Finally, we send out an END TRANSACTION step to all AMPs involved
 	in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
 	statement 1.  The total estimated time is 0.29 seconds.

Next add the decode/translation table

-- table definition: add a decode/translation table
create multiset table bt_i_t, no fallback (
  num_i byteint      not null,
  chr_i character(1) not null)
unique primary index (chr_i)
unique         index (num_i)
;
 
-- statistics definition
collect statistics on bt_i_t index  (num_i);
collect statistics on bt_i_t column (partition);
collect statistics on bt_i_t index  (chr_i);
 
--data load
insert into bt_i_t (0,'N');
insert into bt_i_t (1,'Y');
 
-- statistics collection
collect statistics on bt_i_t;
 
-- foreign key definition
alter table bt_fact_t 
    add foreign key (i) 
    references with no check option bt_i_t (num_i)
;

-- test 2: character indicator = 'Y' via join
explain
lock table bt_fact_t for access
lock table bt_i_t    for access
 
select
  id,
  txt
 
from (
    select
    	fct.id,
    	ind.chr_i,
    	fct.txt
 
    from bt_fact_t fct
 
      inner join bt_i_t ind on ind.num_i = fct.i) foo
 
where chr_i = 'Y'
;

Now we have

  1) First, we lock bt_i_t for access, and we lock
 	bt_fact_t for access.
  2) Next, we do a single-AMP RETRIEVE step from ind by
 	way of the unique primary index "ind.chr_i = 'Y'"
 	with no residual conditions into Spool 2 (all_amps), which is
     duplicated on all AMPs.  The size of Spool 2 is estimated with
     high confidence to be 125 rows (1,750 bytes).  The estimated time
 	for this step is 0.01 seconds.
  3) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of an
 	all-rows scan, which is joined to fct by way of an
 	all-rows scan with no residual conditions.  Spool 2 and
 	fct are joined using a product join, with a join
 	condition of ("num_i = fct.i").  The result goes into
 	Spool 1 (group_amps), which is built locally on the AMPs.  The
 	size of Spool 1 is estimated with low confidence to be 6,292,700
 	rows (704,782,400 bytes).  The estimated time for this step is
 	0.68 seconds.
  4) Finally, we send out an END TRANSACTION step to all AMPs involved
 	in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
 	statement 1.  The total estimated time is 0.69 seconds.

High confidence is gained on the RETRIEVE and low confidence is present on the JOIN. Typically low confidence is the norm on a JOIN, and the optimizer treats low confidence the same as a high confidence.

This was an overly simplistic example. Both queries here will run very fast. However, if there were more objects involved, the latter would be far more beneficial by carrying confidence to later query steps that consumed Spool 1 created in Step 3.

From time to time I hear concerns that performing a JOIN is more costly than applying straight criteria via a WHERE clause. It depends. Here, the decode/translation table is accessed and replicated across the AMPs in 0.01 seconds. Then each AMP uses that value as its criterion when accessing the large table. Confidence is kept.

I hope this helps with your future transformation needs.

Cheers
-Brad