Dedicated to design and performance of databases and audio systems.
The other day I was reviewing some SQL code for a view. One field in the view was based on a CASE
statement that was transforming a numeric value to a character. The field was an indicator, so zero and one became N and Y, respectively. Now, of course this is quite simple and an option. The question it poses, however, is whether or not the derived field will be used as criteria, as part of a join, or solely output within a SELECT
statement. If it was to be used as a criterion or join condition then there is an issue with efficiency.
Statistics are collected on a table field basis. You cannot collect statistics on a view. Therefore, if we collect statistics on our numeric indicator field, the optimizer knows how many zeroes and ones reside in that field. It does not know how many perceived Y's and N's there are.
When the optimizer does not know how the demographics of a table's field it can perform Random AMP Sampling to attempt to determine. With a derived field there is no RAS. Therefore, the optimizer will typically use brute-force and just full-table scan the tables in question. For a small query, or one with few objects, this is not an issue. However, as queries get more sophisticated and larger, more joins are typically involved. More joins means more query steps. This is where the number of combinations of query execution paths really gets large. Without optimizer confidence in particular steps, that cascade from previous steps, confidence can be lost altogether meaning inefficient queries occur. Inefficient queries can run multiple times longer than normal.
The majority of steps in a query plan is generally one of two types: RETRIEVE
or JOIN
. JOIN
s will have two, and only two objects involved. As queries progress the result of these steps access derived work tables known as SPOOL
which were created by previous steps. If there is no confidence in the step which created the SPOOL
table then the step consuming the SPOOL
table will likewise have no confidence.
So, how do we solve this problem? Move the 0/1 to N/Y transformation into the ETL process and populate the table wih N/Y instead of 0/1 in the first place? Yes. That would be the best option. However, another option is to transform the 0/1 to N/Y via a JOIN
to a small decode table--two records: 0/N and 1/Y. Collecting statistics on this table is critical, and adding Foreign Key constraints are huge. This way the optimizer can extrapolate cardinality between the tables.
Here is the example:
-- table definition
create multiset table bt_fact_t, no fallback (
id integer not null,
i byteint not null,
txt varchar(255)) -- to add width and weight to the record
primary index (id)
;
-- statistics definition
collect statistics on bt_fact_t index (id);
collect statistics on bt_fact_t column (partition);
collect statistics on bt_fact_t column;
-- data load: 7,341,400 records
insert into bt_fact_t
select
cal.day_of_calendar * 100 + prd.doc,
case cal.day_of_week
when 1
then 1
else 0
end as num_i,
substring('Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.' from 1f or random(100,255))
from sys_calendar.calendar cal -- System Calendar
-- product join multiplier * 100
cross join (
select day_of_calendar - 1 as doc
from sys_calendar.calendar -- System Calendar
where day_of_calendar <= 100) prd
;
-- statistics collection
collect statistics on bt_fact_t;
-- test 1: character indicator = 'Y' view view
explain
lock table bt_fact_t for access
select
id,
txt
from ( -- view mimic
select
id,
i as num_i,
case i -- transformation of numeric indicator value to character indicator value
when 0
then 'N'
else 'Y'
end as chr_i,
txt
from bt_fact_t) foo
where chr_i = 'Y'
;
Produces
1) First, we lock bt_fact_t for access.
2) Next, we do an all-AMPs RETRIEVE step from bt_fact_t
by way of an all-rows scan with a condition of ("(( CASE WHEN
(bt_fact_t.i = 0) THEN ('N') ELSE ('Y') END ))= 'Y'")
into Spool 1 (group_amps), which is built locally on the AMPs.
The size of Spool 1 is estimated with no confidence to be 734,140
rows (82,223,680 bytes). The estimated time for this step is 0.29
seconds.
3) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.29 seconds.
Next add the decode/translation table
-- table definition: add a decode/translation table
create multiset table bt_i_t, no fallback (
num_i byteint not null,
chr_i character(1) not null)
unique primary index (chr_i)
unique index (num_i)
;
-- statistics definition
collect statistics on bt_i_t index (num_i);
collect statistics on bt_i_t column (partition);
collect statistics on bt_i_t index (chr_i);
--data load
insert into bt_i_t (0,'N');
insert into bt_i_t (1,'Y');
-- statistics collection
collect statistics on bt_i_t;
-- foreign key definition
alter table bt_fact_t
add foreign key (i)
references with no check option bt_i_t (num_i)
;
-- test 2: character indicator = 'Y' via join
explain
lock table bt_fact_t for access
lock table bt_i_t for access
select
id,
txt
from (
select
fct.id,
ind.chr_i,
fct.txt
from bt_fact_t fct
inner join bt_i_t ind on ind.num_i = fct.i) foo
where chr_i = 'Y'
;
Now we have
1) First, we lock bt_i_t for access, and we lock
bt_fact_t for access.
2) Next, we do a single-AMP RETRIEVE step from ind by
way of the unique primary index "ind.chr_i = 'Y'"
with no residual conditions into Spool 2 (all_amps), which is
duplicated on all AMPs. The size of Spool 2 is estimated with
high confidence to be 125 rows (1,750 bytes). The estimated time
for this step is 0.01 seconds.
3) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of an
all-rows scan, which is joined to fct by way of an
all-rows scan with no residual conditions. Spool 2 and
fct are joined using a product join, with a join
condition of ("num_i = fct.i"). The result goes into
Spool 1 (group_amps), which is built locally on the AMPs. The
size of Spool 1 is estimated with low confidence to be 6,292,700
rows (704,782,400 bytes). The estimated time for this step is
0.68 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.69 seconds.
High confidence is gained on the RETRIEVE
and low confidence is present on the JOIN
. Typically low confidence is the norm on a JOIN
, and the optimizer treats low confidence the same as a high confidence.
This was an overly simplistic example. Both queries here will run very fast. However, if there were more objects involved, the latter would be far more beneficial by carrying confidence to later query steps that consumed Spool 1
created in Step 3
.
From time to time I hear concerns that performing a JOIN
is more costly than applying straight criteria via a WHERE
clause. It depends. Here, the decode/translation table is accessed and replicated across the AMPs in 0.01 seconds. Then each AMP uses that value as its criterion when accessing the large table. Confidence is kept.
I hope this helps with your future transformation needs.
Cheers
-Brad