b¬äd†ødd

Dedicated to design and performance of databases and audio systems.

ANSI Joins and Criteria

Over time, I have come across numerous queries that contained multiple objects (i.e. tables, views) joined together in an ANSI style, via the ON clause; however, the criteria imposed were solely imposed within the WHERE clause. This is not a best practice. Typically only criteria applicable to the FROM object should be contained in the WHERE clause. There are exceptions which I have noted below.

Here I break down some of the core components that should be understood by anyone coding Teradata queries involving joins.

To begin with, Teradata supports two styles of joins:

  • Teradata
  • ANSI
The former joins all objects within the FROM clause and all join conditions are listed within the WHERE clause. For example,

from table_1, table_2
where table_1.id = table_2.id

Do not use the Teradata syntax. Instead use the ANSI syntax. For example,

from table_1
inner join table_2 on table_2.id = table_1.id

Perform all join conditions within the ON clause, never in the WHERE clause.

Next, understanding the Teradata order of operations is critical: the ON clause is logically executed before the WHERE clause. This is very important. I say logically because the optimizer may deem the WHERE clause's criteria can be executed sooner in the query process if it would not alter the query's end results.

Third, Teradata processes joins with adherence to strict algebraic principles. This can, and most likely will, differ from other databases. Breaking it down helps us understand why results sometimes may not be what we initially expected.

Left: A

Right: B

A∩B

 
 
 
 
 
 
 
 

If we imagine the Venn diagram from high school math class. Two circles, A and B; A on the left, B on the right; they slightly overlap. We were told in class that the overlapping area of circles A and B was referred to as A∩B (A-intersect-B).

An INNER JOIN is purely A∩B. Only bring back results where A and B match on my join condition.

To illustrate we create some tables, data, and queries to prove it out:

-- left table
create multiset volatile table left_t, no log (
    id       integer      not null,
    left_cde character(1) not null)
unique primary index (id)
on commit preserve rows
;

-- right table
create multiset volatile table right_t, no log (
    id        integer      not null,
    right_cde character(1) not null)
unique primary index (id)
on commit preserve rows
;

-- left data
insert into left_t (1,'A');
insert into left_t (2,'B');
insert into left_t (3,'C');

-- right data
insert into right_t (2,'X');
insert into right_t (3,'Y');
insert into right_t (4,'Z');

-- A intersect B
select
    coalesce(l.id,r.id) as id,
    l.left_cde,
    r.right_cde

from left_t l

  inner join right_t on r.id = l.id
;

When we execute this query we return the set of

id left_cde right_cde
-- -------- ---------
 2 B        X
 3 C        Y
which is the A∩B result. Very simple.

If we added a WHERE clause of l.left_cde = 'B' or an ON clause that contained l.left_cde = 'B' then our results would be

id left_cde right_cde
-- -------- ---------
 2 B        X
because we limited A/Left to records which had a code value of B. Still simple.

With outer joins the complexity increases. A LEFT OUTER JOIN is logically

  1. an inner join: A∩B, plus
  2. an outer join: A - A∩B
Or, A∩B ∪ (A - A∩B). Confusing? It takes a little getting used to. It's that strict algebraic principle thing. So, why does this matter to us? We go back to our earlier statement regarding ON before WHERE.

First, let's illustrate our LEFT OUTER JOIN:

-- left outer join
select
    coalesce(l.id,r.id) as id,
    l.left_cde,
    r.right_cde

from left_t l

  left outer join right_t r on r.id = l.id
;
Here left table is our LEFT table. Running this would net us
id left_cde right_cde
-- -------- ---------
 1 A        null
 2 B        X
 3 C        Y

Now, if we wanted to find A/Left records without a corresponding B/Right record then we would add to the query syntax

where r.id is null

Resulting in

id left_cde right_cde
-- -------- ---------
 1 A        null

So, the join occurs first via the ON clause and then afterwards the WHERE removes all records joined that have a non-null B/Right ID value.

Now, what if we wanted to join our B/Right table to our A/Left table, but only for Left table C values? We still want every record in the A/Left table, but only values from the Right table when the Left Code = C.

select
    coalesce(l.id,r.id) as id,
    l.left_cde,
    r.right_cde

from left_t l

  left outer join right_t r on r.id = l.id and l.left_cde = 'C'
;

Results in

id left_cde right_cde
-- -------- ---------
 1 A        null
 2 B        null
 3 C        Y

However, if we instead placed the Left Code criterion in the WHERE clause the results would be different.

select
    coalesce(l.id,r.id) as id,
    l.left_cde,
    r.right_cde

from left_t l

  left outer join right_t r on r.id = l.id

where l.left_cde = 'C'
;

Results in a very different answer set.

id left_cde right_cde
-- -------- ---------
 3 C        Y

The WHERE removed every record without a Left Code value of C after the join occurred.

A FULL OUTER JOIN is yet even another level up in complexity. Both tables are outer tables. There is no inner table concept. It is thought of a A∪B (A-union-B), but in reality it is A∩B ∪ (A - A∩B) ∪ (B - A∩B). So, what is the difference? It's how the criteria is applied.

A case:

select
    coalesce(l.id,r.id) as id,
    l.left_cde,
    r.right_cde

from left_t l

  full outer join right_t r on r.id = l.id
;

Results in

id left_cde right_cde
-- -------- ---------
 1 A        null
 2 B        X
 3 C        Y
 4 null     Z

Which is what we would expect. The assumption is A∪B. But now let's add some criteria to an ON clause and see what happens.

Query:

select
    coalesce(l.id,r.id) as id,
    l.left_cde,
    r.right_cde

from left_t l

  full outer join right_t r on r.id = l.id and l.left_cde = 'B'
;

What do you think will be returned?

Results in

id left_cde right_cde
-- -------- ---------
 1 A        null
 2 B        X
 3 C        null
 3 null     Y
 4 null     Z

I have to tell you, the first time it happened to me I thought it was a bug. Why is 3 listed twice? Once with C/null and again with null/Y.

The answer is because of the anatomy of an OUTER JOIN. Remember, it is A∩B first! The criteria specified in the ON clause only affects the logical inner/intersection portion of the join. This is critical to understand. We break it down like this:

  1. A∩B with criteria of Left Code = B: 2 B X
  2. A - (A∩B): All Left records not intersected with Right: 1 A and 3 C
  3. B - (A∩B): All Right records not intersected with Left: 3 Y and 4 Z

Note how ID 3 is in both steps 2 and 3 of the above process. That is how the query produces two ID 3 records.Remember three key points:

  1. ON is logically processed before WHERE;
  2. Criteria applied in an ON clause is only applicable to the intersection portion of a join; and
  3. An OUTER JOIN is logically an INNER JOIN first plus an OUTER JOIN process.

I hope this helps clear up any confusion regarding the ON clause versus the WHERE clause, and serves as a reference for now and in the future.

Cheers!
-Brad