**Trying
to Demystify Functional Dependency and Normalisation**

What follows is a top-of-the-head attempt by me to demystify those aspects of functional dependency and normalisation that I perceive, from the comments and questions I have been getting from some of you, as causing the most trouble. Please let me know of anything in this posting that works for you, and also anything that fails miserably. I assume you have read the course text and possibly slides 1-31 of my Warwick University lecture, but are still struggling.

Q.
Why is a functional dependency so called?

A. Because it represents what mathematicians call a function.

Let
me elaborate. Consider the relation
**DayAfter**,
consisting of the following tuples (written as ordered pairs for convenience):

<Mon,
Tue>

<Tue, Wed>

<Wed, Thu>

<Thu, Fri>

<Fri, Sat>

<Sat, Sun>

<Sun, Mon>

Let
the two attributes be named **Today** and **Tomorrow** (so the predicate
for this relation is "*Tomorrow* is the day after *Today*").

Because
there is exactly one tuple for each possible value of **Today**, we say that
this relation is in fact a function. To
be precise, it is a function of days onto days, and moreover it is a *complete*
function. Each element (2-tuple) of
the function maps a day (**Today**) to exactly one other day (**Tomorrow**).
The function is complete because it maps every day there is.
Completeness, however, is not important in the study of FDs and
normalisation--I mention it here only in passing.
It is the "exactly one" property of the mapping that makes the
relation a function. (Note that the
relation "is the square root of" is not a function, because some
numbers--all nonzero positive ones in fact--have two square roots.)

The
functional property of the **DayAfter** relation can be expressed as the
following FD:

**Today
→
Tomorrow**

pronounced "today determines tomorrow".

So,
**Today** is the determinant of some nontrivial FD that holds in **DayAfter**.

If
every attribute of a relation depends on some determinant in the way that **Tomorrow**
depends on **Today**, then that determinant is a *superkey*
of the relation (M359 calls this the *uniqueness* rule).
If, furthermore, no proper subset of the determinant is a superkey, then
that determinant is a *candidate key* (CK) (M359 calls this the *minimality*
rule, though Date prefers *irreducibility,* and so do I).

It
is easy to see, then that **Today** is a CK of **DayAfter**. In additional to the nontrivial FD already noted, the FD **Today****
→
****Today
**holds trivially, because everything determines itself.
The only proper subset of {**Today**} is the empty set and there is no
nontrivial FD that has the empty set as determinant and also holds in **NextDay**.
*(What's all this about the empty set as a possible
determinant?)*

Observe
that the following FD also holds in **NextDay**:

**Tomorrow
****→
Today**

This represents what mathematicians call an inverse function. You might say that "yesterday" is the inverse function of "tomorrow". Not all functions have an inverse. For example, there is no inverse of the function "is the square of" precisely because, as I have already noted, "is the square root of" is not a function.

It
follows that **Tomorrow** is another CK of **DayAfter**.

Obviously
there are no other nontrivial FDs holding in **DayAfter**, so we have two
distinct determinants of nontrivial FDs. As
both of these determinants are CKs, we can declare that **DayAfter**is in
BCNF.

If
2NF and 3NF had not been defined back in the early 1970s when
people were first beginning to explore relational theory as a basis for database
design and manipulation, we wouldn't be trying to teach them to you now.
BCNF would have been the whole story (unless you want to go beyond slide
31 of my lecture on the subject and therefore beyond the scope of this and many
other university courses). But we meanly teach you 2NF and 3NF because those definitions
have never quite managed to find their ways out of the textbooks (though Chris
Date's latest textbook,* Database in Depth*, defines the terms and then
dismisses them).

Q.
Why are 2NF and 3NF now deprecated?

A. First, many of the definitions to be found in the literature define these in
terms of a "primary" key. That's a mistake, because all keys are
equal from a logical point of view. More importantly, they simply aren't
needed. Just look at M359's definition of "transitive
dependency", on which its definition of 3NF depends. Notice also that
the definition of 3NF depends also on the definition of 2NF. BCNF is not
only stronger, in that it covers cases of redudancy not covered by 3NF, but it
is also *simpler*.

Of course it is sometimes the case that a relation has exactly one CK. And it is true that if a relation has exactly one CK and is in 3NF, then it is in BCNF.

Now
consider the following relation, which combines **DayAfter **with the
calendar. It has three attributes, **Date,
Today, **and** Tomorrow,** and contains 3-tuples such as the following:

<30/8/05,
Tue, Wed>

<31/8/05, Wed, Thu>

<1/9/05, Thu, Fri>

<8/9/05, Thu, Fri>

13/9/05, Tue, Wed>

and
a host of others, of course. The
predicate might be "On *Date* the weekday is *Today* and the
weekday of the day after is *Tomorrow*".
Let me call this relation **Calendar_ND** (calendar with next day).

Notice
that the FDs noted as holding in **DayAfter** hold in **Calendar_ND** too:

**Today
****→
Tomorrow
Tomorrow →
Today**

and
yet neither **Today** nor **Tomorrow** is a CK, because the following FDs
do *not* hold in **Calendar_ND**:

**Today
****→
Date
Tomorrow →
Date**

Therefore
**Calendar_ND** is not in BCNF. What's
more you can easily see the redundancy that it suffers from as a result.
For every date that falls on a Monday, we have to note that the next day
will be a Tuesday, even though we know very well that Tuesday is always the day
after a Monday, regardless of the date.

Notice
that we have determined that **Calendar_ND** is not in BCNF without even
finding out what its candidate keys are, let alone appointing some CK to be its
primary key. To find out if **Calendar_ND**
is in 2NF and 3NF we do need to do that exercise.
And to discover the CKs we need to note all the nontrivial FDs that hold
in it. In addition to the ones
already noted, we have:

**Date****
→
Today
Date →
Tomorrow**

Therefore
every attribute of **Calendar_ND** is determined by **Date**, so **Date**
is a superkey of **Calendar_ND**. Moreover,
no proper subset of {**Date**} is a superkey.
Therefore **Date** is a CK. For
the purpose of 2NF and 3NF, we will appoint this CK to be the PK (and in fact
there are clearly no other choices anyway).

Q.
Is **Calendar_ND** in 2NF?

A. Yes. There is no nontrivial FD
that (a) holds in **Calendar_ND** and (b) has as its determinant a set of
attributes that is a proper subset of the PK, {**Date**}.

Q.
Is **Calendar_ND** in 3NF?

A. No. For example, the following
FDs have both been noted as holding in **Calendar_ND**:

**Date
****→
Today
Today →
Tomorrow**

Now,
we already know that the FD **Date ****→
****Tomorrow
**also holds, but even if we didn't know that we could conclude it from the
two FDs just noted, under the FD theorem known as "transitivity".
Because Today appears on the right-hand side of the first FD and
constitutes the entire left-hand side of the second, it follows that **Date ****→
****Tomorrow**
holds.

3NF is violated whenever a nontrivial FD holds whose determinant is not a subset of the PK and whose right-hand side is also not a subset of the PK. (3NF is also violated whenever 2NF is violated, by definition.)

You might already be familiar with the mathematical notion of transitivity. It applies to many relations as well as FDs. For example, it applies to "less than" and "is the ancestor of". If A is less than B and B is less than C, then A is less than C. If A is an ancestor of B and B is an ancestor of C, then A is an ancestor of C. It does not quite apply to "is the brother of", because I am my brother's brother and yet I am not my own brother (under our normal understanding of that term). More clearly, it does not apply to "is the parent of".

To show why 3NF is best dispensed with, I consider a more business-like relation,

EMP(employees), with attributesE#, Name, HireDate, NI#. ClearlyE#andNI#are both CKs, as they are both single attributes that satisfy the uniqueness rule for CKs. As it is possible for several employees to have the same name, and for several to be hired on the same date, there are no other CKs.

It
is easy to see that **EMP** satisfies BCNF.
Every nontrivial determinant (all two of them, in fact) is a CK.
It is also easy to see that **EMP** suffers from no redundancy, and so
represents a satisfactory part of the database design under the criterion of
redundancy avoidance.

And
yet to show that **EMP** is in 3NF, regardless of which CK we appoint to be the
PK, we have to reason much more laboriously.
Say we choose **E#** to be the PK Then we have the following FDs:

**E#****
→
****NI#**

**NI#****
→
HireDate**

and
so **E#****
→
****HireDate**
appears to be a transitive dependency under the definition of that term for the
purpose of defining 3NF and we have to suspect a violation of 3NF.

What
makes **EMP** okay in spite of the apparent violation of 3NF is the
"reverse dependency" **NI#****
→
****E#**.
When an FD holds whose determinant is not a subset of the PK and whose
right-hand side is the PK itself, that determinant is an *alternate key*
and no redundancy arises from it. The
"reverse dependency" is in fact the inverse function of the function
represented by **E# ****→
****NI#**.
(Note that the function represented by **E#****
→
****HireDate**
has no inverse--the FD **HireDate ****→
****E#**
does not hold in **EMP**.)

Similarly,
if instead we make **NI#** the PK, we have:

**NI#****
→
E#
E# →
Name**

and
so **NI# ****→
****Name**
again appears at first glance to be a transitive dependency and again we have
to note the "reverse" dependency that springs to our rescue.

*Any
questions?*

Yes:

**1.
***What's
all this about the empty set as a possible determinant?*

Sorry if that remark caught you unawares. I know that M359 doesn't mention the idea of the empty determinant and its consequence, the empty CK. The history of computer science is littered with negligence of degenerate cases, and this is such a case, even though what I am about to explain is an inexorably logical consequence of our understanding of FDs.

Suppose
we extend **DayAfter** with an additional attribute named **FirstDay **to
yield a new relation called** DayAfterX.**, The predicate for **DayAfterX**
is "*Tomorrow*
is the day after *Today *and the first day of every week is* FirstDay.*"
Let us agree that weeks always start on Monday (and therefore end on
Sunday). Then **DayAfterX** consists of the following tuples (written
as ordered 3-tuples):

<Mon,
Tue, Mon>

<Tue, Wed, Mon>

<Wed, Thu, Mon>

<Thu, Fri, Mon>

<Fri, Sat, Mon>

<Sat, Sun, Mon>

<Sun, Mon, Mon>

It
is easy to see that **DayAfterX**
suffers from redundancy. We have to repeat the value 'Mon' for **FirstDay**
in every tuple. It is not so easy to see that **DayAfterX**
violates BCNF! If it violates BCNF, there must be some nontrivial FD that
holds in **DayAfterX**
but whose determinant is not a CK of **DayAfterX**.

The
following FDs certainly hold in **DayAfterX:**

**Today
****→
FirstDay
Tomorrow →
Firstday**

but
**Today** and **Tomorrow** are both CKs, so neither of them is the one
that causes **DayAfterX** to violate BCNF and suffer from redundancy.
Think about those two FDs, though. Do we need to know what today is, or
what tomorrow is, to work out what the first day of the week is? No.
The first day of the week is just a given. It is determined by
nothing! Well, if FirstDay is determined by nothing--i.e., the empty set,
denoted by { }--then we can clearly write

{
} **→
Firstday**

and
we need to determine whether { } is a CK of **DayAfterX**. Most
importantly, note that **FirstDay** has the same value in every tuple of **DayAfterX**.
That, in fact, is exactly what having { } as a determinant means. Bearing
that consequence in mind, it is very easy indeed to see that neither of the
following FDs holds in **DayAfterX**:

**{
} ****→
Today
**

for
**Today** does not have the same value in every tuple, and nor does **Tomorrow**.
Therefore { } is not a CK. Therefore **DayAfterX** is not in BCNF.

What do we do about that? Easy--just decompose in the usual way, to yield:

Our
original **DayAfter**,
with CKs **Today** and **Tomorrow**, consisting of the following tuples

<Mon,
Tue>

<Tue, Wed>

<Wed, Thu>

<Thu, Fri>

<Fri, Sat>

<Sat, Sun>

<Sun, Mon>

and
the new relation **DayOne**, with a single attribute named **FirstDay**,
and { } as its only CK. It consists of just the following 1-tuple:

<Mon>

Note
that **FirstDay** is not a CK of **DayOne**, for if it were, then **DayOne**
would be permitted to contain more than one tuple. There can't be more
than one first day of the week!

Many years ago my attempt to get the international standard for SQL to allow the empty set to be specified as a candidate key was rejected as a result of vehement opposition from the national body (USA) that represents most of the purveyors of SQL products (e.g., IBM, Oracle, Microsoft and co.). It seems that they thought I was joking and could see no useful commercial purpose for the concept. Can you think of any realistic examples? How about a one-row table that gives the name of the company, the url of its home page on the WWW, and its current share price?

*Any
more questions?*

Yes:

Let
*R* be a relation and let *H* (for "heading") be its set of
attributes. A superkey *SK* is any subset of *H* such that no
two tuples in *R* have the same combined value for the attributes of *SK.
*In other words it is a subset of the heading that has the so-called
uniqueness property, though it does not necessarily also have the so-called
minimality property that would make it a CK. The term *superkey* is
motivated by the observation that every superkey is a superset (not necessarily
a proper superset) of some candidate key.