Trying to Demystify Functional Dependency and Normalisation

Trying to Demystify Functional Dependency and Normalisation

What follows is a top-of-the-head attempt by me to demystify those aspects of functional dependency and normalisation that I perceive, from the comments and questions I have been getting from some of you, as causing the most trouble. Please let me know of anything in this posting that works for you, and also anything that fails miserably. I assume you have read the course text and possibly slides 1-31 of my Warwick University lecture, but are still struggling.

Q. Why is a functional dependency so called?
A. Because it represents what mathematicians call a function.

Let me elaborate. Consider the relation DayAfter, consisting of the following tuples (written as ordered pairs for convenience):

<Mon, Tue>
<Tue, Wed>
<Wed, Thu>
<Thu, Fri>
<Fri, Sat>
<Sat, Sun>
<Sun, Mon>

Let the two attributes be named Today and Tomorrow (so the predicate for this relation is "Tomorrow is the day after Today").

Because there is exactly one tuple for each possible value of Today, we say that this relation is in fact a function. To be precise, it is a function of days onto days, and moreover it is a complete function. Each element (2-tuple) of the function maps a day (Today) to exactly one other day (Tomorrow). The function is complete because it maps every day there is. Completeness, however, is not important in the study of FDs and normalisation--I mention it here only in passing. It is the "exactly one" property of the mapping that makes the relation a function. (Note that the relation "is the square root of" is not a function, because some numbers--all nonzero positive ones in fact--have two square roots.)

The functional property of the DayAfter relation can be expressed as the following FD:

Today → Tomorrow

pronounced "today determines tomorrow".

So, Today is the determinant of some nontrivial FD that holds in DayAfter.

If every attribute of a relation depends on some determinant in the way that Tomorrow depends on Today, then that determinant is a superkey of the relation (M359 calls this the uniqueness rule). If, furthermore, no proper subset of the determinant is a superkey, then that determinant is a candidate key (CK) (M359 calls this the minimality rule, though Date prefers irreducibility, and so do I).

It is easy to see, then that Today is a CK of DayAfter. In additional to the nontrivial FD already noted, the FD Today → Today holds trivially, because everything determines itself. The only proper subset of {Today} is the empty set and there is no nontrivial FD that has the empty set as determinant and also holds in NextDay. (What's all this about the empty set as a possible determinant?)

Observe that the following FD also holds in NextDay:

Tomorrow → Today

This represents what mathematicians call an inverse function. You might say that "yesterday" is the inverse function of "tomorrow". Not all functions have an inverse. For example, there is no inverse of the function "is the square of" precisely because, as I have already noted, "is the square root of" is not a function.

It follows that Tomorrow is another CK of DayAfter.

Obviously there are no other nontrivial FDs holding in DayAfter, so we have two distinct determinants of nontrivial FDs. As both of these determinants are CKs, we can declare that DayAfteris in BCNF.

If 2NF and 3NF had not been defined back in the early 1970s when people were first beginning to explore relational theory as a basis for database design and manipulation, we wouldn't be trying to teach them to you now. BCNF would have been the whole story (unless you want to go beyond slide 31 of my lecture on the subject and therefore beyond the scope of this and many other university courses). But we meanly teach you 2NF and 3NF because those definitions have never quite managed to find their ways out of the textbooks (though Chris Date's latest textbook, Database in Depth, defines the terms and then dismisses them).

Q. Why are 2NF and 3NF now deprecated?
A. First, many of the definitions to be found in the literature define these in terms of a "primary" key. That's a mistake, because all keys are equal from a logical point of view. More importantly, they simply aren't needed. Just look at M359's definition of "transitive dependency", on which its definition of 3NF depends. Notice also that the definition of 3NF depends also on the definition of 2NF. BCNF is not only stronger, in that it covers cases of redudancy not covered by 3NF, but it is also simpler.

Of course it is sometimes the case that a relation has exactly one CK. And it is true that if a relation has exactly one CK and is in 3NF, then it is in BCNF.

Now consider the following relation, which combines DayAfter with the calendar. It has three attributes, Date, Today, and Tomorrow, and contains 3-tuples such as the following:

<30/8/05, Tue, Wed>
<31/8/05, Wed, Thu>
<1/9/05, Thu, Fri>
<8/9/05, Thu, Fri>
13/9/05, Tue, Wed>

and a host of others, of course. The predicate might be "On Date the weekday is Today and the weekday of the day after is Tomorrow". Let me call this relation Calendar_ND (calendar with next day).

Notice that the FDs noted as holding in DayAfter hold in Calendar_ND too:

Today → Tomorrow
Tomorrow → Today

and yet neither Today nor Tomorrow is a CK, because the following FDs do not hold in Calendar_ND:

Today → Date
Tomorrow → Date

Therefore Calendar_ND is not in BCNF. What's more you can easily see the redundancy that it suffers from as a result. For every date that falls on a Monday, we have to note that the next day will be a Tuesday, even though we know very well that Tuesday is always the day after a Monday, regardless of the date.

Notice that we have determined that Calendar_ND is not in BCNF without even finding out what its candidate keys are, let alone appointing some CK to be its primary key. To find out if Calendar_ND is in 2NF and 3NF we do need to do that exercise. And to discover the CKs we need to note all the nontrivial FDs that hold in it. In addition to the ones already noted, we have:

Date → Today
Date → Tomorrow

Therefore every attribute of Calendar_ND is determined by Date, so Date is a superkey of Calendar_ND. Moreover, no proper subset of {Date} is a superkey. Therefore Date is a CK. For the purpose of 2NF and 3NF, we will appoint this CK to be the PK (and in fact there are clearly no other choices anyway).

Q. Is Calendar_ND in 2NF?
A. Yes. There is no nontrivial FD that (a) holds in Calendar_ND and (b) has as its determinant a set of attributes that is a proper subset of the PK, {Date}.

Q. Is Calendar_ND in 3NF?
A. No. For example, the following FDs have both been noted as holding in Calendar_ND:

Date → Today
Today → Tomorrow

Now, we already know that the FD Date → Tomorrow also holds, but even if we didn't know that we could conclude it from the two FDs just noted, under the FD theorem known as "transitivity". Because Today appears on the right-hand side of the first FD and constitutes the entire left-hand side of the second, it follows that Date → Tomorrow holds.

3NF is violated whenever a nontrivial FD holds whose determinant is not a subset of the PK and whose right-hand side is also not a subset of the PK. (3NF is also violated whenever 2NF is violated, by definition.)

You might already be familiar with the mathematical notion of transitivity. It applies to many relations as well as FDs. For example, it applies to "less than" and "is the ancestor of". If A is less than B and B is less than C, then A is less than C. If A is an ancestor of B and B is an ancestor of C, then A is an ancestor of C. It does not quite apply to "is the brother of", because I am my brother's brother and yet I am not my own brother (under our normal understanding of that term). More clearly, it does not apply to "is the parent of".

To show why 3NF is best dispensed with, I consider a more business-like relation, EMP (employees), with attributes E#, Name, HireDate, NI#. Clearly E# and NI# are both CKs, as they are both single attributes that satisfy the uniqueness rule for CKs. As it is possible for several employees to have the same name, and for several to be hired on the same date, there are no other CKs.

It is easy to see that EMP satisfies BCNF. Every nontrivial determinant (all two of them, in fact) is a CK. It is also easy to see that EMP suffers from no redundancy, and so represents a satisfactory part of the database design under the criterion of redundancy avoidance.

And yet to show that EMP is in 3NF, regardless of which CK we appoint to be the PK, we have to reason much more laboriously. Say we choose E# to be the PK Then we have the following FDs:

E# → NI#
NI# → HireDate

and so E# → HireDate appears to be a transitive dependency under the definition of that term for the purpose of defining 3NF and we have to suspect a violation of 3NF.

What makes EMP okay in spite of the apparent violation of 3NF is the "reverse dependency" NI# → E#. When an FD holds whose determinant is not a subset of the PK and whose right-hand side is the PK itself, that determinant is an alternate key and no redundancy arises from it. The "reverse dependency" is in fact the inverse function of the function represented by E# → NI#. (Note that the function represented by E# → HireDate has no inverse--the FD HireDate → E# does not hold in EMP.)

Similarly, if instead we make NI# the PK, we have:

NI# → E#
E# → Name

and so NI# → Name again appears at first glance to be a transitive dependency and again we have to note the "reverse" dependency that springs to our rescue.

Any questions?

Yes:

1. What's all this about the empty set as a possible determinant?

Sorry if that remark caught you unawares. I know that M359 doesn't mention the idea of the empty determinant and its consequence, the empty CK. The history of computer science is littered with negligence of degenerate cases, and this is such a case, even though what I am about to explain is an inexorably logical consequence of our understanding of FDs.

Suppose we extend DayAfter with an additional attribute named FirstDay to yield a new relation called DayAfterX., The predicate for DayAfterX is "Tomorrow is the day after Today and the first day of every week is FirstDay." Let us agree that weeks always start on Monday (and therefore end on Sunday). Then DayAfterX consists of the following tuples (written as ordered 3-tuples):

<Mon, Tue, Mon>
<Tue, Wed, Mon>
<Wed, Thu, Mon>
<Thu, Fri, Mon>
<Fri, Sat, Mon>
<Sat, Sun, Mon>
<Sun, Mon, Mon>

It is easy to see that DayAfterX suffers from redundancy. We have to repeat the value 'Mon' for FirstDay in every tuple. It is not so easy to see that DayAfterX violates BCNF! If it violates BCNF, there must be some nontrivial FD that holds in DayAfterX but whose determinant is not a CK of DayAfterX.

The following FDs certainly hold in DayAfterX:

Today → FirstDay
Tomorrow → Firstday

but Today and Tomorrow are both CKs, so neither of them is the one that causes DayAfterX to violate BCNF and suffer from redundancy. Think about those two FDs, though. Do we need to know what today is, or what tomorrow is, to work out what the first day of the week is? No. The first day of the week is just a given. It is determined by nothing! Well, if FirstDay is determined by nothing--i.e., the empty set, denoted by { }--then we can clearly write

{ } → Firstday

and we need to determine whether { } is a CK of DayAfterX. Most importantly, note that FirstDay has the same value in every tuple of DayAfterX. That, in fact, is exactly what having { } as a determinant means. Bearing that consequence in mind, it is very easy indeed to see that neither of the following FDs holds in DayAfterX:

{ } → Today
{ } → Tomorrow

for Today does not have the same value in every tuple, and nor does Tomorrow. Therefore { } is not a CK. Therefore DayAfterX is not in BCNF.

What do we do about that? Easy--just decompose in the usual way, to yield:

Our original DayAfter, with CKs Today and Tomorrow, consisting of the following tuples

<Mon, Tue>
<Tue, Wed>
<Wed, Thu>
<Thu, Fri>
<Fri, Sat>
<Sat, Sun>
<Sun, Mon>

and the new relation DayOne, with a single attribute named FirstDay, and { } as its only CK. It consists of just the following 1-tuple:

<Mon>

Note that FirstDay is not a CK of DayOne, for if it were, then DayOne would be permitted to contain more than one tuple. There can't be more than one first day of the week!

Many years ago my attempt to get the international standard for SQL to allow the empty set to be specified as a candidate key was rejected as a result of vehement opposition from the national body (USA) that represents most of the purveyors of SQL products (e.g., IBM, Oracle, Microsoft and co.). It seems that they thought I was joking and could see no useful commercial purpose for the concept. Can you think of any realistic examples? How about a one-row table that gives the name of the company, the url of its home page on the WWW, and its current share price?

Any more questions?

Yes:

2. What's a superkey?

Let R be a relation and let H (for "heading") be its set of attributes. A superkey SK is any subset of H such that no two tuples in R have the same combined value for the attributes of SK. In other words it is a subset of the heading that has the so-called uniqueness property, though it does not necessarily also have the so-called minimality property that would make it a CK. The term superkey is motivated by the observation that every superkey is a superset (not necessarily a proper superset) of some candidate key.