Implementers' Reflections

To coincide with the publication of Database Explorations (2010), we solicited contributions from people who have developed software products based on or inspired by TTM. We were very pleased to receive the responses shown here.

Contents: CsiDB Dataphor Dee Duro Ingres D MighTyD RAQUEL Rel SIRA_PRISE TclRAL

CsiDB: A C++ Library for Relational Database Access

CsiDb is a C++ library that was developed for internal use at Thomson Reuters in one of its Creative Solutions line of desktop accounting applications. It is part of a software layer that mediates access to an underlying SQL DBMS. It augments an older in-house object-relational mapping subsystem. The latter system was considered capable for database updates, providing reasonable ease of use to the application developers, as well as a layer that could enforce complex business rules that could not be readily expressed directly by the DBMS. However, the object-relational subsystem was a performance bottleneck when reading data, in part because it made it cumbersome to tailor data views to specific application needs. CsiDb was an answer to this problem, providing application developers with a library interface that allowed them to build custom queries more easily and dynamically (the form of queries depending e.g. on user input), without themselves having to learn an antiquated and anemic dialect of SQL, and often without waiting for the few available database experts on the development team to design views and add them to the schema.

?Foundation for Future Database Systems? and Tutorial D were used as sources of inspiration in designing the library?s APIs. We expected that the relational algebra approach would be easy for programmers to grasp. We also wanted an interface that followed common C++ idioms, so that its use would feel natural to C++ programmers.

This development project had numerous constraints and explicit non-goals that prevent it from being a full database management system. There was no need to provide DDL functionality, since we were working with an existing schema and framework for managing same, and we were not planning on replacing the underlying DBMS in the short term. As mentioned earlier, there was an existing system that was adequate for modifying data in the database, which was already integrated with our user interface framework, so there was no compelling need to implement concepts such as relvar assignment. The focus was very much on querying data efficiently. Some of these priorities might have changed over time, and we tried to anticipate this where possible. For instance, we did wish to eventually replace the underlying DBMS, but in fact active development of this particular product was scaled back before that replacement was ever seriously planned.

Let?s take a look at an example of C++ code that uses the library to retrieve data from the ubiquitous supplier-parts database, and then discuss how it compares to Tutorial D. We?ll start with something simple; creating an application relation which contains the result of projecting P over CITY.

// These three definitions are assumed to be globally

// visible, and are part of the mapping of schema

// elements to CsiDb constructs.

CSupplierParts db;

CExpr P = relvar(?P?);

CAttrRefExpr CITY = attr(?CITY?);

// In application code, one might see this sort

// of thing …

CLocalRelation result (P.DropAllBut(CITY), key_list << CITY);

This C++ code defines a variable, result, of type CTupleList. A tuple list can be thought of as simply a container of tuples. The subexpression key_list << CITY creates a new key list and adds CITY to it. This declaration of result is similar to the following Tutorial D declaration:

VAR result PRIVATE INIT ( P { CITY } ) KEY { CITY };

One can see a fairly a simple correspondence between concepts here, which holds well for most relational operations. DropAllBut is a member function of the class CExpr; the latter is the most general type of a query expression. CExpr defines many member functions to represent relational operators, the most important being these:

– Add (analogous to EXTEND)

– DropAllBut (projection)

– Join (natural join)

– LeftJoin, RightJoin (?respectable? versions of outer join)

– Minus

– Rename (to rename an attribute)

– Restrict (aka WHERE)

– SummarizePer, SummarizeBy

– Quota

– Union

Many scalar operators are defined, also. The most common ones take advantage of C++?s operator overloading, providing programmers with convenient infix notation for e.g. Boolean logic operators and arithmetic operators.

A difference between the above code snippets which may not be immediately obvious is that Tutorial D?s variable result has a specific relation type, which embodies all the details of the heading, and that semantic information is available at compile-time to infer the types of subsequent expressions that use result. In the C++ code, result is simply of type CLocalRelation. The heading is available at run-time, but, sadly, not at compile-time. The main disadvantage of this difference is that attribute type mismatches and attribute name reference errors cannot be detected until run-time. How problematic one judges this difference to be depends on one?s stance on static (compile-time) versus dynamic (run-time) type checking in general, which is one of the ever-ongoing debates in programming language design. Our experience was that the delay in error detection was annoying but tolerable. A compile-time-typed approach would in theory be possible to implement using C++ template features, but would have been fairly difficult to develop and maintain; nor do we know of any other programming language with sufficient compile-time extensibility to make this approach tractable, except perhaps the Lisp family of languages, where compile-time type checking is perhaps possible but not idiomatic.

Early versions of CsiDb used native C++ primitive types for attribute values. Since the schema we were working with already existed and was built solely on the simple types available from early SQL systems, mapping between values stored in the database and their C++ representations was fairly straightforward. In later versions, we introduced data types that, although defined as classes in C++, were intended to enforce TTM?s rules for scalar types. This was part of an effort to introduce a wider variety of user-defined types to the database, with support for selectors, possreps, and GET_ operators; a system for this was successfully developed but ultimately not used within the lifetime of the product.

The library conformed to all of the RM and OO Proscriptions, insofar as they were relevant to querying the database (as opposed to modifying it). The lack of update support for relvars did allow us to ignore some of the more difficult requirements of TTM. Also, it took us a little while to purge the use of nulls from our existing schema.

The library also conformed to the relevant RM Prescriptions, although, as mentioned above, the effort to fully support some aspects of scalar types was wasted for our purposes. In retrospect, some of these requirements do not seem truly essential to supporting the relational model itself, but seem more like programming language design choices.

On the Implementation of Dataphor

Reflections on the process of building a TRDBMS based on The Third Manifesto

Introduction

We have been meaning to write this for a long time, so when we were approached by Hugh Darwen to provide a short write-up about the implementation of Dataphor for an upcoming book dealing with issues relating to The Third Manifesto, we naturally jumped at the chance. Not only does it give us an excuse to finally write down our thoughts and ideas about the experience of building a TTM implementation, but it gives us a chance to give something, however small, back to Chris Date and Hugh Darwen who have helped us so much over the years with their tireless efforts at explaining and defending the Relational Model.

What It Is

The simplest description of Dataphor is that it is an Application Development Platform. The idea was to build a framework that automated as much as possible of the development of an application based solely on the description of that application. We recognized that this was not a new idea, but a fairly comprehensive survey of the available technologies led us to the conclusion that, though there were some frameworks with the same goals, they failed to achieve the level of automation that we felt was possible.

Initially, we began implementation of this idea as a set of client-side technologies that were an offshoot of previous tools layers that we had developed as part of our experiences building various types of database applications. We were making progress towards things like view updatability and constraint enforcement (though we didn?t recognize them as such at the time), but we felt that our solutions were ad-hoc and lacked any fundamental organizing principles. This was around the year 2000, which happily coincided with the publication of the 2^nd edition of The Third Manifesto, a book we stumbled upon during one of our visits to the computer section of the local bookstore. As we read through this fascinating work, we realized at least two things: 1) Most of the problems we were facing in our everyday jobs (and trying to solve with our framework) had already been solved quite nicely, they just weren?t present in the DBMSs of the day for various reasons, and 2) If we wanted to achieve the levels of automation we were looking for, we were going to need to build a DBMS. Luckily, the book was also a high-level schematic for just that.

What Went Right

So we set about defining the language that would be the basis for the system (D4), and we laid out the overall architecture, broadly following the ideas in The Third Manifesto, but focusing them towards our end goal of automating application development. As a result, the system naturally separated into two main areas, the Dataphor Server, which is basically a Federated DBMS, and the Frontend, which provides the development automation by deriving user-interfaces from schema definitions contained in the server. In this discussion we will focus more on the Dataphor Server itself, as that is the portion of the system that was most influenced by the ideas in The Third Manifesto. The following sections discuss the aspects of The Third Manifesto that we feel contributed most to achieving the levels of automation reached in Dataphor.

Data Independence

The emphasis throughout TTM on various logical distinctions such as that between model and implementation provided a guiding principle for the implementation of Dataphor that resulted in a high degree of data independence, both logical and physical.

In terms of logical data independence, the ability to treat tables and views as interchangeable allows the system to provide the same level of user-interface automation for a view as it would for a base table. We knew this was something we wanted for the project, as we had for years suffered the consequences of the poor logical data independence provided by SQL-based systems. By providing complete view updatability as specified in TTM, as well as allowing various logical constructs such as constraints and event handlers to be defined for views, Dataphor achieves a very high degree of logical data independence.

In terms of physical data independence, Dataphor was designed as a Federated DBMS, so the separation between the physical storage of the data (in a SQL-based DBMS or other data storage system) and the logical definition of that data was present from the beginning. Dataphor provides the ability to specify which storage DBMS ultimately houses the data for a particular base table, resulting in a naturally federated system with a high degree of physical data independence.

Constraint Enforcement

As a direct consequence of following the many TTM prescriptions regarding constraint enforcement, Dataphor is, we believe, an unparalleled business rules engine. The system supports type, row, key, table, reference, and database constraints, as well as transition constraints that can be used to expressively implement business rules that are either extremely difficult, or in cases downright impossible, to implement in any other system of which we are aware.

Nullology

Due to the TTM proscription on nullological mistakes, we made a concerted effort throughout the implementation to ensure that the language, and the system in general, were as free from special cases as possible. This aspect was particularly relevant for us, coming from a background in T-SQL, a language with the perhaps unique distinction of having more exceptions than rules.

Row Types

Although it was not apparent at the time, one of the most useful features of the D4 language turned out to be the fact that it fully supported row (or tuple) types, including a type generator, selector, and row-valued operators. We simply included it because it was part of the schematic and made a good deal of sense, but it is now painfully obvious that many of the most cumbersome aspects of SQL (and even many procedural and object-oriented languages) are directly attributable to a lack of proper support for this simple construct. For example, a row type allows operators to easily return multiple values without the need to define a specialized type or use output parameters.

Possible Representations

One of the most important distinctions made clear for us by TTM was the separation of a value from its possible representations. We feel we have a faithful implementation of possible representations, and this has given the D4 language a strong foundation for building applications. Unlike traditional DBMSs, Dataphor has first-class user-defined type support, not only for types within the language itself, but for the representation of values of those types within the physical, logical, and external layers.

Declarative Development

The combination of all these enabling technologies into a single system results in a platform that affords an extraordinary degree of declarative development. Many applications can be built entirely from the application schema, and every application thus far has benefited greatly from having so many of the mundane development tasks completely automated. In a recent port of a large-scale enterprise application with thousands of user-interfaces and almost a half-a-million lines of T-SQL, the savings in lines of code was around 75%, with a significant increase in functionality due to the many features provided out-of-the-box by a Dataphor client application. It is our belief that this level of automation and declarative development is largely attributable to the foundation provided by TTM.

What Went Wrong

When we initially decided to adopt The Third Manifesto as a schematic for Dataphor, we set out to build as faithful an implementation as possible. However, there were some aspects with which we found we could not comply, either because we did not feel that strict adherence achieved the desired outcome, or because we in principle disagreed with some aspect. The following sections discuss the most important of these areas of non-compliance.

Nulls

Of course I had to bring up nulls, what discussion of The Third Manifesto would be complete without them? We like to think we knew a little bit about nulls before we became acquainted with TTM, but we certainly weren?t prepared for the amount of discussions that have taken place over the issue. It seems like such a small thing on the surface, but it really has caused a lot of problems, and, we would add, solutions.

The crux of the matter is that the Dataphor toolset was built to automate database application development, and something that one does a lot of in database applications is data entry. There is in fact no way around it, in a data entry context, the idea of a null is extremely useful. There is a point at which the user is filling out the data that is to be entered into the database, and during that time there are simply not yet values for all the data. It can be argued (and has been) that allowing nulls in a data entry context is not proscribed by TTM, and further that such a feature in no way requires the support of the DBMS to implement. However, as we tried to build a practical solution to the problem without introducing nulls into the DBMS, we ran into significant problems.

The first version of the Dataphor server was TTM compliant with respect to nulls. However, we found that we needed a way to talk about the data as it was being entered in order to enforce constraints that were placed on the data as it was being entered. It can be argued that these types of constraints should be enforced by the client side, but it is our contention that that is one of the main causes of the impedance mismatch present in database applications; one of the main problems we were trying to solve in the first place.

As a result, we tried to introduce another concept that would allow us to deal with nulls in Frontend contexts, but still preserve strict relational requirements in the actual tables. We called these new beasts presentations, and they did in fact turn out to be beasts. They were supposed to be the same as tables, except that they would allow nulls. The problem was that as soon as the language had to deal with nulls in the context of presentations, it had to deal with them everywhere. We tried several different approaches to rectify the problems that arose, but in the end we were forced to concede that we were already dealing with presentations almost exclusively anyway, so why have separate concepts?

As an aside, we even tried to remedy the situation by defining the null as part of the type system, so that every type would have all the normal values of the type, plus one ?value? (apologies to Chris) that would signify null. This approach caused more problems than it addressed however, and was abandoned on the whiteboard.

Several other approaches were tried, including extreme normalization and default values, but in the end, the simplest practical solution to the problem was the adoption of nulls pretty much as they exist in any given SQL-based DBMS today. We note that the D4 compiler supports the inference of nullability as part of the metadata inferred for any expression, and that this information is used to provide helpful warnings and hints to the developer, often mitigating some of the usual risks associated with null support. This is not to say that we feel that SQL-style nulls are the ultimate solution to this problem, only that we do not feel the issue is settled, and we have so far not been convinced that any pragmatically better solutions have been presented.

Value-Based Inheritance

One of the most attractive aspects of TTM is, of course, the possibility of an object-relational rapprochement, and the first version of the D4 language included, a faithful (if somewhat na�ve) implementation of the type system as prescribed by TTM. However, when we actually started using the type system features to build applications, we found that it did not provide the types of reuse we had come to expect from traditional object-oriented approaches.

In particular, we found that the association of the object-oriented features of the language with ?domains? prevented exactly the kind of reuse we were hoping would be achieved. This is not to say that we can find anything technically wrong with the type system prescribed by TTM. On the contrary, it is an elegant theory that solves some problems very nicely (for example, the integer types can be beautifully expressed, as well as some interval type hierarchies). However, the most common type of reuse afforded by traditional object-oriented languages is achieved precisely because the type hierarchy does not strictly follow the concept of values and subsets of values.

Another of the most common usages for type inheritance in early incarnations of D4 was the inheritance of metadata associated with the type. For example, it would be nice to be able to define an ID type, and then have several related subtypes that all inherited the same behavior as far as defaults, constraints, and other metadata. However, this usage conflated several concepts into one and the resulting expressions in D4 were cumbersome.

As a result of these issues, we eventually abandoned value-based type inheritance in favor of transitive implicit conversion. We realize there are type-safety issues with this approach, but we feel that those concerns are outweighed by the practical benefits achieved by allowing types to be declared as ?like? another type, inheriting their behavior and allowing types to be defined and used simply within the language.

As with nulls, however, we do not feel that this is the end of the story for object-relational rapprochement. We believe the benefits of achieving the types of reuse typically found in object-oriented languages would be a significant step towards an even more declarative development paradigm, and we are still actively engaged in research and development in this direction.

View Updatability

This topic could just as easily be discussed in the What Went Right section, as we believe that by incorporating the view updatability mechanisms described in TTM, Dataphor has hands down the best view updatability of any DBMS in the industry today. However, we also feel that this is an area that could stand some significant improvement. The biggest unresolved issue is how to provide the logic for the update in such a way that the system can incorporate it and use it in larger contexts. The end goal would be for updates to be as flexible and powerful as expressions; a sort of closure over updates, if you will. We feel that TTM provided an excellent starting point for this, but that more research and development is necessary.

Multiple Assignment

The idea of multiple assignment honestly was not something we gave much thought to in the early years of Dataphor development. Later on, however, we became interested in it as we began to refine the transaction processing aspects of the system. After considering the ideas about multiple assignment in the latest edition of TTM, we have come to the conclusion that it undermines one of the fundamental abstractions of any procedural language, the procedure, or abstraction over the statement.

The problem as we see it is that because the constraint enforcement boundaries are tied to the syntax of the language, there is no readily apparent way to have constraint enforcement span procedure calls. In other words, we believe that the control of constraint enforcement boundaries is a fundamental aspect of the system that should be available to the developer.

We feel compelled to point out that we realize that this is precisely the problem that is being addressed by multiple assignment, namely the principle that no transaction ever be allowed to read data in an invalid state. However, we remain unconvinced that application logic can be separated to that degree. We feel that there are cases where the constraint enforcement boundary would necessarily span procedures, and must therefore be an explicitly available control mechanism in the language. For this reason we maintain in Dataphor the traditional transaction processing approach of combining the constraint enforcement boundary with transaction committal.

What Comes Next

As discussed in the sections above, we feel there are still many areas of DBMS technology that can benefit from further research. There are many areas of Dataphor that need improvement, and many new features that we would like to add. Temporal support, for example, is something we have already taken significant design steps towards, but have not yet had the time to work those changes into the system. In general, we are constantly searching for new and better approaches to the problems of application development, and have already benefited greatly from the contributions of TTM. To further the development of Dataphor, we have made the project open-source under a liberal license. For information on the Dataphor project, visit www.dataphor.org. We also sponsor a free support forum for users of Dataphor (www.databaseconsultinggroup.com\forums), as well as provide distribution and support services for commercial endeavors at www.alphora.com.

Dee

from Greg Gaughan

I read Date and Darwen's "A Guide to the SQL Standard" in 1998 and its precise, concise description of an SQL with features far beyond those implemented by any existing product inspired me to develop ThinkSQL (http://www.thinksql.co.uk) - an SQL server with powerful constructs from the latest standard. The criticisms of the language in the guide hinted at a better way, and reading Foundation for Future Database Systems (The Third Manifesto) (Second edition). Addison-Wesley (2000) (TTM) certainly provided the details. It took several readings to fully understand the scope of the shift.

Python

I'm familiar with a number of programming languages, but Python provides the best combination of readability and power. So building an implementation of a D in Python seemed like a useful thing to do. Python is an interpreted language which provides the ability to override the built-in operators. The initial implementation of Dee comprises a number of Python classes, e.g. Relation and Tuple, with their associated operators.

I implemented the Tuple and Relation constructors using Python classes. To construct a Tuple:

Tuple (SNO='S1', PNO='P1', QTY=300)

And to create a Relation:

Relation ( [ 'SNO', 'PNO', 'QTY' ],

[

Tuple (SNO='S1', PNO='P1', QTY=300),

Tuple (SNO='S1', PNO='P2', QTY=200),

] )

or using the positional shorthand:

Relation ( [ 'SNO', 'PNO', 'QTY' ],

[

('S1', 'P1', 300),

('S1', 'P2', 200),

] )

And there are a number of methods to load Relations and Tuples from, and unload to, Python lists.

Algebra

With the Relation class available, I could then implement four fundamental operators from the A algebra: ◄AND►, ◄OR► (limited to union for practical reasons), ◄MINUS► (i.e. practical ◄NOT►) and ◄REMOVE►, with the idea of building the more complex operations from these, following the approach taken by TTM's 'A New Relational Algebra' (i.e., A). All the other operators were built on the fundamental ones, e.g. COMPOSE was built from ◄REMOVE► and ◄AND►:

def COMPOSE ( r1, r2 ) :

″ ″ ″ AND and then REMOVE common attributes (macro) ″ ″ ″

A = list (r1.heading () .intersection (r2.heading ()))

return REMOVE (AND (r1, r2), A)

Also, following an idea in TTM - 'Treating operators as relations', I implemented RESTRICT and EXTEND in terms of AND. This was made easy because of Python's ability to pass functions around as first-class objects. To pass a Python expression to functions, such as RESTRICT, I make use of Python's lambda keyword. This allows anonymous functions to be built. The RESTRICT function is available as a method (where) on the Relation class, so for example:

r1.where (lambda t: t.a1 == 5)

would pass the expression 't.a1 == 5' through to the RESTRICT function on a relation r1, using t as variable that ranges over each tuple in the relation. The expression can use all the features of Python and returns either True or False (Python's built-in Boolean values) for each tuple.

Operators

The operators associated with the Relation class are overridden to give a natural Python-like (Pythonic) syntax, e.g.

r1 == r2 #equality

r1 <= r2 #subset

r1 < r2 #proper subset

r1 >= r2 #superset

r1 > r2 #proper superset

r1 in r2 #tuple membership

r1 ( [ 'a1, 'a2'] ) #projection

r1 & r2 #AND, i.e. intersection or natural join or Cartesian join, depending on how many attributes are shared

r1 | r2 #OR, i.e. union

The power of Python really shines through when building functions such as WRAP and UNWRAP, which in most languages would be lengthy and bug-prone, in just a line or two of code. Building on the Relation's extend, project and remove methods, we have:

def WRAP (r, Hr, wrapname) :

″ ″ ″ Wrapping (macro)″ ″ ″

return r.extend ( [wrapname], lambda t:{wrapname:t.project(Hr)}).remove (Hr)

Adding user-defined relational functions is straightforward: the function simply needs to accept one or more relations and return a relation. An example of a recursive relational function is transitive closure, TCLOSE (details are here: http://www.quicksort.co.uk/DeeDoc.html#user-definable-operators).

Methods

Many other functions (upper-case names) are available as methods (lower-case names) on the Relation class and these can be conveniently chained together to offer both an alternative syntax and the opportunity to be executed more efficiently using pipelining rather than fully evaluating and storing each step. For example:

(r1 & r2) .where (lambda t: t.a1 == 5 and t.a2 < 100 ) .remove ( [ 'a3', 'a4'] ) .group ( [ 'a6'], 'grouped')

would join r1 and r2, then filter the result, then remove attributes a3 and a4, and then add a grouping. To use the functions directly, the following syntax would give the same result:

tr0 = AND(r1, r2)

tr1 = RESTRICT(tr0, lambda t: t.a1 == 5 and t.a2 < 100)

tr2 = REMOVE(tr1, ['a3', 'a4'] )

GROUP (tr2, ['a6'], 'grouped' )

Virtual Relvars

Virtual relvars can be created using lambda to evaluate relational expressions. For example:

r3 = r1 & r2

immediately evaluates the expression on the right hand side and assigns the result to r3. Whereas:

r3 = Relation ( [ 'r1a1', 'r2a1' ], lambda: r1 & r2 )

assigns the expression r1 & r2 to the virtual relvar r3 which will then be re-evaluated each time r3 is used in an expression.

Updates

Insert and delete methods are available for the Relation class and have been set to override Python's |= and -= update in-place operators, so the following could be used to insert or delete relations (or tuples):

r1 |= r2 #insert (union update in-place)

r1 -= r2 #delete (minus update in-place)

Also an update method is available which takes two lambda expressions: one to select which tuples to update and one to say how to update those tuples, e.g.

r1.update ( lambda t: t.a1 == 5, [ 'a1' ], lambda u: { 'a1':0 } )

would update each tuple in r1 that had a1 == 5 and set a1 to 0. Of course, the update is just a shorthand for a delete followed by an insertion.

Access to the original tuple is also provided via the _OLD_ prefix, e.g.

r1.update ( lambda t: t.a1 < 5, [ 'a1' ], lambda u: { 'a1':u._OLD_a1∗10 } )

would update each tuple in r1 that had a1 < 5 and set a1 to its original value multiplied by 10.

Operators as Relations

I tried to use the idea of treating operators as relations to create more powerful relational expressions:the details are here. As an example, I defined an operator, 'plusfn', and wrapped it in a virtual relvar, 'plus', which could then be 'composed' with arguments, e.g. what's 3 + 4?:

COMPOSE ( GENERATE ( { 'x':3, 'y':4} ), plus) .toTuple () .z

which would return the value 7. The result of the composition (before the 7 is extracted) is a relation, meaning multiple results could be returned from such 'relational functions'. I think there are many areas where such expressions using relations would be advantageous, e.g. matrix operations.

Other implementation details

Constraints can be passed to the Relation constructor. These are again based on lambda functions, where a True result satisfies the constraint, and with shorthands for key and foreign-key constraints.

To improve the speed of joins, every column value is hashed so that hash-joins can be performed when using the AND function (i.e. we take the intersection of the sets of rows with matching values).

Rudimentary transactions and persistence are provided by versioning the database and pickling to disk (Python's serialisation mechanism).

Some important aspects of TTM still to be implemented are:

• system keys

• updates to virtual relvars - would need to parse Python expressions to discover the component relvars

• multiple updates - I have plans to move all relvars to a database namespace to make this, and assignment-time checks, possible

I have no plans to implement the inheritance model proposed by TTM because it overlaps so much with Python's type system (although I do find the ideas of Generalisation and Specialisation by Constraint enticing). I think it would require a separate interpreter and virtual machine to implement it.

Some aspects of TTM which were difficult to implement were:

• persisting arbitrary constraints and view definitions (there are some security and Python versioning implications with saving and loading such executable code)

• general division - mainly because of a lack of examples

• system catalog - supporting nested relations means the catalog now needs to allow recursive definitions

Dee is open-source and available here: http://www.quicksort.co.uk.

Duro ? a relational database management system based on The Third Manifesto

from Ren� Hartmann

Duro is an attempt to implement a DBMS conformant to TTM.

The goals of the duro Duro project are:

- Compliance with RM and OO Pre- and Proscriptions

- A layered architecture, where relational functionality is implemented on top of a lower-level database engine.

- An industrial strength, not just educational implementation (at least a long-term goal)

- An open-source license

Implementation on top of a SQL DBMS was rejected. This would not be a viable long-term solution, and Duro is not meant to be a throw-away implementation.

Instead, the database library Berkeley DB was chosen as a basis for the implementation, since it provides all the necessary functionality, such as B-trees, hash tables, and nested transactions. It also has industrial strength properties and is available under an open-source license.

The initial goal was the design and implementation of a library that provides all TTM functionality that a library can provide. This library was supposed to serve as a basis for a higher-level layer which provides the remaining functionality (functionality only a language can provide).

The latter fuctionality is now available through an interpreter called Duro D/T, which supports a language based on Tutorial D. (The T in Duro D/T stands for Tutorial D).

The Duro Library

The Library provides a C API. The library can be used in two ways:

- By application programs written in C (or another language which can use a C interface)

- By a higher level layer which provides more fuctionality, e.g. by a TTM-conformant language implementation like the interpreter mentioned above.

The Duro library supports all RM and OO Pre- and Proscriptions, but as an API, it cannot support them as a language, only as function calls. This means, for example, that the TUPLE and RELATION (RM Prescriptions 6 and 7) are not supported as language constructs but in the form of API calls (RDB_create_tuple_type() and RDB_create_relation_type()¾note that these do not create types, they just create C structures for the types).

The interpreter Duro D/T

Duro D/T is an interactive program which reads Tutorial D statements and executes them. Currently it is not a complete implementation of Tutorial D, but implements most of it. It is still in a state of continuous improvement.

Other components

Duro contains a Tcl interface which supports Tutorial D expressions (but not statements) and a graphical administration tool.

Duro and TTM

As noted before, Duro is designed to conform to all RM and OO Pre- and Proscriptions

In my opinion it can be debated if RM Proscription 7 should should be interpreted in a way which prohibits cursors. Instead, creating a cursor could be seen as implicitly creating an array, which then is iterated over. Nonetheless in Duro I adopted a scrict approach which requires the explicit creation of an array in order to access the individual tuples in a relvar.

Multiple Assignment and Constraints

The Duro library implements multiple assignment and constraints in a way that provides reasonable performance for two important cases: "Tuple constraints" and foreign key constraints.

The central entry point for multiple assignments (and assignments in general) is the call RDB_multi_assign(). This call performs the following steps:

1. Transform assignments of virtual relvars into assignment of real relvars.

2. Check the constraints by performing the following steps:

- For all constraints which contain one of the relvars at the left-hand side of an assignment, generate an expression in which the relvars are replaced by the respective right-hand side.

- Transform the expression, exploiting the fact that the constraint expression is true before the assignment.

- Evaluate the expression. If the result is false, stop processing with a constraint violation error.

3. Execute the assignments.

Duro tries to optimize constraint checking, that is, to avoid evaluating the constraint expression during the execution of every statement which could violate the expression.

This constraint checking optimization works only for constraints of the form IS_EMPTY(). Duro attempts to optimize these constraints exploiting the fact that the condition is true before the execution of a statement.

Consider a foreign key constraint given in the following form:

IS_EMPTY (T1 {A} MINUS T2 {A} )

If the tuple T is to be inserted into T1, the following expression must evaluate to true for the insert to be permitted:

IS_EMPTY ((T1 UNION RELATION{T}) {A} MINUS T2 {A} )

Duro transforms this to:

IS_EMPTY(RELATION{T} {A} MINUS T2 {A})

To check this, it has only to be checked if T {A} is an element of T2 {A}, which is a fast operation if there is an index for T2 on attribute A.

Similarily, the "tuple constraint" IS_EMPTY(T1 WHERE C), if the tuple T is to be inserted into T1, is transformed to IS_EMPTY(RELATION{T} WHERE C), which is inexpensive to evaluate.

Duro supports all truth-valued expressions as contraints, but if the transformation described above cannot be performed the constraint checking can make assignments quite slow for large relvars. It remains a challenge for the future to extend the constraint checking optimization to support more cases efficiently.

Inheritance Model and Very Strong Suggestions

The IM Prescriptions and Very Strong Suggestions were not rejected, but their implementation has lower priority than the RM Pre- and Proscriptions.

The interitance model is not supported, but support may be added in the future. It would be interesting to know to what extend the constraint-based approach to type inheritance turns out to be used for real applications. This can only be known if such a system is made available for end users and application developers. At least node types could turn out very useful in practice, having some similarity to what is called interfaces in some OO languages (e. g. Java). There seems to be a trend in Java programming to avoid subclassing in favor of inheriting from interfaces. This might benefit the adoption of TTM-style type inheritance as an alternative to the usual subclassing approach.

As for the Very Strong Suggestions, I have no objections to any of them and consider most of them important, but their implementation is postponed to the future.

Ingres D

from Adrian Hudnott

Ingres D is the first implementation of The Third Manifesto by an already successful commercial DBMS vendor. The eventual aim of Project D (as the process of developing Ingres D is known) is to implement everything described in The Third Manifesto and Temporal Data and The Relational Model, with one exception ? RM Very Strong Suggestion 8, that SQL is implementable in D. Ingres, like MighTyD, accepts programs written in either Tutorial D or SQL and programs written in one can freely access and modify databases created and predominantly manipulated using the other. This avoids the need for Very Strong Suggestion 8, but does raise similar problems, such as when an SQL user creates a nullable column then how will that appear when accessed from a Tutorial D program? A null written from SQL won?t behave like a null in Tutorial D, but how missing information will be represented is an open question.

Ingres D development is still in the early stages. A parser has been constructed at Technische Universit�t Ilmenau that can read Tutorial D code and convert select-project-join queries into a language neutral abstract syntax tree format that it shares with SQL and QUEL (another language that Ingres supports that predates the popularity of SQL). Once a query is converted into this form it can be processed by the existing Ingres optimizer and query execution engine, although supporting more complex parts of Tutorial D will require additions to the back-end DBMS as well. My role is to tackle the areas of the Manifesto that require original research to produce a high performance implementation that meets both the usability requirements of a proper DBMS based upon the Manifesto and the workload requirements of a modern data center. The focus of my work is on efficient execution of multiple simultaneous assignment and efficient enforcement of complex data integrity constraints (generally involving multiple relvars) with only a very low impact on performance for most constraints.

Ingres D was first proposed when Ingres engineers Jeremy Peel and Ray Fan started putting together a set of requirements for a radical new version of Ingres that could differentiate itself from the larger SQL DBMS products with useful features for the future and they soon realized that they were effectively proposing that Ingres comply with The Third Manifesto. Although Jeremy and Ingres have since parted, Project D has been approved at a high level within Ingres and they are kindly able to fund my PhD studies. However, most of the non-research work for Project D is intended to be completed within the open source community so we need you to get involved and contribute to the development effort. You can find out more about Ingres open source projects at http://community.ingres.com/wiki/Engineer

MighTyD

MighTyD began in 2005 as a project created by me and four colleagues¹ for our Masters thesis at the University of Warwick. Hugh Darwen had suggested producing a Tutorial D implementation as a project idea to one of our lecturers some time previously but by the time that we were beginning our project Dave Voorhis had already implemented Rel. Inspired by Hugh?s invited talk on temporal databases we approached him with the idea that we could differentiate our efforts from Rel by focusing on the topics covered in the Temporal Data and The Relational Model book. MighTyD supports PACK and UNPACK on a single attribute and restriction, projection, join, rename, difference, semi-difference, union and extend each with and without the temporal extensions. MighTyD supports tuple and relation literals, including all of the tuple operators. In addition it can access data contained in PostgreSQL databases. Another student subsequently added the ability to create new base relvars from MighTyD, insert and delete data and declare and enforce temporal database constraints. MighTyD supports the basic Tutorial D scalar types as well as intervals of integers and the WEEKDAY example from the book, along with the accompanying Allen?s operators.

There are large parts of the Manifesto not implemented by MighTyD, such as user-defined types and user-defined operators. Although nothing intentional in the design of MighTyD prevents adding these features it was beyond the scope of the initial project to implement them. We found two difficulties in implementing MighTyD. Firstly, MighTyD is implemented in Java and we found that some features of the language helped us, such as the strong typing, good tool support and the rich collection of data structures available in the standard library, while others worked against us. For instance, ?A relation is a set of tuples?, implies an object model where a relation object possesses pointers to tuple objects, but then, ?A tuple has a heading? makes for a lot of redundantly stored copies of that heading!

Secondly, a lot of knowledge is required to implement a successful DBMS. There?s DBMS specific knowledge, such as knowing how to implement an efficient join, although there were no specific performance requirements for MighTyD and the only area that where we could contribute unique optimizations is the implementation of the PACK operator. DBMS implementation is a fantastic application area for developing interesting data structures and algorithms, but unfortunately it is largely omitted from undergraduate courses. However, I have the good fortune to be studying it for my PhD (see Ingres D below). What we had not anticipated though was the level of competence in compiler writing required. The Manifesto model of user-defined types and inheritance (neither of which we implemented) only add to that difficulty. Although several team members had studied compilers it is a complicated area where there was no substitute for practical experience. That said, the fact that a D has the structure of a regular programming language does at least permit study and implementation using classic compiler design techniques, as opposed to the complicated bespoke patterns that you?re more likely to find in an older DBMS. We had a really fun time implementing MighTyD though, and we hope that you find it useful, either for teaching, to learn about DBMS implementation, or to build on if you think that you can improve it. Full technical documentation is available.

_______________________________________________

[1] Rachel Bowers, Sergey Petrov, Thomas Pick, Issam Souilah

RAQUEL

from David Livingstone

Background

An early version of RAQUEL (= Relational Algebra Query, Update and Executive Language) was developed in 1991 at Northumbria University?s School of Computing for teaching relational concepts. As RAQUEL was well received by students, it was further developed over the years and utilised successfully as a teaching tool within the School.

In 2002, the Open Database Project Group was founded at Northumbria University with the aim of building an Open Source DBMS that would use RAQUEL to qualify as a valid D in an implementation of The Third Manifesto (= TTM). In September 2008 the ?Proof of Concept? fund of NorthStar Equity Investors Ltd decided to financially support the development of this work with the objective of building a prototype DBMS by the summer of 2009, with the possibility of a more extensive version in 2010.

The RAQUEL DBMS Compared to the Third Manifesto

The 3^rd edition of TTM (= TTM3) has ?two key features that make it stand out : its strong focus on the relational model, and its thorough treatment of type theory? (quote from page 3). The RAQUEL DBMS is intended to fully support both features. However since the type theory cannot be applied in the absence of an underlying relational DBMS, it is anticipated that type inheritance mechanisms will not be provided until the functionality of the relational model has been implemented on a significant scale. Scalar types are implemented orthogonally to relations so as to provide a sound basis for type inheritance.

Furthermore it is considered that the DBMS should be able to handle scalar data types of a wide variety of natures, for example scalar types ranging from pictures to geographical data to music, in addition to the traditional character-based data types like numbers and text. While TTM3 does not explicitly demand this, it is implied by the statement on page 5 that ?DBMSs needed to evolve to deal with much more sophisticated kinds of data?, together with the example illustrated by figure 1.1. The RAQUEL DBMS implementation will give a high priority to supporting such a range of scalar types despite the fact that it poses significant challenges.

The Application of ?Essentiality?

TTM3 asserts that ?the solid foundation .. must be firmly rooted in the relational model of data, first presented to the world in 1969 by E. F. Codd? (quote from page 8). A highly significant paper in demonstrating the superiority of the relational approach over the network approach to databases was Codd?s paper entitled ?Interactive Support for Programmers : the Relational and Network Approaches?. As C. J. Date points out in chapter 10 of [1], exploiting the concept of ?Essentiality? was what enabled Codd to achieve this. A concept is inessential in a model if its loss does not result in any loss of functionality. The relational model disposes of all inessential concepts without disposing of any functionality. Thus it maximises its ease of use because it maximises the ratio of functionality to conceptual complexity. ?this ratio of function to conceptual complexity is the ultimate test of systems design? [2].

RAQUEL and the Open DB Project have always considered ease of use to be the prime objective. Consequently maximising the ratio of functionality to conceptual complexity is the top priority. To achieve this, ?Essentiality is considered essential?; unnecessary concepts are dispensed with, and those that are used should be such as to provide ?Conceptual Integrity? - see [2]. This philosophy has been applied to the design of RAQUEL. It is underpinned empirically by :

- the very positive feedback received from students when it has been used for teaching at Northumbria University;

- copying the means of applying essentiality in several other computer languages that were very successful in their application areas.

Hence RAQUEL is used to implement TTM rather than Tutorial D. Since the concepts of a language come first and then the syntax to express them, the differences between RAQUEL and Tutorial D can be summarised as follows :

- RAQUEL only handles relational variables and relational values. A tuple is considered an integral part of the concept of a relation, in the same way that an array element is part of the concept of an array. RAQUEL excludes tuple types, tuple variables, and tuple operators as separate and independent concepts. (It is noticeable that TTM3 itself is lukewarm in its support for tuples. Page 148 includes the following : ?The main reason the Manifesto talks about tuples at all is because it obviously has to talk about both relations and scalars, and tuples are the logical middle ground?; ?we do not expect tuple values and variables to play much of a part in any real application of D except in the ?logical middle ground? kind of way?). Since scalar variables are not stored in a relational database, they are omitted as well (although not of course scalar values). These omissions do not lose functionality and do not prevent RAQUEL from being computationally complete.

- The relational concepts are generalised so that the syntax rules can likewise be generalised. All processes follow standard syntax rules, and are categorised as either ?operators? or ?assignments?. All operators are purely read-only. Assignments are generalised such that update operators, declarations, etc can all be handled simply and straightforwardly as particular instances of assignment¹. For both operators and assignments, relational and non-relational parameters are syntactically distinguished to provide clarity. Analogous conceptual and syntax rules apply at both relational and scalar levels; this is particularly important to facilitate nested relations. By contrast, page 122 of TTM3 states ?You may have noticed that the syntax of operator invocations in Tutorial D is not very consistent? and ?the possibility of adopting a more uniform style seems worth exploring?. RAQUEL has achieved such a style.

Open Architecture

The Open DB Project Group?s approach to designing the RAQUEL DBMS is founded on an ?Open Architecture?. This has two aspects :

1. The now current approach of facilitating plug-in modules to allow the DBMS to incorporate additional specialised functionality as required. For RAQUEL these plug-ins fall into three areas :

a. physical data storage mechanisms;

b. support for scalar data types by providing implementations of physical representations of values, plus implementations of operators and/or assignments applicable to such values - this support might also require additional new physical data storage mechanism(s);

c. communication facilities to permit the DBMS to interact with a variety of facilities external to it.

2. A building block approach to the construction of the core DBMS. At each of several levels of abstraction, the DBMS architecture consists of a number of modules, all of whose functionality and interfaces are precisely defined. This allows individual modules to be replaced by functionally equivalent modules and collections of modules to be replaced by equivalent collections of modules. It also allows modules to be omitted when they are unnecessary (e.g. a global optimiser for a standalone system) or added when they are necessary (e.g. for some specialised purpose). This strategy contrasts with the strategy of building ?cut-down? DBMSs in order to solve the problem of bloated DBMSs. The latter strategy assumes and specifies some limited functionality and incorporates it into a monolithic DBMS. The RAQUEL ?building block? strategy allows the user to determine the functionality required and build the DBMS of their choice². This strategy supports the application of relational concepts in areas where currently these is no means of application, e.g. as a component of computer games. In the long run the strategy will need supporting facilities to help users compose their DBMS.

The open architecture is not a requirement of TTM per se, but does support the application of TTM and the development of DBMS modules via the Open Source route (which is the Open DB Project Group?s strategy for implementation).

Exploiting Relational Concepts

In the longer run, there are several areas where the Open DB Project Group would like to exploit relational concepts :

1. The provision of a new form of logical data independence that allows the database user to manipulate sequences and bags of tuples as well as sets of tuples, i.e. relations. Sequences and bags would be mapped to standard relations for storage. The DBMS would provide operators and assignments that function at the sequence and bag level, thereby avoiding the need for the user to translate their sequence/bag processes to the relational level in order to express them. There are certain components of the manipulation of a purely relational database that already conceptually call for sequences and bags anyway; e.g. the sorting of query output, the need for the ?Summarize? operator to be able to deal with bags of attribute values. Sequences and bags of tuples could thus be seen as a major generalisation of these facets.

2. The provision of a ?Stored Relvar Schema? for every database³. Under the ?Logical Schema?, which holds all the real relvars in the database, the Stored Relvar Schema holds all those relvars whose data is held in physical storage. A real relvar defaults to a stored relvar unless the Database Administrator explicitly maps it to a set of stored relvars via a relational algebra expression. In effect the mechanism which provides virtual relations in terms of real relations is also used to provide real relations in terms of stored relations. The purpose of this is to extend the range of storage possibilities available, and incorporate some physical storage optimisation into the logical optimisation process so as to improve performance. For example it would be possible to merge the data of several real relvars into one physically stored relvar, thereby improving the performance of join operations since the data would be stored in its joined configuration. The feasibility of this strategy depends on solving the ?View Updating Problem?, although since stored relations are never visible to the database user, not all of the ramifications of the problem may be significant.

3. The provision of a scalar data type that holds RAQUEL statements. In the short run this provides a means of (say) storing integrity constraints expressed in RAQUEL in the meta database. In the longer run, it provides the potential for a way of having a relation that corresponds to a spreadsheet and could be used as such. More work needs to be done to investigate this possibility.

4. The physical data independence provided by a relational DBMS should include the provision of a wide range of storage mechanisms. The range should be extended beyond the tradition of a portfolio of storage file mechanisms to include any means of storing data. For example it could include in-RAM storage (e.g. for high performance real-time databases), relational toolkits (e.g. Berkeley DB, db++), and complete external SQL DBMSs (which could provide a means of combining a RAQUEL DBMS with SQL DBMSs).

Concerns Regarding Scalar Types

On page 56, TTM3 states that scalar types are either system-defined (i.e. built-in) or user-defined types. It adds that this terminology is not very good, particularly as the whole point is that types of both kinds should behave in the same way, and RM Prescription 4 asserts this specifically. In practice this categorisation is even more unhelpful in that it is possible to buy additional sets of scalar types to plug into many brands of DBMS. Are plug-in types built-in or user-defined ?

TTM3 further distinguishes between system-defined scalar types that have a declared possrep and system-defined scalar types that do not have a declared possrep. On page 69, the difference between the two kinds of type is stated to be solely a matter of whether a possrep is declared or not.

By contrast, RM Prescription 4 states that all user-defined types must have at least one declared possrep.

It is suggested that a more helpful categorisation of scalar data types is between those that are ?primitive? and thus have an ?innate? representation, e.g. numbers and text, and those that are non-primitive and use a ?selector? representation, e.g. a date type whose values are expressed by a date selector operator such as DATE( y, m, d ). This categorisation provides a clear and unequivocal conceptual difference between the two kinds of type. It conforms to the way that database users will inevitably perceive scalar types through their writing of scalar values - are they written with or without a selector operator ? It also conforms to the way that the DBMS has to implement scalar types. The DBMS must provide specific facilities for the innate representation of every primitive type, whereas for all non-primitive types a single generic facility can be used that exploits selector operators through their foundation (possibly recursively) on primitive types.

The RAQUEL DBMS employs this categorisation for scalar data types. It does not conflict with TTM3 type theory but rather provides a sound practical foundation for it, which currently is omitted from TTM3.

The categorisation also provides a clear-cut means of determining whether a scalar type is primitive or nonprimitive for the DBMS. For example, geometric shapes (e.g. triangles, squares) may be conceptually a primitive type because they are unlike any other kind of data value. However, whether they form a primitive type to the DBMS or not depends on how such shapes are represented, because the software can only manipulate data bits not understand them. If geometric values are always represented in some way that is entirely different from any other type handled by the DBMS, then they will indeed form a primitive type. On the other hand if geometric values are always represented by numeric values (of a pre-existing number type) that are used to specify their vertices as points, then they will form a non-primitive type because they can be recognised by means of an appropriate selector operator.

Bibliography

[1] C. J. Date. The Database Relational Model : A Retrospective Review and Analysis. [Addison-Wesley, 2001].

[2] Frederick P. Brooks. The Mythical Man-Month : Essays on Software Engineering. Chapter 4. [20th anniversary edition. Addison-Wesley, 1995].

[3]The Lowell Database Research Self Assessment. Report of 30 authors of the Lowell Workshop, 4-6 May 2003. See here

__________________________________________

[1] This exploits research previously carried out at Northumbria University.

[2] It also allows academics such as the ?Lovell Group? members to pursue and contribute their research; see [3].

[3] The RAQUEL DBMS will support databases organised via the ANSI/SPARC standard and its derivatives.

Rel

from Dave Voorhis

Rel is an implementation of Tutorial D, intended primarily for database students or anyone studying the relational model. It is a fully self-contained, multi-user database management system that runs on Microsoft Windows, Apple OS X, Linux and any other operating system that supports the Java platform.

The Rel distribution includes a simple user interface for evaluating Tutorial D queries and writing and executing Tutorial D scripts. It can connect to multiple local ?desktop? databases and remote Rel servers over the Internet. Its output can be exported as HTML; this makes it easy to include Tutorial D code and query results in Web pages, overhead slides, and word processor documents. This makes Rel ideal for preparing teaching and other presentation materials.

Rel is open source and released under the Apache License, so users are free to examine, modify, use and extend its internal workings. The source code has been organised to facilitate using it in teaching contexts. The lexer/parser, interpreter and virtual machine have all been used on their own to teach related computer science concepts, while Rel as a whole helps emphasise the fact that ?relational database language? and ?SQL? are not necessarily synonyms!

As of this writing (March 2009), support for user-defined types via Tutorial D syntax is incomplete (though new built-in types can be defined in Rel by including Java code), as are a few other relatively minor features. By the time you read this, these facilities may well be implemented. Support for user-defined types is an active work-in-progress. Over time, Rel will become ? and will remain ? a complete and faithful implementation of The Third Manifesto and Tutorial D. However, some extensions have been added, and there are some minor deviations from the Tutorial D specification ? mostly made for technical reasons ? which are identified on the Rel Web site (see below).

As it stands (again, as of this writing), Rel is entirely suitable for teaching relational database concepts and may be suitable for some practical database purposes. The main obstacles to using it as a ?production? or ?industrial? DBMS are the lack of security facilities, minimal query optimisation, and absence of either a standards-based database access mechanism or a set of native user-interface libraries. A standards-based database access mechanism such as ODBC or JDBC would allow it to be used as a conventional stand-alone DBMS, accessed via client-side programming languages. Native user-interface libraries would allow it to be used as a general-purpose, database-oriented programming language. In the long term, both approaches will be supported, which will allow Rel to be used (perhaps simultaneously!) as both a stand-alone DBMS and a general-purpose programming language.

For further information, or to download Rel, please see http://dbappbuilder.sourceforge.net/Rel.html

SIRA_PRISE

from Erwin Smout

It was XML that did it. The company I worked for had taken the decision that it should move on to what it considered to be "modern technologies". That was management lingo for "OO for programming and XML for data". Finding out a few things about this XML thing, I couldn't believe that anyone would seriously consider using this as a technology for operating on data. Googling for "XML criticism" led me to Fabian Pascal's www.dbdebunk.com, which in turn led me to this book's authors' series of books and articles on the relational model. Their observation that "the relational model has never been implemented properly", combined with the fact that I wasn't exactly pleased with the things I observed in my professional environment, made me decide to give it a try, "implement the relational model properly" [1].

SIRA_PRISE is the result of that try. The name stands for "Straightforward Implementation of a Relational Algebra - Prototype of a Relational Information Storage Engine. While obviously mainly inspired by TTM, it is certainly also true that in some respects, SIRA_PRISE differs seriously from the material presented therein.

For one, TTM presents a computationally complete programming language. SIRA_PRISE deliberately does not. Instead, it uses the data sublanguage approach. The consequences thereof are obviously numerous. To name but one, SIRA_PRISE does not "know" the concept of a program, and hence cannot support the concept of "program-local relvars" either, whereas D is required to support such a concept.

Second, the Tutorial D language presented in TTM, is explicitly intended to be only didactic in nature. SIRA_PRISE aims to offer a usable, fully operational DBMS. So nearly all of the features that TTM mentions only as "intentionally left out", are indeed present in SIRA_PRISE : recovery and concurrency controls, security and authorization controls, administration controls, controls for the physical design of the database, sessions and connections, communication with the outside world (that is, facilities for application programs to interact with SIRA_PRISE), exception handling and feedback information, ?

Furthermore, allthough this has little to do with TTM per se, it is worth mentioning explicitly that SIRA_PRISE has also been inspired by the ideas of "Temporal Data and the Relational Model". That is, SIRA_PRISE offers full support for management of temporal data, and in fact just any kind of interval-typed data.

The remaining sections provide a more detailed item-by-item explanation of any differences between SIRA_PRISE and the material in "Databases, Types and the Relational Model". It might be observed that some of these "differences" actually relate more to Tutorial D than to TTM per se, and may therefore be not so important to a system that "deliberately does not try to provide a programming language". I choose to mention them here nevertheless, because even if SIRA_PRISE doesn't concern itself with a programming language per se, part of SIRA_PRISE's goal is that (an implementation for) such a programming language can "trivially be built on top of it".

Aspects of TTM I disagree with (anywhere between slightly and completely).

You can take this to mean, aspects that I have not implemented in SIRA_PRISE, nor have any intentions to do so.

Direct relational assignment

TTM requires support for direct relational assignment for all relvars. I think that it is a bad idea to allow such assignments to persistent database relvars which are "shared" between several programs running concurrently. And all the relvars that are managed by SIRA_PRISE fall into that category almost by definition.

Suppose some program is allowed to execute the assignment CUSTOMER := RELATION {?}. That means that the system should delete all the tuples in CUSTOMER which do not appear in the stated relation value selector. The bad thing about this is that there is no reasonable guarantee that that program/user knows which tuples those are, or put otherwise, it is perfectly possible for that program/user to assert the falseness (that's what deletion is all about) of a whole bunch of propositions contained in the database, without that program/user even knowing which those propositions are ! In particular, it might imply that some tuple is about to be deleted that might have been inserted just a few milliseconds ago, very deliberately and intentionally, by some other concurrently running program (implying that the operation of that program is quite likely to be disturbed). I think it is a good idea if a system takes measures to try and prevent users from being so careless.

Please note that this does not prevent a SIRA_PRISE user from achieving the same effect as a direct assignment. Thanks to SIRA_PRISE's support for multiple assignment, the command 'CMD(ADD R,MINUS(r,R))CMD(DELETE R,INTERSECT(R,r))' has exactly the same effect as D's 'R := r'. SIRA_PRISE only forces the user to be explicit about the delete portion that remains implicit in the assignment.

Nested Rename & Extend

SIRA_PRISE deviated from the "old" ("nested") semantics of RENAME and EXTEND, and already implements the "revised" semantics of these operators that are explained in detail elsewhere in this book.

UNION h {}, INTERSECT h {} and JOIN {}

These operator invocations are not supported as such by SIRA_PRISE. But "bypasses" are available to the user which are, imo, much clearer to any unsuspecting maintenance programmer than the expressions as mentioned. 'UNION h {}' (D syntax( can be replaced with 'RELATION(HEADING(h)BODY())' (SIRA_PRISE syntax), 'INTERSECT h {}' can be replaced with 'RELATION(HEADING(h)BODY(TUPLE(?)?))', and 'JOIN {}' with 'TABLE_DEE'. Only the fact of having to specify a possibly very lengthy relation body in the case of intersect might be inconvenient, but I don't think the need for some other solution is big enough to warrant the effort on my part to provide such other solution.

SUM(), AVG() and the likes in SUMMARIZE

Contrary to >Tutorial D, SIRA_PRISE does not require distinct operator names for the aggregation operators that are invoked in summarizations (an aggregation operator is any operator, scalar or nonscalar, that is both associative and commutative - e.g. integer addition, boolean conjunction and disjunction, relational union, ? The fact that an operator has these properties is declared by the operator implementation code). Summing is done by invoking '+(?)', not 'SUM(?)'. This slightly alleviates the work for someone writing the implementation of an associative operator that must be available for aggregation, and slightly reduces the number of operator names that a user using summarize must memorize. Admittedly, this approach does present some difficulties to the user in cases such as AVG() and MEDIAN() [2].

If the operator implementation does not provide an identity value, SIRA_PRISE always raises an exception (if and when it needs that identity value), so there is no room for "undefined results", as suggested by OO prescription 6.

Possrep names

In SIRA_PRISE's language, possreps have no name of their own. Selector operators corresponding to distinct possreps of the same type can be disambiguated by looking at any component name, which must be unique across all possreps of the type. A possrep with no components must also be unique within its type, and can thus be uniquely identified within its type. Absent a possrepname, value selector invocations are thus always introduced by the typename:

POINT(X(RATIONAL(1.0))Y(RATIONAL(1.0))) or POINT(X(1.0)Y(1.0)) invokes the 'cartesian' value selector, POINT(R(RATIONAL(1.0))THETA(ANGLE(1.0))) or POINT(R(1.0)THETA(1.0)) invokes the 'polar' value selector. Tutorial D uses positional parameters in its value selectors, and uses the possrepnames instead of the typenames to introduce the expression. This has the perhaps not-so-slight disadvantage that in Tutorial D, it is not immediately clear which type of value the expression denotes (there is no mention of 'POINT' in 'POLAR(1.0, 1.0)').

ALL BUT ? attributelist specifications

are not supported. I dislike the implicitness that comes about with such constructs rather seriously. To pursue my point : If at some time t1, the heading of some relvar includes the attributes A and B, then a projection on "ALL BUT B" of that relvar has a heading that includes only A. Now, some people would argue that such a specification is perfectly acceptable if you "know" that "ALL BUT B" really is what you need. That means in fact that at time t1, such a developer is saying "No matter what attributes are added at some later time t2 to the relvar, I will always want those additional attributes to appear in this projection.". How can anyone actually know that upfront ? My answer is : it's impossible, because it's the future. So, imo, and perhaps contrary to common and popular belief, using a construct such as "ALL BUT ?" actually relieves no one from doing at least some kind of "impact analysis" in the event of a relvar being changed, and is actually of far less benefit than it appears to be at first sight.

Aspects of TTM that are not yet implemented.

The inheritance model

TTM's inheritance model is currently completely unimplemented in SIRA_PRISE. There is only one tiny exception in that SIRA_PRISE allows to define "constrained types" : types that are defined by declaring an existing type as the "base type", and by providing a boolean expression that determines whether or not any particular value of the base type is also a value of the constrained type. Features such as multiple inheritance (multiple base types), automatic operator inheritance, declaring a constrained type with another constrained type as its base type, .. are all still unsupported. I may one day decide to "have a go at the IM", but this is unlikely to occur soon.

SUMMARIZE ? PER ? and DIVIDE ? PER ? BY ? PER ?

Both of these operators are not supported by SIRA_PRISE. But please note that their simpler versions (SUMMARIZEBY and the small divide), are. The reason for not supporting the PER ? construct is twofold : there is rumour of these operators going to be replaced by an extension to EXTEND, and the additional complexities brought about by PER ? currently scare me off "attacking the problem".

UPDATE (the "substitute" read-only operator, not the assignment operator, of course), WRAP, UNWRAP and COMPOSE aren't supported either. As a consequence of leaving WRAP unsupported, the specification for TCLOSE had to be slightly adapted, too, to cater for relations of degree >2. Such relations are needed when one wants to manage bill-of-material structures between items that are identified in the database by a composite key. The version of the TCLOSE operator presented in "Databases, Types and the Relational Model" would require the user to first WRAP the respective attributes in a TUPLE value to make the relation binary.

Virtual relvars

Virtual Relvars are supported, but they are not updatable. The common sense in Darwen's objections still bothers me too much to "bite the bullet" and (try and) implement Date's ideas. Like Darwen, I dislike the ambiguity of insert-through-union and delete-through-join, and the way Date wants to deal with it. I dislike Date's dogma that "being a variable implies being assignable to". Like Date, I dislike Darwen's idea that some relvars might be "insertable-to, but not deletable-from" (or vice-versa). I certainly dislike the consequences that such a position might have for UPDATE, which is nothing more than a combination of delete-and-insert. With things standing as they do, to me the most elegant way out seems to be to just plain accept the fact that read-only variables can be, and indeed are, a fact of life.

WITH constructs.

SIRA_PRISE currently does not support the use of WITH constructs in expressions.

THE_ pseudovariables

Assignments to THE_ pseudovariables (in the expressions in an UPDATE assignment) are currently not supported. THE_ constructs are read-only operators exclusively, in SIRA_PRISE.

Usage of relation-typed values and relational operators.

The usage of relation-typed values is limited. Relation-typed possrep components of scalar values are unsupported, relation-typed arguments to scalar operators are unsupported, and relation-typed attributes in a base relvar are unsupported too. All this is simply a consequence of SIRA_PRISE's primary focus on being a Storage Engine, of course.

This means in particular that operators that take relations as any argument, but return some scalar, such as <relation> SUBSETOF <relation>, <tuple> IN <relation>, ISEMPTY (<relation>), COUNT (<relation>), ? are unsupported. Bypasses can be found in all cases, of course, but they do involve client-side programming, which is likely to be experienced as annoying.

Recursion

Recursion as such is not supported, but TCLOSE is supported as a built-in relational operator. Other than that, and given that SIRA_PRISE deliberately does not deal with programming per se (such as when providing the implementation for some user-defined operator), I think this is not really a big issue.

Database constraints

must all be defined as an (implicit) invocation of ISEMPTY (<relational expression>). That is, the user specifies only the <relational expression>, and SIRA_PRISE treats this expression as one whose value should be equal to the empty relation after every update. Database constraints cannot be defined using expressions like, e.g., COUNT(relvar) = COUNT(relvar {key attributes}).

Multiple Assignment

SIRA_PRISE has supported multiple assignment since day 1, albeit with a number of caveats. These caveats have been gradually eliminated from the product, with only one problem left to be solved before achieving full TTM compliance. This final problem is when within a multiple assignment, two (or more) assignments are done to the same relvar, and the second (or third, or ...) has a reference to that very same target relvar in its source (right-hand) side.

RM Very Strong Suggestions 1 - 8 and OO Very Strong Suggestions 1 - 3

It should not come as a surprise that the primary focus in developing SIRA_PRISE is on the prescriptions of TTM, and that therefore none of the features listed under the title "Very Strong Suggestion" are supported by SIRA_PRISE, with the notable exception of RM VSS 3, which says that a system should be able to apply rules of key inference to determine what the keys are for any relational expression. Such a feature is beneficial because it allows an implementation to decide that the duplicate elimination step, when computing projections of such expressions, can be skipped entirely.

It must also be noted that any system that takes on the data sublanguage approach, as SIRA_PRISE does, violates OO VSS 3 almost by definition.

Unimplemented TTM aspects (or TTM aspects implemented differently) that relate to SIRA_PRISE's data sublanguage approach.

User-definable types and operators

Given that SIRA_PRISE does not attempt to provide a computationally complete language, there is no way to let the user define new operators using only SIRA_PRISE syntax. User-definable operators are supported, but the user must provide the implementation of the operator by writing a Java class.

Tuple types

Tuple types as such are not supported. Nor are tuple-typed variables and/or tuple-typed operators. Only within the context of relation value selectors does SIRA_PRISE support a construct that is more or less similar to the tuple value selector construct required by TTM. Given that SIRA_PRISE produces only relations in response to queries, and takes only relations or relational expressions as its input when processing assignment commands, this shouldn't pose any problem.

Typical programming language constructs

such as arrays, scalar variables, scalar update operators, ? are not supported. There simply isn't any need for them, given that SIRA_PRISE deals exclusively with relations.

Implemented TTM aspects that have given rise to significant implementation difficulties

Database Constraints

SIRA_PRISE supports the definition of database constraints of arbitrary complexity (with the exception of only a few very exotic cases), with the constraint checking algorithm not becoming prohibitive to using that feature because of performance reasons. I think it will not come as a surprise if I say that this part of the code was particularly tricky to get right (even to such extent that I'm not very anxious to elaborate and implement a possible improvement I've identified meanwhile).

The most important class of unsupported "exotic cases" are database constraints that involve a RESTRICT, EXTEND or SUMMARIZE operator invocation that contains a reference to any database relvar anywhere in its boolean, extend or summary expressions.

Imagine, e.g., a database with a CUSTOMER relvar that has a redundant "number of outstanding orders" attribute, which must at all times be equal to the number of outstanding orders effectively present for this customer in the ORDERS relvar. Such constraints can indeed be enforced with SIRA_PRISE, but the most obvious and natural way to write them down (CUSTOMERS WHERE outstandingordercount <> ?) cannot be used to enforce such constraints, because then ORDERS must be referenced inside a WHERE clause. In SIRA_PRISE, the expression must be rewritten along the following lines :

((SUMMARIZE ORDERS ? ) JOIN CUSTOMERS) WHERE outstandingordercount <> ?

You might be tempted to think that such cases are in fact nowhere near "exotic", but note that the only cases that are truly unsupported by SIRA_PRISE are those where such a rewrite is impossible (or infeasible in practice). I believe that such situations will indeed be very rare.

Multiple Assignment

As already indicated, earlier versions of SIRA_PRISE have all had some kind of support for multiple assignment, with the level of compliance gradually increasing, but the ultimate goal of full compliance is still not quite yet achieved. Which should suffice as an indication that implementing multiple assignment has indeed been reasonably troublesome.

Additional goodies not in TTM

TRANSFORM Operator

SIRA_PRISE supports a TRANSFORM operator, using which it becomes possible to actually bundle the functionality of both EXTEND and PROJECT in one operator invocation. One 'sequence' of invocations that gets used quite heavily with the relational operators as described in TTM, is to extend a relation doing some computation(s) on some attributes of that relation, and "storing" the results in the "new" attributes, after which the "original" attributes are projected away. SIRA_PRISE's TRANSFORM operator allows to do both of these in one single invocation, thus providing a shorthand that seems to me to be more than just "nice-to-have".

Assignment constraints

SIRA_PRISE offers support for a family of constraints that are not explicitly mentioned as such in TTM, and which I have baptised "assignment constraints". The aspect in which these differ from "regular" database constraints is that database constraints constrain the database value itself, i.e. defines rules that must be satisfied by the database value, whereas assignment constraints constrain the actual operations by which a database transitions from one (database) value to another.

Some may think this is the very same concept as what the literature often refers to as "transition constraints", but I'm not certain that the ideas covered/implied by that term are 100% exactly the same as what is implemented in SIRA_PRISE, so I like to stick by my own term here. My understanding of "transition constraints" is that they involve exactly two database states, the pre-update one and the post-update one (and nothing more than that). Transition constraints can then be seen as a generalization of the concept of database constraints, in that database constraints are just transition constraints in which the pre-update database state does not matter, does not influence the outcome.

However, if one goes with the strict interpretation that transition constraints depend exclusively on said two distinct database states, and nothing else, then there are still some classes of business rules left that are not covered by them. One class are the security rules, which require knowledge about the identity of the user owning the transaction/statement that is attempting to update the database. That information is not part of the database itself, and thus cannot be referenced in expressions that are limited to database states only. Another class of business rules that are not expressible under this restriction is business rules that involve references to the current system date. Allthough this position might be a bit more debatable, the current system date/time is not usually considered as being an integral part of the database itself.

At any rate, SIRA_PRISE's concept of assignment constraints allows the database designer to define, declaratively (i.e. without any programming), rules of just any of the various kinds of constraint mentioned:

- Users of department X cannot update relvar Y.

- Credit above amount X can only be granted by user Y.

- Relvar X cannot be updated before/after date Y.

- No staff member of the HR department can update his own personnel records.

- Relvar X can only be updated if the UPDATABLE relvar has value TABLE_DEE.

- etc. etc.

Explicit naming of the target relvar in the DELETE statement

D's DELETE statement as described in TTM, is essentially 'DELETE R [WHERE ?]', identifying at the same time the relvar in which the deletion is to be done, as well as the tuples to be deleted. But the only way D allows to specify those tuples is via a restriction on the target relvar. SIRA_PRISE syntax explicitly separates the two, allowing to specify the tuples to be deleted by form of, e.g., a relation value selector. This might be helpful when doing delete operations that incorporate the concept of optimistic locking.

To conclude

From doing this project, I have learnt a couple of things that I would like to present as a conclusion to this small chapter :

- Contrary to common and popular belief, the relational model of data is indeed implementable. In full, without perversions or ugly compromises, and without any additions needed to suit users' data management needs.

- This holds in particular for the definition of database constraints of arbitrary complexity, which will allow DBMS users to effectively delegate to the DBMS the enforcement of every single business rule they have, without having to write down one single byte of (imperative) program code to achieve such enforcement.

- Management of temporal data (aka "historical data") is doable, and this without running into the "ugliness" of typical SQL-based systems trying to do exactly that (the most devastating characteristic of such solutions probably being the sheer textual length of the query expressions needed to get something out of the database, as well as of the constraint expressions viz. program code needed to prevent nonsense from getting into the database).

- Vertical decomposition in the logical design of the database is perfectly doable, provided your DBMS offers sufficient options in the area of the physical design of the database to design away (at the physical level where it belongs) any possible devastating effects on the performance of the system. It should not come as a surprise that the second and third item in this list identify precisely the areas of data management where I believe SIRA_PRISE truly excels, and where the great potential lies for this system, or any other one that resembles it.

_____________________________________________

[1] If you think I must be quite some arrogant bastard to think to do better than what entire armies of programmers at the best-paying companies have been able to achieve, I can only respond with two observations : (a) those armies started developing more than 30 years ago, and have missed on all the additional knowledge and insight that has surfaced during the last decades (or they were not at freedom to implement new ideas because they were bound by backward compatibility issues and such), and (b) there is some truth too in the dictum that "one man can do in one day what two men can do in two" (and the count doesn't stop there).

[2] SIRA_PRISE requires aggregation operators to be associative. But AVG() is not an associative operator on any ordinary number type ? Instead, it is an associative operator on some type that includes both a rational number plus an "observation count" in its values. The details are left as an exercise for the reader to investigate. Even more unfortunately, cases such as MEDIAN() pose an optimization problem when they need to be fitted within SIRA_PRISE's approach for aggregations.

TclRAL and TTM

from Andrew Mangogna

TclRAL is an extension to the Tcl programming language. Like most conventional programming languages, Tcl does not provide for any transparent persistence of its variables beyond the run time of the program. TclRAL is not a DBMS, however it borrows heavily from the ideas of TTM and strives to incorporate the principles of TTM in a form convenient for the Tcl language.

From a programmer's point of view, Tcl is a command language and values in Tcl are strings. Higher order data types appear as specially formatted strings. The everything is a string mantra is fundamental to Tcl and great strides are taken to preserve that view to the programmer. Internally, things are quite a bit more complicated as strings are held in other for ms that are more efficient for computation. Type is evaluated only in the context of a command and if, for example, an integer is required in a given command context, then the particular command argument must be a string that can be interpreted as an integer.

TclRAL extends Tcl by providing two new types, Tuple and Relation, and a large number of commands to manipulate values of those types. Like all Tcl values there is a string representation for Tuples and Relations and they may be stored in any ordinary Tcl variable. Relation and Tuple valued attributes are supported and that support posed considerable difficulty in Tcl as its internal typing system does not support any parameterization that would have been convenient for type generators.

TclRAL provides commands for all the algebraic operators defined in Tutorial D and commands that make interfacing with other Tcl types more convenient. The implementation of most of the algebraic operators follows directly from their definitions, but some operators such as transitive closure use algorithms that are well known if not obvious from the definition of the operation. The closure property of relational algebraic operators fits neatly with the command substitution and nesting capabilities of Tcl, allowing arbitrarily complex expressions to be built. Hashing techniques are used extensively in the implementation of TclRAL to improve the efficiency of many of the operator commands. In keeping with TTM, there is no inherent ordering of tuples or attributes at the programming level. There is no NULL value in Tcl nor does TclRAL provide one. Despite several requests and some discussion, TclRAL holds with TTM that NULL values are simply wrong on so many levels.

Relvars in TclRAL

The distinction between values and variables is very important in TTM. TclRAL supports a separate, shadow variable system for relvars. Relation values may be stored in ordinary Tcl variables. In this usage, ordinary Tcl variables act much as named relations in a with clause. Relation variables supported by TclRAL have the same naming conventions as ordinary Tcl variables, but store only relation values. There are distinct commands that operate on relvars making clear those commands that read their arguments and generate relation values as their result and those that operate in situ on relation values stored in a relvar. Relvars form the basis of how referential integrity constraints are enforced and provide a primitive means of serializing and restoring a data set to persistent storage.

Constraints

TclRAL supports type, identification, and referential integrity constraints in a declarative manner. TclRAL insists that the strings used as values for relation and tuple attributes be coercible to the declared type of the attribute. Attr ibutes are declared to be of known Tcl types. No user defined types are supported. Each relation must have at least one candidate key and that key must be a minimal one (TclRAL documentation refers to candidate keys as identifiers ).

Referential integrity constraints are the most complicated part of the TclRAL implementation. TclRAL supports three types of declarative referential integrity constraints in addition to procedural constraints that are analogous to stored procedures (in TclRAL terms, relvar tracing is used execute arbitrary code associated with operations on a particular relvar. Variable tracing is a long supported idiom of the language). The declarative constraints provide a means of encoding certain semantic relationships between relvars. The associative constraint specifies attributes in one relvar that refer to candidate keys in another relvar along with the cardinality of such references. This is the conventional foreign key referential integrity constraint. The correlation constraint specifies a third relvar that holds referring attributes to two other relvars (which are not necessarily distinct). This type of constraint is required to model a many-to-many relationship or when a relationship itself has additional descriptive attributes. The partition constraint enforces a complete, disjoint set partitioning of one relvar called the super-type into a set of relvars called the sub-types. In a partition constraint, attributes in a sub-type relvar refer to a candidate key in exactly one tuple in the super-type relvar and that each tuple in the super-type relvar is referenced by exactly one tuple from among all the sub-type relvars.

The vast majority of the semantic constraints on relvars can be implemented with these three types of declarative constraints. TclRAL borrows the specifics of these ideas less from TTM than from a formal software development methodology known as Executable UML (previously known as the Shlaer-Mellor method).

TclRAL provides for relvar transactions with the ability to execute recursively an arbitrary script with the integrity constraints evaluated at the end. This is in keeping with Tcl conventions of recursive script evaluation. The transactions may be nested to an arbitrary depth. Violation of the referential integrity constraints results in returning the relvar values to that which existed before the transaction start. TclRAL does provide a relvar assignment operation, but does not have any concept of multiple assignment. The transaction capability does allow for achieving the desired result of making a set of assignments that taken individually would result in violating the integrity constraints.

Unimplemented Aspects of TTM

TclRAL does not provide any transparent persistence for relvars. It does provide for serialization and restoring of relvars from external persistent storage. This is acceptable for many problems, but is wholly inadequate as a DBMS.

TclRAL does not support the notion of virtual relvars. An earlier version of TclRAL did support a very limited read only for m of virtual relvars, but this was removed until virtual relvar updating was better understood by the author and virtual relvars could be implemented in a for m closer to that prescribed by TTM. Virtual relvars remain an outstanding need of TclRAL.

TclRAL does not support any notion of type inheritance. One attempt to provide more conventional object-oriented programming ideas that are relationally based is another Tcl extension based on TclRAL called, raloo.

In summary, TclRAL brings many of the ideas of TTM into a convention programming language environment. Several years of experience demonstrate that a well-for med relational algebra provides a very powerful way to organize data and express computations.