Modelling data with relations (by )

An example relation-based system: Prolog

Perhaps the earliest successful attempt to build a fully relation-based system was Prolog, which dates back to around 1972. The notation I've been using to write relation-based examples so far is very close to Prolog; I have only really changed it to make it slightly more "obvious" for somebody to read without training. Prolog uses a few more symbols like :- and , where I've written words like if and and, and marks variables in rules by giving them a capital first letter, as opposed to object IDs that have a lower-case first letter. IDs in Prolog are names, much as in my examples, although while I have tried to give my examples meaningless IDs such as id12 to try to distinguish between the spelling of its ID and an actual "name" attached to it with a name relation, Prolog tradition is to give things meaningful names as symbols. Some of our examples from above might look like this in Prolog:

type(xmas_day, appointment_type). name(xmas_day, "Christmas day!!!"). timestamp(xmas_day, date(Year,12,25)).

ancestor(A, P) :- parent(A, P).

ancestor(A, P) :- parent(I, P), ancestor(A, I).

However, Prolog has a precise specification of how queries are answered, by searching for statements matching a query, and if a rule is found then recursively querying all the parts of the body of a rule in top-down order. This means that it's possible to write Prolog rules that will repeat forever; if we had written ancestor like so:

ancestor(A, P) :- ancestor(A, I), parent(I, P).

ancestor(A, P) :- parent(A, P).

...then an attempt to find the ancestors of a person by asking the system for:

ancestor(X, alaric).

...will first match the first ancestor rule, the first part of the body of which will become:

ancestor(X, I)

...which will match the first ancestor rule again, causing the system to repeat the same series of steps.

The fact that Prolog defines an explicit order of steps means it's possible to also use it to write programs in the normal sense, that "do things". The Prolog language specifies mechanisms for file I/O, and extensions to it allow for all the usual stuff you'd expect in an imperative programming language, like creating graphical user interfaces and network servers and so on, by providing relations that have a "side-effect" when queried and allowing the programmer to write a rule that queries those relations in some order. It also allows programmers to affect the performance of their rules - even when infinite loops aren't a danger, the programmer can order rules to make the system check simpler, or more likely to succeed, ways to find an answer before approaches with a worse predicted cost/benefit ratio.

This has a few implications. It makes the implementation simpler; it's up to the programmer to order rules for performance and to avoid infinite loops, so the implementation just has to do as it's told. However, it also creates an obligation on the programmer to be aware of how the system will execute their queries, and to bear the burden of being responsible for ordering them correctly and optimally. It makes it possible to express sequences of side-effecting operations to use Prolog to do things other than just querying relation-based data, but I'd argue that other languages are better suited to that, and indeed later in this article I will propose a model for co-existence between a relation-based query language and other, more conventional, programming languages.

The fact that Prolog is a full-featured programming language in its own right means you can't just go running untrusted Prolog code. It might deliberately loop forever to waste your resources and keep you waiting for an answer that's never coming; it might delete all your files. I think Prolog is fantastic, and it's been the basis of so much research that we have a lot to thank it for in our understanding of relation-based models, but I propose that, going forward, a language focused on querying relation-based databases, with support for reordering steps automatically for performance and to avoid infinite loops (or to exploit parallelism on multi-processor systems!) is the future.

An example (...the cacnonical) thing-based system: SQL databases

SQL is almost ubiquitous as the standard model for accessing a thing-based database, and because thing-based databases are almost ubiquitous as the standard model for a database, SQL is almost ubiquitous as the standard model for access ANY database.

So let's take a look at it, for comparison!

A SQL database consists of a bunch of named tables. Each table must be explicitly created, and the creation process names a bunch of fields in the table, and gives them a type of data they contain, and optionally also indicates which ones are "primary keys": fields whose values must be unique, in that each value occurs at most once in that column. These values can be used as IDs for those rows. You can also specify "foreign keys", which are fields whose values must match a value in a specific field of a specific other table, although these are often left unspecified.

Here's an example SQL program to create two tables:

CREATE TABLE people ( id INTEGER AUTO_INCREMENT PRIMARY KEY, name TEXT, email TEXT dob DATE);

CREATE TABLE pedigrees ( father_id INTEGER, mother_id INTEGER, child_id INTEGER PRIMARY KEY );

This creates a table of people, with an integer field called id that is a primary key and has the AUTO_INCREMENT flag, something not present in standard SQL but implemented by the MySQL product (and others inheriting from or emulating it), which means that if a value is not provided for id when a person record is created, the system should assign a previously unused one. Person records also have name and email fields (which are text) and a dob field which is a date.

It also creates a "pedigrees" table, indicating the parental relationship between people. A pedigree record contains fields called father_id and mother_id which are integers, and child_id which is an integer and is a primary key, because no person may be the child in multiple pedigree records; you only get one mother and father (in this oversimplified model of human parenting, at least). The author, in this case, has chosen not to make it explicit that the IDs in the pedigrees table refer to the id column in the people table, because there's often not much benefit in making it known to the database, and the syntax for it is rather verbose:

CREATE TABLE pedigrees ( father_id INTEGER, mother_id INTEGER, child_id INTEGER PRIMARY KEY, CONSTRAINT father_is_a_person FOREIGN KEY (father_id) REFERENCES people(id), CONSTRAINT mother_is_a_person FOREIGN KEY (mother_id) REFERENCES people(id), CONSTRAINT child_is_a_person FOREIGN KEY (child_id) REFERENCES people(id) );

Given that, one could create some person records:

INSERT INTO people (name,email,dob) VALUES ('Alaric','...','...'); INSERT INTO people (name,email,dob) VALUES ('Sarah','...','...'); INSERT INTO people (name,email,dob) VALUES ('Jean','...','...');

As you do that, the SQL engine will report back on AUTO_INCREMENT values it assigned, so we will be told the numeric IDs of the three people created, and therefore we could create a pedigree record linking them by ID.

Now, there are myriad SQL databases, so they're easily available and widely understood; and they are the product of extensive engineering so are full of useful features, and are very capable. But, as I referenced more briefly above, they have some annoying downsides compared to the relation-based model. Let's briefly see how some of those pan out in practice.

Firstly, the IDs. The system would probably have assigned AUTO_INCREMENT values for the id field of the people table such as 1 for Alaric, 2 for Sarah and 3 for Jean. Those values alone are just numbers - meaningless unless you also know they correspond to the id column of the people table. Pedigrees don't quite have an ID - we made the child ID be the primary key because we wanted the uniqueness constraint, so the ID of a child is also inherently the ID of their pedigree. As a consequence, any person can have either zero or one pedigree record. But it's possible, and sometimes desirable, to create records with no primary key at all, and therefore no way to identify particular records in the table other than searching for them. This often comes up with tables that really encode a relation between other things, such as one saying that somebody likes somebody else:

CREATE TABLE likes ( liker_id INTEGER, liked_id INTEGER);

This leads onto another point. If you had multiple tables in the database of things a person might like - ice cream flavours, other people, cheeses - how would you represent that in SQL, given that liked_id means nothing unless you know which table to look up the ID in, as the same value could refer to different things in different tables? There's two main options:

CREATE TABLE likes_people ( liker_id INTEGER, liked_person_id INTEGER); CREATE TABLE likes_ice_cream_flavours ( liker_id INTEGER, liked_flavour_id INTEGER); CREATE TABLE likes_cheeses ( liker_id INTEGER, liked_cheese_id INTEGER);

Or:

CREATE TABLE likes ( liker_id INTEGER, liked_thin_type ENUM('person','ice cream flavour','cheese'), liked_id INTEGER);

Both have their downsides, perhaps the most fatal of which is requiring that all the kinds of things that can be liked be listed in one place; while in a relation-based model with unique IDs for things, you can just say likes(x,?THING) and be able to look up what type ?THING has, or just query whether it has a name without bothering to look up the type, as you wish.

A third problem is schema migration. Say we want to extend our system to record the fact that a person can have multiple email addresses. We might change it to:

CREATE TABLE people ( id INTEGER AUTO_INCREMENT PRIMARY KEY, name TEXT, dob DATE); CREATE TABLE email_addresses ( person_id INTEGER, email TEXT);

...and then need to change all our code that expects to read an email field from a person record to go and look up their email addresses in the other table instead, and deal with the fact there may be zero, one, or any other number of them.

In practice, of course, we'd want to change our existing database rather than starting again from scratch, so we'd actually do something like this:

CREATE TABLE email_addresses ( person_id INTEGER, email TEXT) AS SELECT id AS person_id, email FROM people; ALTER TABLE people DROP COLUMN email;

In the relation-based model, you'd probably start with email being a separate relation between a person and an email address, because that's easier to type than a big all-details-of-a-person relation like the people table in SQL. While in SQL, it's easier to add an email field to the people table than to create a separate email_addresses table, so you tend to start with an email field and then have to change it later.

Also, in the relation-based model, if you DID have a wide person-details relation that relates a person to all their details (one part of which is an email address) and decided to change later, you could put in rules to make the old data available in the new format:

email(?PERSON,?ADDRESS) if person-details(?PERSON,..., ?ADDRESS, ...)

And that rule could coexist alongside separate statements of people's email addresses added since the change. (SQL experts: I know you can do something similar with SQL views, but you can't automatically combine a view exposing old data with a table of new data, unless you build it by hand with a UNION ALL).

An example relation-based system: Datalog

Datalog is a subset of Prolog that's focused on querying relation-based databases, giving control of the ordering of steps back to the implementation. Fantastic! Am I happy now? No, I am still not happy, because Datalog goes slightly too far. Datalog notionally separates the world into the "database", which is a bunch of statements with no variables or rules, and the "query", which is where all the rules go. It also places some limits that disallow infinite loops, effectively stating that rules like our ancestor rule are only allowed if they "follow the data" in a way that must end when it eventually runs out of data. Therefore, our ancestor rule is legal no matter what order it's written in; the Datalog system is clever enough to not get sucked into any infinite loops. But an example of a rule that would be forbidden is one that says X is a lucky number if X+1 is a lucky number - you can add one to any number to get another number so that would loop forever; while ancestor repeats itself by following links in the data of parent relationships, and at some point it will have examined every parent relationship in the database and have nothing else to try. Datalog also has some restrictions compared to Prolog regarding negation (in dialects that support negation, anyway) but I won't explain that as I've not even discussed negation in Prolog, or the relation-based model, at all, and this article is long enough already!

But, combined with the lack of side-effects, this means that you could quite happily run arbitrary Datalog queries an untrusted person on the Internet sends you against your database, and even send them the results back if your database is public.

So, why aren't I happy with that? Well, I think that banning variables and rules from the database itself means you can't do many of the fun things I've described above. Rules aren't just for querying - they can be part of the data itself. I presume that distinction arose from a background in Datalog being developed as a query language for existing thing-based relational databases, to replace SQL, and those databases of course don't allow variables and rules; but I think contemporary Datalog implementations with their own internal fact stores have abandoned that distinction. So, yeah, I'm happy about those, and I think Datalog is a superior alternative to SQL in any context where it's available!

Pages: 1 2 3 4 5 6

No Comments

No comments yet.

RSS feed for comments on this post.

Leave a comment

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales