Modelling data with relations (by alaric)
An example relation-based system: Prolog
Perhaps the earliest successful attempt to build a fully relation-based system
was Prolog, which dates back to around 1972. The notation I've been using to
write relation-based examples so far is very close to Prolog; I have only really
changed it to make it slightly more "obvious" for somebody to read without
training. Prolog uses a few more symbols like :- and , where I've written
words like if and and, and marks variables in rules by giving them a capital
first letter, as opposed to object IDs that have a lower-case first letter. IDs
in Prolog are names, much as in my examples, although while I have tried to give
my examples meaningless IDs such as id12 to try to distinguish between the
spelling of its ID and an actual "name" attached to it with a name relation,
Prolog tradition is to give things meaningful names as symbols. Some of our
examples from above might look like this in Prolog:
type(xmas_day, appointment_type).
name(xmas_day, "Christmas day!!!").
timestamp(xmas_day, date(Year,12,25)).
ancestor(A, P) :- parent(A, P).
ancestor(A, P) :-
parent(I, P),
ancestor(A, I).
However, Prolog has a precise specification of how queries are answered, by
searching for statements matching a query, and if a rule is found then
recursively querying all the parts of the body of a rule in top-down
order. This means that it's possible to write Prolog rules that will repeat
forever; if we had written ancestor like so:
ancestor(A, P) :-
ancestor(A, I),
parent(I, P).
ancestor(A, P) :- parent(A, P).
...then an attempt to find the ancestors of a person by asking the system for:
ancestor(X, alaric).
...will first match the first ancestor rule, the first part of the body of which will become:
ancestor(X, I)
...which will match the first ancestor rule again, causing the system to repeat the same series of steps.
The fact that Prolog defines an explicit order of steps means it's possible to also use it to write programs in the normal sense, that "do things". The Prolog language specifies mechanisms for file I/O, and extensions to it allow for all the usual stuff you'd expect in an imperative programming language, like creating graphical user interfaces and network servers and so on, by providing relations that have a "side-effect" when queried and allowing the programmer to write a rule that queries those relations in some order. It also allows programmers to affect the performance of their rules - even when infinite loops aren't a danger, the programmer can order rules to make the system check simpler, or more likely to succeed, ways to find an answer before approaches with a worse predicted cost/benefit ratio.
This has a few implications. It makes the implementation simpler; it's up to the programmer to order rules for performance and to avoid infinite loops, so the implementation just has to do as it's told. However, it also creates an obligation on the programmer to be aware of how the system will execute their queries, and to bear the burden of being responsible for ordering them correctly and optimally. It makes it possible to express sequences of side-effecting operations to use Prolog to do things other than just querying relation-based data, but I'd argue that other languages are better suited to that, and indeed later in this article I will propose a model for co-existence between a relation-based query language and other, more conventional, programming languages.
The fact that Prolog is a full-featured programming language in its own right means you can't just go running untrusted Prolog code. It might deliberately loop forever to waste your resources and keep you waiting for an answer that's never coming; it might delete all your files. I think Prolog is fantastic, and it's been the basis of so much research that we have a lot to thank it for in our understanding of relation-based models, but I propose that, going forward, a language focused on querying relation-based databases, with support for reordering steps automatically for performance and to avoid infinite loops (or to exploit parallelism on multi-processor systems!) is the future.
An example (...the cacnonical) thing-based system: SQL databases
SQL is almost ubiquitous as the standard model for accessing a thing-based database, and because thing-based databases are almost ubiquitous as the standard model for a database, SQL is almost ubiquitous as the standard model for access ANY database.
So let's take a look at it, for comparison!
A SQL database consists of a bunch of named tables. Each table must be explicitly created, and the creation process names a bunch of fields in the table, and gives them a type of data they contain, and optionally also indicates which ones are "primary keys": fields whose values must be unique, in that each value occurs at most once in that column. These values can be used as IDs for those rows. You can also specify "foreign keys", which are fields whose values must match a value in a specific field of a specific other table, although these are often left unspecified.
Here's an example SQL program to create two tables:
CREATE TABLE people (
id INTEGER AUTO_INCREMENT PRIMARY KEY,
name TEXT,
email TEXT
dob DATE);
CREATE TABLE pedigrees (
father_id INTEGER,
mother_id INTEGER,
child_id INTEGER PRIMARY KEY
);
This creates a table of people, with an integer field called id that is a
primary key and has the AUTO_INCREMENT flag, something not present in standard
SQL but implemented by the MySQL product (and others inheriting from or
emulating it), which means that if a value is not provided for id when a
person record is created, the system should assign a previously unused
one. Person records also have name and email fields (which are text) and a dob
field which is a date.
It also creates a "pedigrees" table, indicating the parental relationship
between people. A pedigree record contains fields called father_id and
mother_id which are integers, and child_id which is an integer and is a
primary key, because no person may be the child in multiple pedigree records;
you only get one mother and father (in this oversimplified model of human
parenting, at least). The author, in this case, has chosen not to make it
explicit that the IDs in the pedigrees table refer to the id column in the
people table, because there's often not much benefit in making it known to the
database, and the syntax for it is rather verbose:
CREATE TABLE pedigrees (
father_id INTEGER,
mother_id INTEGER,
child_id INTEGER PRIMARY KEY,
CONSTRAINT father_is_a_person FOREIGN KEY (father_id) REFERENCES people(id),
CONSTRAINT mother_is_a_person FOREIGN KEY (mother_id) REFERENCES people(id),
CONSTRAINT child_is_a_person FOREIGN KEY (child_id) REFERENCES people(id)
);
Given that, one could create some person records:
INSERT INTO people (name,email,dob) VALUES ('Alaric','...','...');
INSERT INTO people (name,email,dob) VALUES ('Sarah','...','...');
INSERT INTO people (name,email,dob) VALUES ('Jean','...','...');
As you do that, the SQL engine will report back on AUTO_INCREMENT values it
assigned, so we will be told the numeric IDs of the three people created, and
therefore we could create a pedigree record linking them by ID.
Now, there are myriad SQL databases, so they're easily available and widely understood; and they are the product of extensive engineering so are full of useful features, and are very capable. But, as I referenced more briefly above, they have some annoying downsides compared to the relation-based model. Let's briefly see how some of those pan out in practice.
Firstly, the IDs. The system would probably have assigned AUTO_INCREMENT
values for the id field of the people table such as 1 for Alaric, 2 for Sarah
and 3 for Jean. Those values alone are just numbers - meaningless unless you
also know they correspond to the id column of the people table. Pedigrees
don't quite have an ID - we made the child ID be the primary key because we
wanted the uniqueness constraint, so the ID of a child is also inherently the ID
of their pedigree. As a consequence, any person can have either zero or one
pedigree record. But it's possible, and sometimes desirable, to create records
with no primary key at all, and therefore no way to identify particular records
in the table other than searching for them. This often comes up with tables that
really encode a relation between other things, such as one saying that somebody
likes somebody else:
CREATE TABLE likes (
liker_id INTEGER,
liked_id INTEGER);
This leads onto another point. If you had multiple tables in the database of
things a person might like - ice cream flavours, other people, cheeses - how
would you represent that in SQL, given that liked_id means nothing unless you
know which table to look up the ID in, as the same value could refer to
different things in different tables? There's two main options:
CREATE TABLE likes_people (
liker_id INTEGER,
liked_person_id INTEGER);
CREATE TABLE likes_ice_cream_flavours (
liker_id INTEGER,
liked_flavour_id INTEGER);
CREATE TABLE likes_cheeses (
liker_id INTEGER,
liked_cheese_id INTEGER);
Or:
CREATE TABLE likes (
liker_id INTEGER,
liked_thin_type ENUM('person','ice cream flavour','cheese'),
liked_id INTEGER);
Both have their downsides, perhaps the most fatal of which is requiring that all
the kinds of things that can be liked be listed in one place; while in a
relation-based model with unique IDs for things, you can just say
likes(x,?THING) and be able to look up what type ?THING has, or just query
whether it has a name without bothering to look up the type, as you wish.
A third problem is schema migration. Say we want to extend our system to record the fact that a person can have multiple email addresses. We might change it to:
CREATE TABLE people (
id INTEGER AUTO_INCREMENT PRIMARY KEY,
name TEXT,
dob DATE);
CREATE TABLE email_addresses (
person_id INTEGER,
email TEXT);
...and then need to change all our code that expects to read an email field from a person record to go and look up their email addresses in the other table instead, and deal with the fact there may be zero, one, or any other number of them.
In practice, of course, we'd want to change our existing database rather than starting again from scratch, so we'd actually do something like this:
CREATE TABLE email_addresses (
person_id INTEGER,
email TEXT)
AS SELECT id AS person_id, email FROM people;
ALTER TABLE people DROP COLUMN email;
In the relation-based model, you'd probably start with email being a separate
relation between a person and an email address, because that's easier to type
than a big all-details-of-a-person relation like the people table in
SQL. While in SQL, it's easier to add an email field to the people table
than to create a separate email_addresses table, so you tend to start with an
email field and then have to change it later.
Also, in the relation-based model, if you DID have a wide person-details
relation that relates a person to all their details (one part of which is an
email address) and decided to change later, you could put in rules to make the
old data available in the new format:
email(?PERSON,?ADDRESS) if person-details(?PERSON,..., ?ADDRESS, ...)
And that rule could coexist alongside separate statements of people's email
addresses added since the change. (SQL experts: I know you can do something
similar with SQL views, but you can't automatically combine a view exposing old
data with a table of new data, unless you build it by hand with a UNION ALL).
An example relation-based system: Datalog
Datalog is a subset of Prolog that's focused on querying relation-based
databases, giving control of the ordering of steps back to the
implementation. Fantastic! Am I happy now? No, I am still not happy, because
Datalog goes slightly too far. Datalog notionally separates the world into the
"database", which is a bunch of statements with no variables or rules, and the
"query", which is where all the rules go. It also places some limits that
disallow infinite loops, effectively stating that rules like our ancestor rule
are only allowed if they "follow the data" in a way that must end when it
eventually runs out of data. Therefore, our ancestor rule is legal no matter
what order it's written in; the Datalog system is clever enough to not get
sucked into any infinite loops. But an example of a rule that would be forbidden
is one that says X is a lucky number if X+1 is a lucky number - you can add one
to any number to get another number so that would loop forever; while ancestor
repeats itself by following links in the data of parent relationships, and at
some point it will have examined every parent relationship in the database and
have nothing else to try. Datalog also has some restrictions compared to Prolog
regarding negation (in dialects that support negation, anyway) but I won't
explain that as I've not even discussed negation in Prolog, or the
relation-based model, at all, and this article is long enough already!
But, combined with the lack of side-effects, this means that you could quite happily run arbitrary Datalog queries an untrusted person on the Internet sends you against your database, and even send them the results back if your database is public.
So, why aren't I happy with that? Well, I think that banning variables and rules from the database itself means you can't do many of the fun things I've described above. Rules aren't just for querying - they can be part of the data itself. I presume that distinction arose from a background in Datalog being developed as a query language for existing thing-based relational databases, to replace SQL, and those databases of course don't allow variables and rules; but I think contemporary Datalog implementations with their own internal fact stores have abandoned that distinction. So, yeah, I'm happy about those, and I think Datalog is a superior alternative to SQL in any context where it's available!
