Modelling data with relations (by alaric)
Cool things you can do with relation-based models
I've already spoken about how the relation-based model makes it easy to represent objects having multiple types, which maps well to how things work in the real world. Thing-based databases struggle with that; programmers have become so accustomed to splitting out different optional aspects of different kinds of objects by hand, and soaking up the complexity in their software. Tools called "object-relational mappers" attempt to automate the process (among other things), but they can only help so far, and they're infamous for creating their own complications.
Merging databases
A related property is that it's easier to merge databases in the relation-based
model. There are two intertwined problems to overcome, though - the two
databases need to use the same IDs for the same things, and there can't be
conflicts between the relations the two databases use (eg, if one uses
parent(father,mother,child) and the other uses parent(child,father,mother)
you can't just put both databases together or you won't know how to interpret
the parent relation; you'll need to translate the parent relation from one
into the parent relation of the other). I've got some ideas for how to handle
that, but we'll cover them later.
Rules
So far we have given examples of relation-based data as a list of relations between specific objects, but relation-based systems also allow the specification of general rules. This is done by allowing parts of a statement to be placeholders called "variables" instead of specific values, and by allowing the database to contain implication rules.
Imagine we allow our database to contain appointments for a calendar, which are
objects of type id_appointment_type:
type(id20, id_appointment_type)
name(id20, "Pick mum up")
timestamp(id20, date-time(2025,12,25,15,00,00))
We could represent a recurring event like Christmas with a timestamp with
a variable ?YEAR replacing the year:
type(id21, id_appointment_type)
name(id21, "Christmas day!!!")
timestamp(id21, date(?YEAR,12,25))
(in this notation, I'll write variables in the form ?NAME). ?YEAR can match
any value occurring in the place it's found - in this case, the first part of a
date, which is the year. So this rule says this event falls on any date that's
the 25th of December in any year.
How this manifests is interesting. When one wants data from a relation-based database, the programmer poses a question: a relational statement with some variables in. If the application wants to know what appointments fall on the 25th of December 2025, it could ask:
type(?EVENT, id_appointment_type)
and name(?EVENT, ?NAME)
and (timestamp(?EVENT, date(2025,12,25) or
timestamp(?EVENT, date-time(2025,12,25,?H,?M,?S))
The system tries to find sets of values for the variables that match the
pattern. This will match our christmas rule, because ?YEAR can match 2025; it
will also match any appointments explicitly created with a full date that
matches the 25th of December. Each answer returned will contain values for all
the variables, eg:
?EVENT=id21,?NAME="Christmas"
?EVENT=id20,?NAME="Pick mum up",?H=15,?M=00,?S=00
(With support for "constraint logic programming", the query engine can also support constraints like "the date is between these two dates", eg to find all appointments in the coming week).
If you want the date of a particular event, such as id21, you might query:
equal(?EVENT,id21)
and type(?EVENT, id_appointment_type)
and name(?EVENT, ?NAME)
and timestamp(?EVENT, ?WHEN)
the response in this case would be:
?EVENT=id21,?NAME="Christmas",?WHEN=date(?YEAR,12,25)
The presence of an unbound variable ?YEAR, represents the fact that this is a
generic rule rather than a specific date. This isn't something a thing-based
database can naturally do!
But where it gets really interesting is the use of implication rules. Say we extend our person schema to include an optional date of birth:
date-of-birth(id1,date(1977,3,5))
We could add a "rule" that creates an appointment for the birthday of anybody in the database:
type(dob(?PERSON), id_appointment_type) if
type(?PERSON, id_person_type) and
date-of-birth(?PERSON, ?DOB)
timestamp(dob(?PERSON), ?DOB) if
type(?PERSON, id_person_type) and
date-of-birth(?PERSON, ?DOB)
name(dob(?PERSON), ?APPOINTMENT_NAME) if
type(?PERSON, id_person_type) and
date-of-birth(?PERSON, ?DOB) and
name(?PERSON, ?PERSON_NAME) and
string_join(?APPOINTMENT_NAME, [?PERSON_NAME, "'s birthday"])
These rules are of the form "(some statement involving variables) if (some query
possibly involving the same variables)".
The first one says that there's an appointment object with ID dob(?PERSON) if
?PERSON is the ID of a person object that has a date of birth (we match the
variable ?DOB, but we don't care what its actual value is).
The second one states that the timestamp of a dob(?PERSON) object is the date
of birth of a person with id ?PERSON, as long as that exists.
The third, more complicated one, gives names to objects with an id of
dob(?PERSON), by finding a person with a birthdate whose id is ?PERSON,
finding their name, and making a string by combining their name with 's
birthday.
With those inference rules fed into the database, every person with a date of birth will automatically cause an appointment to exist on their birthday!
These appointments will, of course, exist in addition to explicitly created ones - the definitions above will not "conflict" with raw statements of fact, or other rules that create types, timestamps, and names of objects.
We can also use rules to express broad facts. If we had a likes relation
between a person and some other object, we might say likes(id_of_sam,
id_of_a_tomato) to say that Sam likes a particular tomato - but what if we want
to say that Sam likes all tomatoes? Well, we could say likes(id_of_sam,
?TOMATO) if type(?TOMATO, id_of_tomato_type).
But we can do something far more interesting with rules: we can create
recursive rules that invoke themselves. The classic example would be to use
our parent relation to define an ancestor relation that relates a person to
anybody who is their parent, or parent of their parent, and so on up the family
tree. We can do this easily. First, let's define a relation rule to check if
somebody is a parent of somebody else, without caring if they're the mother or
the father:
parent(?PARENT, ?PERSON) if parent(?PARENT, ?ANYBODY, ?PERSON)
parent(?PARENT, ?PERSON) if parent(?ANYBODY, ?PARENT, ?PERSON)
With that done, the ancestor relation rule is easy:
ancestor(?ANCESTOR, ?PERSON) if parent(?ANCESTOR, ?PERSON)
ancestor(?ANCESTOR, ?PARENT) if
parent(?INTERMEDIATE, ?PERSON) and
ancestor(?ANCESTOR, ?INTERMEDIATE)
The first rule says that X is an ancestor of Y if X is a parent of Y - the simplest case.
The second rule says that X is an ancestor of Y if somebody - let's call them Z - is a parent of Y, and X is an ancestor of Z.
Being able to include such rules, as well as "raw" data in the form of simple statements of relations, gives relation-based databases great and useful expressive power.
Solving data management problems
The relation-based model makes it easier to solve many of the interesting problems in data management that I mentioned at the start of this article. It doesn't directly solve them itself; but it provides, in some ways, a better framework on which to build solutions; the ability for objects to have multiple types, combined with using rules to translate old-format data to new, makes it easy to handle changing the structure of data while keeping all your existing data.
Many of the other problems I listed - ones to do with large datasets, high demand to read or write data, and needing to note lose data and keep the system running when parts fail - can be solved by "replicating" data, which means storing multiple copies in different places. This is done so you have spares if one fails, and if you have multiple copies you can distribute the read load across the copies - if one server can handle a million reads per second, ten servers with a copy each can handle ten million reads per second.
But the hardest aspect of implementing replication is the fact that changes or additions to the database need to be applied to all the replicas - a task that seems simple at first glance ("Just make the same change to every copy!"), but which is actually exceedingly complicated to do right. Relation-based models can be used to represent changes to a database by inserting new relations and rules, and updates to (or removals of) existing ones handled by new relations contradicting or overruling existing ones; these new relations can just be appended to the end of the database (if query evaluation is done in a way that allows later statements to override earlier ones when they commit), allowing for rapid incorporation of incoming changes with an append-only log (subject to the complications of ensuring that everyone ends up with the same things in the log in the same order despite unreliable infrastructure, but that's a well-studied problem in computer science). However, parts of the log can be "collapsed" once it is agreed on through the cluster, by merging contradicting statements together, effectively modifying the original statement to incorporate the conflicting information in the new statement.
Also, relation-based data models tend to split things into more smaller pieces - rather than having a large and complicated record containing lots of information about a thing, as is common in the thing-based model, you tend to have more, smaller, relations describing aspects of a thing. This means that updates to unrelated bits of information about a thing don't necessarily conflict, while two updates to unrelated fields of the same large record need to be carefully handled so that one doesn't entirely overwrite the other.
