Modelling data with relations (by alaric)
Traditional schemas: Thing-centered data models
All mainstream data management systems in use today revolve around defining "things", and listing what kinds of data are stored about each thing, and how those things tie together.
For instance, as mentioned above, we might say that a "person" record contains a
"name" (text) and a "date of birth" (date); and we might say that a "party"
record contains a "date" (date), "start time" (time), "location" (address), and
"attendees" (list of references to person records). The data management system
will dutifully decide how to store a bunch of "person" records and "party"
records, and probably give each an "id" of some kind so they can be identified
and referred to - often some kind of number, either a short one you might feel
comfortable reading out over the phone or, for various exciting technical
reasons, a long one such as a hexadecimal UUID that you'd only ever want to
copy and paste rather than typing out yourself (let alone trying to read it out
over the phone). You might have seen these things around -
f81d4fae-7dec-11d0-a765-00a0c91e6bf6 is an example! IDs aren't just used to
refer to objects externally, of course - often objects will refer to other
objects, like the attendees of a party in our example; this is generally
implemented by storing a list of IDs of attendees inside the party object. The
IDs need to be unique to each particular record of a given type - there can't be
two "person" records with the same ID, or we won't know which one we're talking
about - but it's entirely possible that the same IDs can be used for different
types of record.
This makes a lot of sense. It's simple to look at the "problem domain" - the actual description of the real-world problem we're trying to build a software system to handle - and think about how to make thing-centered schemas to describe the things in your problem domain.
Relation-based data models
However, there's another approach, based on formal logic. It's the data model used in the Prolog programming language, and a variant of it underpins RDF and the Semantic Web. In the relation-based model, nothing is stored "inside" the objects, and really all an object "is" is a unique identifier. All the information is stored in relationships between them - and other objects, such as strings of text or numbers.
The classical exmaple for explaining this is a family tree. In such a system, there might be a set of objects representing people, which are just unique IDs of some form. These IDs need to be completely unique; you can't reuse the same ID for a person and some other kind of thing.
To provide information about those people, you provide named relations. The "name" relation might connect a person ID to a text string that's the person's name, for example. And a "parent" relation might connect a person ID (the father) to a second person ID (the mother) and a third parent ID (the child).
Unlike in the thing-based model, there isn't automatically a notion of object type. In the thing-based model, you'd define a "person" and a given person ID could be used to identify a person from the list person records, and every record would have the expected things inside it, as declared in the schema.
Instead, there is a notion of relation type; the name relation would be defined to connect a person ID to a text string, and so on. So all the data is still subject to a schema of some kind, with the attendant benefits of being able to check the validity of data, help the user with data entry, and so on.
A piece of software that prints information about a person, given the person's ID, would search the database for "name" and "parent" and whatever other person-related relationships it knew how to handle.
And that's the crux of this model's flexibility: if a program understands the "name" and "parent" relations of people, then it won't break if our database also contains "email-address" relations between people IDs and email addresses (as text strings). On the other hand, our software for displaying family trees can also be applied to any object that has name and parent relations - such as a horse pedigree database. Relations like "name" will apply to most kinds of objects, so a piece of software that just needs to know the names of things can look for a "name" relation, and it'll work with anything that has a name.
"But that's just the same as looking for a 'name' column in an arbitrary table
of a SQL database", somebody who's familiar with traditional databases might
say. Well, sort of. For a start, while in the relation-based model object IDs
are globally unique, and the schema for the name relation states that it links
an object to a name string; to do the same in a SQL database you'll need to know
what table a given ID is to be looked up in. And you'll need to know what column
of that table is the ID column (most databases will let you query the table
schema to find the primary key, but how to do that isn't defined in the SQL
standard): somehow or other you'll need to fill in the blanks in the SQL query
SELECT name FROM [BLANK] WHERE [BLANK] = [id-you-want-to-know-the-name-of].
If this model allows for such flexibility, with relations being shared between types of objects, you might still want a notion of object type. An application for exploring arbitrary data might want to be able to work out that a given object is a person, for instance. Well, object types might not be inherent in the model, but we can add them - with a "type" relation linking an object ID with a second object ID, where that second object is a type object. For a start, it would have a name relationship with a name, which would probably be "type". This is starting to get a bit much to hold in your head, so let me write out an example:
name(id1, "Alaric")
type(id1, id_person_type)
name(id2, "Sarah")
type(id2, id_person_type)
name(id3, "Jean")
type(id3, id_person_type)
parent(id1,id2,id3)
name(id_person_type, "person")
type(id_person_type, id_type_type)
name(id_type_type, "type")
type(id_type_type, id_type_type)
Here, we've defined a few objects: id1 is me, id2 is Sarah, id3 is one of our children, Jean. These objects are given names, and types of the object with id_person_type; and a parent relationship joins them together.
Then we defined the object id_person_type, which is of type id_type_type, and is called "person" - it's the type of people.
And then we define the type id_type_type, which is the type of types, and is called "type", and has itself as a type.
So, we've defined a notion of object type, but we can also build a schema on top of that. For a start, we could define the schemas of the relations we've used so far:
relation(name,[object:id,name:string])
relation(parent,[father:id,mother:id,child:id])
relation(type,[object:id,type:id])
(Note that our definition of parent is insufficient for modelling real parenting relationships - we already have the technology to have multiple donors of genetic material to a baby, and a potentially separate person to gestate them, and a potentially separate set of people who actually raise the child, but let me run with this example, OK?)
And we could define schemas for the object types, by indicating what relations must, or might, exist for objects of a given type. Perhaps we might say that people (objects of type id_person_type) most appear as the object in a name relation, and might occur in the father, mother, or child place in a parent relation:
required-relation(id_person_type, name, object)
optional-relation(id_person_type, parent, father)
optional-relation(id_person_type, parent, mother)
optional-relation(id_person_type, parent, child)
So this model doesn't stop us having the full ability to define schemas as in the thing-based model; while also admitting the flexibility of an object having multiple types - one object can be a person AND an employee, for instance; the real world contains endless examples of overlapping categories of thing, but the traditional thing-based model doesn't make it particularly easy to handle this.
A subtlety: IDs versus strings, numbers, etc
You might be wondering about the real nature of the difference between a "person
object" that is identified by an ID, as opposed to strings like "Alaric", or
maybe numbers, or even symbols like name, parent, type, and so on. And you might
have noticed that in the definition of relation schemas, I included a list of
parts in a relation, written like [object:id,name:string], and that those parts
themselves were two parts with a colon between.
Well, in theory, they're all just objects, like any other. Rather than writing
"Alaric", you could instead give the ID of a string object, and then have a
relation you can use to state that the first letter of this object is A, and
the second letter is l, and so on - but it would be rather cumbersome. So as a
convenience, the written form I have show above lets you just write "Alaric";
and similar for lists of things, and colon-separated pairs of things.
Notionally, an object of inbuilt type like "string" or "number" has its value as its ID, because its value entirely describes it. Only things the system doesn't inherently understand as part of its most basic inbuilt model, like people, need to be assigned explicit IDs and described with other relations!
Not to be confused with the relational model!
Those of you familiar with databases might be a bit confused right now, because surely the "relational" model is what underlies normal thing-based databases?
This is an unfortunate accident of naming, although there is a good reason for it. The thing-based model is often represented by a relational database, because relations like our "name" or "parent" relations are, indeed, based on the same mathematical concepts used to handle a "person" table in a conventional database. The difference is how the concept of an "object" is represented; in the relational model, the objects you're representing get turned into records in a table, and given an ID of some kind, which need only be unique within that table. While in the relation-based model, objects are an explicit thing with globally unique IDs (or, for inbuilt things like strings, lists, and numbers, a value), and relations are used to store information by relating objects to each other.
In the computer science literature, what I'm calling the relation-based model tends to be called "logical", as in "logical databases" or "logic programming". This is because the way that queries are expressed and executed in the relation-based model is based on mathematical logic, but I'm trying to focus here more on how data is modelled and represented than how it's queried (although I'll talk about the implications of the data model on how it CAN be queried, as the relation-based model allows for more expressive forms of query than the thing-based model).
Relation-based models have also been referred to as encoding "knowledge" or being "semantic" rather than mere "data", "actual meaning" as opposed to just text and numbers that humans can understand but humans cannot. This notion stands on shaky ground; anything might just be meaningless data or meaningful knowledge depending on the being interpreting it - is a poem written in Chinese just "meaningless data" because I don't know Chinese? But there is a grain of truth in it; the relation-based model makes the identities of things explicit, whereas the foreign-key and primary-key relationships (you'll get to learn about them below!) that might be encoded in a thing-based schema merely somewhat imply object identities, if they are present. And that certainly opens more opportunities for computers to automate more processing of the data, in a way that might (if you squint hard enough) be considered a form of "greater understanding" on the part of the computer. Having said that, I am guilty of using the term "knowledge base" to describe a relation-based database, to distinguish it from a thing-based "data base", because it's convenient.
So, to conclude: the relation-based model is different to the relational database model, albeit sharing some common theoretical underpinnings.
