Modelling data with relations (by )

An example relation-based system: The Semantic Web

Perhaps the most "mainstream" relation-based system was the Semantic Web, but sadly, it's pretty much died out now after an initial surge of hype in the early 2000s.

The idea of the Semantic Web was to publish relation-based information on the Internet, largely via HTTP, using URIs (the more general concept that URLs are one particular case of) as identifiers for things and relations. A few different ways of encoding that information were defined, such as RDF (based on XML), RDFa (attributes you can embed in HTML), and bespoke languages such as N-Triples or N3.

I won't bore you with the representations; the interesting parts of the Semantic Web, however, are the use of URIs for things, the fact that it restricts relations to relating precisely two objects, and its extremely limited support for expressing rules.

URIs for IDs

The Semantic Web uses URIs to identify things. This has one immense upside - it's a hierarchical system, which is very useful for increasing your chances of being able to meaningfully merge relations from multiple sources.

You don't need to end up fighting over the use of an ID if the IDs you make are all based off a URI you control. The most widespread form of URI is the HTTP (or HTTPS) URL, and somebody publishing data on the Internet probably has HTTP URLs already that they're using to publish stuff, so they can make up their own URLs on their domain - and they won't clash with URLs anybody else makes on their own domains.

But ensuring different things get different IDs is only one half of the problem of combining relations from different publishers; the other part is ensuring the same things get the same IDs. Using URIs helps with that, too. If I make it known that a particularly popular thing has a particular URI for its ID, then word gets around. Although other people aren't supposed to make ID URIs from URIs that I notionally control, they are welcome to use those URIs to refer to the same things. So in the presence of a community all creating URI IDs for things and publishing them, although it is possible for somebody to make a new ID for a thing without realising somebody has already published one (and we'll talk about what happens to fix that later), there is nonetheless some pressure - and a simple mechanism - to gravitate towards using the same IDs for the same things.

However, the Semantic Web made a bit of a mistake here, because URIs are also used for other things on the Internet. For instance, I have published an RDF file full of relation-based statements about myself, using various vocabularies of relation names published by various bodies, which is available from the URL https://www.snell-pym.org.uk/alaric/alaric-foaf.rdf - and in it I have used that URL as the ID of the RDF file itself, when making statements about it (such as the fact that I was its author and that it describes me). I have used a different URI urn:oid:1.2.826.0.1.4062548.2.1 to represent myself. That's not a URL, because I have taken the position that I'm not a Web page, I'm a person - so using a URL as an ID for me would be conflating me and whatever you get when you put that URL into a Web browser.

The Semantic Web lets you write statements about a Web page (or any resource with a URL), and in that case it makes sense to use the URL of that Web page to represent it. But it also lets you write statements about things that aren't Web pages, and makes you pick a URI for them anyway, and most people have used URLs because they're easy to make. So if you have the URL of something from some Semantic Web RDF, you don't really know if the information therein is about whatever's at that URL, or some abstract other thing that exists outside of the Internet.

As the names of relations are IDs, Semantic Web relations have names such as http://xmlns.com/foaf/0.1/firstName; but if you visit that URL you'll find a human-readable specification of the "FOAF vocabulary", a set of relations useful for describing for humans and their relationships (FOAF standing for "Friend of a friend"). Embedded in that HTML are RDFa statements defining the FOAF relations and schema information about them. But there's nothing requiring the URI used to identify a relation to be a working URL at all, nor for anything particular to be there. There's no standard way to go from the URI naming a relation to any useful information about that relation; and in general there's no standard way to go from the URI used to represent an object to any information that object may have published about themselves.

This isn't the end of the world; it just means you need some other way to think of places to look for information, such as crawling the entire Web of objects; there is a relation called http://www.w3.org/2000/01/rdf-schema#seeAlso that relates any object to the URL of an RDF file containing statements about them, which is widely used in the Semantic Web to refer to other sources of information about a thing, which facilitates such crawling and indexing. But that fact that URIs are often URLs that can point to human-readable prose, or to machine-readable RDF, about the abstract "thing" the URI is used as an ID for; or a URL can be an ID for the actual file you get when you visit that URL; has created endless confusion.

Objects with zero, or more than one, ID

The Semantic Web also supports anonymous objects, which aren't assigned a URI. They're given some temporary name, unique and only meaningful within the scope of the file describing them. They are considered to be objects for which the correct URI is unknown, but there are some mechanisms within the Semantic Web schema languages to provide ways for a system to identify that multiple things are the "same" thing and thus merge them, potentially finding one or more URIs to identify them. That I know of, there are two mechanisms: a relation called http://www.w3.org/2002/07/owl#sameAs that states that two objects are the same object; and the ability to specify that a relation is an "inverse functional property", meaning "some other kind of unique ID", and two objects with the same value for an inverse-functional-property relation can be considered the same object. The canonical example is the http://xmlns.com/foaf/0.1/mbox relation, that links a person to a mailbox (eg, an email address, as a mailto:... URL). Two person objects with the same email address can, therefore, be assumed to be the same person.

This means you can have useful objects with zero IDs - they're given local, temporary, IDs by the system that explicitly have no meaning outside the boundaries of that system, but can still be referred to within that system. But it also means you can have an object with multiple IDs, because different people have assigned different IDs to the same object. This is either manually fixed by using the sameAs relation to merge them later, or this duplication might be found automatically through examining some inverse functional property and finding a match.

This isn't incredibly complicated, but in my experience it's not been explained very well in the Semantic Web literature, so it's a bit of a gamble as to how well any given system will correctly handle all this stuff!

Relations are all binary

Relations in the Semantic Web consist of three things: a subject (which is anything, so must be identified by a URI), a relation (which also must be identified by a URI), and an object, which may either be a URI representing an object or a literal value (an arbitrary string, optionally tagged with a datatype such as "integer", and optionally tagged with a language).

This isn't so bad, as many relations are binary in nature. But something like our original three-object parent relationship would need to be represented by creating a whole new object called, I dunno, a pedigree or something, which is defined (in a schema) to have a father relationship to a person, a mother relationship to another person, and a child relationship to a third person. This means you can express anything you could express in Prolog, you just need to create a bunch more objects to describe more complex relationships.

There is also a weak convention that the subject is the "primary" thing in a relationship, where such a distinction exists, somewhat enforced by literals only being allowed as the object. This leads to the nice property that it's easy to search for "What do we know about X?": you can search for any relation with X as the subject, to find out facts about X, then search for any relation with X as the object, to find "things that refer to X". In Prolog, for instance, there are ways to find "all relations with X as the first object" or "all relations involving X in any form", but it's not very easy.

Rules and Queries

The formats for representing Semantic Web information, like the Datalog database, only provide facts: relations about objects. The "anonymous object" mechanism provides a limited way to make statements involving variables in the Prolog/Datalog sense, where the variable represents the unknown ID of the anonymous object, but this doesn't really let you express things like our "Sam likes anything that is a tomato" example; writing RDF that says Sam likes an anonymous thing that is a tomato merely states there is something Same likes that's a tomato, not that Sam likes all tomatoes.

The Semantic Web takes the Datalog model: that the database is a sea of facts from untrusted sources, and all the fun rule stuff is part of a "query" that you choose to run. It does define a query language, SPARQL, that like Datalog doesn't have ways to cause side-effects, so it is theoretically safe to run arbitrary SPARQL queries somebody sends you; but there isn't a way to publish SPARQL queries as "rules" in your RDF files that will be interpreted alongside the plain facts contained therein.

However, there is limited support for some kinds of rule-like behaviour. I mentioned the sameAs and seeAlso relations earlier, and the concept of inverse functional property relations (which are declared by stating the relation if of type http://www.w3.org/2002/07/owl#InverseFunctionalProperty, in turn using the http://www.w3.org/1999/02/22-rdf-syntax-ns#type relation); using them, a data publisher can express information that can drive a query engine to go and look up other data sources (other objects, or other RDF files) and incorporate their contents into the database, in limited ways.

What went wrong?

The Semantic Web is quite cool; a standard format for publishing data online that, because it uses a relation-based model, URIs as IDs, and the sameAs/inverse functional property mechanisms to bring duplicate IDs together, offers a way to merge data from disparate sources together!

But, after a peak of interest in the first decade of the third millennium, the Semantic Web is largely considered a failure these days. As far as I'm aware, there is no software actually reading my FOAF file and doing anything interesting with it.

What went wrong?

Well, the causes of something failing are often complex and multi-faceted, but I think the core problem was that the Semantic Web peaked based partly on the celebrity of its original proponent (Tim Berners-Lee, the creator of the World Wide Web) and its promise of providing a platform to enable useful automation of tasks that otherwise required human beings to go and read Web pages and understand the text on them. However, due to the complications around actually finding semantic web data about a thing on the Internet (too many conflicting conventions about the relationship between the URI of a thing and the URI of a Semantic Web file describing it), the confusion about the meaning of the URI used to identify a thing, the complicated ways to try to establish identity of a thing with zero or more than one URI, and the fact that query technology was a bit of an afterthought after the publishing technology; although many sites started publishing semantic web data (especially social media platforms publishing FOAF), applications doing anything useful or interesting with that information rather lagged behind. Issues such as "How to deal with contradicting information from different sources?" were rather hand-waved over: the progenitors of Semantic Web standards were more interested in making it easy to publish stuff than to actually process it, assuming that somebody else would figure all that out, just like they did with the original HTML/HTTP Web. So it's little surprise that lots of hopeful people (such as myself) published their data in RDF - because the publishing side was fleshed out - but few applications came along to use them.

So we had a lot of hype about a thing, but a shortage of useful results for anybody who tried that thing, and that thing being somewhat half-baked in certain technical aspects, and disillusion set in.

Meanwhile, commercial pressures were turning an Internet that had once been keen on enabling automation - by publishing APIs and data in standard formats - into one that inhibited it, with social networks transforming into walled gardens as they gained sufficient power to become monopolies over their particular corners of the community, and to try to turn that to extract maximum value from them and make it harder for them to transport their data to other places. Suddenly, there was no longer commercial value in for-profit services being part of a wider community. While most Semantic Web data had been FOAF social graphs - that corner of the Semantic Web suddenly wanted to close down and keep its cards close to its chest.

I think there's a lot of good in Semantic Web technologies, that could be used for a lot of stuff today (ESPECIALLY with the Fediverse taking off!), but the rough edges need to be fixed; and it needs to escape from being tarred with the unfortunate brush of being an old, failed, technology.

An example relation-based system: Lojban

Let's take a sudden pivot away from the nerdy world of computers, towards the totally non-nerdy world of humans actually talking to each other. And then rapidly pivot back into nerdiness again, by talking about the constructed language Lojban.

Lojban is designed for humans to talk to each other, but the language is based around mathematical predicate logic, which the relation-based data model I describe above is a particular subset of.

Whereas in English, you might say:

"I like eating cheese"

In Lojban, you might say:

"mi nelci lo'e nu mi citka lo cirla"

Each of those five-letter words - "nelci", "citka", and "cirla" are the names of relations.

"nelci" means "likes", and relates a thing with opinions to a thing that the first thing likes; in this case, it relates "mi" (that refers to the speaker/writer of the sentence) to "lo'e nu mi citka lo cirla".

"cirla" is a relationship between some cheese, and the origin of the cheese (eg, sheep's milk, cow's milk, etc).

"lo" is a construct that turns a relation into an arbitrary object that would fit in the first place of that relation. So "lo cirla" refers to an arbitrary quantity of cheese, and is something like a noun in other languages.

"citka" is a relationship between a thing that eats and the thing that is eaten, so "mi citka lo cirla" means "I eat some cheese", eg "I eat some cheese".

"nu" is a prefix that turns a statement of a relationship between some things - in this case, "mi citka lo cirla", "I eat some of cheese" - into a relationship that expresses the fact that some object is an event of that happening. This is a little confusing at first, by "nu mi citka lo cirla" is a relationship meaning that something is an instance of me eating some cheese.

"lo'e" is a prefix that, similarly to "lo", turns a relationship into a typical thing that would fit in its first place - so "lo'e nu mi citka lo cirla" is "The typical instance of me eating some cheese".

Therefore, "mi nelci lo'e nu mi citka lo cirla" translates literally to "I like the typical instance of me eating some cheese"; it doesn't state you always like to each anything that's cheese, merely the "typical" instance. It's making something of a statistical statement that you enjoy the majority of times you eat cheese.

But it's clear how this is all about relations between things. You might write it as something like this:

likes(me, ?TYPICAL-CHEESE-EVENT) typical-instance-of(?TYPICAL-CHEESE-EVENT, ?TYPE-OF-CHEESE-EVENT) event(?TYPE-OF-CHEESE-EVENT, eats(me, ?CHEESE)) type(?CHEESE, id-of-cheese-type)

Communicating in Lojban requires one to think about everything in terms of relations between objects, and demonstrates that everything can be broken down into that form - even complex English grammar, with all its controlled vagueness, has a representation in relations between objects. Vague concepts like "usually" or "typically" can be expressed, with relations like typical-instance-of expressing that something is a "typical" instance of some category, even though "typical" can't be precisely defined! (Lojban also has a word, "le'e", that works like "lo'e" but refers to a stereotypical instance of something rather than a typical one - directly expressing the notion that it might represent some inaccurate cultural expectation rather than actually being typical, too, to further complicate matters...).

An aside: Linda tuple spaces

Ok, so this is a bit of a tangent, but there's a technology I've had a bit of a nerd crush on for years, that's kind of related. So let's take a quick detour to look at it.

The Linda model is that a bunch of parallel processes working together on some shared problems - perhaps a bunch of parts of a programming running on multiple CPUs inside one computer, or a room full of computers cooperating on some large problem - can coordinate activities and share information between themselves by using a shared "tuple space". This is a place where processes can place "tuples", which can be seen as like records in a SQL database or statements of relationship in a relation-based system, such as email(alaric, "..."). Processes can also execute queries against the tuple space, such as email(alaric, ?EMAIL); rather than, finding all email addresses alaric has, however, in this case this will either return a single result if a tuple such as email(alaric, "...") was found - deleting that tuple in the process, so no OTHER query can also return it - or make the querying process wait until one such tuple arrives.

Clearly, you can use such a tuple space as a way to distribute jobs. You could insert tuples of the form we-want-to-send-an-email("foo@example.com", "Dear Foo...") to indicate that we want to send an email to somebody, while a pool of email-sender processes is sat waiting on a query for we-want-to-send-an-email(?ADDRESS,?BODY); when they get a result, they send the corresponding email, then repeat the query to try to find new work. If we add more email-senders we can send more emails per second; we could even monitor the number of pending tuples in the tuple space, firing up more email-senders when we need them, or closing some down if there's no we-want-to-send-en-email tuples waiting.

But you can also use the tuple space as more of a database, storing information of interest to the collective of processes; it's also possible to "non-destructively" query the database, not deleting the matching tuples. This could be the database part of a Datalog-like system with arbitrarily complex rules in the query, or you could build a hybrid system that allows rules in the tuple space (they'd only really make sense for non-destructive queries).

I think there's scope for tuple spaces to exist as a means for transient communication of information between cooperating processes in a complex system, alongside relation-based databases for long-term storage of information, and they could share an underlying data model and representation rather than needing to each build their own.

Pages: 1 2 3 4 5 6

No Comments

No comments yet.

RSS feed for comments on this post.

Leave a comment

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales