Modelling data with relations (by alaric)
A proposal: CARBON
I've looked at a few implementations of tools for handling data in a relation-based manner, and compared them to SQL databases, the canonical tool for representing data in a thing-based manner. I hope I've demonstrated that the relation-based model is pretty cool, and largely an improvement for most tasks over the thing-based model.
I've focused on where the thing-based model is weaker than the relation-based model, because I'm trying to make a case that we've neglected the latter; but rest assured that the thing-based model does work well for a bunch of cases. If your data is purely regular in shape - lots of records that all follow the same pattern - the relation-based model doesn't add anything over the thing-based model. However, I struggle to think of areas where the thing based model is better than the relation-based model, as opposed to areas where it's merely just as good.
But I've also pointed out weaknesses in the relation-based tools I've presented as examples. Prolog is too powerful, aiming to be a general purpose programming language that just happens to be based on a relation model, and puts extra burdens on the programmer to understand its execution model to get the best performance and avoid infinite loops. Datalog in its original definition is too restrictive, disallowing variables and rules in the database, although it's easy to imagine implementations not having that restriction. Neither has inherent support for using data federated from multiple publishers on the open Internet, unlike the Semantic Web, but the latter never really got around to properly defining a model for how to query all this data.
Needless to say, I can't resist trying to combine the best points of all of the above, and proposing an idea of my own...
ARGON
Of course, I did this in the context of my ARGON project, an attempt to redefine every bit of software infrastructure I can, because the current state of software is terrible, and fixing any one bit of it is hindered by needing to fit in with all the other broken things around. Without going into any more detail than necessary, the networking model of ARGON is that each system is organised into a bunch of "entities" which provide services over the network to other entities, on the same or other systems. A protocol called MERCURY handles this communication, much like how HTTP is used on the Internet today for users to communicate with the things that HTTP URLs identify. Entities are identified by Entity IDs (EIDs for short), which contain one or more network addresses for servers that can handle MERCURY communications for that entity, cryptographic keys to authenticate those servers, and other metadata. If you have an EID, MERCURY can let you talk to it (unless there's a network or server fault or something similar preventing it!).
How entities work internally is entirely up to the server hosting them, but the default tools for building entities that programmers are expected to use work by implementing an entity as a database to store its state, along with configuration specifying what services are provided by this entity via MERCURY, and specifying the source code of the software that will provide those services - by referencing software packages providing those implementations.
As I like the relation-based model, I propose that the data storage within an entity will relation-based. An ARGON component called TUNGSTEN provides the persistent storage of relation-based data within entities, handling replication and distribution of state across a cluster of cooperating servers if available, so the software interface to TUNGSTEN needs to be "replication aware" from the ground up.
But, for consistency, this means that the information published by entities via MERCURY also needs to use the same relation-based model as TUNGSTEN; why have more data models than we need? Then we can export information from our TUNGSTEN storage via MERCURY, and store information gleaned via MERCURY in our internal TUNGSTEN store, without any translation. That data model I call CARBON.
Much as one can feed an HTTP URL into a browser, and it will perform an HTTP GET operation to ask the URL's server for the file that URL resolves to, and the browser will attempt to display it - one can, given an EID, ask MERCURY for that entity's data. The result will be a set of CARBON statements - relation-based facts and rules. This will be a big bundle of "everything this entity has to say (to you; it may authenticate you and provide different answers to different askers)", but it's also possible to send a specific relation-based query to the entity and just get answers to that query; and those queries may be answered from a larger set of CARBON statements than the entity publishes by default. Consider, for example, a massive database of information. Querying it for its data will just return top-level information about what the database is and what it provides, but if you send it a specific query, it can answer it from its many terabytes of data. The published data might contain a declaration that the query interface has access to more information than is in the published data, so it's worth asking it; it might even state what relations the query interface can answer questions about, so the user knows when to not bother.
Identifying objects
EIDs can of course be used as IDs in those relations, but other forms of ID exist. Randomly generated UUIDs can be used to represent objects that exist "outside the system", like people and abstract concepts like "love". Neither EIDs nor UUIDs are pleasant for humans to read, both basically being blobs of raw entropy, probably represented in hex if humans ever look at them; a more convenient form of ID for objects is the "symbol", a hierarchical name that can potentially be looked up using a process like DNS to find an EID or UUID. Let me explain how this works.
A symbol might look like /uk/gloucestershire/blagrave. To resolve that symbol,
the system would have a hardcoded EID of the namespace root, stored in its
configuration. It can use MERCURY to send that EID a query of the form:
carbon-tree-child(<ROOT EID>,"uk",?CHILD)
The root EID will response with the EID of the entity responsible for the "/uk" namespace, which you then query with:
carbon-tree-child(</uk EID>,"gloucestershire",?CHILD)
...which responds with the EID of the entity responsible for "/uk/gloucestershire", which you then query with:
carbon-tree-child(</uk/gloucestershire EID>,"blagrave",?CHILD)
...which then replies with ?CHILD bound to the EID or UUID associated with
/uk/gloucestershire/blagrave.
Much like the inverse functional properties of the Semantic Web, the rules of inference in CARBON state that any relations about a symbol are also relations about any EID or UUID that symbol resolves to. So, apart from cases where a symbolic name is not known for something, or as part of the mechanism used to resolve symbols into things, you generally just use symbols as IDs in statements rather than EIDs or UUIDs.
Note that, unlike with DNS where you need to run a DNS server to publish DNS
records, every entity can take part in the CARBON symbol system. If I have an
entity that has a symbol name, I can make that entity publish arbitrary
carbon-tree-child statements creating further names "under" that entity's
name. If /uk/gloucestershire/blagrave resolved to an EID, then that entity
could publish statements mapping /uk/gloucestershire/blagrave/quarry to any
EID or UUID, or making any other statements about that symbol, for instance.
An entity can create arbitrarily many EIDs if it needs to. It can either create new entities that get their own EIDs, which consumes storage space on the system hosting the entities; or it can dynamically create "personas" of itself, by adding extra metadata to its own EID, in a manner similar to adding query-string parameters to an HTTP URL. Whatever data it shoves in the "persona" of an EID it publishes somewhere gets sent back to it when that EID is used, so an entity could choose to create an illusion of infinitely many separate entities by issuing EIDs with different data in the persona field, and then using that data to present arbitrary different behaviour when they're accessed via MERCURY, including publishing different CARBON data. So a single actual entity, running on a server somewhere, can create the illusion of a nigh-infinite number of entities with their own symbolic names and behaviour. This could be used to create a "gateway" into some other information system, such as a database.
Additionally, a name doesn't have to resolve to an EID or UUID. The process of
name resolution might stop early due to the chain of carbon-tree-child
statements running dry - so you don't find an EID or a UUID but you can still
ask questions about that name of the last entity in the chain. For instance,
/uk/gloucestershire/blagrave/quarry might only resolve as far as the
/uk/gloucestershire/blagrave entity, but that entity could still provide
answers to questions about /uk/gloucestershire/blagrave/quarry.
Querying
So, ARGON software runs within the context of some entity somewhere, having access to that entity's CARBON data stored in TUNGSTEN, and able to communicate over the network using MERCURY to ask other entities for whatever CARBON data they'd like to publish to me in one big blob, or to send them specific CARBON queries and get back the results of those queries. And it might have transient sets of CARBON data it creates itself, or obtains via some other means, that it wants to query, but doesn't need to store persistently in TUNGSTEN.
How does one provide a framework for software to use those capabilities to do useful things?
For a start, there's clearly scope for some abstraction. The TUNGSTEN storage, an entity sent queries, and a bunch of CARBON data held in memory are all relation-based databases that can be queried, despite having vastly different implementations. So let there be a common "CARBON database" interface that they all provide, certainly allowing for querying and (for non-read-only implementations) also inserting new statements; which, if they contradict existing ones, will override them, thereby providing the ability to arbitrarily modify a database if you have write access to it.
It would also be useful to be able to combine data from multiple sources. One might combine CARBON data published by an entity with the interface to send it queries (with the latter having been given any metadata from the former about what queries to bother sending), to create a single queryable database combining whatever the entity publishes directly and anything it can only be asked via queries. If looking to buy a product, one might combine data form an e-commerce system's entity representing the product, published by the seller, with third-party data from a review database hosted by another entity; and one might choose to trust the latter over the former where they contradict. And one might also wish to augment this raw data with a set of CARBON rules that combine data from the different sources and transform it into a useful form, so your software can just send in queries involving those rules and have the CARBON engine perform logical inference to query the combined data for you - by including a raw list of CARBON statements as another data source that's the most trusted of all, or referencing those statements from your own TUNGSTEN storage.
This implies there should be a way to create a "CARBON database" that has a list of other CARBON databases inside it, with a trust ordering between them; it sends queries to all its component databases and combines the result, using the trust order to resolve any conflicts.
I think that the CARBON data storage in TUNGSTEN should be divided into compartments. I'm not sure yet how to organise them, so provisionally, they'll just be named with a string. This enables the entity to keep different CARBON databases separated, and lets it choose which parts of its data to consult in different situations. This makes it safe to store information obtained from not-entirely-trusted third parties, without potentially polluting other data within the entity; and it also allows the entity to manage the data it publishes, perhaps by having a compartment for its published data and another compartment used only for answering specific queries (which will, of course, also consult the published data compartment). Of course, the entity is free to dynamically compute the results of requests for its published data or the answers to queries from scratch every time, but if it's just publishing the same stuff, it might as well just put it in TUNGSTEN and set up configuration metadata instructing MERCURY to answer CARBON requests direct from the TUNGSTEN compartments, without even running any entity code, thereby reducing latency and performance costs.
CARBON documents
I've mentioned that one can query an entity, given its EID, for its published data, and it returns a bunch of CARBON data. This bunch is in the form of a "CARBON document", a container for a set of CARBON statements. As well as the statements, it has various optional bits of metadata:
- Provenance information - the name or EID of the entity that provided this information. When MERCURY is asked to fetch an entity's published data, the result will be rejected if the provenance entity isn't the entity we're trying to query.
- The ID of an object considered "primary"; I'll explain this in a moment.
- Cryptographic signatures of the entire document, including all its metadata apart form the cryptographic signatures themselves. For an entity's published data, it should be signed by that entity.
The provenance and signature sections clearly help to provide attestation of the source of data, and ways for its source and other entities to vouch for it. The "primary object ID", however, is there to support use cases such as sending messages over MERCURY.
I've explained how MERCURY lets you query data from entities, but it also lets you send information, and one of the most important interfaces for that is "Object Push", which is intended to encompass all the cases where you just send something to an entity. You can push an object to an entity representing a printer and, subject to access control rules, it will attempt to print it for you. You can push an object to an entity representing a person, and it will treat it like an incoming email message to them, putting it in a queue for them to look at. But the "object" sent in an object push request is a CARBON document, and the primary object inside it is the object that is being sent. For instance, a document to be printed might be a document object with one or more page objects, each of which contains paragraph objects and image objects and so on - so the CARBON document containing it will mention a big pile of objects. Having the "primary object ID" pointing at the document object means the printer (or the document viewer app, if it's sent to a person) knows where to start, finding all the page and other objects by following links from that primary object.
Because this is part of the metadata of the document, it'll be cryptographically signed along with the document, so an attacker can't meddle with a message by changing its primary object ID.
Untrusted data sources
I propose that CARBON rules should be interpreted more like Datalog than Prolog, and CARBON rules certainly cannot have side-effects like in Prolog, or access resources other than other information in the CARBON database being queried. So it's safe to execute rules you've received from untrusted sources, subject to the Datalog restrictions that prohibit infinite loops.
But in trusted cases, you might wish to use rules that could potentially infinitely loop, or explicitly control the order of execution for performance reasons. So I propose that any CARBON database be assigned a trust level. The default is Datalog-like (although including allowing rules), but it can be set higher to remove the Datalog restrictions, and at lower levels it might only allow the database to export statements that aren't rules, or even restrict to statements with no variables, or even filter it down to only statements involving a certain relation, or some other arbitrary constraint.
When combining multiple databases into one, each component database can have its own trust level - as well as the trust priority order used to resolve conflicts between component databases. So one might restrict external data sources harshly, while giving ones' own internal rules unlimited power.
Gateways
Not all your persistent data should, or indeed can, actually be stored in CARBON databases in TUNGSTEN. Perhaps it's already in some external system you can access and it would be wasteful to make a copy. Perhaps it's so vast it has to be stored in some bespoke custom format, and the required parts extracted from it using custom algorithms when specific queries come in. Perhaps you're trying to express something like the rules of arithmetic, and typing out all the statements of the form:
+(1,1,2) /// 1+1=2
+(1,2,3) /// 1+2=3
+(1,3,4) /// 1+3=4
...
...is getting laborious.
It would seem useful to be able to augment a CARBON database with "gateway rules", rules whose bodies are actually callbacks into a "real programming language" that does something to answer the query.
For instance, we might implement our addition relation by adding some gateway
rules to our knowledge base. We would need to provide multiple gateways - one
that says if we're asked for +(?A,?B,?C) and we already know ?A and ?B we
can obtain ?C from ?A + ?B; one that says if we know ?A and ?C but
need ?B we can work it out from ?C - ?A; and one that says we can work out
?A from ?B and ?C via ?C - ?B. This isn't a complete set, but it doesn't
have to be; if the user queries the system with:
+(1, ?X, ?Y)
...it probably would be more helpful for the query to terminate with an error
that attempt to give an infinite result by exhaustively listing all numbers for
?X and reporting one plus each of them for ?Y.
There shouldn't be a way to write these gateway rules in CARBON, in a way that can be sent across the network via MERCURY, for instance; they need to be added to a CARBON database from the code in the programming language that's using that database to do queries, because they need to contain references to functions in that programming languages. Function pointers, if you're familiar with C; closures, if you're from a functional programming world. They can't be made by embedding source code in CARBON statements or anything like that.
However, you can still have shared libraries - in your programming language, not published like other CARBON data - that provide CARBON databases full of useful gateways. A library might provide CARBON arithmetic, for instance, implementing all the useful mathematical operations are relations between numbers.
But a second form of gateway CAN be provided via MERCURY. An entity providing an
interface to a real-time temperature sensor, or a gateway to some third-party
database, might state in its published data that is can be queried for
statements of the form temperature-at(?LOCATION, ?TEMPERATURE), returning a
temperature if it has a sensor at that location; or hostname-ip(?HOSTNAME,
?IP), return one or more IP addresses if a valid hostname is provided. Querying
that will result in the query being sent to the entity via MERCURY, where it is
welcome to execute arbitrary code to find the answer, and send it back.
Software distribution
I mentioned that entities contain, within them (stored of course as a CARBON database in a particular TUNGSTEN compartment), some kind of information about how to handle incoming MERCURY requests by invoking software inside the entity.
Of course, it might not need to invoke any software - published data requests and CARBON queries can be satiated directly from specified TUNGSTEN compartments. But in the general case, some software needs to run to figure out how to react to an incoming request from some remote entity, via MERCURY.
So, in the configuration compartment of the TUNGSTEN state of every entity is a declaration of its MERCURY endpoints and what code handles them. Of course, this is all in CARBON. The references to functions that handle events are the names of software library packages, and the names of functions exported by them. The name of a software package is a symbol, and that symbol can be looked up in the usual manner to find the entity publishing information about that symbol - which will state that it's a software library package, with all the metadata about versions and documentation and so on, all as published CARBON data; and the source code itself is also encoded in CARBON. After all, a library package in normal programming languages is a bunch of declarations: imports from other packages, functions and types and constants and classes and whatnot being declared, and some of those declarations are marked for export so other libraries can import them. Each of those things can be easily done with CARBON statements - but it stops there; the bodies of declarations are just plain source code!
Many languages have an "annotation" system, that let you attach extra information to declarations. But if we store source code declarations as CARBON databases, we can do that for free by just providing extra statements of other relations.
So a software library package might look something like:
import(/some/other/library/to/import)
defun((add x y), (+ x y))
doc(add, "Adds two numbers together and returns the result")
export(add)
I've used a shorthand here, whereby symbols written just as a bare name rather
than in /full/path/form are considered to be named "within" the entity
publishing data about them. Remember that a CARBON document includes the
provenance field; foo in a document with a provenance entity name of /a/b/c
refers to /a/b/c/foo.
How is this different to the Semantic Web?
Clearly, the "CARBON via MERCURY" side of this aims to fill a similar niche to the Semantic Web: allowing publishing of relation-based data over the Internet. Let's look at the differences.
For a start, I'm defining the meaning of IDs much more tightly. If you publish a relation at a given symbolic name, then that symbolic name MUST point to an entity that publishes the definition and documentation for that relation. In general, if you have the name of anything, you can resolve that name to find an entity that will tell you what the creator of that name has to say about it.
Secondly, I've actually defined the process of querying, and how data from different sources might be combined with controllable levels of trust.
Thirdly, due to having defined a query model as part of the publication process, I've made it possible to publish rules.
But most importantly, CARBON is the means of publishing data in ARGON - not an optional adjunct. The Semantic Web was a new thing to be added to existing systems, alongside HTML interfaces and downloaded CSV files and so on; one option among many, and a new one with a limited installed base. So, in an ARGON system, the extra power of relation-based data modelling comes "for free" whenever data is published.
However, ARGON as a whole shares with the Semantic Web the fact that it's a whole new thing starting with zero installed base, and I don't even have the fame of Tim Berners-Lee and the clout of the W3C to help market it. So in the event that I get time and money to implement all this, I'm still not promising
