Modelling data with relations (by alaric)
So, I've been thinking about data. I'm going to explain my thoughts with enough background so that non-programmers should be able to follow along, but don't worry - I won't be going on huge tangents with basic tutorials, so if you already know everything about pointers in C and SQL databases, you can just skim over the early paragraphs!
Managing data is complicated
Data is one of the hard parts of programming. Generally, the way data is handled in computer systems is that data sits somewhere - maybe on persistent media such as disks that will survive reboots, maybe somewhere more transient - until some event causes the computer to decide to do something with it; perhaps a user clicks a button. And then a program or subprogram or some other unit of software will run, reading data from the source of the event (eg, the search terms a user is asking for) and will go looking at the data it has access to, and do some processing on it. Maybe this will involve changing the data, or maybe it will involve storing new data, and maybe it will involve sending some data back to the source of the event.
To give a concrete example: You book a ticket on a cinema's web site. When you press "book now", the web site runs a bit of software that checks in a database that the seat you want is still free, and if it is, and the payment goes through, updates the database to mark that seat as reserved and creates a record of your booking so when you show the QR code to the machine a the cinema, the correct ticket for your seat will come out of the machine. It then replies to you saying the booking worked, or it didn't and you need to pick another seat.
(I'm glossing over other things software might do, like invoking other bits of software - creating new events to wake them up, and waiting for responses - or talking to other attached devices such as printers or 3D printer drive motors or something, because I'm focusing on data here).
This is a simple summary of how data is handled, but there's a tremendous amount of complicated details required to make all this stuff work properly. To name a few issues:
- Multiple bits of software fiddling with the same data can cause a whole lot of trouble, especially if some of those things are changing the data at the same time as other bits are looking at it, or (even worse) changing it themselves.
- Data might not be represented correctly - a name stored where a date of birth is expected, or a date written in Y/M/D form instead of D/M/Y, for instance. But the real question is: how did it get this way? Was there a bug in some software that wrote data wrongly? Was data from an external source put in without checking?
- The place data is stored might fail. Hard disks, SSDs, and indeed any bit of a computer will randomly stop working sometimes; and then the data stored in them is lost. You can overcome that by having a way to recreate that data from scratch (perhaps it's just something you downloaded, so it can be downloaded again), or periodically storing a copy somewhere you can copy it back from (in which case you might lose changes that were made since the last copy was taken), or keeping one or more "replicas": copies that you constantly update with changes as they happen, so if any one fails you can just keep using the others. But then you run into interesting issues, such as what happens if a replica is temporarily isolated from the others and then returns later, or if the system is split into two halves by a communications failure, with some replicas on each side, and activity still continues in both halves.
- The place you store data might not be able to handle the sheer bulk of it. You can get multi-terabyte hard disks these days, but what if your data is hundreds of terabytes? You need ways to split it up but still be able to access it all. If you split the data onto different disks, you need a way to know what bits are on what disks.
- The place you store data might not be able to handle the number of different things trying to access it at once. The latest posts of a widely-followed social media account might be being checked by millions of people worldwide, so thousands of phones could be trying to fetch their latest posts at any point in time. The "like" counts on their posts might be being updated hundreds of times a second as people smash that like button. Whatever means you use to store that data will need to be able to keep responding to those requests snappily!
- The place you store data might struggle with the sheer amount of data you need to look at in one operation. For instance, we might have a massive database of all the phonecalls that have been handled by a phone company, going back years, and we want to know the total number of minutes of phonecall we've handled. To do that, we need to look at every single record. To get the answer as quickly as possible, we might want to use a whole datacentre full of computers and get each to add up the total for a fraction of the database, and then combine all the sums together at the end. This is a complex process to manage!
- On the other hand, a lot of the time we only need to handle a tiny fraction of the data. Perhaps we need to look up the details of a single phonecall, to a particular person at a particular date and time. How do we quickly find that one entry in the database? If we have to look through the whole lot to find it, that just won't work in practice - as the database gets bigger, it'll get slower, and beyond a certain scale it just won't be fast enough to use.
- You might want to change what data you store - perhaps you want to start logging some new information, or your service offers a new feature so you need to store settings for it per user; or perhaps you want to rearrange how your data is stored to help solve one of the above problems. It's easy to just start storing new data differently, but what about all the existing data? Do you need to go and re-write it to fit the new layout, or do you need a way to support different records of the same "kind" having different structures, to support old and new data alongside? And what should happen if your software asks for the settings for this new feature, from an old user account created before that feature was added? What happens if your software tries to update the settings for that user?
I'm not going to go into detail about how all those interesting problems can be solved; my point here is merely to demonstrate that handling the storage of data can be surprisingly complicated. I've largely built my career around solving these kinds of problems, working for several database and data management companies, and currently working for a data consultancy; which just goes to show how much work there is to be done in the area of data management.
Schemas, and Physical and Logical structure
Any given system for managing data will allow - and often, but not always, requireq - the programmer setting it up to specify the logical structure of the data it's going to store; this specification is often called a "schema". By "logical" structure, I mean a description of the shape of the data in some relatively abstract form, such as "There are 'person' records, and each of them contains a name (which is text) and a date of birth (which is a date)", rather than specifying exactly how that is represented as bytes of data in memory, or on disk, or in a message sent over a network.
The latter is the "physical" structure, and the choice of that is generally left to the data management system itself. The physical structure will often be very complicated to describe, involving concepts such as B-trees, copy-on-write data, log-structured merge trees, Bloom filters, data compressionhttps://en.wikipedia.org/wiki/Data_compression, delta-encoded columnar storage, replication, error-correcting codes, Lamport timestamps, [vector clocks[(https://en.wikipedia.org/wiki/Vector_clock), distributed consensus protocols, variable-length integer encodings, and so on; the design of physical structures for data is a large and deep section of computer science, and where many of the exciting problems in data management are addressed. Specialists in these topics can spend a lot of time designing data management software, and let users feed in a relatively easily understood logical schema, and have the software generate a complicated physical structure full of wizardry; and users don't need to worry about all that.
Formally defining a schema by configuring your database with it is often a good thing, as giving the computer more information about what you're about to expect of it enables it to do a better job; and it also allows it to enforce the schema by checking that data inserted into the database matches it, thereby simplifying processing data from the database because data consumers can safely assume that all the data fits their expectations. But even if you don't express a schema in a way the computer can understand and just write it down as documentation, or even if you don't even do that - the schema will still exist in the form of unwritten expectations; although some data management software describes itself as "schemaless", that just means they don't required you to provide one. The only truly schemaless database is one where nobody knows what's inside it and how it's represented, which isn't very useful.
Regardless of whether it's written down or not, the schema for a given data store will be influenced by the tools that data store provides to structure and search data. The crux of the thesis I will set out in this article is that the way we traditionally define logical schemas isn't great; a second model has existed in the shadows for years, that has many strengths compared to the traditional one; through accidents of history one became mainstream and, thus, what everyone thought of when they reach for a data model. I'm going to explain both models, then try to explore the fundamental differences between them!
