Snell-Pym

One Size Fits All (by alaric)

Another approach to proving a one-size-fits-all solution without making nasty compromises is to provide multiple different mechanisms - but integrating them neatly.

For example, a big challenge in ARGON, that I have still yet to completely resolve, is efficient, reliably, distributed storage within clusters. Each entity has an internal state, which will be accessed via an API of my choice. When a request comes in to a node, it will examine the request to see which entity should handle the request, then it will load the appropriate handler code from that entity's state and execute it, with access to the contents of the request and the entity's state, via the API.

I need to design a data model, an API, and a way of implementing this API.

There are a number of distributed data storage models already available; however, they almost ways try too hard to hide the fact that the data is distributed. They add a lot of overhead to make sure very data item is replicated and that every client sees a consistent ordering of events.

And so you will hear people saying that certain bits of data don't need to be in the distributed database. Data that is read very frequently, but for which a little outdatedness is of no import, end up on a master/slave replication system instead. Caches end up local to each node, using the local filesystem. And so each application may end up talking to a transactional accurate transparent distributed database for critical data like account balances, a master/slave lazily replicated database for system configuration and other mainly-constant data, a local file system for caches, and perhaps a specialist distributed transactional queue manager or two, in order to store different types of data in appropriate ways.

And each of these storage managers is a separate daemon (or, more often, set of daemons) to configure and monitor and run, and a separate API for the application, and often a very different data model.

In this case, I am quite sure there is no general model that encompasses all. I know different pieces of data, even within the same entity, will need to be distributed in very different ways. So instead I am neatly integrating several different models into a common system.

After all, almost any distributed data storage system will have some core essentials, such as a way of specifying a list of nodes, a way of the nodes communicating with each other, basic session management protocols for clients to connect and authenticate and request stuff, underlying physical storage management, monitoring the other nodes to detect failures or network partitions, and so on. They might as well all share the same code for this.

And they might as well share a common data model. Standardise on YAML or XDR or something else for cached objects, transparently replicated objects, master/slave replicated objects, or queue entries. Then it becomes a matter of creating 'containers' within each entity, each of which contains a number of objects in a specified storage management system, with its appropriate API. A shared infrastructure can manage the set of containers within the entity, perhaps giving them names. Then a queue container can provide PUSH and POP and PEEK and GET QUEUE SIZE operations, while a transparently or master/slave replicated object store can provide operations to get an object given an ID, insert a new object (getting its ID returned), update an object, or maybe find a list of object IDs that match some search specification (perhaps by consulting indices). The master/slave replication could also provide an extra operation, to get a guaranteed up to date copy of an object given an ID by going straight to the master server rather than a possibly outdated replica; and the cache storage manager can provide a similar API to a transparently replicated data store, except with the added caveat that objects written on one node may not be there from another node, and that objects may randomly disappear anyway when space is needed for other things.

In practice, I'm going more for an object-oriented Deductive database approach than the simple object store implied in the last paragraph, but that's by the by. What I'm still struggling with is the definition of the containers. My best idea so far is to subdivide the data within an entity into a named hierarchy; each node will have a specified storage manager, plus access-control information for external access to the entity's data. Queries at any given node will, unless specified otherwise, also include data stored in children of that node, so queries at the root will query the entire entity's state. Also, each node will have configuration parameters for its storage manager; a transparent replication manager might be tunable by giving an expected read/write ratio, which can be used to compute quorum sizes appropriately.

The key here, when providing lots of parallel mechanisms that provide different services, is to share as much functionality as is possible; this reduces implementation effort, administrative effort, and effort for the programmer using your now-simpler APIs, compared to providing the different mechanisms as completely isolated products.

Sarah and Alaric Snell-Pym living in interesting times

One Size Fits All (by alaric)

No Comments

Leave a comment

Search

Categories

About Us

Ada Lovelace Day

Business

Family

Fictional Friends

Friends

Mind candy

Projects

The Salaric Blogs

Archives

Meta

Snell-Pym

Sarah and Alaric Snell-Pym living in interesting times

One Size Fits All (by alaric)

No Comments

Leave a comment

Subscribe

Search

Categories

About Us

Ada Lovelace Day

Business

Family

Fictional Friends

Friends

Mind candy

Projects

The Salaric Blogs

Archives

Meta