Cloud Storage (by )

Currently, you can go to various providers and buy online storage capacity (IMHO, rsync.net is best, after research I did to find an offsite backup host for work). It's more expensive than a hard disk in your computer, and miles slower, but it has one brilliant advantage: it's remote. So it's perfect for backups.

And that's the heart of a free market - storage is cheap to the cloud providers (they just buy disks, and in bulk at that), but their storage has more value to you than your own storage because of it's remoteness. So they can rent it to you at a markup, and you get a benefit, and everyone is happy. Money flows, the economy grows, and one day we'll get to have affordable space tourism et cetera.

But large, centralised, cloud storage providers are attractive targets for people who want to steal data. They become centralised points of failure; if they go bankrupt, lots of people lose their backups. Therefore, it's smart to do your backups to more than one of them, just in case. But that means setting up your systems to talk to each one's interfaces, arranging payment and agreeing to terms and conditions with them all individually, and so on.

Surely this state of affairs can be improved? With ADVANCED TECHNOLOGY?

Well, I think it can, and here's how. Imagine a marketplace for cloud storage. This might be a centralised trading server, or it might be a peer-to-peer protocol... greater minds than I are working on decentralised P2P marketplaces, I hope. But however it's implemented, imagine that I can run a daemon on my server that measures my free disk space, subtracts some amount (10GiB?) for my short-term growth, and rents the rest out on the marketplace. By looking at the depth of market (how many unfulfilled bids for how much storage are out there, ordered by bidding price, highest first), it can choose the best price it can rent my storage for that will use up my available storage. My offer will include a price to upload a block (base price + price per byte), the price to keep a block (base price + price per byte, and the billing period) and the price to download a block (base price + price per byte).

It's an interesting question whether periodic storage fees, or just having a "successful download bounty", will win out. Charging storage fees encourages the buyers to notify you if they don't want a block any more, but just charging for successful downloads (and just deleting blocks that aren't referenced on an LRU basis to free up space) is beautifully simple.

The trust model is rather different to normal cloud providers. If a provider loses their data, I can't sue them; I just don't get to pay them the download bounty for getting my block back. So I'll have to store my data widely across several providers, and prices will lower to take account of that, and I'll need to do trial downloads to check my blocks are still available from time to time, and if not, hire a new storage provider to take a new copy of that block from a surviving copy.

But all of this can be done in software. A storage manager app would present a simple get/store block interface to, eg, Ugarit or Tahoe-LAFS, but behind the scenes, it would manage relationships with providers, checking blocks are available, ensuring there's a sufficient number of copies of each, shifting between providers when rates go up or if a provider's reliability score drops too low, etc.

But all of this depends on it being easy for computers to send money between themselves, which is where Bitcion comes in. Storage providers and consumers can just run bitcoin wallets and arrange transfers between themselves.

The end result? I can run a daemon to rent out spare storage space on my system, and money would slowly accrue in a Bitcoin wallet. The daemon would rent out all but a safety margin of my space, and as I used up my safety margin, it would shed blocks (notifying the owner) to make more room, and increase its offer price in the market to reduce demand so that the lower-paying blocks move willingly and can be replaced with higher-paying blocks.

And I can run another daemon as part of my backup system, that would spend from the same bitcoin wallet to get backup space on other machines. When I have mostly empty filesystems, I will be spending little on backups, and earning lots on renting that space out, so money will accumulate... when I start to fill the filesystems up, the trickle will slowly reverse, and then perhaps I should spend my profits on a new hard disk before they all go and I have to top it up from my own Bitcoin wallet!

Details

The devil's in the details, as always. The marketplace will depend on being able to place bids in a standard format. Potential buyers will need to be able to introduce themselves, perhaps via an HTTP-based protocol served by the storage-for-hire daemon on my server; sign up for an account by registering a public key, and then access upload/download/delete block interfaces. The daemon would quote a price in the market, but each block upload would have to be annotated with the rates the buyer is offering, to avoid race conditions when rates change during a transaction. Blocks with unattractive rates can be rejected by the server. There would need to be a back channel for the server to asynchronously notify buyers that it needs to get rid of a block - I'd hate to force buyers to have public IPs (many will be behind NAT) by giving them an HTTP endpoint, but perhaps a choice of that or polling the server to ask for blocks that need to be shifted within a time limit would suffice. It would also be polite for the server to inform the buyer of any blocks it had to delete without notice, rather than waiting for them to check them.

But how to address blocks? On the one hand, I want content-addressed storage, as it prevents cheating. There's no way a bad server can claim to have blocks it's deleted by sending back random junk and saying "But that's what you gave me! PROVE I'M LYING!" if they are identified by hashes. But on the other hand, existing systems have their own addressing schemes (Ugarit identifies block by a keyed hash of their uncompressed plaintext contents, so that the hash doesn't give away the content (it's a keyed hash), but it will also remain unchanged if the compression or encryption algorithms are upgraded - old blocks can still be read while new blocks are written with the new algorithms, and old blocks can be re-compressed and re-encrypted without breaking the references to them). So enforcing that blocks are identified by the SHA256 of their ciphertext would exclude various uses.

The best scheme I can think of is this: each block is identified by a client-supplied ID string combined with a hash based on an agreed algorithm. So the server would say "I support SHA1, SHA256, and Tiger", and the client would say "Ok, here's a block I want to call Boris, and I like SHA256", and the server would reply with "Ok, that block's called Boris:<256-bit hash>". The client should check the returned hash matches the hash it computed itself. A client that's happy with server-assigned IDs would give all their blocks the same name (the empty string), as the hash in the resulting identifier keeps it unique. The server will store the block by hash (deduplicating blocks with the same hash), but keep a per-customer table mapping names to hashes. If the client hasn't provided distinct names, then the LAST mapping for the name provided is kept.

Meanwhile, on retrieval, a block can be requested by name, or by hash. The client should remember the hashes, even if it uses names, so that it can check that the server isn't sending it a garbage block.

As a Ugarit backend, this would work fine; the Ugarit keyed hash can be used as the name, and the server's hash stored for cross-checking on retrieval. If the local store is lost due to disaster, it could either be restored from another backup somehow, or it could just be skipped and we hope that the servers don't lie to us (the latter would be better than refusing to try to restore at all!). Ugarit tags (which are the roots of the hash tree) can be stored by using the tag name as a block name, and using the fact that multiple uploads with the same block name just overwrite the name->hash mapping.

Needless to say, clients should encrypt ALL their data! You can't trust random providers.

Have I missed any other scams? Servers might try to accept lots of blocks and keep the upload fees and never keep them. That provides an incentive to servers to not charge upload fees at all, and just hope to make money on download fees and/or storage. It'll be interesting to see how the market ends up structuring itself! Also, as it's a low risk to accept data from somebody but a high risk to send money, I think the protocol should be based around periodic billing at the end of the period, rather than per-operation micropayments (that makes more efficient use of Bitcoin's transaction charge and hour transaction confirmation latency, too). Billing periods could be anything from a day upwards.

But this is a real cloud, in a sense far beyond the current definition of cloud computing. Millions of tiny providers, all competing in a marketplace, with the clients automatically spreading their risk across them in a fine-grained way. I think that'd work for storage, as it's easy to define and commoditise; doing it for computation might be possible, but it'd require much more standardisation of execution models and sandboxes and the like...

(Thanks to the folks in #bitcoin on Freenode IRC for inspiration for all this!)

UPDATE: A friend suggests an improvement over periodic downloads to check the data is still there. Have a "check" operation where the client supplies a random key and a block name or hash, and the server has to hash the block along with the key and return the result. That allows the client to check the block is still there if it has a way to get a local copy of the block. Otherwise, it would still have to rely on downloading the block and checking the hash matches.

6 Comments

  • By y, Tue 24th May 2011 @ 11:38 am

    wuala provides much of this and you can even pay with bitcoin there.

  • By Hassan Seth, Tue 24th May 2011 @ 11:49 am

    Great idea, but only public data backups can be stored on these services. Private data is way too important to store on distributed cloud storage.

  • By alaric, Tue 24th May 2011 @ 1:10 pm

    Hassan: I'm not worried about people prying at my data if it's sufficiently encrypted; I'm worried about losing my data due to everyone holding a copy of my block giving up at once... However, the kind of backup architecture I have in mind would be fairly resilient to that; even if it did happen, the next backup run from Ugarit would replace the data and all would be well again. I might lose archive copies of old data, though. Well, it's all down to your risk model... Paying 10p for storing some data with a 30% chance of getting it back sure beats paying £10 to store it with a 95% chance of getting it back!

    y: Wuala, eh? I'll look into that, thanks...

    ...Ok, it seems to be another online storage provider. Still centralised, though. The point here is to create an open market for this stuff, not Yet Another Provider.

  • By AJ, Tue 24th May 2011 @ 5:00 pm

    You might want to check out http://flud.org, which implemented (or at least architected) a lot of this stuff. flud is 100% dormant, but there were a few good ideas there about how to do at least some of what you are describing. A pretty comprehensive bibliography (at least as-of 2007) of similar systems that pioneered some of these ideas is found at http://flud.org/wiki/RelatedPapers

  • By Zooko, Tue 31st May 2011 @ 10:41 pm

    Alaric: please see the description of immutable files in http://tahoe-lafs.org/~zooko/lafs.pdf . The interface that you outline here would not be sufficient for a Tahoe-LAFS storage client to use a Tahoe-LAFS storage server.

  • By aha, Thu 23rd Jun 2011 @ 11:58 pm

    it is a myth that wuala support btc

Other Links to this Post

RSS feed for comments on this post.

Leave a comment

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales