Building an online currency exchange (by )

The biggest currency exchange market in the Bitcoin world is MtGox. When it goes down, either due to a DDoS attack or sheer high load due to everyone panic-selling, then people who hold bitcoins and care about their value in dollars get the panicky realisation that they can't easily sell them - which causes two things:

  1. A drop in the value of bitcoins; people care about their ability to turn them back into fiat money, and will continue to do so until lots of things can be bought directly with bitcoins.

  2. Widespread angst that MtGox is a central point of failure for the Bitcoin economy, complaining that they are vulnerable to DDoSes and get high trading lag when under load, and so on.

So, as a high-performance systems developer, I thought I'd write some notes on how to build a more resilient exchange platform. Perhaps MtGox will do something like this, but perhaps more ideally, one of their competitors will, and thereby win some of MtGox's market share, and thus decentralise the exchange market somewhat.

High performance trading engine core

The core of an exchange platform is holding an order book, and processing a stream of orders in strict arrival-order sequence. Each incoming order "matches" zero or more orders from the book, which modifies or deletes those orders by partially or completely satisfying them and causing records of trades to be output; and it may also cause the creation of a new order in the order book, if it cannot be entirely satisfied from existing open orders. The output stream is a list of trades,grouped into a log entry for each input order specifying which trades resulted from it and what open order was created as a result of it.

Now, this is hard to parallelize, as there is a big data dependency between orders in the input stream affecting each other through mutation of the order book. Some orders don't affect each other, but I don't think this can be detected without applying them to an order book in the first place, so I am going to assume we have to go through a single processing system to handle these.

Clearly, that single processing system needs to be as fast as possible on the available hardware, in order to provide as much capacity as possible.

The order book can be quite large, with lots of open orders floating about, so choosing the best data structure for this is clearly crucial. It's split into two parts; open buy orders, for people waiting to buy BTC, and open sell orders for people waiting to sell BTC. An incoming buy order is matched against sell orders, and may leave an open buy order; and vice versa. For now, let's just examine the case of incoming buy orders being matched against sells, and we can then handle the other case in the same way but just swapping everything over.

So, when an order to buy some bitcoin comes in, it will have a maximum price per bitcoin that will be paid, and some limit as to how many to buy (be it a limit on the fiat spent or the bitcoins bought). What we want is to find appropriate matching open sell orders, each of which has a minimum price per bitcoin they will sell at. We want sell orders whose minimum sell price is less than or equal to the maximum buy price of the incoming order. The data structure that comes to mind for each half of the order book, then, is a priority heap, sorted on the price. Each entry in the heap corresponds to a given price per bitcion, and then contains a heap of open orders at that price; this heap is ordered on the priority of the open orders, which is a matter of exchange policy - perhaps the oldest open orders get to "go first", or perhaps it's some arbitrary function of age, limit, whether the person placing the order has paid for Platinum Membership, or whatever.

Then the order-matching algorithm becomes one of traversing the price heap in order, and for each price, traversing the order heap within it, generating trades that eat away the limit of the input order. This loop terminates when we run out of price heap entries that match the price restriction on the input order, or the input order is entirely satisfied.

Traversing heaps like that takes time linearly proportional to the number of open orders that we trade against, and I can't see how we can get better than that. We can update open orders in-place without changing their price, so the price heap doesn't need reorganising unless we completely satisfy an open order and delete it (which is O(log N), where N is the total number of open orders); but if the priority function used to disambiguate orders at the same price is a function of the remaining limits on open orders, then those heaps will need updating - also O(log N), but with a much smaller factor as we only consider orders at a given price. And then finally we may need to insert a new open order if the input order is not entirely satisfied, which is O(log N). So overally, we can say that the time taken to process an order should be O(M log N), where N is the number of orders and M is the number that are needed to satisfy the input order. Depending on how big the order is compared to the size of the order book, that will vary from O(log N) to O(N log N).

However, we don't need to store the entire order book in our heap. Most trades will only involve open orders that are "close to the spread", ie the top of the order book. In order to avoid needing enough RAM to store the entire order book, and to have to re-shuffle a large heap lots on every trade, we can just store the top entries of the order book in the heap, and push the rest off to disk in chunks of contiguous prices. If we empty the order heap, we load the next chunk from disk into the heap. If the order book size reaches a size threshold, we can push the bottom X entries off into a new contiguous chunk. And if we try to insert an order that would be at the bottom of the heap because the price is so far out, we can insert directly into the appropriate chunk, preserving contiguity of the chunks; the heap, and each store chunk, should store orders whose prices fall within a range, and those ranges should not overlap.

Having the order book stored as two size-constrained nested heaps in RAM should provide good throughput, with good use of CPU caches.

Resilience

But storing the order book in RAM leads to issues if the system fails. To allow recovery, we need a resilience mechanism.

The order processor can be viewed as a black box, with an input stream of orders, an output stream of trades, and a bidirectional interface to a store of offline order book chunks. If we log chunk write operations into the same output stream as the trade stream, and provide a magic order type that sends a dump of the in-memory order book to the output stream, we have all we need to make it resilient.

A front-end system needs to sit in front of the order processor, logging all interaction with the processor to a stable store. This stable store can be parallelized for throughput and replicated for reliability - it can be a RAID6 array or something more sophisticated; it can also be where offline chunks of order book are stored. The log should list all orders that go into the trading engine, and everything that comes out of the log. The front-end should also periodically insert requests to snapshot the heaps into the input stream.

In the event of a failure of the order processor, once it is online again, the front-end can examine the logs to find the last snapshot. The list of trades and offline chunk operations (which represent orders moved in or out of the order book heaps) since that snapshot can then be replayed into the snapshot to bring it up to date, and that state loaded into the order processor; then input orders that had not been processed at the time of failure can be replayed into it, along with input orders that arrived while it was down (which would have been put into the log nonetheless).

The front-end stores no important state of its own; it's all in the log, so the front-end can be restarted in the event of failure. The only really stateful operation it performs is rebuilding the snapshot state by replaying logged events, but if that fails, it can just be restarted.

As such, the front-end can really be considered part of the order processor, itself logging its inputs and outputs and recovering its state from snapshots as required; binding the two components into one would simplify implementation, but it's useful to reason about their concerns separately.

Viewing the order book

For informational reasons, people wish to view the order book. This can be done by looking at the most recent snapshot of the heaps, if they are frequent enough, and consulting the offline chunks if necessary. If the heap snapshots can't be frequent enough to satisfy users, then the trading engine also needs to generate more-frequency summary snapshots, perhaps just listing the total BTC available in each price range, thereby summarising lots of orders. Generating this takes time in O(N), where N is the number of orders in the heaps.

Interface

The order-processor and its resilience front-end should be a sealed unit, accessed purely by a very minimal input and output stream, as documented. But we need a multi-user interface on top that lets users authenticate, submit trades, and receive the results of their trades. This component needs to keep track of each user's bitcoin and fiat balances in order to check they can place orders, and to update their balances in response to completed trades.

Therefore, it needs its own stable storage mechanism to store such per-user information, but as noted above, off-the-shelf techniques to increase the throughput of that, and to make it resilient to failures, exist.

The interface system itself can be parallelized; operations within the context of a given user introduce data dependencies with that user's data, meaning it can be easily "sharded" on the user ID to split the processing and storage load across a cluster.

Each node in the cluster needs to be able to submit orders to the payment processing engine, and receive the log entry corresponding to that order when it comes back out (which can be done by including the node ID in the order metadata), and providing the payment processing engine with a routing system for the output log entries back to the correct node. Successful receipt of output log entries needs to be noted in the payment processor's log, so that a failed node that does not receive log entries will get them later, when it is working again and the router retries delivery.

The interface cluster can then provide a per-user authenticated API to actual users. It will be a programmer-level RPC API; Web-based human interfaces then sit on top of it as an extra layer that just talks to the interface cluster to get stuff done. These interface nodes are stateless, so can easily be scaled into a load-balanced cluster.

Overload protection

Internet-based exchanges get overloaded by legitimate users, and taken down by people trying to launch an attack, because the servers have a finite capacity in terms of orders they can handle per second, and because the Internet connection itself has a finite capacity to receive orders.

Also, handling orders over an HTTP-based API, or even a fully HTML-based Web interface, is quite a lot of work for the server. This can be mitigated by adding more servers in a cluster (they're stateless), but a cheaper way would be to offer an API based on lighter-weight RPC protocols. But that just moves the limits up by a factor and saves some hardware budget, it doesn't remove the limits.

Another approach to take is for the exchange to rent private links to subscribers. These could range from leased lines direct to the interface cluster, colocating customer hardware with a dedicated link to the interface cluster, or hosting customer virtual machines on hardware linked to the interface cluster. Either way, as dedicated links exist for each subscriber, they can each be given their own order processing quota; and the public Internet link can have its own quota, calculated so that there is enough capacity available beyond the worst the Internet can throw that the paying subscribers' quality of service is within the agreed levels.

If a paying subscriber abuses the system in any way, then this abuse can easily be traced to their private link, and that link shut down.

As long as the access control in the interface layer manages to keep user data safe (eg, the system is not "hacked", which is another aspect that can be made harder with better architecture), the worst the public Internet connection can do is to make itself unusable. If the Internet connection becomes so toxic as to threaten the stability of the system by exploiting bugs in the interface cluster's quota mechanism, then it can be shut down - the paying subscribers will still have access to the trading engine.

Security

The protection of user data is handled by the interface layer. How can we make that as resilient as possible?

The naive approach to building a multi-user server platform is to make each request identify the user using it, and then to use that user ID to select the appropriate state from a big shared database with everyone's records lumped together.

This means that the isolation of user data from each other is handled individually in each query. And if "public data" lives in the same database, to be viewed by parts of the interface accessible to unauthenticated users, then there is also a danger of those queries being manipulated to access private data.

As we will need to split the data by user ID anyway to allow for sharding to scale the interface layer across a cluster, the logical answer is to store each user's private data, and the public data, into entirely separate databases. When performing an API request, the request handler framework starts by authenticating the calling user and connecting to the appropriate database for their private data, then dispatches to the handler code for the type of request. It's possible to also encrypt each user's data with a different encryption key, either stored in the central user database or even provided by the user as part of the authentication process, as an extra layer of protection. However, in the latter case, care must be taken surrounding changes to a user's data other than from their authenticated requests, such as incoming money transfers or completed trades. Perhaps those events need to sit in a cleartext queue until the user logs in, allowing them to be applied to their encrypted state.

Of course, we also need protections against different requests for different users interfering with each other; the usual mechanisms of inter-process memory protection on the interface node servers are probably trustworthy enough for that, if used correctly. One point that needs careful examination is the routing of trade logs from the order processor back to the interface layer; if you can submit an order that sells your bitcoins at a terribly loss-making price, but cause the trade log to be routed to another user's account, then they will suffer the loss and you won't. The "originating user ID" part of the order metadata needs to be considered a well-protected opaque token through the order processing system, so it is hard to cause it to be modified, and maybe even have the order protected with a cryptographic signature (based on a key stored in the user's private data area) over it and the entire order and a unique order timestamp, so that it cannot be forged or transplanted into another order.

It should also be possible to have different sections of the interface layer cluster for different classes of user. One sub-cluster can handle users whose balances stick in the sub-$10,000 range, another for $10,000 - $1,000,000, and another for those dealing above that range. It is easy to set up a sub-$10,000 account, but any exploits you can use to "break into" other accounts in the same cluster will be limited to only steal from other sub-$10,000 accounts, until you've stolen enough to qualify for an above $10,000 account - but that would probably require manual verification, and your deceit would risk discovery during that time. Within each balance range, purely arbitrary division of users into "shards" with their own storage and processing hardware would also help to limit the scope of an attack.

There are also security risks where currency flows in and out of the exchange, too; incoming fiat or Bitcoin money transfers need to be routed to the appropriate users' account balances, and requests to make outgoing transfers need to be authenticated, and feedback of completed or failed outgoing transfers routed to the correct accounts without the possibility of interference (if you can fake a message saying that an outgoing transfer failed, when it succeeded, then you can transfer the same balance out repeatedly). As with orders, including a cryptographically-signed order ID in requests and responses, along with simple and easy-to-verify routing of responses, should do the trick.

Insider threats

The biggest remaining danger is an insider threat. If somebody with access to the interface layer storage systems can just go in and edit people's account balances, or request arbitrary outgoing money transfers, or forge trades from users into the order processing engine, then they can bleed the system dry in a number of ways.

This can be made harder by requiring users' actions moving through the system to be signed with a cryptographic identity stored in that user's own area, and making it hard to steal the key, but that is impossible to entirely enforce.

The best defences against these kinds of attacks are to require administrative actions which override security to be counter-signed by multiple administrators, and logging all actions.

The servers should only accept software updates that have been signed by the development team and by a separate QA team, for instance, to prevent developers, QA stuff, or systems administrators from installing a backdoor.

Each administrator should only be able to access one type of server - and those servers should log their administrative logins to servers in another "type", so nobody can both administer a server and edit its administrative action logs.

The automated systems that make outgoing payments should operate from a dedicated account per type of currency, with the bulk of the stored funds being kept in a number of offline accounts, which can only be used to pay into the "live accounts" - and then, with an audit trail, and requiring access countersigned by multiple people.

All staff need to have their background checked, and enough information about them kept to make them easy to find if they decide to go missing. Not just random folks from IRC.

No Comments

No comments yet.

RSS feed for comments on this post.

Leave a comment

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales