Life beyond Distributed Transactions
Mar. 21st, 2014 09:30 am[Another one for the programmers. This time, we're going to be talking about architecture, so it's mainly aimed at the folks who are hardcore, and want to understand how modern super-gigantic systems can work.]
A few weeks ago, I got a pointer to the paper Life beyond Distributed Transactions: an Apostate's Opinion, and I just finished reading it. It's a position paper from one of the folks at Amazon, and it isn't new (it looks like it dates to 2007), but it is *enormously* important -- it turns out to establish a lot of the reasoning behind Akka, the platform that Querki is built upon.
The paper basically looks at the problem of building nigh-infinitely-scalable systems. Historically, most folks simply assumed that their program was going to run on a single machine, or on such a tight cluster of machines as to make little difference. A world like that leads you to certain assumptions -- in particular, the assumption that you can structure your application's data more or less arbitrarily, and let database transactions paper over the complexities. But in the modern web-based world, where a successful program is going to run on anywhere from dozens to thousands of nodes, often scattered geographically, that just plain doesn't *work* any more. Worse yet, in the modern world you often have to be sharding your database, so the whole concept of distributed transactions kind of falls apart. So the paper asserts that distributed transactions have proven to be a scalability failure in practice, and successful systems work quite differently.
(I discovered the paper when someone on the Akka list asked how you support distributed transactions inside Akka, was told "don't do that", and pointed to this paper to explain why. Indeed, Akka actually *had* support for in-memory transactions between Actors, and has consciously dropped that on the grounds that it can't scale. Akka is increasingly hardcore that any serious program should be scale-agnostic from the beginning, so they're deprecating elements that can't scale arbitrarily.)
It winds up outlining an architectural approach built around Entities, that is quite similar to an Actor architecture. Moreover, it describes some of the techniques necessary to make this architecture robust, and how the business logic should be structured if you want to make it work. Far as I can tell, this paper is strongly influencing the direction of Akka, which has moved beyond simply duplicating Erlang in a better language, and is now rapidly turning into a much more serious and complete platform for building large-scale applications. The newest release of Akka actually contains built-in support for a number of the ideas suggested in here, and the end result is *wildly* different from a conventional program architecture, especially in how the data gets managed.
I'm gratified to see that the paper's suggested approach is fairly close to the way I designed Querki, and for many of the same reasons. (I don't mind reinventing the wheel, so long as it's round.) That said, it provides some very important food for thought that I'm going to need to build in -- for instance, it argues that you need to focus on at-least-once messaging, and be tolerant of duplication at the business-logic level. (Akka is naturally at-most-once messaging, but is beginning to add support for at-least-once now.) I don't agree with every detail of the paper -- in particular, he goes just a bit too far in asserting that the individual Entities must be completely location-agnostic (Querki makes some extremely intentional assumptions about how Actors are grouped together, for efficiency) -- but by and large, it's pretty sound stuff.
Anyway, go read it. It's dense stuff, but it's not very long, and it sets out some very sensible arguments that are grounded in the reality of what has been found to actually work. If you have any interest in the construction of truly large-scale systems, it is totally worth understanding...
A few weeks ago, I got a pointer to the paper Life beyond Distributed Transactions: an Apostate's Opinion, and I just finished reading it. It's a position paper from one of the folks at Amazon, and it isn't new (it looks like it dates to 2007), but it is *enormously* important -- it turns out to establish a lot of the reasoning behind Akka, the platform that Querki is built upon.
The paper basically looks at the problem of building nigh-infinitely-scalable systems. Historically, most folks simply assumed that their program was going to run on a single machine, or on such a tight cluster of machines as to make little difference. A world like that leads you to certain assumptions -- in particular, the assumption that you can structure your application's data more or less arbitrarily, and let database transactions paper over the complexities. But in the modern web-based world, where a successful program is going to run on anywhere from dozens to thousands of nodes, often scattered geographically, that just plain doesn't *work* any more. Worse yet, in the modern world you often have to be sharding your database, so the whole concept of distributed transactions kind of falls apart. So the paper asserts that distributed transactions have proven to be a scalability failure in practice, and successful systems work quite differently.
(I discovered the paper when someone on the Akka list asked how you support distributed transactions inside Akka, was told "don't do that", and pointed to this paper to explain why. Indeed, Akka actually *had* support for in-memory transactions between Actors, and has consciously dropped that on the grounds that it can't scale. Akka is increasingly hardcore that any serious program should be scale-agnostic from the beginning, so they're deprecating elements that can't scale arbitrarily.)
It winds up outlining an architectural approach built around Entities, that is quite similar to an Actor architecture. Moreover, it describes some of the techniques necessary to make this architecture robust, and how the business logic should be structured if you want to make it work. Far as I can tell, this paper is strongly influencing the direction of Akka, which has moved beyond simply duplicating Erlang in a better language, and is now rapidly turning into a much more serious and complete platform for building large-scale applications. The newest release of Akka actually contains built-in support for a number of the ideas suggested in here, and the end result is *wildly* different from a conventional program architecture, especially in how the data gets managed.
I'm gratified to see that the paper's suggested approach is fairly close to the way I designed Querki, and for many of the same reasons. (I don't mind reinventing the wheel, so long as it's round.) That said, it provides some very important food for thought that I'm going to need to build in -- for instance, it argues that you need to focus on at-least-once messaging, and be tolerant of duplication at the business-logic level. (Akka is naturally at-most-once messaging, but is beginning to add support for at-least-once now.) I don't agree with every detail of the paper -- in particular, he goes just a bit too far in asserting that the individual Entities must be completely location-agnostic (Querki makes some extremely intentional assumptions about how Actors are grouped together, for efficiency) -- but by and large, it's pretty sound stuff.
Anyway, go read it. It's dense stuff, but it's not very long, and it sets out some very sensible arguments that are grounded in the reality of what has been found to actually work. If you have any interest in the construction of truly large-scale systems, it is totally worth understanding...