Why Event Sourced Systems Fail
I have given literally hundreds of talks on what makes them great but I don't think I have ever discussed previously the common failures :O
I have had many discussions with teams and people recently who have had failures and I think a talk discussing the DOWNSIDES of such systems would make for a good talk ... especially coming from me as I have hundreds of talks on their POSITIVE aspects.
- Gregory Young coined the term “CQRS” (Command Query Responsibility Segregation) and it was instantly picked up by the community who have elaborated upon it ever since. Greg is an independent consultant and serial entrepreneur
- He has 15+ years of varied experience in computer science from embedded operating systems to business systems and he brings a pragmatic and often times unusual viewpoint to discussions
- He’s a frequent contributor to InfoQ, speaker/trainer at Skills Matter and also a well-known speaker at international conferences
- Greg also writes about CQRS, DDD and other hot topics on codebetter.com
So let's start with the talk, and then we'll see each other after your talk. So your title of the talk is Why Event-Source Systems Fail. So we can start. Is my slides visible right now? You need to do them full screen, please. Yep. Now we see them. Okay, so we're all good technically? Yep. All right, great. So I will just get started with the talk then for the people who are listening as opposed to talking about technical details. Then again, the entire point of the talk is, well, kind of to talk about technical details. So what we're going to be talking about for the next 45-50 minutes or so is a rather interesting subject for me because there's been a lot going on lately about it. And it's something that I've been dealing with for many, many years.
A lot of people associate me directly to event-source systems. And over the many years, I've talked a lot about the pros of event-source systems. I have not necessarily discussed a lot about the cons associated with event-source systems. So in this talk, what we're going to be looking at is why some people run into issues with event-source systems to the point where it can even become a project failure. Now, as I mentioned, I've been talking about event-source systems for a very, very long time. In fact, here's a talk about event-source systems before they were even called event-source systems. This is going all the way back to QCon San Francisco in 2007. And you may notice that we had different pattern names back then, such as asynchronous context mapping. Dear God, what is that? But the stuff that I'm going to be talking about is not something that's new or, you know, we came up with it six months ago. We have more than a decade of experience looking at these kinds of systems and how they generally work. Now, before we get into that, we're going to go through a very, very quick intro of what event-sourcing is.
And when I say a quick intro here, I mean, we're talking five, seven minutes. We're not talking a 20-minute intro. I want to spend most of the time today getting into the other stuff about why they fail, but there may be some people watching who are not aware of what event-sourcing is. And going through quickly will give us all a common point of what we are talking about. Now, with event-sourcing, and I've been using this particular slide for many, many years, what we do is we are an append-only model. We can do deletes and such, but we do deletes in different ways.
What we're basically doing is we are appending facts to a log. I can have a purchase order with end line items and some shipping information associated with it. This is a very standard structural model that you might see in a database or even a key value store. But instead of structuring things this way, we're going to structure things as a series of events. So here we have a cart created, three items were added, and then we had shipping information added. Note that this would be five events. Those three items added, each would be a separate event. It's just if I made them five separate events, the boxes got really small, and they don't necessarily look very good on top of your screen. But I can take five events like this, I can replay them quickly, and I can hand you back this purchase order.
The question with event-sourcing is what do we store? Do we store a structural model similar to this purchase order, or do I store a series of events and then rebuild my structural model off of the series of events that I'm storing? Interestingly, if you actually go into, let's say, a SQL database, a lot of SQL databases themselves will actually be essentially event-sourced internally. There are many pros and cons between these two different models. Before we get into the rest of it, I just want to go through one thing that's kind of tricky with event-source systems, and that is deletions and how do you delete information from your log. When we have a model like this, where I have a purchase order with n line items, I can just remove one of the line items. When I have a model that's append-only, and when we talk about append-only, remember, this could even be written to write-once media. It's actually a quite common thing to do to write your event log onto write-once media for regulatory purposes.
How would I delete something? Well, basically what you end up with is we have our cart created, three items added, one item removed, and then shipping information added. In other words, every time we delete something, the way that we delete it is by adding it, which is quite odd for some people. But there's some reasons for why that can actually be better. One of the main reasons that it can be better is, again, we have a full log of this. I can see that this item was added, and then later it was removed. In most systems, even when we are removing things like this, we will not actually get rid of the item added at any point. We'll just leave it there forever. Why? Because there might be reports that want to look at when that item was added, even though it's later been removed. There are times where we might need to remove this for regulatory purposes, things along those lines, and that can be done. But generally, we're not going to be doing that. So now we've gotten through our quick introduction to event sourcing. And obviously, everything is amazing. It's completely, totally awesome.
Correct? Well, there have been some people that have been bringing up some issues with it. And, you know, like everything, event sourcing has pros and cons associated to it. So let's look at what some of these things might actually be. How could these systems have possibly failed? And my issue with many of the failures that people have run into, is that they try to take the failure that they've actually had, and generalize that failure to the idea as a whole. This would be equivalent of me saying that I went and did a functional programming project, and we failed, therefore functional programming is bad. Well, it's possibly correlated to functional programming, it's possibly not correlated to functional programming. We don't really know.
So let's go through and take a look at some of the problems that people have run into, and look at whether or not they are really valid. So the biggest issue, and this is the one I see probably 80 to 90% of the time, let's say more than a first standard deviation, is that it's different. Well, yeah, it is different. The fact that it's different is also part of why you're using it. It's going to take you a couple projects of working on things until you kind of grasp what's really going on. It's not that you come through and, you know, knock your first project out of the park, because you know so well what you're doing when you're doing it for the first time.
You are going to run into some issues. And those issues are going to be different than what you would run into using other things. It just takes some time. Again, this is a real issue. And it should be an expected issue. You're working with something different, it's going to take time for you to learn what your pitfalls may be associated with that thing. A lot ofthis is just you're working with something different. I mean, if I were going to go and jump onto an Erlang project tomorrow, I would expect that working every day in an Erlang environment, I would have to work somewhat differently than the way I work today, that there would be different patterns or be different ideas associated to it. This is not unusual.
The problem that many have been running into, though, is they have, let's say 10 years of experience building systems. And they have three months of experience building event source systems. And they consider them to be experts in it because they are seniors, they have 10 years of experience overall, you need to remember that you need the experience inside of the thing that you're working with, not just overall experience. Now, while that is the largest thing, there are some very, very valid arguments to bring up again, event source systems. So let's go through what some of the really valid arguments actually are. The single biggest issue that I have found with event source systems is dealing with versioning over time. I am writing my events down to this log. This log, I do not delete from I keep it forever. I know I can possibly delete from it. But let's imagine I'm continuing to append to it.
What happens over time? Is the event that I wrote two years ago, the same as the event that I write today? If they've changed interpretations over time, how do I handle reading all the different versions of events that may have been written over long periods of time? And, again, this is a very, very valid issue to bring up with event source systems. In fact, it's non trivial. A lot of people like to kind of push this issue underneath the carpet, and we're not going to talk about it. There's some things that you can do, for instance, using weak schema, where you can only add and remove things and you can't rename things. And well, your versioning issues mostly kind of go away, but not completely. And in general, this versioning over long periods of time is a completely non trivial issue to be dealing with. In fact, it's so non trivial, I wrote a book on it. This is not something that you can just ignore. Versioning does take a bit of time. And it is different than what you would be dealing with in a lot of other systems. There are also some ways that you can make versioning relatively easy. One of the ways to make versioning really, really easy is you actually do a transform on every release. So when I do a release of my software, I release out an entire new event store as well.
All the events from the old event store get brought over to the new event store and they happen through a transformation process. In other words, we take an event from the old system, it runs through a function which can possibly transform it, and we write it to the new event store. Doing this will take away pretty much any versioning issue I would ever have. Because I have this transformation occurring. But it also requires a bit of infrastructure in order for me to do this. I need to be able to automate how to do this because you certainly don't want to do this manually on every single release. But doing such a thing will take care of most of your versioning issues. And be careful about making decisions on versioning too early in your project without having somebody who's done it before.
Because it does take experience to actually learn how to best version events for a production system. It's not just, you should do it this way. If you pull down the book that I linked earlier, it goes through at least five separate strategies you can possibly be using on this. But I want to stress that this is something that doesn't make event source systems hard. It's, you just need to learn how to do it. Most of the complaints I find of people who are having a hard time with such systems are getting stuck with versioning. And they are then associating their problems associated with versioning back to the ID in general. And it's more so that they just haven't learned really how to version things properly. Let's move right on to number three. Modeling is different. I am not shitting you. I have actually had people tell me that event source systems are bad because you have to model them differently. Well, okay. That's also kind of why they're good. I mean, we're modeling events as opposed to modeling some structured state. This, it's not that one is inherently right, and the other is inherently wrong. It's that they're different.
How I go through and model when we're dealing with a series of events is different than how I'm going to be modeling when we're talking about state. This is going to come through all the way to how I do analysis. This is going to come through to how we discuss things. It's different. And on some problems, well, it's better. And other problems, it's perhaps worse. You can't even discuss this without getting into what the problem is. This may be a benefit, it may be a drawback. We can't really say which one of the two we're actually dealing with, unless we're talking about what the problem is. This is no different than saying that functional programming is bad because you solve problems differently than you do in object-oriented programming. Well, yeah, that's kind of the goal.
Functional programming is not object-oriented programming. There are different trade-offs associated between the two. Those trade-offs do not make one inherently good or inherently bad unless we're talking about a concrete situation. There are times where an event-sourced system might be preferable over other models. There's times when other models might be preferable over an event-sourced system. It's not that one is right and the other is wrong. It's we need to talk about context. Now, we're going to get into my favorite issue that people have been bringing up. And that is eventual consistency and consistency.
And this is where most people who have been raising issues have been raising their largest issues. And this issue is just utterly bizarre for me. And the reason this issue is utterly bizarre for me is you start seeing diagrams like this. And this looks similar to some diagrams I have made in the past. Well, I shouldn't say similar. I mean, this is a direct copy and paste of one of the ones that people were using.
So, let's go back to my screen. We're here on the left. We have our event store, which for some bizarre reason is doing create, update, and delete. And there's this weird thing called an event queue, which is actually part of your event store. Like, you don't need an event queue if you actually have an event store. And then on the other side, we have our read model. Now, what's weird about this is a couple things. So, one, this read model, well, there's not normally one of them. There can easily be three of these over here. The second bit that's weird is we have this event queue, but we would never actually use an event queue in any of this. Our event store can act as our event queue for us. And this is eventually consistent in the way that this is done.
So, I don't think that they would be building this up, but that's not necessarily bad. And you could make this fully consistent between these two models if you wanted to. I mean, could I have my domain model when it writes to the event store also write to my read model if I only have one of them, and for instance, to use a distributed transaction between these two models in order to put them through at the same time?
And the answer is, yes, of course, I could do this. Now, of course, this wouldn't turn out very well if I actually tried, but I could do it. Now, what's utterly bizarre for me about this particular argument is that this argument is being used against CQRS. But when I actually go through and show people CQRS, my slide actually only has a single data storage on it. It's not event sourced. So, event sourcing and CQRS are two different things, and we can discuss whether or not that's an issue with event sourcing, but it's certainly not an issue with CQRS.
And you can get around all of this if you were to do the transactional update that we were talking about. And you can do that transactional update. I've seen project after project after project do it and succeed. But you need to understand that when you do that, you are giving up on certain things. And you need to think about whether those things were actually important to your project. The eventual consistency that exists there in many systems is there for reasons. So, let's go through and look at what some of the benefits of eventual consistency there are. The biggest one, and this is what makes me kind of laugh at this little diagram that was being used, is that they only have one read model.
So, they're writing to the event store, and then they write to one read model, and they can do their distributed transaction here, and yay, everything's great. What happens when I have nine read models? And my nine read models are of varying types. So, some of them might be OLAP, some of them might be MongoDB, some of them might be files. How do I do this process and do it completely transactionally to nine of these? Well, that doesn't really work. The reason why you have this bit of eventual consistency in here is because there may be many of these, and you want each one of them to be able to operate independently.
So, I want to be able to have my series of projections following that event store over there and writing to this instance of SQL Server. I can bring up five of these. I can bring up five of these that are following the event store and writing into their own instance. What would happen if I tried to do this transactionally with that event store? Well, this is going to turn into a huge problem, and I'm going to end up losing my ability to have many of these. The main reason that we're doing this non-transactionally and we're following this log and bringing everything over here is because I want to be able to have many read models. In any system that you're working with that's event sourced, you should end up with multiple read models. Literally, almost every system I have seen, I have seen places where it's, well, you know what, we should throw in a different read model for that.
Yeah, we're doing a bunch of OLAP processing in SQL Server, but we found that we really want to do this full text indexing. Well, yeah, we probably shouldn't do that in SQL Server. There's a lot of better choices for doing that. We should probably use one of them. Be very, very careful when you see event source systems and they only end up with one read model associated with them. You almost always want to have multiple.
Keep in mind that when we talk about having these multiple read models, it's not just different types of read models. It's not just that one can be in a document database and another one can be in a SQL database. It's also that I can put one in London, I can put one in Hong Kong, and I can put one in New York City. If I'm following this method of doing things, my ability to geographically distribute read models is essentially free. And there's lots of great examples of where this ability to geographically distribute read models becomes extraordinarily valuable. One of the most common ones is I'm hosting everything in the cloud. And then who would ever imagine that somebody wants to be able to run reports locally and not on the cloud? Who could even imagine such a feature request? Well, all I have to do is I just bring up a read model in their locality. And then they have a read model running locally. So now we have the cloud one and we have the one running locally for them.
And it's important to note that this ability to do this essentially comes for free inside of an event source systems. Now, I've seen people make the argument that they could then transactionally write to all three read models. Well, you're not going to turn out very well if you do that. Because if you're trying to write transactionally to three read models, and one of them is down, what happens? Well, now you can't write transactionally to three of them. So you stop. This is a fairly well known issue. And then people say, well, we'll just continue. Okay, yeah, now you're eventually consistent.
There's no way of really getting around this issue. The eventual consistency that's there, we're not putting that eventual consistency there. Because we want some massive scalability, or we think it's cool. Or what we're looking at there is availability issues that can come up if we don't do it. It's kind of something that you need to be doing.We need to remember as well that geographic distribution is one of our primary goals in this type of system. I mean, you will find system after system after system, where once you have the ability to geographically distribute, it becomes, wow, this is like the coolest thing ever.
You may not even realize it up front. But your ability to actually just throw up a read model and bring it up anywhere in the world. And then people can locally work with that read model is actually a huge benefit of this type of system. Now with our eventual consistency, we can hide it. In my class, I literally go through for over an hour talking about things that we can do to hide it. But the easiest one to deal with is when you write me an event, what if I wrote you back where in the log this event was written? So you write an event to me, I take the event, I write the log, I say that this event is at position 104. Now when you go to query, back to the OLAP side, you include the version that I gave back to you saying where it was written inside of the event store. If you happen to be using event store, the product, we actually return back a position, which is the position, the all position.
It is where this thing was written in the log, essentially a byte position. On the other side, when I go to do my query, I include that position when I go up to go do my read. Asynchronously, we've got our projections that are following the event store, and they are updating inside of my read model where they currently are in the event store. Remember, we can look at the event store as being a linearized log. As my read model is coming forward in that log, it's just saying this is how far along in the log that I am. When I go to do my query, I include that position on the result from the thing that I just wrote. Because I've done that, this read model can tell me whether or not it has actually seen that write that I did in order to do my query. If that number is lower, it just returns me a retry after. We are in its entirety here talking about a few lines of code. This is not a huge piece of work to be doing. But if you do this, you will never see eventual consistency. But it begs a question.
I do my write. I get back that it's been completed. I go to do my read with my version on it, and it tells me that my read is not ready yet. Okay, cool. So let's do a retry after. We'll wait five seconds, and I do it again. Okay, and it tells me that it's not ready yet. Okay, so I do a retry after. Now it's another five seconds. It's still not ready yet. At some point, I'm going to reach where it's better for me to return something than to say that it's not available. Where is this point? And it depends from system to system. This is not something that is a global decision, and I can tell you for your system what it actually is.
There will probably be different pieces of data in your system that have different requirements associated. But doing this, almost all of my eventual consistency issues would just go away. My reads would be coming back as being consistent. And don't get me wrong, this isn't going to solve every single possible issue that you could have, but this is going to solve the vast majority of them. And this is a very, very simple thing that we can be applying. This is something that can be implemented on top of an existing system within a few minutes. This is not a huge piece of work or some difficult technical idea.
Saying that the eventual consistency issues are what make event source projects fail is ridiculous. I can take away almost all of my eventual consistency issues in just a few minutes. It's just I need to know how to do that. Literally, this is something that we could implement inside of your system in less than a day, most likely within an afternoon. It's really that easy to be able to do it. Now, a lot of these arguments that have been going on against event source systems are being made by people who have built one.
They are not experts in the material. You have to be very careful these days, especially because there's a lot of people that are out in all sorts of different areas of expertise with very strong opinions and not necessarily the experience to back having such strong opinions. And I want to stress that this is more something that we get because of, I honestly believe, largely social media. And we see it not just in software. We see it in all sorts of topics. I mean, we can talk about your uncle who's somehow an expert in politics all of a sudden and suddenly knows all the details of how governments actually interact and will tell you about what the personality of a given person is we've never even met.
But keep in mind, a lot of this stuff is FUD. But having gone through many of their arguments that we've seen along the way, I have a real question for you guys. Do you want to know why their projects actually failed? There's a reason why they failed. And what's hilarious for me is again, it comes from industry to industry to industry. We see the same pattern going on. The reasons projects fail are the people associated with them. You don't fail a project because you chose to model something in a given way.
You fail a project because the people that were involved with the project. Perhaps that modeling wasn't the right way of modeling something for that project and you chose to try to model in that way.That's still a people issue. It's not an issue with the technology or the concepts. It's an issue with the people associated. And we see this in time after time after time after time. Ideas are not what generally fail. It's the people associated who fail.And with that, I will open things up for questions if we have any questions, but I'm not quite sure how this questions might work. Oh, there she is.