Software Architecture for Humans! [eng]
Talk presentation
Software architecture is only appearing to be a technical topic. Of course, software needs to have technologies and structures, but people have to be at the focus of the architecture. After all, the key challenge of software development and software architecture is that the software systems we build are too complex for a single human to understand. However, the organization and management of people can also solve problems that relate to software architecture. Thus, the talk shows the dependencies between architecture, people, and organization - and how to use them to develop software more successfully.
- Eberhard Wolff has 20+ years of experience as an architect and consultant - often on the intersection of business and technology.
- He is the Head of Architecture at SWAGLab in Germany.
- As a speaker, he has given talks at international conferences and as an author, he has written more than 100 articles and books e.g. about Microservices and Continuous Delivery.
- His technological focus is on modern architectures – often involving Cloud, Domain-driven Design, DevOps, or Microservices.
- site, Twitter, mastodon, Github, Linkedin
Talk transcription
Thank you so much for the introduction. I'm glad to be here. Thanks for having me. So, I will talk about software architecture for humans, not for computers. I'm the head of architecture with Swaglab in Germany. I want to start off the presentation by asking the question, "Is this a great architecture?" That's a typical question that you would have in some architecture reviews. And that leads to the question, why are we doing architecture? The reason we are doing this is that humans have limited mental capacity and still need to be able to modify the system. However, the systems we are building are so complex that one human alone cannot possibly understand and modify it. Therefore, we need to split the system in a way that each human can change some part of the system without understanding everything about everything; otherwise, they will be out of luck and won't be able to change the system because it's much too complex.
So, is this a great architecture? Well, actually, I don't know because it depends on who the architecture is for. Therefore, I would argue that what we are really looking at is some complexity, and there is some bound for the maximum capacity complexity a team can handle efficiently. As long as the actual complexity of the system is below that threshold, we are fine. So, whether it is good architecture depends on the capacity of the team and their mental capacity. Therefore, we can only review architecture when we consider the people too. There is no absolute great architecture; it doesn't make a lot of sense to say, "Okay, this is a great architecture," without looking at the people who need to understand it. Therefore, we should use metrics with care. Even though the metrics might say this is good or bad architecture, we need to look at the people at hand to figure out whether it's really good or bad architecture.
Managing dependencies, managing the shape of the architecture only takes you so far. So, I would argue that you should do some interviews and figure out what the people think where the problem is with the current architecture. Then you can support those findings with some metrics that say, "Okay, this part of the system is not good." And therefore, we should look at the people at hand to figure out whether the system is really too complex. Then you can think about improvements. It doesn't make a lot of sense to just take a look at the architecture and say, "Okay, this is good or bad"; you have to talk to the people. There are even some metrics that, in my opinion, are quite interesting, talking about social aspects, such as who changes which part of the system, what changes frequently, what is only changed suddenly, and so on.
The reason why this is interesting is that if there is some part of the system that only one person can change, that's probably a risk, and that person can never go on vacation. Stuff that has changed frequently but is really complex—well, you should probably improve that because chances are that quite a lot of time is wasted on it. While stuff that is really complex but is changed only seldomly, well, maybe it's fine that it's complex because it doesn't really matter that much after all. On Fridays, I usually do a stream about specific topics about software architecture. Most of the episodes are in German, I'm afraid, but there is one that talks about behavioral code analysis with Adam Thornhill. He is also the author of the book "Your Code is a Crime Scene." He walks us through concrete examples of how to do behavioral code analysis that takes these points into account, like who changes what, what has changed frequently, and so on.
Figuring out whether an architecture is good or bad is only one part of the game. The more important part is the question, so how do you improve the architecture? Usually, what you would do is say, "Okay, here is the threshold of the maximum capacity that the team can handle. This is the actual complexity. It's higher than what the team can handle efficiently. So therefore, we need to bring the complexity down and simplify the system." This is what we do in many cases to improve software architecture and make it easier to change the system.
However, Adam talked about one specific problem in a consulting engagement where the team was fine with one system, but there is this other system that they're working on, and that is really a bad system. When they looked at the metrics, they figured out that the team, that system is actually well-structured. So it's not clear at all why this other system is so hard to handle and so hard to modify. The reason was that the other system was well-structured, so it should be easy to understand, but the team never learned the system. It's not the structure; it's that the team never really got around to understanding that other system and working with it. If you try to fix that situation by improving the system and making it easier to change by decreasing complexity, it will probably not work. For that reason, what you need to do is increase the threshold of what the team can handle, and therefore, the team needs to learn the system and you need to educate the team.
So, if you just look at the metrics, in this case, they are really great, and they say that this system is a great system, but really you need to optimize, and in this case, you need to optimize the people who are working on the system. Let's talk about legacy. The question is whether legacy is also a social problem. The traditional explanation for legacy is that over time, the complexity of your system increases because of software rot. Software becomes rotten after some time because the structure deteriorates and becomes bad and bad because of technical debt. This is how we usually think about legacy. I would argue that there are social explanations as well. So the maximum threshold of what a team can handle might actually decrease because people quit. In that case, the problem is not so much about the complexity of the system; it's rather about the people who quit and had the great knowledge about the system. The fix, in that case, would also be education: making people learn the system, making them understand the system, and things of that nature.
One of the interesting things and one of the interesting patterns in that regard is the big ball of mud. It's a pattern, so you can find it out there that explains this pattern. There is a paper that talks about it, and that paper argues that increasing complexity might actually be fine, as long as the system stays maintainable. You are actually good, and it's cheaper to build systems that way because making a system perfect in every regard is quite expensive. Therefore, you have to do some compromises. Generally speaking, not all parts of a system will be perfect. For that reason, there will be better and worse parts anyway. Therefore, it doesn't make a lot of sense to say.
Okay, we are going to have great quality all over the world. You will fail. You will have good parts. You will have bad parts. And somehow you have to deal with that. So, for that reason, you have to accept some additional complexity. However, I need to warn about the consequences. The system must stay efficiently maintainable. If the quality drops too much, then the consequences might actually be disastrous because then you might have to go back and have a standstill where nothing can be changed in the system, where it's really unmaintainable, and you are basically screwed. But at the same time, as I said, there is no such thing as a perfect system.
I did an episode about the Big Ball of Mud 2, and you can find the original link to the Big Ball of Mud paper there. I would urge you to read it because there's a lot of interesting stuff to learn from that paper. Talking about that paper, there is another thing that I want to talk… Oh, and I should add that also this episode is in German, I'm afraid. So if you don't know German, then you should read the original paper. Otherwise, you can follow the stream.
The other problem that this paper talks about is the question of whether you would like to be called a good developer. We usually want to be called a good developer. Not me. I don't really consider myself a developer anymore. But developers want to be called good developers. They want to be recognized as good developers and would also probably be praised for being good developers. And I will come back to that in a moment.
So, the graph that I've shown so far talks about the actual complexity of the system and the maximum complexity you can handle. As I said, there is the maximum threshold that you... you can handle, which is... somewhere here at the top. And then there is the actual complexity. Actually, this is too much of a simplification because in your team, you will have good developers that are able to handle more complex code. And there are average developers who can only work with less complex code. I think this definition of good and average developers is probably something that resonates with you and makes some sense, but we will come back to that later.
Now, the problem is if you have your system and the complexity of the system here and you let it increase and you let it increase too much, then you end up in a situation where the system is so complex that only good developers can change it meaningfully and can still handle it efficiently. And the question is, well, can you avoid such growing complexity? As we saw, there is software rot, there is technical debt, so there is only so much you can do about it. The question is also, should you? Because it costs a lot of money to do so. In particular, the paper, the big ball of mud paper says that oftentimes developers, particularly good developers, say, "Okay, we are just going to do some practical solution and not this theoretical, architecture-style game that we have with well-structured systems. That's much too much. We just do basic practical stuff." As I said, clean is really hard and requires some effort. So, therefore, you might go into that direction where you're building systems that are at the end maybe not that greatly structured anymore.
Now, if you have that situation, then you will probably go to the developers, to the good developers, and tell them, "Well, you saved the day. You are great developers." And I have to admit that I have one developer in mind who we walked to and discussed lots of changes. And I thought he was really, really valuable for the project. But it's somewhat dangerous because it means there is this one person, and he is, well, he has some incentive to make the system more complex because then he can be praised. And I don't think that such a developer would willingly increase the complexity. It's just that it's a slippery road, and that you might end up with increasing complexity for these reasons. It might be something that is rather involuntary, and it just happens. It's also good for job security and it also provides interesting challenges. So, for that reason, it might be something that is quite important.
I did a little research, and there are a few people who talk about this that you can also watch in English. Actually, I did that in Lviv a few years back as a keynote at a conference. And this talk is basically about how we think that complex systems are interesting stuff and that we might be led to a situation where we value complexity and we think it's a great thing. Now you could argue that these are not good developers, and I have to admit that I would love to agree. However, if you look at Java certifications, for example, you will see that, in fact, we have an interesting idea about what good developers are.
In Java, we have basic data types like the byte here with a small "b." And we are calling a method that takes two bytes as an input. The first method that could be called is one where you can have an arbitrary number of byte arguments. The second one is one where there are two longs. So the byte would need to be converted to a long. And the third one is the one where we are using a byte with a capital "B," so that's the object type. The question really is, which method is called? So it's either the first, the second, or the third method, or it's a compilation error. And really, this is about how the byte becomes converted. If the two bytes become converted to varargs, it's the first method. If they become converted to longs, it's the second method. If they become converted to that object type, then it's the fourth, then it's a compilation error. And I took that from a blog that explains Java certification. And I'm not sure how you feel about that. Oh, by the way, obviously, the solution is B. And I figured that out by taking the code and compiling it. Therefore, yeah, having the compiler solve the problem for me.
I have to admit that I'd rather not work on a project that requires understanding such code or where people write such code. So therefore, the question is, why do we ask for such knowledge? It means that we have some broken idea about what good developers are, in my opinion. Let me talk about another thing that's micro and macro architecture. In micro, macro architecture, it's about delegating decisions. Macro architectures are decisions that are made for all modules. So for example, in Kubernetes, you might come up with Kubernetes as a platform. And then there might be a macro architecture decision that says everything should be done in a Kubernetes container, in a Docker container. And this is binding for all modules. There might also be a microarchitecture, things that might be different for each module. So sticking to the example, you might use a different programming language for each Docker container. And the microarchitecture would be left to the team, so in my example, which decides about that as long as they provide some Docker container for the microservice. So that's the option that they have. And therefore, the decision about the programming language is delegated to the team.
Now let's talk about static code analysis. Static code analysis is something where you get some information about the code. For example, you get some information about the complexity. You get some information about unit test coverage. You get some information about the blockers. And the question really is, should we put that as part of the macro architecture? And I would argue that there are three options that we have. Yes, we can have predefined metrics. So we could say, okay, you should have at least 80% code coverage. You could say, okay, you have to do this code coverage analysis, but it's up to you to come up with meaningful metrics for your team. Or you could say, no, we are not forcing you to use static code analysis. Usually, when I ask this question, most people say, well, we have to have at least some kind of static code analysis. So we rather force the teams to have it. So they usually take one of these two options and not that one. I find that interesting because just a few slides back, we talked about micro and macro architecture and why we are doing this. The reason why we are doing this is that we want to delegate decisions. Now we are enforcing a decision. We are saying you have to do static code analysis. Why would we do that? It's not about that static code analysis doesn't make any sense. It's about, do we force the teams to use static code analysis? And I would argue that ideally, we shouldn't force them.
The goals should be that the team somehow acts autonomously. We don't interfere with them too much. They have to deliver some quality. They need to decide how to do that, and it's up to them to decide how we do that with or without static code analysis. As long as they provide maintainable software of high quality, it's fine. Whether we are going to force them is, well, I would rather not force them. However, if I do this, I need to trust the teams to deliver quality. And I need them to trust them to choose the right means to do so, which might or might not include static code analysis. But, you know, the trust might be limited. So teams might not be trusted because there might be some team from some external company that we know provides low-quality code, and we cannot trust them. So then we might, you know, put static code analysis in place and figure out, uh, whether, um, and try to handle the problem that way.
However, there is a problem if you do that. And the problem is Goodhart's law. Goodhart's law says that every measure that becomes a target becomes a bad measure. A typical example is a red. So in the Middle Ages, uh, some city, I'm not sure whether in Germany or someplace else decided to have a bounty on rats. So every dead rat was rewarded with some money, and what people, and obviously if you kill more rats, chances are that there are fewer rats in the city and you will get rid of them. However, what people, um, started to do is, uh, they started to breed rats and therefore, uh, the number of killed rats became irrelevant for, uh, for fighting that rat plague, uh, because there were just, you know, newly bred rats in some cellars that wouldn't exist if there wasn't a bounty on them.
And the same is true for static code analysis. If you have some reward, if you have a high code coverage, then chances are that people will start to gamble the system and write code that executes or write tests that execute the code but never check anything. So, you know, I have some code that adds two plus two. I execute the code. The code gives us a result of five. And, uh, I will never know that this is wrong because I'm not checking the result. Um, and things of that nature. So for that reason, if you start doing that, if you start managing by things like code coverage, chances are that you might fail and that, uh, your problem might actually become worse.
Talking about micro micro architecture, there are other approaches. Um, so one approach that I find interesting is an approach that is, I call it the requirements approach. So there is a document that talks about the requirements and how you can handle them. So that document has chapters. It talks about scaling your system, security, how to work with multiple teams. And in each of these, uh, chapters, there are two sections. One talks about requirements and one about possible solutions. And the requirements might, uh, for scaling might be okay. So we are going to scale up the system. There will be more and more customers. And actually, there are business goals that say that there are more customers. Uh, these business goals are not set in stone, but if they change, they will even be increased. And there might be peaks that are not planned for. And that just happen arbitrarily. So this is what the document says is sort of your problem statement or your requirements. And then it talks to you about, uh, what some solution might be. So you can get a larger machine. You can scale it up. You can have horizontal scaling where you have more machines to handle the load. You can chart the data. Uh, so separate customers from different regions. Also, you can have graceful degradation where parts of the system are different. You can have a lot of load. You can have disabled because when the load is too high, you can have asynchronous communication and so on. And for each of the, these solutions, there is a description and a list of experts that you can talk to and advantages and disadvantages.
And what I like about this approach very much is that it clearly communicates trade-offs. It says, okay, if you want to have scalable software, you can do horizontal scaling. That means it will be costly if you have a lot of load, but chances are that you will be able to handle the load, and you have to be careful because there shouldn't be any bottleneck concerning parallelism, for example. So now I can make my own decision. Um, and I would feel very supported by this document because it says, this is your problem. This is how you can handle it. This is who you can talk to to get it handled. And I feel quite autonomous about it. Um, but, um, at the same time, it requires even more trust because what I'm basically saying is here is your problem. Figure out how to solve it. Here is some help, and speak up if there is a problem that you can't solve yourself. And obviously, I would support the teams, but there is no such thing as control in this environment. So, um, the discussion that I have is okay, that I would have is okay. Are you going, are you able to change your, to, to tackle your problems? Can I support you somehow? But it's not going to be about, okay, so are you really on track to reach your goals? Let me see how you do that precisely and so on and so on. And that's quite different from those metrics where I would say, okay, are you doing 80% code coverage? Yes, no, it's much more, there is much more freedom, autonomy, and, uh, it's easier to, uh, to handle, um, the, it's, the, uh, it relies on the teams to actually fulfill, uh, that role of, of doing, uh, what they are asked to do.
So in conclusion, when should I choose what? Well, this depends on person, culture, and trust. If there is a team that I cannot really trust and I give them the requirements approach, we will have a huge problem. Um, some people need to be controlled. Some want to be told what to do. Some need just some guidance. Some want to, uh, decide by themselves. Uh, then those are really autonomous teams. And depending on that, you can choose one of these approaches, right? So if you have really autonomous teams, you will probably, uh, use, uh, the approach, uh, with the requirements approach. If you want to control people, chances are that you will enforce, um, static code analysis and try to solve, solve the problem that way. So in conclusion, what now? I think that's an interesting question. So you could try to fix the organization, right? You could try to make sure that people actually trust, uh, the people more, but I want to develop software and I don't want to fix the, uh, the organization. So for that reason, I'm not sure whether I would invest my resources in that. So you want, you need to, you, you probably need to live with what the organization is like. And, uh, therefore you might need to change your architecture as I said. So if you can't trust the people, maybe you need to, uh, well control them, um, and work that way. If you found it interesting, um, send an email to this email address, uh, you will get a link to, to OneDrive. That OneDrive, uh, includes, uh, a lot of features and features of, of some other books. There's also the QR code that you can scan. Uh, the email is answered by Amazon Lambda and a microservice. So your email address will be locked for 14 days. Uh, that's the minimum that I can set up in Amazon Lambda. And if you mistyped your email address, I will fix it myself and it's going to be handled manually. So that's about the data privacy. Thanks a lot for listening. Thanks a lot for taking the time. And, uh, I'm happy to, uh, take your questions now.