Data Mesh in Kubernetes [eng]
Talk presentation
PwC Germany is working with a lot of data from different domains and sources, access to which should be properly governed. To tackle those problems and to make access to the data more transparent and straight-forward, we’re building our internal Data Ecosystem.
From this talk, we will cover the following topics:
- Data storage and analytics evolution
- What is Data Mesh?
- How do we build it in Kubernetes?
- Challenges that we were dealing with, and see in front of us.
- PwC Deutschland, Tech Lead of Data Infrastructure
- Andrii was building the internet and making it reliable. Now, let’s crunch some numbers!
- Has over 10 years in the Software engineering industry
- For the last 6 years he has been working in healthcare, fintech, and consulting industries, which are heavily regulated
- Quite interested in Domain Driven Design, Kubernetes, clouds, data
Talk transcription
Hello, everyone. Happy to see everybody. Okay, I don't see anybody. But I hope you will enjoy my talk about Data Mesh in Kubernetes. This talk is specifically designed to be Kubernetes-focused and about Data Mesh, basically. First, our agenda for today would be interesting. First, an introduction. That's what's happening now. Data storage and analytics evolution. What is Data Mesh? Motivation for us to use it. Components, implementation, evolution, challenges that we face during all of those processes. And of course, a Q&A session for questions and answers. First introduction.
Who am I? I am a domain-driven design enthusiast. I started my career as a software developer. I'm a software engineer, pivoted into the site-reliability field, data ecosystem, tech lead as of now. And I'm quite curious about data, therefore, working with Data Mesh as of now. Below, you can find me on LinkedIn, if you care. Anyway, let's go to the fun part. Data storage and analytics evolution. Okay. So, data analytics is not quite a fresh field, so to say. It's not as old as software engineering in general, therefore, it took a lot of approaches, took a lot of inspiration from software engineering design, and was evolving constantly. Modern software architecture. What is it? Well, as you can see, it is not very modern, probably. All of us were facing some monolithic architectures. This one is quite service-oriented. You can see some decoupling happening here. But in general, when you start a small startup, you investigate into it, you hire a team, and then you start building.
The first and the easiest one. And I'm not gonna go into detail, is monolithic architecture, whenever there is a server and some front-end, some back-ends running there. Of course, you use some cool techniques here and there, but to implement microservice architecture from the go, it is rather a difficult and talent-wise expensive decision. Therefore, the majority of the company starts with this. What you can see there is a database usually, that single database that eventually will get sharded, eventually will get replicated. There is a monolith application that is growing, and if structured properly, that will last you for some time.
When I start digging into data, what you can do is try to build some business cases for your business to utilize, to benefit from if you start building a platform for storing the data that would be future in future use for analytics. Usually, that ends up with a data warehouse, and it's growing, growing, growing, growing. The approach or the structure of the team and the approach basically stays the same. You have your back-end engineering team, your front-end engineering team, and then a data analytics team. They provide you with business insights, they dig deeper, and hopefully, they will find something that will help your business to grow. That sounds good, and the solution is working, of course. In the future, and that's how I started my research in the marketing world, and I've got my own experience like having a model, I've got my own league. That's what I'm trying to do for my business, and that's important to me. That's why I keep this topic simple. I think the concept of this product is for people to believe that it's going to be a business that has been in business for a while.
For example, if you're running software that has been on the market for many years, then you can start in front of you that basically includes a bunch of microservices that are an end-to-end solution, right? They are responsible for some functionality. It is decoupled if everything goes cool and well. Probably you're following domain-driven design principles and you have domains dedicated to some functionality that represents business domains. That is usually how you decouple, as the software industry is mostly created to help business and therefore tends to replicate business structures. If this looks quite scalable, you start your team as a bunch of front-end engineers. Also, oh, sorry, you did build uh self-independent, so to say, teams that include front-end engineers, back-end engineers, um.
In the future, it might include data analytics that are working on some features. They are basically independent, using all the benefits of microservice architecture. They ship isolated features, functionality, and serve isolated domains, making it much more scalable when you scale your business. When you are becoming a scale-up and getting a bunch of investors giving you money, you can finally make your products as you envision them. For that, of course, you need to increase hiring, standardize technology stacks, and factor in a lot of nuances. Foreign, it usually looks like, regardless of the infrastructure that lays behind it, and that is the most recent, so to say, go-to architecture.
But this talk is actually about data mesh. So, what is data mesh? Data mesh is a concept that I usually describe as a microservice architecture applied to data. The main principles are treating data as a product, treating the data you expose as a self-sufficient product available for consumption by other teams. It also includes federated governance, which is a very important part of any company. You need to have control over who can access what and so on. Domain ownership comes from the idea of microservice architecture and domain-driven design. A team owns the domain as a software domain and as a business domain, and therefore, as an analytics domain. In this case, a team that exposes data is responsible for the quality of the data, interfaces, and all the things needed to discover and work with this data. And last but not the least important one, self-serve.
Recently, one of the latest trends in software architecture and engineering is platformifying everything. Whatever you have, make a platform out of it. If you can make a platform for platforms, good for you. Why is it a hot topic these days? Because it enables people to utilize common infrastructure and, in some cases, serve themselves. They would go to some backstage UI, pick the service they need, use scaffolding architecture, and in a few clicks, provision a new service with all the required resources, infrastructure as code, and more. This is a crucial part of the independence of the team, impacting the team's velocity and, therefore, the time to market for features. It determines whether you beat your competition or at least catch up with them. These are the main principles related to data mesh, not only in data mesh but also in the software side of architectures these days.
Why would we, as PwC Germany, dive into this mesh? The answer is also reflected in the structure of the company. PwC, in general, is a global brand. While each territory like Germany, UK, USA acts independently, within each territory, there are partners who also act independently. When everyone around is so independent, they need a place to run their architecture. For example, you start with some analytic parts, then try to decentralize things around so people get more freedom and can achieve more. But over time, you'll shrink again, in some cases, to centralize some of the core features, functionalities, and structures that you need. So, what they take away from this setup can add partnerships. For example, let's talk about data.
Let's take, for example, publicly available data. When everybody assesses some point to public open-source data, they would implement different ways to assess it, to work with that, to clean the data. Therefore, there would be a lot of redundancy, a lot of steps that you could have done once only. Also, you need to govern access to this data, and here we come. Why do we do that? It was working just fine yesterday, two days ago, one decade ago. It was always working just fine as it was before. But while the whole world is moving forward, you probably should too.
What it brings us is reusability of data products like public information and faster time to markets. When you have standardized interfaces to access the data, it's much faster to produce any analytics and make decisions based on it. And of course, decentralized data teams. When they are working with data, it basically enables everyone to work with the same data in different ways, the way they prefer. Now, let's discuss the components' implementation. I'll share the components that we built from, as you remember, we are using Kubernetes a little bit on the top, you on the bottom of its lace, where clouds meet. We try to build our infrastructure as much as independent from vendor locks as possible. Also, I'll share the implementation, how all of it works, and we'll discuss a little bit of the evolution side of those components.
There is a bunch of software here, right? A lot of resources, Kubernetes, Apache Ranger, CertManager, Helm, Istio, dbt. Anyone who works with data has probably heard about some of those, like dbt, Hive, Trino, Apache Atlas, Data Hub. Those are great products that we try to implement in our system and offer to various data users, analytics teams. Of course, we are not providing them just Kubernetes clusters, right? And a bunch of Helm charts. We try to build out of all the tools that are requested from us, something reusable that can be used in the future by each and every team that joins our data ecosystem. Whenever you want to expose data or consume it, you provide a lot of information about how to use it. So that's probably going to us. Our landscape and kind of architecture looks like that. We have governance domains, data products, pipelines, self-service, in this case, backstage. And I would call it query services, like Trino and Apache Superset, but I wasn't sure how to call it in this presentation.
As you can see, we are using Ranger, KeyClock, HashiCorp Vaults, Data Hub, Apache Atlas, Create Expectations. All of them should work nicely and coupled to ensure our governance. Apache Atlas is used for data and security. It's also used for governance as a data catalog, Data Hub mostly as a data catalog. Create Expectations, if I'm not mistaken, is a data quality tool that also falls under the governance domain. KeyClock is a nice tool that helps us to track all the users and what they can access and what's not, and Apache Ranger enforces policies.
Why this setup for governance? Apache Ranger is a known project, a known product that has been in the market for a while already, and it helps us to establish governance on, I would say, a low level. It has a good connection to Hive. We can control down to table and column level the access that users have with the policies of Ranger. Unfortunately, it's not an infrastructure-as-a-policy solution, but everything's fine. We can also use it to store data in a database, but that's a topic for another discussion.
Super Set and Trino are used to query and assess all of the data products. As you can see, we can have multiple ones, and we do have multiple stakeholders that store their data with us. Of course, we also provide them means to work with that. For example, Jupyter Notebooks. We use Hive to connect to S3 endpoints, to databases like Postgres, Microsoft databases. Also, we use Azure Data Lake, specifically Gen2. But whenever you work with products, we already covered the governance parts, we covered the querying parts. But how does data get into those data products? It comes from different pipeline solutions like Kafka, Apache Airflow, Argo Workflow, and if you wonder what this squid is doing, Apache Spark, Apache Camel, and dbt, why not? Yeah, all of those sources are available for you to ingest your data products, which, I think, is very useful. Also, we have a few more tools that we are using in the future to build data.
So, we can also use Power BI, which somebody in the future can use to query and build analytics, creating Power BI dashboards or anything else considered valuable. Last but not least, a self-service domain. That basically consists of Backstage and all of the integrations to it. I didn't mention them here, but you can imagine that we are building some kind of forms that help you spin up data products, spin up integrations, serving as documentation parts for your teams and self-service endpoints. It is a crucial part of the whole setup. Of course, you can live without it, but then there would be some bottlenecks, and you would probably need some different decisions based on the number of your customers. For example, if you serve one team, that's quite easy to provide the whole setup by yourself, right? When you serve like 20 teams, 50 teams, that would be a rather annoying part of your job. So, to automate, give as much power to the user to do whatever they want, whatever they consider will bring value. We use Backstage, of course.
Challenges that we are facing, we were facing, and probably we will as well. One of the challenges is policy as code. Me as a tech lead, I prefer to follow the GitOps way. And as I mentioned before, some tools do not offer that to make sure that we can recover from anything, that we can restore our environment to the best state, preferably the state just before the disruption. We do take some roundabouts. As I mentioned already, we prefer everything to be in a GitOps way. That's why we have part of this infrastructure as an Argo workflow and what's not. We decided, for example, to put policies for Ranger into some JSON files, which then get synced with the environment. And for that particular case, we developed our own operator. So, we always can be sure that we are using whatever is in a Git. Data quality. It is a very good question. Usually, that's not a concern for my particular team. We basically try to provide the best environment for others to thrive. And therefore, we need to provide some data quality tools like Deque or Great Expectations.
Another challenge that you might see coming is a lot of small databases. It is not necessarily in this architecture that you would put the whole monster for tens of terabytes, hundreds of terabytes of data, maybe even petabytes. But it's not necessarily the case. Maybe the data set that we need to provide that a customer wants to acquire is rather small. They would prefer to have a simple Postgres database. In this case, if you get a lot of small databases, you'll end up having quite a bit of those databases that would take space on the nodes. And it depends on how often they are used. If they are not queried enough, that's something that you can basically look at as a misuse of the space of the nodes. Therefore, even when we try to build a cloud-agnostic solution that we can reuse on each and every cloud, we still end up providing some of the databases directly in the clouds for the cloud provider. That sometimes is not very avoidable, I would say, although we're trying to avoid any vendor lock-ins. Under this, we understand commercial vendors lock-ins.
That can take some time to migrate in case of an emergency or not an emergency even, whatever reason can be. And by avoiding vendor lock-in, we can offer the solution to more of our customers. Maybe a customer of our customer is interested in something like that. Right? Why not offer one more solution for them? Especially if you have an expertise in TAME. Platformization of all of the things. That is a challenge in itself. It just sounds like it's cool. I just create a forum and somebody will feel. But then what will happen? You need to provide a repository. You need to provide the resources, the infrastructure, the infrastructure as code, and all of the things to make sure that it's working. That comes also to the maintenance topic. Maintenance topic is not the simplest one. As it takes, I would say, a huge part of our efforts. As you have seen, that's a lot of software to maintain, to make sure that versions work smoothly between themselves, everything is compatible.
And whenever you upgrade the Kubernetes cluster, you can imagine how much effort it might be if you're migrating from version like 1.24, for example, or maybe even older ones. Somebody don't necessarily want to spend time and effort on doing it by themselves. Therefore, we have this cloud. We have this data ecosystem solution. For example, one of the recent cases was upgrades of Kubernetes to 1.26 from 1.24 version. There were a couple of resources that were deprecated. Let's talk, for example, let's take Apache Camel. That's a hot topic in our team right now. Apache Camel operator, known as CamelK, is actually a great tool that helps you to build integrations using YAML, using Java, using Python, using... Isn't that enough? There are, of course, some ways to build integrations that we are not using in our team in particular. We prefer Java in this case.
But whenever you need to upgrade Kubernetes, that might affect your operators around the cluster. And that was the case for us. The Apache Camel operator, the version that we were running, did not support the latest CRD version of cron job. It is just a bump of the version, but that cost us a lot of effort to update Camel itself, CamelK operator that provisions Camel. It cost us a lot of effort to update each and every integration to synchronize with all of the stakeholders that are using this operator, to make sure that they update to the latest version. And what do you know? Maybe you just finished this update and the next one is coming. And now you need to provide, to dig deeper into another operator, Istio, for example, or what's not.
Supporting such a variety of tools, of course, comes with a cost. The question is, how many of those are used at the time? You might end up in a situation when nobody uses Spark for quite a while. Or maybe nobody is using Apache Airflow for whatever reason. They just don't feel like it. And teams that required that before migrated to some other tools, right? And in the end, you'll end up retiring some of the solutions because it just doesn't make sense to maintain something that is not used.
That comes to the standardization of the ecosystem and the offerings. And it happens in our team as well, the same way it happens to the majority of software companies. First, you start with something small. Then you get traction. And of course, you start offering more and more and more and more. Because you see there is a demand. But eventually, you'll come to the conclusion that, hey, that's great. Everything is shiny. Our customers are happy. But what about standardization? Maybe you don't need to use each team's separate tool. Maybe some use cases can live without some parts of it. Like maybe don't use Apache Camel. That would be nice. And use something else. Maybe you would try to provision Camel separately from the operator.
That comes with the costs, sometimes even losing customers that were using one particular tool and nobody else was using it. For example, that happened with one of our offerings for machine learning. Only one team was asking for it. And they were not very much using it on the staging environment. And then we faced a lot of issues whenever we came to production. Because in this tool, basically, every communication, any workload was visible for anybody. That is not something that we can allow in our company. As we are regulated not just by the market, so to say, not by our reputation, but also by the government. Yes. And therefore, we retired some of the machine learning tools that were spawning, allowed some functions approach in our cluster. And now we need to live with that decision.
The support was not worthy of it. The amount of efforts that it will cost us to bring it to the company standards would not justify the possibility of serving this particular customer with this particular tool. They ended up using other tools though. Yeah. That's about challenges, about vendor locks-in, about challenges in platformization. And the biggest challenge is to get it to work together. Different versions, different interfaces. Every solution has something that they do in their own way, which might or might not be compatible with other tools. As you might have seen, Hive, for example, and Ranger, they are part of the Hadoop cluster of things. And also, Apache Spark.
I would like to talk about how you choose tools to satisfy your team's needs. First, one of the most important aspects is to assess the workload. What are people trying to achieve? What do they want to do? Basically, understanding their plans and expectations. Maybe you don't need something like Kafka for streaming events if you don't have a continuous influx of traffic. In this case, Apache Spark might be a better solution, enabling batch processing without the need to maintain something like Kafka. Of course, in our case, we maintain both because we have different customers who require data from various endpoints. Some of them need regular updates, and we have to siphon all the data into the platform.
A significant topic for data echo and data meshes is data discoverability, deserving a separate discussion. Data discoverability enables a team to find suitable data and possibly request it. When you work with data for a while, you might get a sample and make a judgment call if it suits your business case. If some public data fits your business case, you can request it through proper channels. The first evaluation and exposure to the data happen through tools like DataHub and, of course, Apache Atlas - solutions for data cataloging. These tools offer insights such as data lineage and relevancy. Knowing when data was acquired and how often it updates allows a data analyst to make a judgment call on its suitability. If it is suitable, then you can dive deeper. These aspects are crucial in any decoupled architecture, akin to service discoverability in software. If it's difficult to find data, you might end up duplicating the work of acquiring and exposing it, something we'd prefer you not to do. That's essentially it about policies, data quality, and the challenges we were facing. Feel free to ask any questions you may have.