The Developing Dev
The Peterman Pod
AWS Distinguished Eng: Learnings From 3000 Incidents And How Engineering Is Changing | Marc Brooker
0:00
-1:11:30

AWS Distinguished Eng: Learnings From 3000 Incidents And How Engineering Is Changing | Marc Brooker

Where caching is bad, thoughts on the industry, and learnings across his career

In this episode, I talked to Marc Brooker, a distinguished engineer at AWS who started there as a new grad and rose through the ranks. We discussed technical learnings from 3,000+ cloud system postmortems, how software engineering is changing with AI, how to find impactful problems and much more.

Check out the episode wherever you get your podcasts: YouTube, Spotify, Apple Podcasts.

Timestamps

1:27 - Finding problems that matter

11:42 - Learnings from 3000 postmortems

23:58 - Why caches are bad

29:37 - How AI will change software engineering

36:49 - Advice for junior engineers given AI

44:02 - Thoughts for senior engineers

49:59 - Why engineers should write

57:51 - Visibility and apparent expertise

1:04:23 - AWS engineers he admires

1:06:53 - Technical book recommendations

1:09:06 - Advice for his younger self

Transcript

1:27 — Finding problems that matter

Marc:

[1:27] Yeah, I think you have to go super broad. So I think there’s a set of those things that come in from customers from the world. Right. Like here is an unsolved problem. You know, I spend a lot of time meeting with AWS customers and listening to them talk about, you know, what are the things they still find difficult in our space? What are they, you know, what are they investing in? Where are they spending their time?

[1:48] Where would they prefer to be not spending their time and focus on their core business instead? And so that’s one rich seam of ideas and focus on what’s, you know, what’s interesting, I think completely at the other level is sort of looking at the technical trends and you can look at just the kind of speeds and feeds like, wow, networks have gotten faster, storage has gotten faster. You know, we’ve seen this huge explosion in multicore and now in GPUs and you know, and so there’s a bottom up innovation trend there too, which you can also look at and say, well, this enables all of these new, new things.

[2:27] And, and then broadly kind of across the world, like what, what are the big trends that are going on? What are the things that are changing in our industry? What are the things that are changing in, in the world? And really it is those kind of moments of change that have the, you know, bring with them the opportunity to, to, to build things and, and, and to recognize problems. And so to pick one, you know, concretely.

[2:52] You know, when I was in, working in the lambda team in 2020 and I was talking to a lot of customers about, you know, they were super excited about building on serverless, they were super excited about building on containers. There had been this massive shift and what people were seeing then was, wow, I love these serverless products, I love building this way. But the world of data, and especially relational data doesn’t fit super well into this, this paradigm, right?

[3:19] These relational databases are still very serverful, you know, fantastically powerful products, but not kind of operationally the same. And, you know, that thinking was, you know, just felt super important to me of like, wow, these customers have brought to me a gift of understanding something that’s really important. And so I joined the Aurora team. We built Aurora Serverless and then we built the SQL.

[3:45] You know, we’ve been investing deeply across all of our database products to make them a better fit for these, you know, serverless and container workloads. And that is an example of a trend that was brought by, you know, brought by a customer. But then also these trends that have been driven by kind of architecture or by other things going on, right? Faster networks, faster compute, faster connectivity.

[4:15] And so one of the big technical trends in the database world right now is this sort of block storage becoming the default backend, the default durability layer for databases of all kinds, from analytics workloads to online workloads. And there’s been this incredible explosion around that. And so if you look at what we did with Aurora DSQL, for example, that was very much learning from that trend and taking a lead on that trend and saying, well, well, we’re going to make S3 this block store that we built 20 years ago, sorry, object store that we built 20 years ago,.

[4:55] Underlying durability layer of this new database, but obviously it doesn’t have the latency properties or the rich interface that an online database needs. And so we’re going to build an architecture on top of that that deals with all of these other things in a much better way, but doesn’t have to worry about durability. And, you know, so that was this perfect collision of a set of things I was hearing from customers and a set of things that were technical trends coming together and thinking, wow, we’ve got this opportunity to build something now that is going to be a market leading product that would be hard to imagine without either of those input signals.

Ryan:

[5:39] I saw something that you wrote. You mentioned that you were on call for 15 years somewhere in there. And I’ve heard many stories of more senior engineers negotiating out of on call because per unit time it could be perceived as not that impactful. And so why did you stay on call for so long?

Marc:

[5:59] I would say that the majority of my in practice knowledge about how to build distributed systems has come from being on call and analyzing and deeply understanding these post mortems and COEs. You know, one of the challenges of running a company like AWS and running large scale systems is that folks come out of college with great, often great knowledge of computer science fundamentals, great programming skills, you know, great mathematical skills.

[6:34] All of that stuff is fantastic. But without the grounded knowledge of what it actually means to run and understand, you know, understand systems and you know, on call is one of the best ways to learn those things. Best ways to see, you know, how do systems really run? How do they really behave, you know, how do customers really use them? What happens when customers use systems in unexpected ways?

[7:01] How can we make systems more resilient to customers using them in different ways? And I think that should be almost a goal of on call, right? If you have folks in your teams who are on call and they’re just closing the same ticket over and over and over, well, you know, that’s where you need to just build some automation. And again, building automation is easier than ever. It’s more powerful than ever.

[7:24] Fantastic. But where you really want to spend the time of the deep experts on your team is, you know, here’s something unexpected or unusual that’s happened in the system. Let’s deeply understand that and let’s bring that knowledge back to both improving that system and communicating broadly to the company and the outside community what we’ve learned from that. And so one of the most, you know, one of the most powerful things we do at AWS is we have this mechanism of a very broad weekly meeting where we all get together, you know, engineers from across AWS leaders, senior leaders from across AWS and talk about COEs, these postmortems that we write and what we can learn from them and how we can apply those lessons across the whole company.

[8:22] And I think that particular mechanism, that particular kind of Wednesday morning meeting that we have, is one of the things that has been a core, almost causal factor behind AWS’s success because it has allowed us to and Forced us to spend leadership, bandwidth, to spend expertise, to spend the time of our best engineers deeply understanding how our systems operate and why they operate the way they do.

[8:58] And you know, that level of being just extremely grounded in reality helps you design better products, help helps you architect better systems, and it helps you think more clearly about the next round of things. Helps you fix, you know, helps you fix issues. And so it’s this fundamental kind of learning exercise. It’s a real blessing. So I would, you know, I would recommend oncall to anybody who wants to learn about the practice of distributed systems.

[9:29] And I would certainly recommend spending time reading coes, reading post mortems and, and deeply reflecting on not only what can we fix tactically, but what can we fix organizationally and strategically and what kind of tools might, might need to exist to, to prevent this kind of thing happening again. And you know, you asked earlier about, you know, where do ideas come from? This is another, you know, fantastic kind of flow of ideas of saying, wow, you know, we seem to be solving the same problem over and over in different ways and getting it slightly wrong every time.

[10:06] You know, can we extract a tool to do that? Can we build a service around that? Can we, can we build a feature around that to make it easier for us to get right and easier for our customers to get right?

Ryan:

[10:20] Yeah. It’s interesting because I think if you ask most engineers, they really avoid on call, but it sounds like you, you kind of go towards it and you’ve learned a lot from it because it’s a major source of customer problems. Yeah.

Marc:

[10:35] And again, you know, I think for me it comes down to optimizing for finding the most important things to work on. And you know, if you aren’t close to operating your actual system and you don’t know how it’s actually working, how are you supposed to identify what to fix? Right. You can come up with some theories about those, but they’re probably not going to be right. And again, like, I don’t think there’s a huge amount of value in the rote ticket closing work of oncall.

[11:06] I think automation, you know, should be doing those kinds of work. But I think there’s fantastic value in, you know, deep understanding, deep investigations and, and deep reflection on what you learn from postmortems and COEs. I tried to estimate a couple of months ago for a talk how many industry postmortems and Amazon coes I’d read over my career. The best estimate I could come up to, and this was about a year ago.

[11:33] Was between 3000 and 4000, and so know even a little bit of lesson from each one, and it tends to, you know, tends to stick.

11:42 — Learnings from 3000 postmortems

Ryan:

[11:42] Yeah, that was my next question, actually. I looked at the slides from that internal presentation and it said, I’ve read approximately 3,000 cloud system postmortems from across the industry. And my immediate thought was, I wanted to ask you what makes a good postmortem?

Marc:

[12:00] So I think, you know, what makes a really great postmortem is first really getting into the details and making sure that you deeply understand what happened rather than just assuming what happened based on the biases you bring in. And so there’s a kind of lesson one, there is, if you can’t understand what happened, well, that teaches you something about your logging and metrics and observability and simulations and all of these other things.

[12:30] And then once you deeply understand what happened, then the ability, then a great postmortem steps through the whys behind that at multiple levels. Right? Like, why? Well, yeah, there was a code bug. Okay, sure. Code bugs, yes, we can fix that, but we can’t stop there. Right? Like, why was that missed in testing and validation for these reasons? What can we improve? What can we build around those?

[12:58] Okay, next step. You know, why, you know, why was our testing and validation where it was? Or, you know, why did we assume a certain thing about the behavior of the system that we wouldn’t have assumed before? And so as you sort of get through these deeper and deeper layers, a great postmortem not only identifies kind of fixes to the proximal cause, but also identifies broader fixes to technology, to organizations, to products, and so on.

[13:31] And so that’s a kind of multiple levels thing, right? You can’t get stuck on what is the most proximal cause of an incident, but you also can’t get stuck on this. Well, things fail sometimes, and what are we going to do about it? And you have to come up with a set of, you know, really concrete action items to fix things at different levels. Fix this particular line in the software that caused something, you know, fix the testing processes that didn’t catch that, you know, fix the, you know, maybe social or team processes that led to those technical processes.

[14:14] And, you know, and then if you’re seeing patterns across multiple postmortems, sort of level those up and say, well, clearly there’s a hard underlying problem here. You know, can we build a service around that? Can we build a library around that? Can we build a, you know, community of practice around that, you know, are there technical changes we can make to avoid whole classes of things? So that’s quite a long winded answer.

[14:43] But I do think it all flows from understanding and understanding at multiple levels, like understanding immediately like what happened, but also understanding broadly what happened technologically and organizationally and in context, and then the ability to connect that particular event or postmortem with other ones and extract those patterns. One of the things that we did in DSQL was we spent a lot of time as we were designing that, looking around relational database related postmortems and thinking about both our own and our customers and thinking about how can we design a database that helps people avoid falling into these traps.

[15:31] A really common outage pattern, folks with relational databases is you have a client on a distributed system, starts a transaction and then goes out to lunch for whatever reason. And that could be a GC pause, or it could be a lossy network, or it could be a loss of connectivity and now it’s holding locks. And so if you look at relational databases, they don’t tend to be resilient to clients misbehaving in that way.

[15:59] And that’s a really common cause of operational issues for systems built on relational databases. And so as we were designing DSQL we were thinking how do we avoid broadly that class of problems? So folks can say, hey, I’m going to build on D SQL and just not have this whole class of problems. And I think that’s a really kind of powerful outer loop over the post mortem process is to say, how do we turn all of these lessons into new services and into service improvements?

Ryan:

[16:35] How do you prevent misbehaving clients from being a problem for the database?

Marc:

[16:40] Yeah, so in DSQL’s case, we have no pessimistic locking. And so within the scope of a transaction, everything that happens in that transaction, all of the reads happen using this mechanism called multi version concurrency control, where every row in the database we sort of store a history of versions. And so you can read an old version of a row without blocking writers and saying, hey, you can’t update this because I just read it.

[17:11] And then locally within the query processor that’s handling a connection, we spool the writes locally and then you get to commit time and we do this optimistic check of can I commit this transaction at, at the transaction commit time. And so combining those two mechanisms of having multi version concurrency control and, and the scale out storage that comes with it and the commit time optimistic checks, we can strongly say that, you know, there is no way that a Reader of a piece of data can block other writers.

[17:46] And there’s a no way that that a writer of data can block readers. Writers can block writers, but only by changing data, not just by looking at it. And so you can say, well, I can cause, sorry, writers can’t block writers, but they can prevent other writers transactions from eventually committing by making a bunch of changes. And that is inherent to the definition of the particular database isolation level.

Ryan:

[18:18] Out of curiosity, in practice, what percent overhead would you expect for keeping copies of old rows for the sake of those stale reads?

Marc:

[18:28] Yeah, it’s actually surprisingly small. And it’s surprisingly small because if you look at the access patterns for most online databases, even ones that do a lot of write traffic, that write traffic tends to be quite concentrated. And it’s quite unusual for an online database workload, or even an analytics workload to make a second version of every row in the database. Typically what it’s doing is making a, you know, first, second, third, hundredth version of this row and a 50th version of that row.

[18:59] But the vast majority of data isn’t changing. And so it’s super workload dependent, as is everything in the database world. But the overhead tends to be relatively small. I would say it’s unusual for a online database workload for that overhead on storage to be more than about 10%.

Ryan:

[19:23] From my experience, I’ve seen an interesting dichotomy between teams where some teams, they really understand postmortem culture. They tend to be infrastructure teams, they tend to take it really seriously. And everyone on those teams, the tech leads are asking you, hey, why? Why did that happen? And you know, really follow up and make sure it’s, it’s not a problem. Then I’ve also noticed on other teams that is less of a strong muscle for those teams that don’t take it too seriously.

[19:51] What would be your, your pitch for why they should take it seriously?

Marc:

[19:55] Yeah, it all comes down to where you want to spend your time, right? Do you want to spend your time improving your product and making it better, or do you want to spend your time fighting the same fire over and over? And you know, really the culture of building, you know, building great postmortem culture is to make sure that at the pros, at the product level and at the organizational level, you are fixing known issues and you are avoiding having the same problems multiple times.

[20:35] And typically when I see teams that have, you know, poor post mortem culture, I think they’re probably one of two failure modes there. You know, one of them is A lack of focus on just the outcomes, right? Like, you know, a lack of really, I wouldn’t say caring enough. I think that’s a little bit too personal. But being really focused on, on, you know, is this, is this product performing super well?

[21:06] Are we, you know, are we really making our customers happy? And that is fundamentally a cultural and leadership, cultural problem of, of setting the right standards. Oh, and by the way, like, I don’t think, you know, standards should be, you know, should be uniform, right? Like there are places where, you know, the details really, really matter, where things like durability are just critical and, and, and you do need to have super high standards in those places and you know, places where you want to optimize for other things and, and, and maybe have, you know, have, have a, a higher production defect rate.

[21:44] And I think that’s, that’s okay as long as that’s an intentional decision that’s being made. So that’s kind of case one, right? Like insufficient focus on the outcome. I think case two, and this is a harder one to change is normalization of kind of operational heroics. Like, we don’t need to fix these root causes because our on calls are superheroic and they’re going to stay up all night and they’re going to, you know, they’re going to hack around things and they don’t mind being paged 100 times a week.

[22:16] And they can feel from the inside like it’s a good culture, right? Like, oh, wow, these people are super strong owners. They’re super engaged, they really care, they’re really working hard on call call. And those are all good signals. But then when you look at it from the outside, it’s like, wow, we’re not actually fixing the causes of things. We’re just doing this fantastically expensive investment of taking all of these people and their strong ownership and their expertise and spending them just on, on, on this break fix cycle.

[22:46] And that’s where you need to kind of look at it from the outside and say, well, let’s take this energy of this team, fantastic energy, and focus it on improving the service, getting out of the cycle, finding new things to fix, finding new things to build. And that can be hard because it can be hard for those folks who’ve been in that mode to look at it and say, this feels so good. It feels really like we’re caring about our customers and caring about our product and caring about our business to realize that, oh no, we’re actually caring about it at the wrong level and we’re not Serving our business in the best possible way by being so narrowly and tactically focused on this break fix cycle.

[23:34] And that’s where you sort of need to pop them out and say, well, let’s spend more time thinking about the postmortem. Let’s spend more time thinking about the causes of things. Let’s spend more time addressing these things in a more strategic way. And wow, okay, now you’ve got so much more time to do that because you’ve broken the cycle and you can improve your product in different ways.

23:58 — Why caches are bad

Ryan:

[23:58] I mean, since you have worked on AWS for almost two decades, I’m sure you have a lot of experience building distributed systems. And I think one of the most common advice that you hear, I guess this is maybe in the context of system design, is I almost hear almost 100% of the time people will say, just throw a cache on it or you’ll have a system design. You say, how do you make it better? Let’s put a cache here, let’s put a cache there.

[24:26] And I saw you had a tweet that said that there are cases where caches are bad, despite people saying it’s best practice. I was curious if you could explain that.

Marc:

[24:35] Yeah, so caching is good, right? Like it’s, hey, I’m going to take these core ideas from computer science of temporal and spatial locality and I am going to exploit those to make my system faster, scale better, etc. And so, you know, obviously very attractive. But the downside of caches, especially in distributed systems, is they have this mode, right? Like they have this, you know, there’s a mode where the cache is full and the cache is full of the right data in time and space to perform very well.

[25:08] And there’s a mode where the cache is empty or contains the wrong data. And in the first mode, the system is fast and happy and healthy. In the second mode, the system is slow, often down, because now the back end isn’t scaled to deal with all of this uncached traffic. Customers are very disappointed. And often it is down in a stable way. And this is this kind of idea of metastable failures where the system has switched from state one to state two.

[25:43] And in state two, it’s still stable, right? Like it’s still, it’s down, but it’s not going to come back up under its own energy because, for example, all of this traffic is causing a huge amount of contention in my database or is saturating the network and so I can’t even refill the cache. It’s not even getting the right kind of data in. And so, you know, when I talk about the downsides of caches, it’s really about, you know, how do we avoid that modality between, you know, fast and, you know, that, that value of caches and the, you know, how do we avoid the state where we’re down?

[26:23] And so if I go back to D SQL, like our answer there is D SQL. What we call the storage tier is essentially a cache, but it is a complete cache. It contains every row in the database. And so it doesn’t have this mode where how do I recover from it being empty or containing the wrong data. It contains all of the data. Similarly, if you look at a more, let’s say classical relational database design like Aurora, the Aurora leader is constantly telling the potential failover targets, here’s something you should cache, here’s something you should cache, here’s something you should cache.

[27:01] So when a failover happens, the cache is warm on, you know, on the failover target. And so those are the kinds of things that you can do to avoid those modalities. But in general, you know, and I wouldn’t extract this as a rule or say that, you know, this applies 100% of the time, but in general I prefer to see the teams around me avoiding caching where possible. I prefer patterns where you have a, let’s say, complete materialized view of the data.

[27:35] If you need very fast access to it, especially if it’s slow moving, just pull it down onto your local machine and work with it in memory. If it’s only being updated once a week, who cares? Just make lots of copies of it. So that’s one pattern. Or use a scalable backend, DSQL or DynamoDB or whatever your favorite scalable database is, and keep your database vendor honest about getting to the scale and performance you need rather than putting a cache in front of things.

[28:06] So caching isn’t a bad pattern, but it is a pattern with some significant downsides that are really best avoided in practice.

Ryan:

[28:18] How often do you see that metastable failure though?

Marc:

[28:22] Yeah, it’s not super common. You might go years without seeing something like that. But if you look across the biggest, most impactful, you know, system postmortems across the industry, I would say that these kinds of metastable failures have been an underlying cause in probably a majority of them. And it’s super important that, you know, as an industry and as a community of practice, we understand those things deeply.

[28:52] Because also those cases where these do happen, you know, tend to be larger scale issues, longer recovery time issues and more complex to fix issues. Right. Where you have to often turn it off and turn it back on again, which is this very, very painful thing for a team or an organization to do. And you know, and so again, like you might go years operating a system with seeing nothing like this. And but if you look at the most impactful issues, it’s actually fairly common as an underlying cause for those issues.

[29:31] And so it’s kind of both of these things of being quite uncommon and being rather common.

29:37 — How AI will change software engineering

Ryan:

[29:37] I was reading your blog and you have a series of posts on how AI may impact the future of software engineering. And I kind of want to pick your brain on that. So what’s your perspective on how you think AI will impact software engineering and how it’ll change things?

Marc:

[29:54] Yeah, I mean, it’s maybe harder than ever to tell the future. And so this is a set of maybe guesses and predictions about the future. So I’ll say the first thing I deeply believe about software is we have only just started to see the impact that software is going to have on the world. There is such an opportunity for more software to exist, bigger software, better software, more personal software, all of these things.

[30:29] And so Software has throughout its 60 ish year history been supply constrained. And I think that’s going to remain true. I think the opportunity for software in the world is just almost unbounded. And that’s really exciting, right? It’s really exciting to be at a moment when the economics of building software are changing and are changing rather quickly. And that gives us an opportunity to think about what could we do in the world with a lot more software, you know, a lot more software personalization, a lot more just the right software in the right place at the right time.

[31:15] And you know, that gives me a huge amount of excitement about the future of this industry because we have a massive opportunity ahead of us driven by these changing economics of software development. Now also with those changes, there are going to be needs for us as software practitioners, people who build software, people who love software, to, to adapt. And you know that that means that software careers are going to look different.

[31:53] They’re going to look different early on, they’re going to look different later on. I think the software business is going to look different. And the success of people in organizations over the next, you know, next, who knows, five years, decades is going to be largely predicated on their ability to adapt to that change and to lead that change.

Ryan:

[32:17] You told this story about this guy who bet on analog circuits when obviously we know Digital became kind of the more dominant way, yet he made good money. For the people who maybe don’t want to adapt, you could still get by and succeed. It’s not going to be like a crazy thing. Is that kind of the takeaway and why you brought up that story?

Marc:

[32:41] Yeah, I think that’s, that’s the right takeaway. And so if I sort of break down, you know, the world into two, three tiers, you know, I think there’s going to remain a huge amount of joy in the craft of software. You know, like the craft of joinery with, you know, with hand saws. Right. Like it’s, it’s a nice way to spend time. It’s not a particularly economically interesting activity anymore. But not everything we do has to be an economically interesting opportunity.

[33:12] It can just be something I do because I enjoy it, because I enjoy the product of it, because I enjoy talking to people about it. Right. And so there’s, you know, I don’t think that is going to go away. I think we’re going to see, you know, a lot of interest in that. Like, there’s been interest in retro computing and, you know, people who run an Apple II as their desktop and like. Well, again, it’s wildly impractical.

[33:34] It’s not economically interesting, but it’s fun and something I could do as a hobby. And so, you know, that’s going to be a remaining part of the world of software for probably forever. And then there’s this kind of story that I told in the blog post and I think this relates to driving change in the real world is always harder than it looks from the outside. As you get into the details, things become more difficult.

[34:04] They become more dependent on people, they become more dependent on politics and policy and, you know, our various irrationalities as humans. And so driven by that, you know, there is going to be a huge amount of, and a shrinking over time, amount, but a huge amount of the software industry that is run in what I might call the old way, right? Past techniques, past languages, past technologies. And there’s real economic opportunity in engaging with that part of the, you know, part of the world.

[34:43] You know, as we saw with, with analog electronics. Analog electronics still very much exist. In fact, there are parts of the world like, you know, like radio and power systems, where there’s been incredible technological advancement in those fields, but they have become more niche. And so, you know, digital became the mainstream. We wouldn’t be talking like we are today if it wasn’t for this, you know, 12 orders of magnitude or whatever explosion in digital transistor counts.

[35:17] But there’s interesting opportunity there. And I think that interesting opportunity is going to change shape and become more and more specialized and more and more niche and great careers to be built there. And then there is the mainstream, which I think is going to adopt these new technologies from agentic development to AI powered development to specification driven development and a whole lot of other new things whose names we don’t even know yet, to build software at a speed and a cost that is unimaginable to do with the old techniques.

[35:57] And I think that is where, correctly, the majority of the industry is going to be going. I think that’s where the majority of careers are going to be built. I think that’s where the majority of economic opportunity is. It’s the space I’d be in if I was building a company today. It’s the space I’m in in my role and you know, the one I would sort of personally be most excited about. But yeah, it isn’t the only one.

[36:22] I think there’s going to be the spectrum of software practice and especially where software engages with the physical world. There are going to be some really interesting questions about how do we bring these new technologies, how do we bring these new practices into the various, many niches that software is going to and has over six decades kind of wormed its way into.

36:49 — Advice for junior engineers given AI

Ryan:

[36:49] It’s interesting you mentioned joinery. I wonder if down the road we will see apps on the App Store that people pay extra for because it’s marketed as this was written by a human or it was written by hand. So bespoke custom app. Crazy how the world is going to change. But so it sounds like, you know, change is obviously the common case. It’s the one that we should be thinking about. Maybe we can break up the conversation into two parts.

[37:18] One is for junior engineers, what is important given that code is kind of flowing like water.

Marc:

[37:26] Now, at risk of being a bit meta about our past conversation, it really is about finding those problems that matter and doing that early in a career. And that requires an understanding of customers, it requires an understanding of the business, it requires an understanding of economics and of systems. And that can, I think that’s going to move from being, you know, almost kind of senior engineer work of like, oh, well, you know, now you’re going to go and talk to customers and actually understand the context of the stuff you’re building to being more and more part of even the earliest steps of an engineering career.

[38:09] Right. Like, here’s the context, here’s the Problem, here’s the customer. Let’s go off and work together and solve problems, you know, and solve this problem with all of this context. And I think that’s going to be super exciting for one set of folks and a little bit frustrating for people who have come into, you know, looking for a pure software development career. Right? Looking for a career where they sit down, open their idea, start typing, and don’t stop for eight hours.

[38:21] I think that’s going to be a mode that we’re going to see fewer people in and a mode that’s going to be harder and harder to build a career around. Now, the other mode of, oh, I’m excited to go off and learn from my customers about what they’re building and what they need. I think that’s going to be ever more highly valuable and so super exciting opportunity to build careers there. And then maybe.

[39:09] And this might come across as being a little bit paradoxical. I think there’s also a ton of opportunity for folks who are extremely technically deep, who are deep on optimization problems or deep on infrastructure problems, or deep on various scientific things, or deep on databases, or deep on, you know, one of the many, many topics that are behind our industry, because I think the ability to ask the right questions is also much more valuable than it was has ever been.

[39:49] And so I think there is a ton of opportunity for people coming into the industry with deep technical or scientific knowledge to now leverage that in ways that, you know, maybe were, were hard before. Right. There was too much sort of boilerplate to really, you know, to really use that leverage that you have. And so I think we’re going to see a lot more of, of, of those kinds of careers of really kind of building expertise in a technical topic, in a scientific topic, and then be able to turn that into software and software products in a way that was really difficult before and in some cases wasn’t possible before and is now, you know, vastly easier.

Ryan:

[40:31] If I was to look at a career ladder’s expectations, some of what you described, of maybe engaging with the customers and understanding the business context, uniquely in software engineering, it feels like the earliest levels are insulated from all of that. You have your tech lead, tech leads handing out tasks, and then the early level engineers just given tasks just converted into code. And it sounds like that part’s relatively solved.

[41:00] If not now, maybe I’d be surprised if a year or two from now wasn’t completely solved. And I think that could scare a lot of junior engineers because they would think you’re going to expect me to graduate from college or start working as a software engineer, and then I would have the senior engineer expectations. What would you say to the scared software engineer that’s just entering the industry with all this change?

Marc:

[41:30] Yeah, I think. Well, I would remind them that we as people who hire and build organizations of software engineers and they, as people who are building software engineering careers, have really aligned incentives. It’s not valuable to hire a bunch of people and set them up to fail like that’s, nobody wants that. It’s not an outcome that is good for anybody. And so, yeah, we’re going to need to figure out how do you support people on that path?

[42:06] How do you help people learn those things? How do you give them the right guardrails? Hey, that first time that you go out and talk to a customer, yeah, it’s going to be scary. My first time talking to an AWS customer was, you know, was, it was, was super scary. But you know, I, I got a bunch of help with that and I got a bunch of advice and I got a bunch of mentorship and I got a bunch of feedback and I got better and better at that over time.

[42:30] And I think that’s exactly what these things look like is, you know, you start off and you, you start small and, and, and, and you learn, you know, as you go and, and, and so that feedback loop goes faster. And so I don’t expect that people coming in from college or, you know, will, will. Will come in with all of this knowledge. I think, you know, it’s never been true that people coming into technical or engineering career straight out of college know everything or any career for that matter.

[42:59] Right. Like you talk to, to teachers about, you know, what they’ve learned on their job versus what they learned, you know, studying, you know, they learn a huge amount in, in, in things like internships and, and so on and over the course of a career or doctors or anybody in a field like that. And so, yeah, it is going to be about learning. And I think the emphasis on what people learn is going to be different.

[43:25] I think it is going to require leaders like me who care deeply about hiring and developing folks early in their career to be really thoughtful about what does that new letter look like. And, and you know, we’re doing a lot of that thinking. I think people are doing that kind of thinking across the industry. And yeah, it’s changing fast. It’s uncertain. It’s an interesting time to be graduating. But again, like, it’s a super exciting time.

[43:57] I think that just the scale of the opportunity is bigger than it’s ever been.

44:02 — Thoughts for senior engineers

Ryan:

[44:02] Sounds like your advice for senior engineers is different from that of junior engineers. What is your thinking there?

Marc:

[44:09] Yeah, I mean, I think for folks there, the challenge is how do you retain the value of this incredible experience and knowledge that you’ve gained over a career while not falling behind, while learning how to best use the tools? And when I look at senior folks, this is a challenge ahead of them. I think a lot of people have found themselves in influence and leadership type positions where they aren’t hands on building every day.

[44:44] And I think it’s going to be harder and harder to be in that kind of role and be able to influence and advise in a relevant way, in a positive way. And so really I think my advice for folks is you kind of got to get building, you got to get back in, get back into it. You need to deeply understand how the practice of building software and the practice of designing software has changed and is continuing to change.

[45:22] And so the challenge is, how do I really take advantage of all of this knowledge and expertise that I’ve built up in my career and be super curious and be super hands on and really be in the details? And the good news for that? Well, I think there’s two bits of good news. One of them is because of these new tools. You know, time spent as a practitioner is so much more leveraged than it is today. You can build such cool stuff, you know, during that period of time, the amount of kind of wasted time and boilerplate and so on is so much smaller.

[46:00] And so you really do have this opportunity. And the other one is again, like, why did you know, why did we get into the space? Well, I didn’t get into it so I could go to meetings and sound smart. I got into it because I love learning and because I love building technology and because I love solving my customers problems and because I love learning about new technologies and learning new things. And there’s more opportunity to do that than ever before again because of this new set of tools and the leverage that comes with them.

[46:31] And so really is getting back to, you know, why are you here? Why did you get into this career? And I think it really gets us as technology focused people closer to our original answer to that. It’s really obvious to me right now when I speak to, you know, practitioners, you know, who and who isn’t using a, you know, modern set of agentic powered developer practices, right? And the people who are have these really interesting things to say about the strengths and weaknesses of those approaches and the work that still needs to be done and the integrations that still need to be done and the things that are working and aren’t, and the people who are, you know, using them hands on have such a poor mental model of how they work, what they’re good at, what they’re not good at, that the things they say about them tend to be essentially fiction.

[47:31] And so I think we are in this minute that if you aren’t doing it hands on, your opinion about it is very likely to be completely wrong. And that takes a level of humility to admit. That is tough. It’s tough for folks with fancy titles and it’s tough for folks with, with distinguished careers, but I think it’s a must.

Ryan:

[47:55] I feel like there’s a common sentiment among software engineers when they work with someone who is a quote unquote, tech lead, but they’re not really hands on. So they’ve kind of been in the docs for the last five years or so and there’s these minor things they can tell that this person doesn’t actually understand the underlying thing. And sounds like that gap will widen with these new tools, which is if you’re, you’re looking at things from a thousand feet up and you’re not actually using the tools, that’s just another thing that separates you from the people who are actually building, where you’ll be very out of touch.

Marc:

[48:35] And I think, you know, when I look at, I think that’s always been true. I think it is wider than, you know, ever before. And. But when I look at the, you know, engineering leaders that I’ve really respected and learned a huge amount from over my career, you know, for example, some of the folks who built S3, you know, 20 years ago, that was such a successful product because those folks were so deep in the details and so grounded on the use cases and so deep in the economics and really just did, you know, really thought about both the kind of strategic world of like, how is this cloud thing going to change the way people want to interact with storage?

[49:21] But also the, you know, minute to minute details of what’s fast now, what’s slow, what’s good, what’s bad. And I think, you know, when you think about a extremely enduring product like S3 or EC2, I think it’s been that groundedness in the details from early on, from all levels of leadership that has made those things so successful, where other products seemingly with the same amount of early promise didn’t turn out to be as successful.

49:59 — Why engineers should write

Ryan:

[49:59] I think one of the last topics that I wanted to ask you about was writing. You have a ton of awesome posts on your blog. The style of writing is incredibly clear. And I was curious, why do you write so much as an engineer?

Marc:

[50:19] Writing and speaking, but especially writing, have this incredible power and for technical folks, it’s this incredible multiplier in being able to take these ideas that’s in your head and share them with the world. And, you know, you can take a set of technical ideas in your head and share them with the world by building a great product. And that’s a fantastic thing to do. You can share them in the world kind of one on one, you know, mentorship, teach people, learn small groups, also a great way to spend time.

[50:54] But the multiplication factor of doing a talk or even more of writing something is so much higher, right? Like there are so many more people that you can share that with and it lasts for a much longer period of time. And so just having something written on my blog, even that I wrote like a decade ago that I can share with someone and say, you know, here’s, you know, here’s how to think about this problem.

[51:21] Here’s an insight that I wanted to share with you or have people discover that organically is just super powerful. And so writing lets you scale out the impact of your expertise in space and time in a way that’s really hard to do in other media. I think with video and with podcasts and so on, we’ve seen other ways to do that. But I think writing remains kind of uniquely powerful. And then there’s also this idea which is this kind of core belief culturally at Amazon.

[51:53] And I’ve obviously been affected by this over the years that, you know, writing forces a level of mental clarity that speaking, making slide decks, etc doesn’t. And, you know, that’s something that has also really been my experience of sitting down to write something down forces me to think that through at a depth that I wouldn’t have been been forced to think it through without that. And so I saw one of your early conversations with, was with Leslie Lamport, who kind of takes that a step further and say, hey, you know, it’s formal mathematics that is the next step there.

[52:31] And I, I love that point, but I, I, I think writing is this really accessible thing for, for people to do that does force a level of thinking. And so I do a lot of writing sometimes just for myself, right. Like I’ll, I’ll write a doc, not ever intending to share it with anybody, but just to sharpen my own thinking on a, on A particular point. And so it’s some of that combination of three things, right?

[52:56] Like I just have something to say and I want to say it. You know, I have something to say and I want to scale it out in time and space and I want to sharpen my own thinking on a subject or the thinking of a small group on a subject in a way that writing is just a super powerful tool to do.

Ryan:

[53:20] Definitely.

Marc:

[53:20] Yeah.

Ryan:

[53:21] I remember being surprised early in my career I had a manager or tech lead who we would write these docs on either designs or the strategy. And he said even if you just wrote it and you threw it away, it would still be worthwhile because you’ll realize things as you’re writing and that clarity will save you a lot of time down the road. And it’s interesting to me because a lot of engineers, they complain about writing docs docs and all the stuff around the code.

[53:53] They kind of hate that. It’s a sign of slow big company processes. And what would you say to an engineer like that who’s saying just let me write the code?

Marc:

[54:03] Yeah, and that’s a great point. And I think it really depends on the level of problem you’re trying to solve. And so, you know, if I look at, I’m going to pick on UML for a minute here, right? Like it’s a sort of semi formal software design process and not one that I’ve ever found useful because I think it just happens at the wrong semantic level. I think it’s bothered with details at a level that aren’t helpful.

[54:30] And I think a lot of the let’s go off and document this has, has a similar problem, right? Like does this actually require, require that level of reflection and thinking? And so I think what you know, for me separates a valuable doc writing and thinking process from a busy work process is understanding what you’re getting out of it and what you’re getting out of it might be an artifact to share with the future, which is super valuable.

[55:02] Either your future self if you’ve got a terrible memory like me or you know, new teams, new people, people you know, or I want to share something with customers or I want to share something with the world and so that’s super valuable. Or I want to write down something so I can think through a really difficult, often one way door, kind of hard to change technical decision or API design decision. And I’m not going to do that every time I make a technical decision.

[55:34] It’s not worth it because a lot of those technical decisions are either Easy or not as critical or can be just be taken back if we figure out they’re wrong. But I am going to spend my time that way when there are key decisions to make, when there are key insights to find. And I think it is that what is the purpose of writing that separates well spent time from poorly spent time. Now there are people who still don’t like writing, even when it’s well spent time, even when it’s like, you know, you have to explain this piece of technology to, you know, to a future team.

[56:23] I think that’s a skill worth developing. You know, sometimes you do need to, you know, eat your vegetables, you know, and it’s, it is a skill worth getting good at. And especially in documenting the core kind of technical decisions behind a design is so useful. And that’s useful in two ways, by the way. Like, one of them is, as we think about building a big system, we make thousands of decisions.

[56:54] And some of those decisions are very carefully chosen, very particular and very impactful. And some of those decisions are the best thing we could guess in the moment based on having no data to make that decision. And it’s super useful for people who are coming in to improve that system down the line to be able to look at the design and say which of these things were very carefully chosen and thought through and which of these things were arbitrary.

[57:25] And because the arbitrary things like, okay, well I’m going to change that and I’m going to just go ahead and change that because I have better data now, I’ve watched the system run, I can go and change those. And these other ones are like, well, let me really engage with the reason that we made this decision. Maybe it was non obvious, maybe there was some more advanced thinking. And so being able to kind of understand the amount of thought that went into a decision is almost as important as understanding what that thought was.

57:51 — Visibility and apparent expertise

Ryan:

[57:51] You had a really interesting blog post. This was from a while back. It’s titled the Four Hobbies and Apparent Expertise. And you introduced this really interesting idea. It’s a two by two matrix and on one side there’s doing versus discussing and on the other side there’s the hobby and the gear. Maybe I can overlay it for people who want to see. And then later you kind of liken that to your career and how.

[58:20] I guess maybe we could imagine the hobby is actually coding and maybe the gear is. Let’s just say it’s like your dev setup or something like that. You talked about these two aspects of being in depending on which quadrant you are, which is there’s this trade off between expertise and visibility where imagine you’re really into coding and you’re really into doing. You’re going to be phenomenal in terms of expertise, but maybe not as visible because you’re not talking with everyone about how cool your setup is and all of that.

[58:54] And on the flip side, if you’re really into the gear, or maybe you’re set up in this case and you’re really into discussing, you’re on all the messaging posts and that you might not actually be that good at coding, but you’re very visible and you have this apparent competence. And I thought that trade off was really interesting because I’ve seen that so much in software engineering too is there might be someone who’s really quiet coder, they never write anything, but they know everything because they’ve just been in the weeds all the time.

[59:26] And then there are people on the complete opposite end of the spectrum that writing all the time, speaking all the time, but maybe not actually practicing as much. And my question to you is, how do you strike that balance? Because obviously too far in either direction is not optimal. So how do you strike that balance?

Marc:

[59:44] Yeah, that’s, you know, that’s, that’s something that I reflect on a, on a lot, you know, and I do explicitly think that sort of being 100% on either of those ends is a, is a, is a failure mode. And I think, you know, I will say that I have a lot more personal enjoyment working with the people that are 100% on the doing side and 0% on the talking side. I appreciate and deeply, you know, deeply love their expertise, but I do think that, you know, they, they could have more impact and leverage if they, you know, swung a little bit away from that.

[1:00:22] You know, I tend to not enjoy as much interacting with the people who are 100% on the speaking side, but I think they would, you know, have a lot more relevant things to say, you know, if, if, if they, you know, swung a little bit back towards the center. The other challenge of being on 100% on the doing side, it sort of gets back to that. How do you find the really important problems? And you know, if your head’s down in your ide all day, you could very likely be working on the wrong thing.

[1:00:59] You know, something that, that isn’t as, as important, isn’t as impactful, you know, doesn’t have these properties that people want. So, you know, how do you find the optimal balance? I don’t have a recipe for, you know, what really is optimal? I tend to do about, let’s say 75, 25 kind of practitioner versus, you know, teaching and communicating, maybe 80, 20 at times. I found that about what feels right for me.

[1:01:35] I would say that, you know, they’re great people I work with, you know, from sort of 9010 on that scale up to about 50, 50 on that scale. I think, you know, outside of those, you know, folks tend to, you know, tend to get into trouble as practitioners, right? Like, you know, there are people whose job it is to be, you know, communicators. And that’s great as long as they have the curiosity and, and are clear about what they, you know, know and don’t know.

[1:02:04] But, you know, I found that sweet spot to that sort of 75, 25 point in my career and, and that’s what’s what’s worked for me, I think. And I think in this moment where things are changing so fast, there’s so much to learn, you know, swinging a little bit more towards the practitioner side, I think generally will help people. But again, you don’t want to go too far that way because then you lose the, you know, what’s important for you that comes with interacting with the outside world.

Ryan:

[1:02:37] On the doing versus discussing axis. I kind of view the doing one as if you were too far, you would be underrated, and if you were too far out, the discussing, you would be overrated. And if for someone who’s structuring their career, would you say it’s better to be overrated or underrated?

Marc:

[1:03:01] I think long term, if you’re using that terminology, it’s probably better to be underrated. I think being overrated can feel great in the moment, but it’s rarely sustainable and really gets you to where you need to be. I really enjoy, you know, things like sports and, you know, these sort of creative hobbies and, you know, crafts, because it does, you know, turn that, let’s say perception and reality knob to very much reality.

[1:03:42] Right. Like as a sports person, you can’t fool the world for very long. It very quickly becomes very obvious, you know, who can and who can’t. You know, I think as a, as a craftsperson, the same, right? It very quickly becomes obvious who can and who can’t. And I think it takes a little bit longer in a field like ours where there is so much kind of qualitative stuff that goes on. But I think long term, when I look at, you know, careers that I really admire and people I really admire, they tend to be people who are personally very honest about their level of knowledge and understanding and skill.

1:04:23 — AWS engineers he admires

Ryan:

[1:04:23] So people who walk the walk, not necessarily talk the talk. I see, yeah. About engineers that you admire. I’d be curious because you have worked at AWS for such a long time and you have seen so many legendary engineers. Who, at aws, do you look up to and why?

Marc:

[1:04:43] Yeah, I mean, just fantastic. One of the blessings of working at a place like AWS is I get to work with so many great people. You know, maybe because he’s retired, I’ll. I’ll talk a little bit about. So Allan Vermeulen was one of the sort of early engineers at AWS and original, say, huge contributor to the design of S3, a really big contributor to the design of a lot of. A lot of our database services over time.

[1:05:11] Al was actually the CTO of Amazon for. For a period of time when he realized, I think that wasn’t the job he wanted to do. But what I really admired about Al from early in my career is very clearly he was somebody who deeply understood the things he was doing and he could work in these two modes. I have a great memory of a 2010ish, you know, arguing with Al about some of the edge cases in the Paxos paper.

[1:05:49] And, you know, he was super deep at that level, but could also get up to the really kind of executive level and talk about, you know, cloud strategy and the way we should be explaining things to people and some of the, you know, sort of fundamental things that we need to be building. And I really admired that ability to work sort of almost at every level. And I was like, wow, you know, this is. This is something I aspire to and, you know, want to model my.

[1:06:16] Want to model my own career after. And so, you know, that’s. I. I think that is, you know, the kind of person I’ve really, you know, really enjoyed working with is. Is people who do have that, you know, do have that breadth. And I think, you know, one of the other things that is really admirable about a lot of these folks is, you know, they. They don’t want to be celebrities, right?

1:06:53 — Technical book recommendations

Ryan:

[1:06:53] want to continue their engineering education and really remain on top of things, deeply understand the technology. Do you have any top technical book recommendations?

Marc:

[1:07:04] You know, anybody who’s building distributed system things? I highly recommend Martin Kleppman’s book. I think there’s a second edition of that coming out soon. There’s a new edition of Quantitative Systems Design book, which I also think is great. Hennessy and Patterson’s Computer Architecture book. This is a super useful one that covers a ton of ground. I read a ton of fiction and nonfiction and mostly papers when I’m reading technical things.

[1:07:37] I find I find engaging at that level more useful for me. And by the way, that’s become way more accessible now. One of the great ways to dive into a paper is, hey, Claude, summarize this for me and then I can dive into it and read the author’s words. And I find that mode is great and it’s super accessible for people who haven’t been able to read papers in the past. But, you know, and then there’s also a ton of insight in, in some really old stuff too.

[1:08:16] For example, you know, some of the algorithms that we used in Lambda to manage traffic and manage bursts of traffic come from erlang’s work like 100 years ago on managing telephone call centers and his book about that. And so folks also shouldn’t think that, oh well, the industry is changing super fast.

Ryan:

[1:08:59] more

Marc:

[1:09:00] again, more, maybe more leveraged than ever before. You know, deeply understanding those topics.

1:09:06 — Advice for his younger self

Ryan:

[1:09:06] And then last question for you is if you could go back to your younger self when you just joined AWS and give yourself some advice, what would you say?

Marc:

[1:09:16] I think maybe be a little bit bolder. I really love the team that I worked with and, you know, especially in EC2 and in the early days in EBS, and I think I was a little bit more hesitant than was optimal about leaving those teams and looking for the next thing. You know, as, you know, my own learning and impact kind of, you know, tapered off a little bit in those places. And so, you know, I think I’ve changed organizations kind of in a big way four times in my career and maybe five or six would have been optimal.

[1:09:53] Not a lot more, but some more. And so, you know, don’t, don’t hesitate to think about, you know, what am I learning and who am I learning from and is there a better environment to do that, you know, more quickly and to learn more things? And, you know, I’m highly, personally highly motivated by being able to follow my curiosity. And every time I’ve done that in my career, I found that a valuable move and something that I’ve personally enjoyed.

Ryan:

[1:10:27] Awesome. Okay, well thank you so much for your time. I really appreciate it. Mark, thank you for sharing with the audience.

Marc:

[1:10:34] This has been super fun. Thanks so much.

Discussion about this episode

User's avatar

Ready for more?