Every breakage is an opportunity to make our systems more resilient. After leading hundreds of incident reviews, I realize that each review can be reduced down to three simple questions. In this post I’ll go over why these questions are exhaustive and what common tactics come from answering these questions.
Good points! How do you approach issues you can't easily trace? Having retries on a function level and maybe even on a worker level increases the resilience of an app/service quite well but that does not fully work if you run out of memory for example. It's often the case that you don't even see an error log because the service crashed. That's something I found very tricky in the past.
> It's often the case that you don't even see an error log because the service crashed
Logging to an external service works well for auditing what happened. That way, even if your main service crashes, you can still query and analyze the logs to see what happened up until the service stopped responding.
Awesome article! I would love to hear more about ways to protect the release process such as using canaries or other tactics to prevent incidents from happening.
Good points! How do you approach issues you can't easily trace? Having retries on a function level and maybe even on a worker level increases the resilience of an app/service quite well but that does not fully work if you run out of memory for example. It's often the case that you don't even see an error log because the service crashed. That's something I found very tricky in the past.
> It's often the case that you don't even see an error log because the service crashed
Logging to an external service works well for auditing what happened. That way, even if your main service crashes, you can still query and analyze the logs to see what happened up until the service stopped responding.
Awesome article, Ryan. I like the breakdown into the 3 questions. It’s super helpful to think about it how you laid it out
Thank you Jordan, glad you liked it!
When trying to solve the problem of bugs if you cannot fix the bug, take a break. This will give you a better state of mind.
Awesome article! I would love to hear more about ways to protect the release process such as using canaries or other tactics to prevent incidents from happening.
Thank you Danilo, glad you liked it. That's a good idea, I'll add it to my notepad for a future article :)