Being On-Call is not easy. So does writing software. Being On-Call is not just a magic solution, anyone who has been On-Call can tell you that, it's a stressful, you could be woken up at the middle of the night, and be undress stress, there are way's to mitigate that. White having software developers as On-Calls has its benefits, in order to preserve the benefits you should take special measurements in order to mitigate the stress and lack of sleep missing work-life balance that comes along with it. Many software developers can tell you that even if they were not being contacted the thought of being available 24/7 had its toll on them. But on the contrary a software developer who is an On-Call's gains many insights into troubleshooting, responsibility and deeper understanding of the code that he and his peers wrote.
Being an On-Call all has become a natural part of software development. Please note I do not call software development software engineering because I think in order to do software engineering a different process should be taking which is less answering the needs of business today.
Just like software developers are much more encompassing the end-to-end testing of the pieces of software they write, just as today software developers are responsible for the cloud and the product, as micro services which make a single team responsible for a single or multiple services, so with the advent of fast internet, mobile phones, chat applications, VPN connections, replacing developers desktops with laptops, this also meant that software developers today are also responsible for the support and technical aliveness and stability of the software they write.
And inherent part of this, like it or not, is being On-Call.
Most of the FAANG giant companies as of today and the industry leaders have their software developers as On-Call and respond to incidents happening on the software and services they write.
You can ask of course, with all the automation that we have, after all it's software that we develop, this software is the automatic process in many cases of what was in the manual process, so how come we still don't have the capability to make all the OnCall in a way that would mean we reduce the resulting OnCall effort to minimum.
This is not just a wishful thinking of a software developer, in fact, in many organizations each second of down time results to a lot of money lost, the difference between a short outage to an hour long of outage, the time it takes to resolve a problem is in fact as of these days can result with a loss of millions of dollars, so how come this problem is not resolved yet, how come we still need a man in the loop of resolving such issues which when the compiler you can claim does so much of a complex work, that it does not really need a man in the loop what is missing.
The case is this with all the automation testing, monitoring, that we have today the systems are still complex, it is said that the terminal window be it standard mac terminal or iterm2 is actually slower with all the computing power that we have today it's in fact slower than the terminal that you had 20 years ago, this is all of a result of increased complexity. The more computing power that we have the more complex and the more people are involved in creating software. So the fact that we still need a man in the loop is a mere fact of the fact that we need much more complex software involving a huge amount of developers and teams and complexity and ever-changing specs, designs library versions and source code versions, all these contribute to increased complexity that eventually you would be paged with an issue, and in these cases you need to know what to do fast because as we said this can result in a loss of millions of dollars to the business.
There is no magical visibility and monitoring tool data you should collect. However, when collecting and visualizing its best practice if you could focus on being able to understand the flow, being able to drill down from larger problems into details, and being able to run SQL yes SQL on each request and aggregation on your requests and responses.
Good visibility means you have all the tooling that you need in order to understand the state of your system. There is balance in between too much of monitoring, to too little, between good and bad monitoring. Good monitoring allows you to quickly triage problems and to defer between major to minor issues, between which system is responsible for the problem.
Good monitoring would provide you with the errors that happened at the time of the crisis, the latency of the systems, the response time of different services.
One of the most important properties of properties that I would recommend is having a complete view of the flow, if you could take those requests that failed, and see, yes actually see the actual flow of them, what happened at each stage, the latency in each stage, and what failed this might be one of the most crucial ways for you to triage and identify problems. What component caused the site to break, whether you should involve more people or is it solely your system that caused it, how badly does it affect customers.
You have a few minutes for the initial investigation so the tools should be there to help you, you should practice it also when there is no error, you should not do this process the first time that you are troubleshooting the problem, so it's best to utilize these tools as part of developing and analysing new features that you provide to production as if an error has just occurred.
When you develop a feature ask yourself how do I know that it's working what if not, what are the metrics the flows me to do that. It would be hard to go through logs and understand if feature works so its better to have dashboards and graphs and summaries and databases that contain auto data of the processes that happened and as we said to actually view the full flow of requests to understand the flow of the system this would give you the eventual visibility that you need.
Overview dashboards usually allow in many companies a quick glance into the state of the system, and help to determine the severity of the errors, correlate feature commits to errors. If when an error happens you are able via dashboard to quickly correlate it to traffic shape changes, code commits, database changes or any upgrade, and deployment, client requests, configuration changes you are in good shape. If it's going to take your time, then not.
An on call procedure for response could be a developer best friend, and whenever an error happens you should do a postmortem and see if the procedures helped you identify the problem.
Of course the more data you have in these notebooks the higher the time it's going to take the On-Call to refer to them.
So it's best to have a kind of cheat Sheet. Just like you have quick cheatSheets try to organize the On-Call procedures as cheat|Sheets, if possible in One single sheet to capture all the procedures the best.
Communication during an issue is an important aspect of visibility, you should stick to logging everything you have in the problem in once place so that others would learn from the procedure, in future someone would have a similar error wouldn't it be great if he could search for past errors and find how to work around it. It's important to keep updating on issues, depending on the severity so that everyone knows what is the state and to see that you focus only on mitigating the problem and not the root cause of the problem, the root cause is not solved at 02:00 am remember that you have now the support had the client had and not the software developer hat, problem should be mitigated and overcome as fast as possible, then software developers should take a step back and work on the problem during the regular sprint, root cause analysis and solving is too important to be solved as part of the OnCall shift it should be part of the software development standard flow. First get the system back into stability.
If you search google for 60 seconds performance analysis you would find a great Netflix post about how to check fast a system for performance issues, this is not a coincidence that the same company that perfected chaos monkey and procedure is able to provide such a blog post.
A system without changes usually have a very limited set of problems if not any problem. Almost any problem I saw was caused by a change, the change could be a traffic shape spike or change, a code change a configuration change a deployment change or dataset upload change. When reaching the phase of resolving the actual problem this is when you look deeply into the change that happened and ask how would it be possible not to have such a problem the next time we introduce a change or even better how can we improve visibility and simplify at the same time the dashboarding and proecures such that such changes would easily be correlated to the problem.
Making sure the problem won't happen again requires time, but it's time you are going to gain with high dividends in the future, every time a problem is not root caused solved no matter how much effort and time it takes it's going to consume so much time, it's like a loan you take on developers future time, you don't want to take that loan, you want to invest that time today so that you can move even faster in the future. If you don't you will lose the software development game.
When all is solved the stability of a system is the best indicator of how healthy your organization is, the more stable while being able to develop new features the happier everyone where gonged to be, but this requires lot of examination, root cause analysis and follow-up stories and tasks. One method which could be great if you are a team of a few people is that along as you have issue one person is continuously working on reducing any downtime, at first this may seem like lot of effort and time-sharing put into this but this will pay out, if you invest all your time only on development and problem resolution, you would find sooner than later that the time spent on the quality loan would consume your development time.
Troubleshooting problems as OnCall is not simply having shifts and assigning developers to it, it's something you should plan carefully the result of this process is that you have happier developers that can move forward with development while on one hand being able to troubleshoot problems as fast and as early as possible and on the other hand introduce the least amount of issues or being able to find quickly an issue once happens. However once an issue has happened this is a great opportunity for you to quickly turn this into a task that software developers can then work on in order to ensure it would at most cases won't happen again and at least cases if it happens be able to mitigate quickly and continuously improve on this as the software lifecycle continues.
This comment has been removed by a blog administrator.ReplyDelete