Responsibilities of an On-call Developer
Being on-call means that you are able to be contacted at any time in order to investigate and fix issues that may arise in the system. You will be expected to take whatever actions are necessary in order to resolve the issue and return services to a normal state.
On-call responsibilities extend beyond normal office hours, and if you're on-call, you’re expected to be able to respond to issues, even at 2 am.
Prepare
Have your laptop and internet with you at all times. (dongle, a phone with a tethering plan, laptop chargers, etc.)
Team alert escalation happens within 10 minutes, therefore set/stagger your notification timeouts (push, SMS, phone) accordingly. Make sure Jira updates/Slack notification can bypass your “Do not Disturb” settings.
Ensure the relevant project environments are set up on your system and tested. The current working copy of the necessary repository should be local and functioning.
You must always keep updated credentials and access to the GCP platform and its tools, Gitlab, Stackdriver, other third-party tools used in the project, at all times. To debug an issue, you will require these!
You must have full access to the Jira On-call bugs project and Status page.
Read our On-call process document to understand how we handle serious incidents and what are the different roles and methods of communications.
Be aware of your upcoming on-call roaster and plan your week accordingly. Arrange swaps around travel plans, vacation, appointments etc.
Triage
Acknowledge and act on alerts/incidents whenever you can. During work hours, acknowledgement TAT is 10 minutes from the time an issue is reported in the Jira On-call bugs project.
The Status page will be your single reporting tool. Once the issue is confirmed, you are required to update the incident on the status page.
Determine the urgency of the problem depending on the severity of the incident/issue;
-> Is it something that should be worked upon right now or escalated into a major incident? ("production server down” situations, Security alerts) -- please do so.
->Is it some tactical work that doesn't have to happen during the night? (the trend is not indicating impending doom or financial impact) - snooze the alert until a more suitable time (working hours, the next morning...) and get back to fixing it then.Check your team communication forums (Whatsapp, Google chat, Slack) for current activity. Often (but not always) actions that could potentially cause incidents will be reported there.
Does the alert and your initial investigation indicate a general problem or an issue with a specific service that the relevant team should look into? If it does not look like a problem you are the expert for, then escalate to the responsible team/developer, immediately.
Fix
You are empowered to dive into any problem and act to fix it.
Ensure to check whether the issue relates to a client-side configuration or a limitation of the product. These details are usually (not always) mentioned in the Run-book.
Involve other team members as necessary: do not hesitate to escalate if you cannot figure out the cause within a reasonable time frame or if the resolution is something not present in the run book or you have not tackled before.
When your not able to resolve an issue, you will allocate the bug to the developer who can and track the resolution progress.
Remember, you are solely responsible for an incident resolution even if you’re not the acting developer. Taking hourly follow-ups and status page updation is mandatory.
Improve
If a particular issue keeps replicating; if an issue is alerted often with the same root cause, you will find a way to answer this with automation. If the issue cannot be answered with automation, you will update the runbook with a step-by-step resolution guide.
If the information is difficult/impossible to find, write it down. You are responsible to constantly refactor and improve our knowledge base and documentation.
Add redundant links and pointers if your mental model of the wiki/codebase does not match the way it is currently organized.
Support
When your on-call "shift" ends, let the next on-call know about issues that have not been resolved yet and other experiences of note. This will typically be the “On-Call handover meeting” agenda.
Support each other: “On-call” is about close co-ordination towards achieving a single goal.
If you are making a change that impacts the on-call schedule (adding/removing yourself, for example), let others know since many of us make arrangements around the on-call schedule well in advance.
On-call Duties
Acknowledge the incident to the Support team
Analysis of the Incident
Identify whether the reported incident is a Bug or not
-> If it is a BugIdentify whether it is a Frontend or Backend bug
Assign priority to the bug
Report an incident on the status page
Inform support team that the complainant is able to monitor the resolution on the Status page
→ If it is not a bug (in these instances, you will inform the Support Team)
Check for any Client-side configuration issues
Check for functionality limitations from Run-book
Not a Bug at all. Functionality is working as expected
Resolve the Bug as per the severity. Run-book will assist you with the resolution process.
If you’re unable to resolve the bug within a reasonable time, hand-over the bug to the concerned developer. You are required to be in full control of the incident and is recommended to take hourly follow-ups from the acting developer.
Once the bug is resolved and tested, you will update the status page with the “Resolved” status.
If the bug is not answered by automation and requires manual intervention, you will update a step-by-step resolution guide for the incident in the run book.
On-call Hand-over meeting will be conducted weekly, on the day the roaster changes.
You will assign all ongoing bugs to the new on-call developers. You will also let them know about the criticality of the issues.
Notify about the follows-ups needed to be done
Notify about the current and pending status page updates
Any feedback or drawbacks you faced in the system through the resolution process