On-Policy
(How does it work in Eshopbox
2 on-call employees (front end and backend member) will be assigned on a weekly level
They are required to ensure monitoring and acknowledge any bugs/glitches in the run time;
- Opsgenie for alerting (Stackdriver integration)
- Production Bugs reported in the Jira board (Intercom, Client, Team member, QA, anonymous)
- Glitches experience by anyoneThere will always be a primary and secondary responder for an issue
Any outages/issues must be acknowledged within 5 mins of reporting. Once acknowledged, the issue must be segregated on a priority/criticality level (P1, P2, P3, P4).
In some cases, the client will be required to be communicated with the following template wrt the issue, committing that the development team is looking into this and will get back with a resolution. (Status page)
After the priority is assigned, they will be required to resolve the issue. (Faqs and infra diagram). P1 and P2 level issues must be resolved with immediate effect. P3 and P4 issues will be answered as per the on-call backlog.
If the issues cannot be resolved by the POC, it will be escalated for further support from senior members;
Levels Of supportLevel 1 Support (Mentor, Manager)
Level 2 Support (Department Head/Tech lead)
Level 3 Support (CTO)
Conducting an incident postmortem
The on-call members are also required to conduct daily level testing procedures
Clearly define the on-call responsibilities
Responsibilities during on-call should be clearly defined. This helps prevent burnout, confusion, and frustration. We suggest documenting your incident response process and expectations for what it means to be on call.Make sure alerts are being assigned to the right person
Getting your alerting tooling dialled in effectively shouldn’t be overlooked. Making sure to have clear altering flow with the right notifications and overrides can avoid a lot of headaches.Have primary and secondary responders
Life doesn’t stop just because someone is on call. Just like an unexpected personal emergency can take a developer offline during the workday, the same can happen when they’re on call. Putting a backup in place limits the potential damage from this kind of interruption.Fine-tune your schedules
Teams are not static things, neither should be your on-call schedule. We recommend a culture of continuously reviewing, adjusting, and improving your on-call practices.Make sure they have access and familiarity with all the relevant diagnostics tools
Every team varies in the tools they use to track operational health, application performance, resource utilization, etc., Make sure your on-call engineers are familiar with the tools used and have proper access to them.
Metrics to measure the on-call support
Incident Communication
Template for incident outage communication
Criticality logic for the incident
Communication Tool (Slack)
On-call Roaster
2 SD’s per team on a weekly level
Depending on the team members, the roaster is set
Bug reporting and Flow
Reporting from Intercom, Client, Team member, QA, anonymous
Resolved with simple steps by the on-call run book
On-call person should know where the bug should be assigned
Creates an incident on the status page and updated on an hourly level
Ensure the bug is assigned to the relevant team
Bug is resolved
RCA report and postmortem
Run book updation and handover meeting (Template for Runbook)
Journey | Actions | Points to note |
---|---|---|
Issue Reported on channel |
|
|
POC reports the issue in Jira |
|
|
Oncall person takes charge |
|
|
On-call person resolution |
|
|
Bug handover to the developer (if not resolved) |
|
|
Bug is resolved by the developer |
|
|
RCA/Postmortem (Jira) |
|
|
Run book Updation |
|
|
Status page configuration (setup)
Jira configuration for bug notification (Notification, Bug RCA questionnaire etc)
- Bug template from Gopal
- Bug Format for description and custom fields to move bug from QA to doneTemplate to assign a bug from ON-call to a developer (Actual, ideal)
Define Handover meeting format (agenda, stakeholder involvement and outcomes)
- Change the reporter and assignee for notifications
- Known responsibilities and issues to be handed over
- On-going bug assigned to incoming on-call persons
- Take control of the Status page (Closing previous incident, opening new tickets etc)
-Bug project configuration
Project wise team segregation for the on-call person (profile from Jira)
On-call Roaster
Run book template
Defining the scope in the Runbook project wise
Weekly on-call handover meeting
run book
automation and avoidance
What issue were not resolved and why--whether the steps in run-book were not adequate.
Outcome of On-call
Eshopbox status page to report the incident and must be integrated with intercom
Bug reporting cycle and handover meeting (Runbook)