/
On-Policy

On-Policy

(How does it work in Eshopbox

  • 2 on-call employees (front end and backend member) will be assigned on a weekly level

  • They are required to ensure monitoring and acknowledge any bugs/glitches in the run time;
    - Opsgenie for alerting (Stackdriver integration)
    - Production Bugs reported in the Jira board (Intercom, Client, Team member, QA, anonymous)
    - Glitches experience by anyone

  • There will always be a primary and secondary responder for an issue

  • Any outages/issues must be acknowledged within 5 mins of reporting. Once acknowledged, the issue must be segregated on a priority/criticality level (P1, P2, P3, P4).

  • In some cases, the client will be required to be communicated with the following template wrt the issue, committing that the development team is looking into this and will get back with a resolution. (Status page)

  • After the priority is assigned, they will be required to resolve the issue. (Faqs and infra diagram). P1 and P2 level issues must be resolved with immediate effect. P3 and P4 issues will be answered as per the on-call backlog.

  • If the issues cannot be resolved by the POC, it will be escalated for further support from senior members;
    Levels Of support

    • Level 1 Support (Mentor, Manager)

    • Level 2 Support (Department Head/Tech lead)

    • Level 3 Support (CTO)

  • Conducting an incident postmortem

  • The on-call members are also required to conduct daily level testing procedures

 

 

  • Clearly define the on-call responsibilities
    Responsibilities during on-call should be clearly defined. This helps prevent burnout, confusion, and frustration. We suggest documenting your incident response process and expectations for what it means to be on call.

  • Make sure alerts are being assigned to the right person
    Getting your alerting tooling dialled in effectively shouldn’t be overlooked. Making sure to have clear altering flow with the right notifications and overrides can avoid a lot of headaches.

  • Have primary and secondary responders
    Life doesn’t stop just because someone is on call. Just like an unexpected personal emergency can take a developer offline during the workday, the same can happen when they’re on call. Putting a backup in place limits the potential damage from this kind of interruption.

  • Fine-tune your schedules
    Teams are not static things, neither should be your on-call schedule. We recommend a culture of continuously reviewing, adjusting, and improving your on-call practices.

  • Make sure they have access and familiarity with all the relevant diagnostics tools
    Every team varies in the tools they use to track operational health, application performance, resource utilization, etc., Make sure your on-call engineers are familiar with the tools used and have proper access to them.



Metrics to measure the on-call support

 

Incident Communication

  • Template for incident outage communication

  • Criticality logic for the incident

  • Communication Tool (Slack)

 

On-call Roaster

  • 2 SD’s per team on a weekly level

  • Depending on the team members, the roaster is set

 

 

Bug reporting and Flow

  • Reporting from Intercom, Client, Team member, QA, anonymous

  • Resolved with simple steps by the on-call run book

  • On-call person should know where the bug should be assigned

  • Creates an incident on the status page and updated on an hourly level

  • Ensure the bug is assigned to the relevant team

  • Bug is resolved

  • RCA report and postmortem

  • Run book updation and handover meeting (Template for Runbook)

Journey

Actions

Points to note

Journey

Actions

Points to note

Issue Reported on channel

  • Ticket raised on Intercom

 

POC reports the issue in Jira

  • The bug is created in Bugs project in Jira

  • Notification is sent the on-call members (Both)

  • POC must know the on-call roaster

  • Jira automation

  • Reporter and assignee

  • Status page check

Oncall person takes charge

  • Must acknowledge the bug within 5 minutes of reporting

  • Identify and verify the issue on the status page

  • Analysing the bug to answer whether its a bug or any config/settings

  • Decides whether it is a fronted or backend bug

  • Assigns priority to the bug (P1, P2, P3, P4)

  • Updates the incident on the status page

  • Checking whether the bug is already reported and being worked upon. (Status page)

  • Understands which project the bug relates

On-call person resolution

  • Checks for the solution in the Run book

  • Resolves the bug with the step by step guide

  • Deploy changes

 

Bug handover to the developer (if not resolved)

  • Assigns the bug to the relevant project & developer

  • Depending on priority, the bug is picked up by the developer

  • Takes hourly followups from the developer to ensure resolution

  • Updates the hourly resolution status for the bug on the status page

  • Format of assigning a bug
    (What did he try, what is a technical issue)

Bug is resolved by the developer

  • Deploy on the environment

  • On-call person to close the Status page incident.

 

RCA/Postmortem (Jira)

  • Developer to fill out the RCA on the Jira bug assigned to him

  • Draft RCA form template

  • Permanent or temp fix

  • Answered with automation or manual

Run book Updation

  • The on-call person updates the Runbook from the resolution given by the developer

 

 

  • Status page configuration (setup)

  • Jira configuration for bug notification (Notification, Bug RCA questionnaire etc)
    - Bug template from Gopal
    - Bug Format for description and custom fields to move bug from QA to done

  • Template to assign a bug from ON-call to a developer (Actual, ideal)

  • Define Handover meeting format (agenda, stakeholder involvement and outcomes)
    - Change the reporter and assignee for notifications
    - Known responsibilities and issues to be handed over
    - On-going bug assigned to incoming on-call persons
    - Take control of the Status page (Closing previous incident, opening new tickets etc)
    -




  • Bug project configuration

  • Project wise team segregation for the on-call person (profile from Jira)

  • On-call Roaster

  • Run book template

  • Defining the scope in the Runbook project wise

 

 

Weekly on-call handover meeting

  • run book

  • automation and avoidance

  • What issue were not resolved and why--whether the steps in run-book were not adequate.

 

Outcome of On-call

  • Eshopbox status page to report the incident and must be integrated with intercom

  • Bug reporting cycle and handover meeting (Runbook)

Related content