Shaw Systems Disaster Recovery and Business Continuity Plan
By: Colin Doherty, Cloud Services & IT Manager
In the event of a disaster what are our priorities?
Our number one priority is the safety of our people. After that we prioritize production systems, specifically client facing systems such as Email, JIRA, and Confluence. Next on the list after those would be business critical systems such as code repositories and file shares. Finally we try to mitigate any unforeseen circumstances which would hinder our people working such as the inability to reach the office.
What are the preparations for a disaster and how do we execute on the business continuity plan?
Once a potential business continuity event has been identified our team meets and reviews the systems that could be potentially affected and where the primary server resides. Once all potential systems have been reviewed, a task is put in place to move those systems from their at risk primary’s to their secondary’s, we have a primary and a secondary cluster for all mission critical systems spread across the Richmond and the Houston offices.
We also meet to identify how we can best keep our people working and communicating such as providing Wi-Fi hotspots to critical support personnel and providing access instructions for the rest of the affected staff.
Which applications most greatly affect the clients of our company?
Due to the nature of our business almost all IT systems can be considered mission critical and can affect profitability. Which is why we have a primary or secondary for majority of our systems in both offices. We also have redundant connections between the offices and to the internet. Finally the hardware our applications run on is also completely redundant with no single point of failure. We also utilize the wide dispersal of geography for both offices so that chances of both being affected by the same event are almost nonexistent.
What level of redundancy do we have?
We have multiple levels of redundancy for all our mission critical systems. These redundancies include:
- Backups – We have scheduled backup of all our systems.
- Redundant Storage – All of our applications run on redundant storage.
- Redundant Server Hardware – All of our servers are redundant and if we lost a physical server we could run the logical servers on backup hardware.
- Clusters – We also have high availability clusters that allow multiple servers to work as one. We could lose one and the system would still operate.
- Multisite Redundancy – We have 2 offices with replicated setups to allow us to run applications out of either or both office.
Once critical applications have been reviewed, at what point do each of the applications need to operate in order to ensure business operations remain stable? And what time frame is required for full recovery of services and applications?
Depending on the application we can absorb an uptime of 99% without any issues. That being said we aim for a 99.99% uptime and consistently meet that goal. Our Recovery Point Objective is 1 to 2 hours maximum, depending on the application, it could be as low as a few minutes.
What kind of testing do we have in place to test our strategy plan in the event of a disaster?
We require annual testing but usually test 2 to 4 times a year including maintenance of failovers and other events. This testing involves complete fail over of systems and using them as the production system. We also monitor and test our backups, hardware, and network connections regularly.
Colin Doherty, Cloud Services & IT Manager
As the IT/Cloud manager Colin oversees day to day operation; support and maintenance of internal IT systems; identifies, evaluates, purchases, and implements new IT systems; provides subject matter expert support to the sales team for both IT and Cloud questions; and manages the Shaw Cloud Services environment.