July 2017 IT service disruption summary
Nipissing’s University Technology Services (UTS) and Facilities departments have had an eventful summer. As has been previously communicated, there were two significant disruptions to IT services that occurred. The university recognizes that the service disruptions were challenging for many individuals and would like to thank the university community for their patience and understanding during the outage. In an effort to communicate exactly what transpired, how it was addressed, and what steps have been taken to remedy the situation, UTS has created the following timeline.
On Friday, July 21, Nipissing University experienced a significant IT service disruption resulting in intermittent availability of all IT services for approximately four business days. Services were fully restored in the early morning of Thursday, July 27.
The issues began at approximately 10 a.m. on Friday, July 21, when UTS received system notifications indicating the failure of one firewall unit in the main IT server room. The UTS Infrastructure team was immediately deployed to the server room to investigate and take action.
About two hours later, UTS technologists heard loud electrical zapping sounds within the same server room and witnessed the immediate failure of multiple servers, switches and firewall units.
The Facilities department was immediately engaged and an electrician was onsite by 1 p.m.
Several more units continued to fail over the next several hours as electricians and technicians searched for the cause.
All main IT services were non-operational.
The university engaged several external professional service providers to work in conjunction with Facilities and UTS staff in an attempt to identify and rectify the cause of the disruption in the server room and to replace and restore downed equipment.
By 9:30 p.m. on Friday, July 21, the Nipissing University website was restored.
At 5p.m. on Saturday, July 22, the on campus wired internet connection was restored.
Unfortunately, shortly thereafter, the fire suppression system in the server room malfunctioned, forcing an evacuation of the room for safety.
Meanwhile, several more units in the server room were now showing signs of failure.
At this point, no specific cause had been identified and the server room was deemed unstable. The decision was made to relocate necessary equipment to a secondary server room, one that had been utilized previously and was currently housing several operational servers.
On Sunday, July 23, UTS staff moved 90% of the server room equipment to the new space. By 9 p.m. the equipment was in place and staff began powering the units up.
The main server unit could not be relocated due to its size and weight and had to remain in the unstable server room, along with the Bell equipment and associated devices.
On Monday, July 24, at 7 a.m. all IT services were restored, with the exception of on campus wireless and residence wired/wireless internet connectivity.
Unfortunately, at 10 a.m. another electrical event occurred which again significantly disrupted all services.
Following consultation with experts at HP, our service provider, the decision was made to take the main server off line in an effort to guard against data corruption until arrangements could be made for the relocation of this large and heavy equipment.
Arrangements were made to move the larger main server to the new server room on Wednesday, July 26, utilizing the services of a specialized IT moving company.
On Tuesday, July 25, some IT services were once again available, including the NU website, employee email, on campus wired internet, and Blackboard via learn.nipissingu.ca. UTS technicians continued to work to fully restore functionality.
On Wednesday, July 26, the large server unit was successfully moved to the secure server room and the service restoration process continued in full gear.
On Thursday, July 27, at 2 a.m. all regular IT services were restored.
The investigation into the root cause of the electrical events and subsequent IT service disruption continued.
On Friday, August 11, an electrical storm damaged a sensor in the fire suppression systems in the stable server room. This initiated a failsafe shutdown, causing a second IT service outage. Normally UTS receives system notifications in advance of a failsafe shutdown being initiated but an investigation revealed that the service provider had not connected the communications cable. The cable was immediately connected to ensure future notifications would be received. The same storm also damaged Bell equipment resulting in an outage that lasted until Sunday evening.
UTS initiated a planned outage on Friday, August 18, at 10p.m. in order to move of the remaining Bell equipment and associated devices to the new server room.
The investigation continues into the cause of the July service disruption. There is no evidence whatsoever of any malicious activity. It should be noted that both server rooms fully meet both environmental and operational standards.
UTS has a disaster recovery plan which uses cloud services however, this service is meant to be engaged during events of a lengthy duration and therefore was not engaged for these relatively short outages.
It is important to acknowledge the tireless work of the UTS Infrastructure team as they rebuilt, replaced and restored equipment and services night and day throughout this unexpected event. The Facilities department also worked in close collaboration with UTS throughout the outage.
Over the next few years the majority of the existing server infrastructure will be moved to a cloud based platform as UTS continues to work towards IT service solutions that are more stable, secure and well supported.
While even the largest IT service providers including Microsoft, Google and Apple experience unexpected outages, an event of this nature on campus serves to reinforce the need for every department to examine their disaster recovery strategy, including their ability to conduct critical operations when disruptions occur.
We hope that this information has been useful in creating an understanding of exactly what occurred and the steps that were taken in addressing the issues.