RFO Report for Network Service Incident
Summary
At approximately 17.00 on Tuesday 9th August 2022 our monitoring alerted us to a network affecting issue resulting in a loss of all connectivity to our network.
Upon investigation we identified that we had lost communication with both core switches resulting in a network wide outage.
Our network operations team attempted to hard reboot the switches to resume service however their attempts were unsuccessful.
At 17.40 our field engineer arrived at rack and rebooted the affected devices. This action resumed our network operation and traffic levels started to recover.
At 18.15 we identified that our traffic levels had recovered to only 30% of what they were prior to the incident. Deeper investigation showed this was potentially caused by DNS caching on our internal resolvers. Flushing the DNS cache resolved the issue and traffic levels resumed to normal levels.
At 18.50 our network engineering team performed previously scheduled emergency maintenance to replace a routing-engine on our core router.
Investigation and Root Cause Analysis
Further investigation and consultation with vendors technical assistance centre we have identified that this issue was caused by a manufacturer known firmware issue due in our current switch configuration.