Following on from yesterday's issues the post mortem is as follows.
At around 16:00 we identified traffic issues which was resulting in high latency and packet loss around 25-35% for all traffic that was internet bound. We then started our diagnosis work in isolating the potential cause and had identified the issues as most likely beyond our elements of control within about 15 minutes. At this point we reached out via our escalation routes to our IP transit providers handed over the diagnostic information and received our first response around 16:30. They had completed some basic testing and couldn’t see the same issue we could and requested further diagnosis. It was at this point the job split into two tasks.
There appeared to be a delay of about 45 minutes whilst we were working through step 1 and getting the information back regards step 2 which it now transpires is down the current Covid-19 action plan meaning that those needing to communicate didn’t have the access they needed as quickly as they would normally.
At around 17:00 we decide to flip our circuit to the backup route even though we could still see an issue on both routes we wanted to isolate the master switch and route traffic out the backup switch to see if the fault lay with the switch itself. This requires some manual cable re-routing and took about 15 minutes to complete and a further 10 minutes waiting for the ARP entries to re-route across the network. Unfortunately our communication with the IP transit providers was being hampered during this time but it transpires they had been working on further diagnosis and started implementing a fix.
At around 17:30 we think we had identified the issue however it seems that the IP transit providers had been making changes and nothing we could or did do would have fixed it, as the fault lay with their kit. They like us had been upgrading core infrastructure, for us it is so we can provide 1Gb connectivity and for them so they can support additional demands that we and companies like us put on them. The covid-19 situation has meant that all non essential work was paused and that has meant upgrade work that should of been completed is currently on hold. They had replace 3 of the 4 core switches prior to covid-19 however our inbound route which is chosen via OSPF happened to be via the one that hasn’t been upgraded and it was that one that ran out of space in its routing table and caused issue.
They have identified and resolved the issue that should mean we don’t have a recurrence before the switch is replaced. They are also working on replacing OSPF with BGP for us that will provide more control. We have also worked with them on an action plan so that we can provide better initial diagnosis from our end and allow quick targeting on there end. Finally we have sorted out the communication problem that was making it hard for us to exchange information during this period.
I believe that the lessons learned from this will mean that any future issues will be resolved faster especially during this current health crisis. The move away from OSPF will allow for faster re-route times for us and once we can get back to normality the upgrades we are working on with them will make for faster and more resilient services overall.
Finally I hope you are all keeping well and safe and we look forward to meeting you all again on the other side.
Thank you and take care.