Internet Bound Traffic Issues

Incident Report for AYN

Postmortem

Following on from yesterday's issues the post mortem is as follows.

At around 16:00 we identified traffic issues which was resulting in high latency and packet loss around 25-35% for all traffic that was internet bound. We then started our diagnosis work in isolating the potential cause and had identified the issues as most likely beyond our elements of control within about 15 minutes. At this point we reached out via our escalation routes to our IP transit providers handed over the diagnostic information and received our first response around 16:30. They had completed some basic testing and couldn’t see the same issue we could and requested further diagnosis. It was at this point the job split into two tasks.

We started doing more in depth checks on our setup.
We collated the information and request more information from them on their checks.

There appeared to be a delay of about 45 minutes whilst we were working through step 1 and getting the information back regards step 2 which it now transpires is down the current Covid-19 action plan meaning that those needing to communicate didn’t have the access they needed as quickly as they would normally.

At around 17:00 we decide to flip our circuit to the backup route even though we could still see an issue on both routes we wanted to isolate the master switch and route traffic out the backup switch to see if the fault lay with the switch itself. This requires some manual cable re-routing and took about 15 minutes to complete and a further 10 minutes waiting for the ARP entries to re-route across the network. Unfortunately our communication with the IP transit providers was being hampered during this time but it transpires they had been working on further diagnosis and started implementing a fix.

At around 17:30 we think we had identified the issue however it seems that the IP transit providers had been making changes and nothing we could or did do would have fixed it, as the fault lay with their kit. They like us had been upgrading core infrastructure, for us it is so we can provide 1Gb connectivity and for them so they can support additional demands that we and companies like us put on them. The covid-19 situation has meant that all non essential work was paused and that has meant upgrade work that should of been completed is currently on hold. They had replace 3 of the 4 core switches prior to covid-19 however our inbound route which is chosen via OSPF happened to be via the one that hasn’t been upgraded and it was that one that ran out of space in its routing table and caused issue.

They have identified and resolved the issue that should mean we don’t have a recurrence before the switch is replaced. They are also working on replacing OSPF with BGP for us that will provide more control. We have also worked with them on an action plan so that we can provide better initial diagnosis from our end and allow quick targeting on there end. Finally we have sorted out the communication problem that was making it hard for us to exchange information during this period.

I believe that the lessons learned from this will mean that any future issues will be resolved faster especially during this current health crisis. The move away from OSPF will allow for faster re-route times for us and once we can get back to normality the upgrades we are working on with them will make for faster and more resilient services overall.

Finally I hope you are all keeping well and safe and we look forward to meeting you all again on the other side.

Thank you and take care.

Posted May 05, 2020 - 17:13 BST

Resolved

Issue appears to be resolved.

Posted May 04, 2020 - 19:32 BST

Monitoring

The issue has been fixed by the IP Transit provider and we are monitoring the situation. As part of our attempt to mitigate the issue we moved our services to our backup circuit which didn't resolve the issue but was part of the steps requested by the IP transit provider to rule out everything our end. We will continue to operate via the backup circuit for the time being and will look to revert out of hours which we will schedule here.

Posted May 04, 2020 - 18:01 BST

Investigating

Issue appears to be with Core IP transit providers and we are continuing to work with them to resolve.

Posted May 04, 2020 - 16:30 BST

Update

We are continuing to work on a fix for this issue.

Posted May 04, 2020 - 16:13 BST

Identified

We have identified an issue outside of our control that is affecting our service. We have raised this with suppliers to track what is causing the issue and will update this incident once we know more.

Posted May 04, 2020 - 16:06 BST

This incident affected: Canonbury Tenant Services (Comms Room Hosting, Dedicated Internet Services, Hosted Phone Services), Asyouneed Services (DNS Services, Web Hosting Services, Outbound Mail Delivery, Inbound Email Delivery V4 Cloud, Inbound Email Delivery V5 Cloud, Virtual Server Hosting, Hosted Phone Services, Server Colocation Services), and Asyouneed 3rd Party Services, Client Management Portals.