Datacenter maintenance

The datacenter we are using were doing an upgrade on Sunday which didn't work out as planned.

The upgrade this morning was a mass-software upgrade in preparation for some large additions to our network and to maintain some general housekeeping.

The upgrade path was pretty clear cut and, as usual, verified by a third party. The initial upgrade was successful, but around 3am certain routing loops causes unusually high processor utilization. This issue result in the datacenter's decision to back out of the upgrade. The process of downgrading the software on the core switches was impeded by a version mismatch between previously existing firware on external routing cards and the management software on the switch itself. This left the datacenter in a limbo between two versions and an inability to reimplement the original code. They are currently waiting on a software patch to be written by the manufacturer to enable to either go full forward or fall back on the old version.

Right now, the most obvious symptom of this condition is packet loss. The datacenter are working with the manufacturer, who is onsite, to resolve this.

We are planning a DNS based failover for these cases, more on this later. We also won't charge this month's hosting fees, damn, we gotta get things working reliably until we start charging people.

Update: there was a couple (2) of hours of downtime and another 16 hours of packet loss which meant slowness.