Virtual Server Disk Issue

Date: 
Wed, 06/03/2013 - 02:15 - 08:20

A storage issue is continuing to affect some virtual servers and services at our primary site. Our engineers are working on the issue.

We apologise for any inconvenience caused.

Update 06/03/2013 06:05: The issue has been rectified and our engineers are bringing the affected virtual machines back online as fast as possible, but the association between disks and virtual machines is having to be re-created manually so this will take a little time.

Update 06/03/2013 08:20: All virtual machines affected by this issue are now back online. Further details of the issue will be compiled and detailed here shortly.

Update 06/03/2013 08:58: This issue was caused by two volume snapshots on our legacy SAN becoming locked and causing the data in that volume to become larger that it otherwise would. This caused the volume to reach its maximum allowed size and writes to the volume, which holds a large number of Linux virtual machines, to fail.

We quickly detected the issue and isolated the cause, removing snapshots to free up space and shutting down affected virtual machines in order that the machines could be restarted. Unfortunately the XenServer hosts for these virtual machines mis-interpreted the failed writes as an iSCSI multipathing failure, which was incorrect.

To recover this we initially migrated machines off some hosts to reboot those hosts, but whilst the multipathing error cleared these hosts still refused to mount affected iSCSI LUNs.

At this point we made the decision to attempt to bring up the affected virtual machines on our new SAN, which these virtual machines were scheduled to be migrated to later this week and already housed a replica of the volume, but XenServer still refused to mount the LUNs even from the new SAN.

To recover this we removed the storage repository object for this volume in XenServer and re-introduced it, but this step meant that the metadata that links virtual machines to their disks was destroyed, and had to be manually re-created for each virtual machine by mounting the file system, reading the network configuration to identify the virtual machine it should be associated with and creating the mapping. This process took just over 2 hours, with virtual machines coming online throughout the process.

Because part of the resolution of this issue included re-targetting the affected storage repository at our new SAN, customers affected by this issue will now not be affected by the planned maintenance window previously advertised for this Sunday and can now enjoy the benefits of our newer storage infrastructure. Other customers will still have their virtual machines migrated at this time.

To prevent the issue re-occuring, now the volume is on the new SAN we are able to make it significantly larger but also will be implementing volume size monitoring and associated alerts so that if the issue does happen again we should be able to detect it before it causes a failure. Once again we apologise for any inconvenience caused - please get in touch if you have any queries.