Tech4Him – Technology with Integrity

A Christian technology chaos wrangler and his thoughts

VMWare ESXi – VM Crashes during failed Snapshot Delete

Posted by admin On November - 20 - 2008

crossOkay, I’m really thanking the good Lord right now. He granted us discernment that kept us from losing an entire day’s worth of data for my employer. To Him be all the glory.

The Problem

So here is the scenario. I am removing an old VMWare snapshot through the VI Client on a large file and print server. Of course, it is a Virtual server running Windows Server 2003. The Snapshot removal was being done in preparation of taking a new snapshot prior to extending on of the virtual drives on this virtual server. The Snapshot delete gets to 95% and hangs. After 45 minutes I restart the management agents on the ESXi server per a number of similar posts.

Upon reconnecting to the server via the VI client, I see suddenly that this server is now listed in the inventory as “Unknown (invalid)”. What? This can’t be happening! Can it?

The long and short of it is that I ended up created a new Virtual Machine and used the existing virtual disks (.vmdk files.) Started the new VM up and there it was. My server was back, or was it. A quick look at the event viewer told me the horrible news, there were events from the last few minutes and then nothing until about three months ago when we originally migrated the physical server to a virtual.

Oh my, the snapshots! I had the new VM pointed at the original parent disks not the latest snapshot files. I shut the server back down and say a quick prayer. I manually download and modify the .vmx definition to point to the various server-000003.vmdk files. (These were the last in the snapshot numbering scheme) Uploaded the VMX back and started the VM up.

No luck, I get an error message stating that there was a problem with the disks and the snapshot, “parent virtual disk has been modified”. Oh crud! I killed the snapshot linkages by starting the VM with the parent disks the first go round instead of the snapshots.

Solution

Thanks to my good friend Google, I quickly found my way to the Drive:Activated blog and this posting about the disk file linkage issue.

This boils down to ensuring that the you properly chain the parentCID in the largest numbered .vmdk disk file to the CID of the next largest numbered .vmdk file. Continue this until you get to the root .vmdk. (It is the one that doesn’t have the -00000x numbering after the disk name.

Since we are running on ESXi I needed a way to read and write the .vmdk files. There are a couple of ways to do it.

1. Use the VI Client Storage browser, download the .vmdk files, edit them and then upload them.
2. Use shell utilities to do the same directly on the ESXi server.

ESXi is missing some things from its big brother, namely a console. Thanks to numerous posts including this one, I entered the unsupported ESXi console.

From here, I navigated to the storage area and followed the concepts in the Drive:Activated article. Specifically I first used the grep command to quickly ascertain the various CID values for each vmdk file in each virtual drive series.


> grep CID= SERVER.vmdk
> grep CID= SERVER-000001.vmdk
> grep CID= SERVER-000002.vmdk
> grep CID= SERVER-000003.vmdk

You get the idea. I had a number of virtual drives in this server so I made myself a table to keep track of the CID and ParentCID.

vmdk_cid_matrix

As you can see above, the SERVER and SERVER2 disks had two parentCIDs that I needed to fix (RED). Best I can tell, the snapshot problem hung on the SERVER_1 disks as the 000001, 000002, 000003 disks all had the same CID. Kinda hard to link them together that way! :) By the grace of God, this was a small archive drive and I could restore it quickly from backups.

Finally I restarted the VM and after an anxious 10 minutes for ESXi to boot the server, the lovely Windows logon screen greeted me. I logged in and what do you know, we were good to go.

The good news is, the server is back up and running. Everything was either good to go or was restored from backups (4GB of files out of over 500GB).

Popularity: 76% [?]

Share and Enjoy:
  • Print
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • email
  • LinkedIn
  • PDF
  • RSS
  • Slashdot
  • StumbleUpon
  • Technorati
  • Twitter

Leave a Reply