Friday, September 18, 2009
Nehalem Processors + ESX 4 = Guest Monitoring Issues
After about a week or two we started to notice that vCenter was regularly reporting memory at 95%-100% utilized on the guests and it was constantly in an alert state. After doing some investigating I noticed the actual guest machines were not using very much memory at all. So I gathered some information and did some googling on it and I came across this thread on the VMware Communities Forums:
ESX4 + Nehalem Host + vMMU = Broken TPS !
It seems that this is more an issue of vCenter reporting the information incorrectly (not only Nehalem processors) and the temporary fix is to set Mem.AllocGuestLargePage to 0 instead of it's default which is 1. It has been stated that this could cause a noticeable performance issue. However, the guests on the particular host I found this on should not really be affected since they are some simple infrastructure services and a file server. So I made the change and had to reboot since I could not vMotion the guests off (foundation box). Instantly it began to work and report the correct memory usage.
VMware stated that an actual fix for this should be released in Patch 2 with a rough time estimate of mid to late September. Still have not seen this patch released but now have two hosts experiencing this issue I have corrected with the workaround. I will be keeping my eyes on the new KBs and updates for this.
Wednesday, September 16, 2009
the parent virtual disk has been modified since the child was created
So today my colleague decided to move one of his test servers from a ESX Foundation box to our ESX cluster. The mistake he made was he did not remove the snapshots before he moved the guest to the cluster. Also he had already deleted the original server from the ESX Foundation box, easiest fix ruled out. He contacted me for help when he received the message trying to boot the server:
the parent virtual disk has been modified since the child was created
Easy fix here, But backups must be taken of everything for safety sake. This is the VMDK from the actual base disk (see below). Now this is a simple one because it had one base disk and one delta file. It would be easiest just to change the VMX file to point to the base disk VMDK but problem here is the snapshot had all the necessary apps loaded it into it and the base disk was a plain installation.
# Disk DescriptorFile
version=1
CID=fc9c727e
parentCID=ffffffff
createType="vmfs"
# Extent description
RW 25165824 VMFS "flapjacks-flat.vmdk"
# The Disk Data Base
#DDB
ddb.adapterType = "lsilogic"
ddb.geometry.sectors = "63"
ddb.geometry.heads = "255"
ddb.geometry.cylinders = "1566"
ddb.uuid = "60 00 C2 9d ee 19 a7 ba-71 16 1c ac cc 2b 2b 09"
ddb.toolsVersion = "7202"
ddb.virtualHWVersion = "4"
See the CID above? Check the VMDK of the Snapshot and I bet you money it doesn't match the parentCID. Simply change the parentCID value to match the one on the base disk and the server should now boot. By forcing the CIDs to match it should think it is was never out of sync.
We had another problem. The old snapshot could not be deleted/merged because something still was not quite right. So I did a v2v with converter to a new guest and was able to merge things that way. Now I know this is probably not a perfect situation but the new server is running stable now so I will take it!
If this was a more complex situation and more changes had been made before the server was moved the data should have simply been recovered after getting it to boot and the server reloaded because chances are in that situation it would not be very stable.
Thursday, September 10, 2009
Quick thoughts on VMWare long distance Vmotion
It is nice to see Cisco, EMC and VMWare teaming up on this but right now it has some serious limitations. Minimum bandwidth of 622 Mbits/sec isn't quite too bad. In my mind 5 ms latency is pretty low.. At this point it might be useful for evacuation to some kind of disaster recovery site with a bigger pipe, but not quite a 'follow the sun' approach between datacenters. When they figure out how to deal with higher latency and are able to go inter-continental with Vmotion this will change the way global companies IT operations work!
Here is a video demonstration done at VMWorld by Chad Sakac with EMC:
Thursday, September 3, 2009
Upgrading vCenter 2.5u4 to vCenter 4.0
Today I began my first phase of upgrading our 3.5u4 ESX environment in our Chicago office to 4.0. Upgrading vCenter is the first step. I would have preferred to create a whole new fresh install but decided I would upgrade and see if it came out ok. With snapshots and the like I always have the opportunity to go back. My next step is to change two of our hosts over to ESX 4.0 in a week or so and test it out for a couple of weeks before I fully vSphere-icize our environment. I am already fairly certain everything will be fine as I already have a 4.0 cluster going in our Zurich office. Here is a cut and paste of some braindumping I was doing into notepad as I was doing the upgrade:
upgraded memory on vCenter server from 1GB to 4GB
added a second processor to the vCenter Server
Seperate Database Server. Bumped Memory from 1GB to 3GB
Double checked all my SQL dbo perms
Made backup of the virtualcenter db
Disabled HA – Taking very long and sluggish then suddenly finished
Ran Upgrade on vCenter Server
Updating the client
re-enabled HA
Test Drove it a little bit to make sure it was performing properly
Uh oh… Trouble
Database server is cranked.. Have to give it a second CPU. Should have seen that coming
Now DB server is fine and virtual center server is cranked 100%.. lol
statsupdate eating cpu on DB server – working on it decided to service pack SQL server
Turned out to be this problem
Hmmmmm…. it appears even tho it states 2.5 this is still a bug in 4.0 and the fix works
In this version tho you have to drop the views before you create them.
Install new version of converter and update manager and tested to see if they worked.
All is running good now… Next phase re-install two of my hosts with ESX 4.0 and test for a few weeks before going all the way