Friday, September 18, 2009

Nehalem Processors + ESX 4 = Guest Monitoring Issues

Recently we bought a Dell PowerEdge T610 with a couple of Intel Xeon Nehalem L5520 processors to serve as a ESX Foundation Server for our new Seoul South Korea office. It was installed and set up on site with two guest machines running nothing heavy but still room for some growth.

After about a week or two we started to notice that vCenter was regularly reporting memory at 95%-100% utilized on the guests and it was constantly in an alert state. After doing some investigating I noticed the actual guest machines were not using very much memory at all. So I gathered some information and did some googling on it and I came across this thread on the VMware Communities Forums:

ESX4 + Nehalem Host + vMMU = Broken TPS !

It seems that this is more an issue of vCenter reporting the information incorrectly (not only Nehalem processors) and the temporary fix is to set Mem.AllocGuestLargePage to 0 instead of it's default which is 1. It has been stated that this could cause a noticeable performance issue. However, the guests on the particular host I found this on should not really be affected since they are some simple infrastructure services and a file server. So I made the change and had to reboot since I could not vMotion the guests off (foundation box). Instantly it began to work and report the correct memory usage.

VMware stated that an actual fix for this should be released in Patch 2 with a rough time estimate of mid to late September. Still have not seen this patch released but now have two hosts experiencing this issue I have corrected with the workaround. I will be keeping my eyes on the new KBs and updates for this.

Wednesday, September 16, 2009

the parent virtual disk has been modified since the child was created

Now I know this is an old one dating way back but just wanted to add this as a note on my blog here. If you can use it great... I will start by telling a story:

So today my colleague decided to move one of his test servers from a ESX Foundation box to our ESX cluster. The mistake he made was he did not remove the snapshots before he moved the guest to the cluster. Also he had already deleted the original server from the ESX Foundation box, easiest fix ruled out. He contacted me for help when he received the message trying to boot the server:

the parent virtual disk has been modified since the child was created


Easy fix here, But backups must be taken of everything for safety sake. This is the VMDK from the actual base disk (see below). Now this is a simple one because it had one base disk and one delta file. It would be easiest just to change the VMX file to point to the base disk VMDK but problem here is the snapshot had all the necessary apps loaded it into it and the base disk was a plain installation.


# Disk DescriptorFile
version=1
CID=fc9c727e
parentCID=ffffffff
createType="vmfs"

# Extent description
RW 25165824 VMFS "flapjacks-flat.vmdk"

# The Disk Data Base
#DDB

ddb.adapterType = "lsilogic"
ddb.geometry.sectors = "63"
ddb.geometry.heads = "255"
ddb.geometry.cylinders = "1566"
ddb.uuid = "60 00 C2 9d ee 19 a7 ba-71 16 1c ac cc 2b 2b 09"
ddb.toolsVersion = "7202"
ddb.virtualHWVersion = "4"


See the CID above? Check the VMDK of the Snapshot and I bet you money it doesn't match the parentCID. Simply change the parentCID value to match the one on the base disk and the server should now boot. By forcing the CIDs to match it should think it is was never out of sync.

We had another problem. The old snapshot could not be deleted/merged because something still was not quite right. So I did a v2v with converter to a new guest and was able to merge things that way. Now I know this is probably not a perfect situation but the new server is running stable now so I will take it!

If this was a more complex situation and more changes had been made before the server was moved the data should have simply been recovered after getting it to boot and the server reloaded because chances are in that situation it would not be very stable.

Thursday, September 10, 2009

Quick thoughts on VMWare long distance Vmotion

I can't say much more than what has already been said about VMWare announcing long distance VMotion capabilities up to 200km but I had some quick thoughts about it and figured I would put it down here.

It is nice to see Cisco, EMC and VMWare teaming up on this but right now it has some serious limitations. Minimum bandwidth of 622 Mbits/sec isn't quite too bad. In my mind 5 ms latency is pretty low.. At this point it might be useful for evacuation to some kind of disaster recovery site with a bigger pipe, but not quite a 'follow the sun' approach between datacenters. When they figure out how to deal with higher latency and are able to go inter-continental with Vmotion this will change the way global companies IT operations work!

Here is a video demonstration done at VMWorld by Chad Sakac with EMC:

Thursday, September 3, 2009

Upgrading vCenter 2.5u4 to vCenter 4.0


Today I began my first phase of upgrading our 3.5u4 ESX environment in our Chicago office to 4.0. Upgrading vCenter is the first step. I would have preferred to create a whole new fresh install but decided I would upgrade and see if it came out ok. With snapshots and the like I always have the opportunity to go back. My next step is to change two of our hosts over to ESX 4.0 in a week or so and test it out for a couple of weeks before I fully vSphere-icize our environment. I am already fairly certain everything will be fine as I already have a 4.0 cluster going in our Zurich office. Here is a cut and paste of some braindumping I was doing into notepad as I was doing the upgrade:

upgraded memory on vCenter server from 1GB to 4GB

added a second processor to the vCenter Server

Seperate Database Server. Bumped Memory from 1GB to 3GB

Double checked all my SQL dbo perms

Made backup of the virtualcenter db

Disabled HA – Taking very long and sluggish then suddenly finished

Ran Upgrade on vCenter Server

Updating the client

re-enabled HA

Test Drove it a little bit to make sure it was performing properly

Uh oh… Trouble

Database server is cranked.. Have to give it a second CPU. Should have seen that coming

Now DB server is fine and virtual center server is cranked 100%.. lol

statsupdate eating cpu on DB server – working on it decided to service pack SQL server

Turned out to be this problem

LINK TO VMWARE DOC

Hmmmmm…. it appears even tho it states 2.5 this is still a bug in 4.0 and the fix works

In this version tho you have to drop the views before you create them.

Install new version of converter and update manager and tested to see if they worked.

All is running good now… Next phase re-install two of my hosts with ESX 4.0 and test for a few weeks before going all the way