Thoughts on Virtual Infrastructure Management

Is your cluster N+2, N+1, or N+0 from a capacity perspective?

Most of you know how vMotion and DRS work and ensure you do not “shoot yourself in the foot” by exceeding thresholds for too long.  You also know this is not load balancing.

If one physical host goes down and sheds its VMs successfully to the rest of the cluster, is everybody going to play nicely?  Take this very simple example. 

You have 100 VMs on five hosts.  One host goes down and 20 servers vMotion to the remaining servers.  Each physical is handling about the same workload, only if the five servers it receives were loaded the same. Otherwise, it could be an exponential workload increase. 

If all the VMs’ workloads were not created equal, expect more VM transfers and some amount of pin-wheeling and rotation of the VMs around the remaining member hosts.

Some vendors offer a “capacity planning” option for their software that makes getting all the data about your data-stores and storage in general an exercise in excel.  How does this type of planning make you or your guys more efficient?  The most important planning key is to get the right data in the right hands at the right time.  Is the right thing graphing 74 servers in one graph or telling that a spike in memory, cpu or IO is an issue?

Another thing I’ve learned through working with a lot of virtualization administrators is that reports, such as BalancePoint’s Performance Index report, get IT folks out of counting “c drives” and into fixing problems, avoiding application slow downs, and lowering their costs by getting the most out of the physical and human assets.

Total System Workload/Capacity Modeling

Total System Workload/Capacity Modeling

How can you get a handle on how much cluster capacity you have if you can’t get the right data at the right time in the right hands?  You would think simple questions like “is our cluster N+1″ should really just pop out of a tool.  They don’t.  All VMs are not created equal.  Workloads vary per VM and sometimes within VMs based on times of day.  This is tough math and not a straight line equation.  It is not simply drawing pretty graphs.  Remember – garbage in…garbage out. 

The whole idea around Performance Indicators is to turn data into information.  For example, if I have a Performance Index of 90 across my cluster of 200 guests:10 ESX servers, I am not N+1.  Ultimately, there will be an impact on response time of my applications if I have a hardware failure or even if you use vMotion to do maintenance.

comments

Leave a Reply