It has become evident to me as I move forward in my career that:
- Virtualization is a complicated subject that is not well understood even by IT professionals;
- Virtualization has been made available to the common man without any barriers to entry.
Following this, I thought I might make a quick “how-to” guide for people who might have dipped their toes already, but want to know a bit more in terms of “best practices” for smaller HyperV (or VMWare!) setups. It’s a collection of my own personal best practices that I hope may help someone else. I hope to expand it as I learn myself.
I’ve not recommended or installed a server with less than four Ethernet ports in years. Having said this, one common setup mistake I see is a single Ethernet cable hanging out of a server, the others screaming to be used. Here are a few recommendations:
- If your server is using iSCSI to connect to a SAN, make sure you have at least two interfaces worth of traffic to/from your SAN. Read about VMWare MPIO here in this PDF or Hyper-V MPIO here.
This gets you mutiple paths to your storage increasing total overall bandwidth to your SAN and giving you the redundancy of two networking paths.
- If you are using iSCSI to connect to a SAN, enable jumbo framing on your switch. After confirming your switch supports it, enable the feature (usually a global option and usually requiring a switch reboot). For the NICs involved (SAN and Server) change the MTU to 9000 (or whatever the least common denominator is for your equipment). On some Windows NICs you may have to enable other features in the NIC driver settings.
For most workloads, this gets you faster speeds and lower latencies on your Disk I/O.
- Make use of NIC teaming for your VM connections – and disregard the “One NIC per VM” ideology unless there is a very, very specific reason to do so. Teaming four NICs together for VM I/O means each VM is sharing 4Gb/s. It means, for example, your File/Print VM could get up to 4Gb/s to your LAN – but it also means that an unplugged Ethernet cable isn’t going to bring down any VMs. Typically, this means simply adding physical interfaces to your virtual switches. Be aware that this is like having (4) 1Gb/s highways, not one 4Gb/s highway – no individual car is going faster than 1Gb/s, but cars are automatically distributed to the faster highway depending on where they are coming from/going to.
It’s the cheapest, easiest way possible to increase Network I/O and redundancy. Seriously.
Recommended Reading: VMWare Teaming, HyperV Teaming, Some Basic LACP vs. Static Info
- Before you use SR-IOV or other networking virtualization technologies, understand what they do and know they aren’t a panacea. Many people I’ve worked with will turn these babies on and then – something doesn’t work right or they aren’t able to connect to the LAN. There are specific needs when implementing SR-IOV and unless you can state them all and understand them, don’t turn it on. It’s also worth mentioning that these technologies really only kick in around the 4-5Gb/s range, so unless you’re pushing that traffic continuously out your VMs, it’s not likely to yield a large benefit.
Don’t use SR-IOV or other advances unless you know exactly why you’re turning it on.
- Always, always install the latest drivers to your physical system. For example, Broadcom NICs have an issue with VM Queuing (‘VMQ’) that could cause high latencies, packet loss or speed issues between your virtual switches and your external network (see here). These new technologies mean new bugs, so keep things up to date and save some time.
Most setups in my scope aren’t using SANs, but local storage on disks in the server chassis. This is fine for many setups, especially small businesses who are consolidating a few server boxes into one setup for cost savings. I fully expect some of my recommendations to flame a few people off, but again – I base this on my experiences fixing other people’s mistakes.
- Never, never, never use dynamic disks unless you fully understand a few things:
- They will fragment your local storage significantly. As the VHD files grows, they consume blocks all over the physical disks, Internally, 100MB of contiguous file space is written all over the file as well, meaning your performance will only reduce over time.
- They are slower – because of the information above. As updates are installed, files written and changes made, the local drive heads have to flicker all over the disk to read.
- They are tough to plan for. Seriously, seriously tough. I’ve watched people ignore my advice, build 10 VMs and tell me how wrong I was, only to ask for my help two years later when the physical disks slowly filled to my favorite 0 Bytes Free status. Then you’re shrinking dynamic disks onto USB drives, then copying them back while the customer is down, and telling them they need to order two more disks for something you should have planned for 2 years ago.
- The best intentions of your VM OS (trying to defragment their disks, or allocate files in long contiguous chains) will affect your physical dynamic disk in the exact opposite manner – stretching out the list of non-contiguous blocks even further.
- Snapshots make the matter worse. Snapshots – which are essentially dynamic disks themselves – require all reads to go through two dynamically mapped disks.
- Backup solutions running on the physical host(s) require the disks to simmer down for a few seconds while they can do a snapshot take much longer with dynamic disks.
- Converting them to Fixed disks, even if you plan to do it later, require the VM to be off. So, a lot of people just forget to do it – don’t be those people.
- Separate your system disks onto a smaller RAID1 array, where you keep your Windows installation and any other software that might compete for Disk I/O (you’re not really going to install anything else on a HyperV machine, right?). Put your VMs on their own disks with faster spindles and dedicated I/O.
- Don’t skimp on your RAID controller. Seriously, I’ve seen people order low-end RAID cards with no cache memory and drop four disks on it in RAID5 mode – and then act surprised when I/O stinks. Cache memory means the card can wait to write to disk when it’s busy and a cache battery means it can do it safely. Don’t cheap out on these. In fact, just don’t use a RAID card that has less than 512Mb of RAM and a backup battery.
- If the goal of your projects is to get your servers all under one chassis, don’t be afraid to install that separate RAID array for your Exchange databases or SQL data. If your storage is busy, it’s better to have your Exchange database on it’s own RAID1 array than to constantly lock up your shared storage.
- Avoid using pass-through disks for production data unless you have a good reason. I’ve seen this one but talked about many more times, usually with the above rule – someone virtualized a database application and thought the SQL volume should go on passthrough disk (“it’s faster, man!”). It’s about 2% faster with the oldest, least optimized technology. Tell me it’s faster when you have to migrate it somewhere else.
This section is much harder to understand – it touches more deeply in the hardware than you might be used to. It’s worth noting that for most people, they assign what they think the correct amount of RAM and CPU cores is necessary for a workload and then dust their hands clean. That’s not a smart idea. Check out this rough diagram of an HP ProLiant DL380 G8 I just installed for a customer:
Let’s imagine you undersized the host and later installed a VM that required more resources. For example:
- Your VM requires more than 16GB of RAM;
- Your VM requires more than 4 CPU cores;
- Your VM requires 8GB of RAM, but you have two other VMs running using 10GB each.
In each of these situations, you’d be doing what’s called NUMA Spanning. This means that the memory or core count allocated to your VM cannot be assigned to either the free RAM connected to one CPU in whole or is assigned more cores than a CPU can offer, meaning the VM is running on two different CPUs.
This means that the CPU running one or more threads comprising your VM has to stop and ask the other CPU to read/write the remote memory connected to it – a huge performance penalty.
Often, when one VM host is sold, the bare minimum number of CPUs, cores and/or RAM is installed and your VMs cannot fit neatly into one NUMA node or the other. Sometimes it won’t affect the VM enough to matter, while other times it cripples performance. You can turn the feature off but if you do – beware – for your VM won’t start unless it can fit neatly in one NUMA node!
Check out the Wikipedia entry for the NUMA topology here, but know it’s not incredibly specific in regards to the Windows implementation. How it affects HyperV can be found here, and VMWare’s more technical and official dive is found here.