UCS Manager 2.0.2r KVM bug

Well, we’ve been battling with a KVM bug in our UCS installation, that’s been driving me (and apparently the Cisco L3 support and development) nuts. But lets back up a bit. If you’ve worked with UCS before, once you open up the KVM console you’ll see the KVM and a shortcut commands (Shutdown, Reset) and another tab that allows you to mount virtual media.

Once you open it up, it should look like this:

UCS Manager: KVM console working fine
UCS Manager: KVM console working fine

Now, when we re-installed some of our servers (mostly the XenServer’s) and out of a sudden the KVM virtual media didn’t work for some reason. The UCS KVM would suddenly reject us from switching to the virtual media tab, saying that either the Login timed out or we’d have the wrong user and/or password, even if we tried with the most powerful user the UCS has, the local admin account.

UCS Manager - KVM virtual media tab rejecting authentification
UCS Manager – KVM virtual media tab rejecting authentification

 

So I opened a TAC, and Cisco got to work on it immediately. After poking around in the depths of the fabric interconnect with a dplug extension from Cisco with a Cisco L3 guy, and after about two months of development I just got a call back from the Cisco support guy. Apparently development figured out why we’d get the above error message.

Once you put a hash tag (#) in the Service Profiles User Label you’d get the error message.

UCS Manager - User Label
UCS Manager – User Label

Once I removed the hash tag, the KVM started working like it’s supposed to do. So if anyone ever comes across this, that’s your solution. Apparently Cisco is going to fix this in an upcoming release, but just removing the hash tag and everything is fine.

VMware ESXi – Free memory limits corrected

Well, a coworker of mine asked me about this. Since I didn’t know (yeah, I don’t know everything) I went to my trusted friend – Google – and searched for it. There seems to be a lot of confusion about this, so I thought I’d clarify this.

I ended up putting a license to one of my hosts in vCenter.

VMware ESXi Free Edition  Memory Limit

 

Yeah well, the host has a bit more memory than the allowed 32GB vRAM per Socket (the host has two sockets) – thus you’re allowed to have 64GB RAM if your host has two sockets.

Dealing with SnapVault replication issues

Well, for the past two months I had a case open with NetApp to figure out this SnapVault replication issue we were seeing. The initial transfer of the SnapVault relation would complete with a hick up, manual snapshot transfers also work – just the scheduled, auto-created Snapshots won’t replicate.

At first I (and the NetApp support) thought this was an issue with SnapVault itself, however after being away for the last four weeks I looked at the issue with fresh eyes. After a short peek into the logs, I found what I had found back when I first looked into this.

SnapVault would create the daily snapshot on the SnapVault Primary and start the replication. However something (or someone, wasn’t clear at this point) then created a FlexClone of a volume … And as, back when we first encountered this, I was kinda puzzled.

But then I decided (please don’t ask me what made me look there) to look at the logs of the NetApp Filer on our logserver. As it turns out, back when I enabled syslogging to an external logserver I seem to have enabled debug logging … and it was great to have that! Below you’ll find the log I found – and as you can see there’s at least a clue as to from where that ghost snapshot is coming from.

Now, with knowing from which corner this issue originated it dawned on me, we have had a similar issue before. A quick peek into TSM Manager and I knew I was on the right track. The daily system backup starts around 21:15. Now our TSM backup includes the System State backup (which in turn utilizes VSS – which triggers the NetApp Snapshot!).

After excluding the System State from the Daily Backup the SnapVault stuff worked without a hickup. I ended up removing SnapDrive from the Server in question, since we don’t really need it there. Snapshots created from SnapDrive of the boot lun are gonna be inconsistent anyhow (doesn’t matter if I do ’em from SnapDrive or the NetApp CLI).

That restored the default VSS handler, which enables TSM to backup the System State again.