Work – Page 3

Implementing SnapVault backups – the hard way

December 29, 2012June 21, 2013 Christian 1 Comment

Well, I recently had the pleasant task of implementing SnapVault backups, that are being shipped to an offsite location with SnapMirror.

That in itself isn’t the bad thing, however we decided against Protection Manager (since it was a charged product back when we decided on this). So I basically had the three tasks:

Actually implement the SnapVault stuff (and learn my way around it and also document it)
Write a bunch of scripts, that help us in creating scheduled backups of our databases
Create a monitoring script, that’ll fit into our Nagios environment already in place

Well, two months later (sadly it still has some kinks – I can’t figure out this one bug though for the life of it) and a few hundred hours of working on/with it and out came four things:

Bash-scripts to create the SnapVault/SnapMirror relations
Powershell scripts to trigger the SnapVault updates
a Nagios plugin, based on NetApp’s SDK for Data ONTAP (even if the API is crap from time to time – it’s still better than using SNMP)

I’ll post those things one after another, once I wrote up all the articles.

MDS9000: Setting summer time for CET

October 17, 2012June 21, 2013 Christian Leave a comment

After rebuilding two MDS9148, I wanted them to correctly switch the summer/winter time for my time zone. Currently I’m in CET (or CEST during the summer), so I googled for that. The search came up with Cisco-FAQ, however that needed a slight adjustment.

Apparently the NXOS doesn’t support the feature “recurring” in the clock configuration. So I had to slightly adapt the configuration line:

clock timezone MET 2 0
clock summer-time MET 5 Sun Mar 02:00 5 Sun Oct 03:00 60

1 2	clock timezone MET 2 0 clock summer-time MET 5 Sun Mar 02:00 5 Sun Oct 03:00 60

vCenter: Removing VSC custom attributes

October 17, 2012June 21, 2013 Christian Leave a comment

Well, yesterday I got pissed of those Virtual Storage Console custom attributes.

Currently we don’t use the Provisioning & Cloning feature of the VSC, thus we don’t need the custom attributes. After poking around, I decided to write a short PowerCLI script to do the task.

It’s really rather simple, so here goes:

param( [string] $vCenter )

# Add the VI-Snapin if it isn't loaded already
if ( (Get-PSSnapin -Name "VMware.VimAutomation.Core" -ErrorAction SilentlyContinue) -eq $null )
{
	Add-PSSnapin -Name "VMware.VimAutomation.Core"
}

If ( !($vCenter) )
{
	Write-Host
	Write-Host "vcenter-remove-vsc-attributes: <vcenter-server>"
	Write-Host
	Write-Host "   <vcenter-server>  - DNS name of your vCenter server."
	Write-Host
	exit 1
}

Connect-VIServer -Server $vCenter

Remove-CustomAttribute -CustomAttribute "PnC.CustSpec", "PnC.Deployed", "PnC.GroupID", "PnC.Source" -Confirm:$false

Disconnect-VIServer -server $vCenter -Confirm:$false

param( [string] $vCenter )

# Add the VI-Snapin if it isn't loaded already

if ( (Get-PSSnapin -Name "VMware.VimAutomation.Core" -ErrorAction SilentlyContinue) -eq $null )

{

Add-PSSnapin -Name "VMware.VimAutomation.Core"

}

If ( !($vCenter) )

{

Write-Host

Write-Host "vcenter-remove-vsc-attributes: <vcenter-server>"

Write-Host

Write-Host " <vcenter-server> - DNS name of your vCenter server."

Write-Host

exit 1

}

Connect-VIServer -Server $vCenter

Remove-CustomAttribute -CustomAttribute "PnC.CustSpec", "PnC.Deployed", "PnC.GroupID", "PnC.Source" -Confirm:$false

Disconnect-VIServer -server $vCenter -Confirm:$false

NetApp: Changing DS4243 shelf ID

October 11, 2012June 21, 2013 Christian Leave a comment

I’m working on a project right now, providing a SnapVault target for our “big” NetApp. So we moved our 3240 to it’s target location, I spent most of my time yesterday doing the cabling (SAS and ACP, as well as power).

I’m still not finished, I still need to “beautify” the power cables, need to fix the network cables (currently I don’t have none ….) and some other minor stuff. But lemme skip back a bit.

The 3240 initially had only two shelfs, one with the ID 10 and the other with the ID 50. When reimplementing the thing, I wanted to do two things:

Make the shelfs “proper” (i.e. adjacent shelf IDs)
Make sure it’s done right

So, I ended up googling the topic (or rather NetApping, since the NOW page isn’t being indexed), and found a NetApp Community post. As I already did a complete wipe/cleanconfig of both filers, I was left with this:

Halt both controllers (don’t power them off!)
Change the shelf ID using the front panel of the DS4243
Power-cycle each shelf
Wait at least 30 seconds
Boot both controllers

And that actually did it, my HA controllers are up and running, with the new shelf ID’s.

NetApp: Migrating FCP luns with ndmpcopy to another controller

October 6, 2012June 21, 2013 Christian Leave a comment

Well, I’m in a situation, where I need to move all volumes from one controller to two others. So I looked at the ways I had available:

Freshly implementing everything: No option at all!
vol copy: Is rather slow, thus no option
ndmpcopy: That’s exactly what I needed!

ndmpcopy is a great way to copy over a whole volume including it’s files (thus FCP luns) to another volume/controller.

First I threw in a crossover cable, since at around 6 PM our backup system starts it’s daily run, and everything else running via IP in between 6 PM and 6 AM is seriously impaired by this. Configured the additional ports on all three controllers (picked a private, not-routed range just in case) and then kicked of a simple bash script that ran the following:

ssh fas03 ndmpcopy -sa ndmp:ndmppass -da ndmp:ndmppass 
  192.168.2.30:/vol/vol_xen_boot 192.168.2.40:/vol/vol_xen_boot

1 2	ssh fas03 ndmpcopy -sa ndmp:ndmppass -da ndmp:ndmppass 192.168.2.30:/vol/vol_xen_boot 192.168.2.40:/vol/vol_xen_boot

Now, that in itself worked like a charm as you can see from the output below.

Ndmpcopy: Starting copy [ 32 ] ...
Ndmpcopy: 192.168.2.30: Notify: Connection established
Ndmpcopy: 192.168.2.40: Notify: Connection established
Ndmpcopy: 192.168.2.30: Connect: Authentication successful
Ndmpcopy: 192.168.2.40: Connect: Authentication successful
Ndmpcopy: 192.168.2.30: Log: DUMP: creating "/vol/vol_xen_boot/../snapshot_for_backup.17" snapshot.
Ndmpcopy: 192.168.2.30: Log: DUMP: Using Full Volume Dump
Ndmpcopy: 192.168.2.30: Log: DUMP: Date of this level 0 dump: Fri Oct 5 23:41:47 2012.
Ndmpcopy: 192.168.2.30: Log: DUMP: Date of last level 0 dump: the epoch.
Ndmpcopy: 192.168.2.30: Log: DUMP: Dumping /vol/vol_xen_boot to NDMP connection
Ndmpcopy: 192.168.2.30: Log: DUMP: mapping (Pass I)[regular files]
Ndmpcopy: 192.168.2.30: Log: DUMP: mapping (Pass II)[directories]
Ndmpcopy: 192.168.2.30: Log: DUMP: estimated 14916204 KB.
Ndmpcopy: 192.168.2.30: Log: DUMP: dumping (Pass III) [directories]
Ndmpcopy: 192.168.2.30: Log: DUMP: dumping (Pass IV) [regular files]
Ndmpcopy: 192.168.2.40: Log: RESTORE: Fri Oct 5 23:41:53 2012: Begin level 0 restore
Ndmpcopy: 192.168.2.40: Log: RESTORE: Fri Oct 5 23:41:53 2012: Reading directories from the backup
Ndmpcopy: 192.168.2.40: Log: RESTORE: Fri Oct 5 23:41:55 2012: Creating files and directories.
Ndmpcopy: 192.168.2.40: Log: RESTORE: Fri Oct 5 23:41:55 2012: Writing data to files.
Ndmpcopy: 192.168.2.30: Log: DUMP: dumping (Pass V) [ACLs]
Ndmpcopy: 192.168.2.30: Log: DUMP: 14857192 KB
Ndmpcopy: 192.168.2.30: Log: DUMP: DUMP IS DONE
Ndmpcopy: 192.168.2.30: Log: DUMP: Deleting "/vol/vol_xen_boot/../snapshot_for_backup.17" snapshot.
Ndmpcopy: 192.168.2.40: Log: RESTORE: Fri Oct 5 23:44:52 2012: Restoring NT ACLs.
Ndmpcopy: 192.168.2.40: Log: RESTORE: RESTORE IS DONE
Ndmpcopy: 192.168.2.40: Log: RESTORE: The destination path is /vol/vol_xen_boot/
Ndmpcopy: 192.168.2.30: Notify: dump successful
Ndmpcopy: 192.168.2.40: Notify: restore successful
Ndmpcopy: Transfer successful [ 3 minutes 10 seconds ]
Ndmpcopy: Done

Ndmpcopy: Starting copy [ 32 ] ...

Ndmpcopy: 192.168.2.30: Notify: Connection established

Ndmpcopy: 192.168.2.40: Notify: Connection established

Ndmpcopy: 192.168.2.30: Connect: Authentication successful

Ndmpcopy: 192.168.2.40: Connect: Authentication successful

Ndmpcopy: 192.168.2.30: Log: DUMP: creating "/vol/vol_xen_boot/../snapshot_for_backup.17" snapshot.

Ndmpcopy: 192.168.2.30: Log: DUMP: Using Full Volume Dump

Ndmpcopy: 192.168.2.30: Log: DUMP: Date of this level 0 dump: Fri Oct 5 23:41:47 2012.

Ndmpcopy: 192.168.2.30: Log: DUMP: Date of last level 0 dump: the epoch.

Ndmpcopy: 192.168.2.30: Log: DUMP: Dumping /vol/vol_xen_boot to NDMP connection

Ndmpcopy: 192.168.2.30: Log: DUMP: mapping (Pass I)[regular files]

Ndmpcopy: 192.168.2.30: Log: DUMP: mapping (Pass II)[directories]

Ndmpcopy: 192.168.2.30: Log: DUMP: estimated 14916204 KB.

Ndmpcopy: 192.168.2.30: Log: DUMP: dumping (Pass III) [directories]

Ndmpcopy: 192.168.2.30: Log: DUMP: dumping (Pass IV) [regular files]

Ndmpcopy: 192.168.2.40: Log: RESTORE: Fri Oct 5 23:41:53 2012: Begin level 0 restore

Ndmpcopy: 192.168.2.40: Log: RESTORE: Fri Oct 5 23:41:53 2012: Reading directories from the backup

Ndmpcopy: 192.168.2.40: Log: RESTORE: Fri Oct 5 23:41:55 2012: Creating files and directories.

Ndmpcopy: 192.168.2.40: Log: RESTORE: Fri Oct 5 23:41:55 2012: Writing data to files.

Ndmpcopy: 192.168.2.30: Log: DUMP: dumping (Pass V) [ACLs]

Ndmpcopy: 192.168.2.30: Log: DUMP: 14857192 KB

Ndmpcopy: 192.168.2.30: Log: DUMP: DUMP IS DONE

Ndmpcopy: 192.168.2.30: Log: DUMP: Deleting "/vol/vol_xen_boot/../snapshot_for_backup.17" snapshot.

Ndmpcopy: 192.168.2.40: Log: RESTORE: Fri Oct 5 23:44:52 2012: Restoring NT ACLs.

Ndmpcopy: 192.168.2.40: Log: RESTORE: RESTORE IS DONE

Ndmpcopy: 192.168.2.40: Log: RESTORE: The destination path is /vol/vol_xen_boot/

Ndmpcopy: 192.168.2.30: Notify: dump successful

Ndmpcopy: 192.168.2.40: Notify: restore successful

Ndmpcopy: Transfer successful [ 3 minutes 10 seconds ]

Ndmpcopy: Done

However, once I switched the UCS into the correct VSAN and modified the Boot Policy, the XenServer would boot, but didn’t find *any* Storage Repository. So I went ahead and looked at the CLI of the XenServer, looked at /var/log/messages and saw that apparently the PBD’s weren’t there yet (for whatever reason).

Poked around in /dev/disk/by-id, looked at the output of xe pbd-list and found that the SCSI-IDs used in the PBD’s we’re actually not present yet. So I was like *wtf* for a moment, however then took a quick peek at the output of lun show -v /vol/vol_xen_boot on both NetApp controllers and found the cause for my troubles:

FAS03> lun show -v /vol/vol_xen_boot/xen01_lun00
/vol/vol_xen_boot/xen01_lun00 15g (16106127360) (r/w, online, mapped)
Comment: "xen01_lun00"
Serial#: dnNtFoiCJH1c
Share: none
Space Reservation: disabled
Multiprotocol Type: linux
Maps: XEN01=0
Occupied Size: 3.3g (3571474432)
Creation Time: Mon Feb 13 10:26:16 CET 2012
Cluster Shared Volume Information: 0x0

FAS03> lun show -v /vol/vol_xen_boot/xen01_lun00

/vol/vol_xen_boot/xen01_lun00 15g (16106127360) (r/w, online, mapped)

Comment: "xen01_lun00"

Serial#: dnNtFoiCJH1c

Share: none

Space Reservation: disabled

Multiprotocol Type: linux

Maps: XEN01=0

Occupied Size: 3.3g (3571474432)

Creation Time: Mon Feb 13 10:26:16 CET 2012

Cluster Shared Volume Information: 0x0

FAS01> lun show -v /vol/vol_xen_boot/xen01_lun00
/vol/vol_xen_boot/xen01_lun00 15g (16106127360) (r/w, online, mapped)
Comment: "xen01_lun00"
Serial#: dfb8c4mpKwd9
Share: none
Space Reservation: disabled
Multiprotocol Type: linux
Maps: XEN01=0
Occupied Size: 3.3g (3571474432)
Creation Time: Mon Feb 13 10:26:16 CET 2012
Cluster Shared Volume Information: 0x0

FAS01> lun show -v /vol/vol_xen_boot/xen01_lun00

/vol/vol_xen_boot/xen01_lun00 15g (16106127360) (r/w, online, mapped)

Comment: "xen01_lun00"

Serial#: dfb8c4mpKwd9

Share: none

Space Reservation: disabled

Multiprotocol Type: linux

Maps: XEN01=0

Occupied Size: 3.3g (3571474432)

Creation Time: Mon Feb 13 10:26:16 CET 2012

Cluster Shared Volume Information: 0x0

As you can see, the lun itself is available and mapped with the correct LUN ID. However, if you look closely at the serial of both LUNs you might notice what I noticed. So it turns out, ndmpcopy does the copy-process, however you need to adjust the LUN serial on the destination controller to match the one from the source controller, otherwise it’ll throw any system out of whack.

After adjusting that, everything came up just fine. And I’m finished with my first XenServer environment, only the big one is still copying.

UCS 5108 power redundancy lost

September 20, 2012June 21, 2013 Christian 2 Comments

Well, another day – another UCS error. Out of the blue, one of our chassis started displaying that one PSU had failed, however the UCS was showing no PSU had failed *shrug*

Well, as it turns out – this is yet another known bug in 2.0.2(r). You’ll either have to unplug and plug all the power cables (that’s four) in a maintainance window – or simply change the Equipment Power Policy (found in the Root of your UCS, tab Policy)

from “N+1” to “Non Redundant”; wait a minute or two till the error is “fixed” and then change it back to “N+1”. Problem solved for now … 😀

NetApp LUN creation/vol sizing

September 20, 2012June 21, 2013 Christian Leave a comment

Well, as you might know I’ve been tinkering with a NetApp FAS at work. The last few months, I’ve been trying to figure out a few things, which I actually did.

One “error” I ran into with creating the lun’s and volumes by hand was that the volumes were running out of space. Even if the volume was a bit larger than the LUN. After that happened a few times, I decided to see how to fix that. As it turns out, the GUI “fixes” that already in a way I wouldn’t have expected.

The GUI wizard for creating a new LUN simply enlarges the hosting volume by three percent (that’s 3%!). So if you create a 300GiB LUN, the GUI will create a volume with 309GiB (well about that – the GUI calculates in KiB thus you’ll see something like 324009984k in the output of vol size).

I also wrote a short script, which will sum up the space of all LUNs contained inside a volume and then based on your snap reserve and the actual LUN space give you the current vol size and the vol size it should be. I’ll post the script later on.

UCS 5108: VIF down

September 17, 2012June 21, 2013 Christian Leave a comment

Well, I have yet another weird UCS problem. I have a single blade, that has trouble with it’s primary fabric attachment.

The problem get’s even more weird, if you look at the details.

After looking at the IO modules, the error doesn’t become any clearer:

So far, I have tried nearly everything. I’ve tried resetting the active and passive Connectivity of the vNIC, I tried resetting the DCE adapter for the vNIC, but nothing. I even tried resetting the vHBA that’s associated with this fabric, but that didn’t result to anything. Not even the usual flogi (fibre channel login) errors, that you get when either booting/resetting the blade.

Well, I opened a TAC case and the Cisco engineer looked over the CLI of my fabric interconnects, hacked away at the thousand logs the UCS keeps, and asked me if I could switch the blade to another slot. However, since I don’t have any slots to spare – the five chassis are full – he said he’d start the RMA process for the MK81R mezzanine card.

The new MK81R arrived a few days later, however the issue still persisted. So the TAC engineer suggested I’d pull the IO module and plug in a “new” one (one that I knew was working). So I picked a Friday afternoon for the maintainance window (since I didn’t know if the blades would survive a IO module failover) and pulled one from my newly arrived chassis and reseated the one I knew to be faulty.

Guess what: after swapping the IO modules, the error is completely gone …. *shrug* I don’t have a clue how this error happened or why it was fixed by pulling and plugging the IO module. Guess another error for the odd category.

UCS 5108: Power problem

September 4, 2012June 21, 2013 Christian Leave a comment

Well, I recently had yet another UCS display/I2C communication problem. Somehow one of my chassis’ started to think, that the power redundancy was lost.

After looking at it a bit deeper, it seems only the GUI or the chassis did notice this power glitch:

As you can see, all PSU’s still have power. Now, since I had a big maintainance window the last weekend anyhow (and I spent ~14 hours at work), I decided to restart the IO modules in that chassis. And guess what: The error is gone! Another weird I2C communication issue with the firmware release 2.0.2 …

SMT mirroring troubles

August 30, 2012June 21, 2013 Christian Leave a comment

Well, I’ve been having issues with our SMT. Basically after the release of SLES11 SP2 in February, I’ve been waiting for the repositories to turn “mirrorable”, which they haven’t yet. So I wrote a mail to the EMEA customer support.

And yesterday I got a reply, stating that my mirror credentials had been regenerated. And guess what ? I can mirror stuff again. So if you ever have any issues like the ones I described (or any of the people in the forum threads experienced), write a mail to your Novell customer support. 😛