Well, for the past two months I had a case open with NetApp to figure out this SnapVault replication issue we were seeing. The initial transfer of the SnapVault relation would complete with a hick up, manual snapshot transfers also work – just the scheduled, auto-created Snapshots won’t replicate.
At first I (and the NetApp support) thought this was an issue with SnapVault itself, however after being away for the last four weeks I looked at the issue with fresh eyes. After a short peek into the logs, I found what I had found back when I first looked into this.
| 1 2 3 4 5 6 | Mon Apr  1 21:16:46 CEST [fas01: lun.offline:warning]: LUN /vol/flex_windows_boot/sv/{6f3899ab-6c8a-402e-bbb7-d7b7298d254f}.aux has been taken offline Mon Apr  1 21:16:46 CEST [fas01: lun.offline:warning]: LUN /vol/flex_windows_boot/sv/{6f3899ab-6c8a-402e-bbb7-d7b7298d254f}.rws has been taken offline Mon Apr  1 23:06:35 CEST [fas01: rpl.src.lun.invalid_clone:error]: Replication source transfer failed due to an invalid LUN clone with fileid 32466 in snapid 67 in volume flex_windows_boot. Mon Apr  1 23:06:35 CEST [fas01: replication.src.err:error]: SnapVault: source transfer from /vol/flex_windows_boot/sv/ to fas03:/vol/flex_windows_boot/sv : replication source found an invalid lun clone. Mon Apr  1 23:08:35 CEST [fas01: rpl.src.lun.invalid_clone:error]: Replication source transfer failed due to an invalid LUN clone with fileid 32466 in snapid 67 in volume flex_windows_boot. Mon Apr  1 23:08:36 CEST [fas01: replication.src.err:error]: SnapVault: source transfer from /vol/flex_windows_boot/sv/ to fas03:/vol/flex_windows_boot/sv : replication source found an invalid lun clone. | 
SnapVault would create the daily snapshot on the SnapVault Primary and start the replication. However something (or someone, wasn’t clear at this point) then created a FlexClone of a volume … And as, back when we first encountered this, I was kinda puzzled.
But then I decided (please don’t ask me what made me look there) to look at the logs of the NetApp Filer on our logserver. As it turns out, back when I enabled syslogging to an external logserver I seem to have enabled debug logging … and it was great to have that! Below you’ll find the log I found – and as you can see there’s at least a clue as to from where that ghost snapshot is coming from.
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | Apr  1 21:16:46 fas01 [fas01: lun.offline:warning]: LUN /vol/flex_windows_boot/sv/{6f3899ab-6c8a-402e-bbb7-d7b7298d254f}.aux has been taken offline Apr  1 21:16:46 fas01 [fas01: lun.destroy:info]: LUN /vol/flex_windows_boot/sv/{6f3899ab-6c8a-402e-bbb7-d7b7298d254f}.aux destroyed Apr  1 21:16:46 fas01 [fas01: lun.offline:warning]: LUN /vol/flex_windows_boot/sv/{6f3899ab-6c8a-402e-bbb7-d7b7298d254f}.rws has been taken offline Apr  1 21:16:48 fas01 [fas01: lun.map:info]: LUN /vol/flex_windows_boot/sv/{6f3899ab-6c8a-402e-bbb7-d7b7298d254f}.rws was mapped to initiator group viaRPC.20:00:00:25:b5:02:0a:4c.b230-5=3 Apr  1 21:16:48 fas01 [fas01: lun.map:info]: LUN /vol/flex_windows_boot/sv/{6f3899ab-6c8a-402e-bbb7-d7b7298d254f}.rws was mapped to initiator group viaRPC.20:00:00:25:b5:02:0b:4c.b230-5=3 Apr  1 22:05:31 fas01 [fas01: lun.map.unmap:info]: LUN /vol/flex_windows_boot/sv/{6f3899ab-6c8a-402e-bbb7-d7b7298d254f}.rws unmapped from initiator group viaRPC.20:00:00:25:b5:02:0a:4c.b230-5 Apr  1 22:05:31 fas01 [fas01: lun.map.unmap:info]: LUN /vol/flex_windows_boot/sv/{6f3899ab-6c8a-402e-bbb7-d7b7298d254f}.rws unmapped from initiator group viaRPC.20:00:00:25:b5:02:0b:4c.b230-5 Apr  1 22:05:35 fas01 [fas01: lun.destroy:info]: LUN /vol/flex_windows_boot/sv/{6f3899ab-6c8a-402e-bbb7-d7b7298d254f}.rws destroyed Apr  1 22:05:37 fas01 [fas01: wafl.snap.delete:info]: Snapshot copy {25b8da84-9351-4f20-987c-e7b02d76f15e} on volume flex_windows_boot NetApp was deleted by the Data ONTAP function zapi_snapshot_ delete. The unique ID for this Snapshot copy is (64, 3471969). Apr  1 23:06:35 fas01 [fas01: rpl.src.lun.invalid_clone:error]: Replication source transfer failed due to an invalid LUN clone with fileid 32466 in snapid 67 in volume flex_windows_boot. Apr  1 23:06:35 fas01 [fas01: snapdiff.abnormal.abort:debug]: Encountered unexpected error while computing differences between Snapshot copies. Apr  1 23:06:35 fas01 [fas01: replication.src.err:error]: SnapVault: source transfer from /vol/flex_windows_boot/sv/ to fas03:/vol/flex_windows_boot/sv : replication source found an invalid lun clone. Apr  1 23:08:35 fas01 [fas01: rpl.src.lun.invalid_clone:error]: Replication source transfer failed due to an invalid LUN clone with fileid 32466 in snapid 67 in volume flex_windows_boot. Apr  1 23:08:35 fas01 [fas01: snapdiff.abnormal.abort:debug]: Encountered unexpected error while computing differences between Snapshot copies. Apr  1 23:08:36 fas01 [fas01: replication.src.err:error]: SnapVault: source transfer from /vol/flex_windows_boot/sv/ to fas03:/vol/flex_windows_boot/sv : replication source found an invalid lun clone. | 
Now, with knowing from which corner this issue originated it dawned on me, we have had a similar issue before. A quick peek into TSM Manager and I knew I was on the right track. The daily system backup starts around 21:15. Now our TSM backup includes the System State backup (which in turn utilizes VSS – which triggers the NetApp Snapshot!).
After excluding the System State from the Daily Backup the SnapVault stuff worked without a hickup. I ended up removing SnapDrive from the Server in question, since we don’t really need it there. Snapshots created from SnapDrive of the boot lun are gonna be inconsistent anyhow (doesn’t matter if I do ’em from SnapDrive or the NetApp CLI).
That restored the default VSS handler, which enables TSM to backup the System State again.