Project

General

Profile

Bug #25270

VAAI Issues

Added by Tom Schneider about 3 years ago. Updated about 3 years ago.

Status:
Closed: Behaves correctly
Priority:
Important
Assignee:
Alexander Motin
Category:
OS
Target version:
Seen in:
Severity:
New
Reason for Closing:
Reason for Blocked:
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:
ChangeLog Required:
No

Description

I have a simple setup - FreeNAS Mini XL with one ESX 6.0.0 patch level 505093 host connected via iScsi. I turned on VAAI after upgrading to the latest freenas 10 version. I am now seeing errors on the Freenas side (debug attached) as well as multiple latency and disconnect issues on the LUNs from the ESX event log. According to the ESX log the outages last under 1 or 2 seconds. The I/O Latency issues last longer - about the duration of the backup jobs.

I have seen the ESX errors ever since using the FreeNAS mini but the reason for this bug report are the new errors on the FreeNAS Console: ctl_datamove: tag 0x33dbbc on (0:5:2) aborted
These errors occurred with different tag references and lun references.

Please let me know if there is anything else you need.

History

#1 Updated by Tom Schneider about 3 years ago

  • File debug-freenas-20170723103012.txz added

#2 Updated by Dru Lavigne about 3 years ago

  • Status changed from Unscreened to 15
  • Assignee changed from Release Council to Alexander Motin
  • Seen in changed from Unspecified to 9.10.2-U5

Tom: are you able to update this system to 11.0-U2? If so, does the issue persist?

#3 Updated by Tom Schneider about 3 years ago

Dru Lavigne wrote:

Tom: are you able to update this system to 11.0-U2? If so, does the issue persist?

I have not tried yet - i can update if that is the right course - with the Mini XL are there any issues with upgrading at this point?

If it helps to diagnose - i can update and then re-run my backups to see.

#4 Updated by Dru Lavigne about 3 years ago

Upgrading would make sure you're at the latest drivers which might resolve the issue. If it doesn't, it will make it easier to apply a fix as development is now on the 11.x series.

#5 Updated by Tom Schneider about 3 years ago

Dru Lavigne wrote:

Upgrading would make sure you're at the latest drivers which might resolve the issue. If it doesn't, it will make it easier to apply a fix as development is now on the 11.x series.

Sounds good - I will update then. I will retest after the update and let you know.

#6 Updated by Tom Schneider about 3 years ago

ok - i have upgraded to FreeNAS-11.0-U2 (e417d8aa5)

The same errors are occurring on the console:
Jul 23 13:25:54 freenas ctl_datamove: tag 0x4536a on (1:5:2) aborted
Jul 23 13:25:54 freenas ctl_datamove: tag 0x4536b on (1:5:2) aborted
Jul 23 13:25:54 freenas ctl_datamove: tag 0x44a8c on (0:5:2) aborted
Jul 23 13:25:54 freenas ctl_datamove: tag 0x45368 on (1:5:2) aborted
Jul 23 13:25:54 freenas ctl_datamove: tag 0x6e8c on (0:6:3) aborted
Jul 23 13:25:54 freenas ctl_datamove: tag 0x6e8b on (0:6:3) aborted
Jul 23 14:23:53 freenas ctl_datamove: tag 0x11837 on (1:6:3) aborted
Jul 23 14:23:53 freenas ctl_datamove: tag 0x117e9 on (0:6:3) aborted

#7 Updated by Alexander Motin about 3 years ago

  • Priority changed from No priority to Important
  • Target version set to 11.1
  • Seen in changed from 9.10.2-U5 to 11.0-U2

Aborted commands is a consequence, not the source of the problem. Could you attach new debug archive exposing the problem, plus also collest additional info about NAS operation while the problem occurs (`ctladm dumpooa`, `gstat -I 1s -p`, `top -SHIz`, ...).

#8 Updated by Alexander Motin about 3 years ago

Not related, but note that your boot pool is quite full. You may need to delete some old boot environments.

#9 Updated by Tom Schneider about 3 years ago

  • File debug-freenas-20170723151151.tgz added
  • File 20170723 command line output.docx added

ok - will remove some boot environments.

I hope these attachments make sense...

#10 Updated by Alexander Motin about 3 years ago

They make some, but unfortunately `ctladm dumpooa` is useless, since it shows only commands running for 32ms, which is not enough to trigger abort request on timeout. Could you please try to run it more in time what you experience the problem to try to catch any latencies measured in seconds.

Also provided `top` output shows pythomn eating quite a lot of CPU time. Could you track whether it is persistent and what is that process exactly with doing `ps lw {pid}` for it?

Also if problem is reproducible, describe please what you are doing.

#11 Updated by Tom Schneider about 3 years ago

I will be happy to run whatever you like:
I am running a stock Mini XL with no hosts/plugins/etc. Just iScsi/NFS/Samba

I don't see pythomn running now - maybe it was running to generate the debug file?

I am performing backups of my ESX host environment to a local LUN - by a script called ghettoVCB on the esx host - what it does is the following:
1) snapshot the VM
2) Perform a "copy" of the VM files using something like the following:
/sbin/vmkfstools -i "/vmfs/volumes/dsf02/DC1/DC1.vmdk" -a "lsilogic" -d "thin" "/vmfs/volumes/vmbackupslcl/VM_BACKUPS/DC1/DC1-2017-07-23_16-58-11/DC1.vmdk"
3) Delete the snapshot

Once all of the backups are completed i then rsync them to a NFS fileshare via freenas rsync service

Note: /vmfs/volumes/dsf02 and /vmfs/volumnes/vmbackupslcl are LUNS provided by FreeNas

If there are any commands you want me to run please let me know.
-Tom

#12 Updated by Tom Schneider about 3 years ago

I also ran ctladm dumpooa several times in a row - here are the results:
LUN 2 tag 0xdf480 RTR: WRITE. CDB: 2a 00 10 a6 93 50 00 00 08 00 (3 ms)
LUN 3 tag 0x765e7 RTR: EXTENDED COPY. CDB: (83 ms)
LUN 3 tag 0x765e8 RTR: EXTENDED COPY. CDB: (83 ms)
LUN 3 tag 0x765e9 RTR: EXTENDED COPY. CDB: (83 ms)
LUN 3 tag 0x765ea RTR: EXTENDED COPY. CDB: (83 ms)
LUN 3 tag 0x765eb RTR: EXTENDED COPY. CDB: (83 ms)
LUN 3 tag 0x765ec RTR: EXTENDED COPY. CDB: (83 ms)
LUN 3 tag 0x765ed RTR: EXTENDED COPY. CDB: (83 ms)
LUN 3 tag 0x76637 RTR: EXTENDED COPY. CDB: (83 ms)
LUN 3 tag 0xb41a2 RTR: WRITE. CDB: 2a 00 2c 06 d0 80 00 08 00 00 (80 ms)
LUN 3 tag 0xb41ab RTR: WRITE. CDB: 2a 00 2c 06 d8 80 00 08 00 00 (78 ms)
LUN 3 tag 0xb41b4 RTR: WRITE. CDB: 2a 00 2c 06 e0 80 00 08 00 00 (77 ms)
LUN 3 tag 0xb41b5 RTR: WRITE. CDB: 2a 00 2c 06 e8 80 00 08 00 00 (75 ms)
LUN 3 tag 0xb41b6 RTR: WRITE. CDB: 2a 00 2c 06 f0 80 00 08 00 00 (73 ms)
LUN 3 tag 0xb41b7 RTR: WRITE. CDB: 2a 00 2c 06 f8 80 00 08 00 00 (72 ms)
LUN 3 tag 0xb41b8 RTR: WRITE. CDB: 2a 00 2c 07 00 80 00 08 00 00 (69 ms)
LUN 3 tag 0xb41b9 RTR: WRITE. CDB: 2a 00 2c 07 08 80 00 08 00 00 (66 ms)
LUN 3 tag 0xb41ba RTR: WRITE. CDB: 2a 00 2c 07 10 80 00 08 00 00 (63 ms)
LUN 3 tag 0xb41bb RTR: WRITE. CDB: 2a 00 2c 07 18 80 00 08 00 00 (58 ms)
LUN 3 tag 0xb41bc RTR: WRITE. CDB: 2a 00 2c 07 20 80 00 08 00 00 (51 ms)
LUN 3 tag 0xb41bd RTR: WRITE. CDB: 2a 00 2c 07 28 80 00 08 00 00 (49 ms)
LUN 3 tag 0xb41be RTR: WRITE. CDB: 2a 00 2c 07 30 80 00 08 00 00 (44 ms)
LUN 3 tag 0xb41bf RTR: WRITE. CDB: 2a 00 2c 07 38 80 00 08 00 00 (41 ms)
LUN 3 tag 0xb41c0 RTR: WRITE. CDB: 2a 00 2c 07 40 80 00 08 00 00 (37 ms)
LUN 3 tag 0xb41c1 RTR: WRITE. CDB: 2a 00 2c 07 48 80 00 08 00 00 (33 ms)
LUN 3 tag 0xb41c2 RTR: WRITE. CDB: 2a 00 2c 07 50 80 00 08 00 00 (30 ms)
LUN 3 tag 0xb41c3 RTR: WRITE. CDB: 2a 00 2c 07 58 80 00 08 00 00 (28 ms)
LUN 3 tag 0xb41c4 RTR: WRITE. CDB: 2a 00 2c 07 60 80 00 08 00 00 (27 ms)
LUN 3 tag 0xb41c5 RTR: WRITE. CDB: 2a 00 2c 07 68 80 00 08 00 00 (24 ms)
LUN 3 tag 0xb41c6 RTR: WRITE. CDB: 2a 00 2c 07 70 80 00 08 00 00 (21 ms)
LUN 3 tag 0xb41c7 RTR: WRITE. CDB: 2a 00 2c 07 78 80 00 08 00 00 (18 ms)
LUN 3 tag 0xb41c8 RTR: WRITE. CDB: 2a 00 2c 07 80 80 00 08 00 00 (16 ms)
LUN 3 tag 0xb41c9 RTR: WRITE. CDB: 2a 00 2c 07 88 80 00 08 00 00 (13 ms)
LUN 3 tag 0xb41ca RTR: WRITE. CDB: 2a 00 2c 07 90 80 00 08 00 00 (10 ms)
LUN 3 tag 0xb41cb RTR: WRITE. CDB: 2a 00 2c 07 98 80 00 08 00 00 (8 ms)
LUN 3 tag 0xb41cc RTR: WRITE. CDB: 2a 00 2c 07 a0 80 00 08 00 00 (5 ms)
LUN 3 tag 0xb41cd RTR: WRITE. CDB: 2a 00 2c 07 a8 80 00 08 00 00 (3 ms)
LUN 3 tag 0xb41ce RTR: WRITE. CDB: 2a 00 2c 07 b0 80 00 08 00 00 (0 ms)
OOA queues dump done
root@freenas:~ # ctladm dumpooa
Dumping OOA queues
LUN 2 tag 0xdffab RTR: READ. CDB: 28 00 07 38 ff 50 00 00 10 00 (13 ms)
LUN 3 tag 0x765ea RTR: EXTENDED COPY. CDB: (597 ms)
LUN 3 tag 0x765eb RTR: EXTENDED COPY. CDB: (597 ms)
LUN 3 tag 0x765ec RTR: EXTENDED COPY. CDB: (597 ms)
LUN 3 tag 0x765ed RTR: EXTENDED COPY. CDB: (597 ms)
LUN 3 tag 0x76637 RTR: EXTENDED COPY. CDB: (597 ms)
LUN 3 tag 0xb41c0 RTR: WRITE. CDB: 2a 00 2c 07 40 80 00 08 00 00 (551 ms)
LUN 3 tag 0xb41c1 RTR: WRITE. CDB: 2a 00 2c 07 48 80 00 08 00 00 (547 ms)
LUN 3 tag 0xb41c2 RTR: WRITE. CDB: 2a 00 2c 07 50 80 00 08 00 00 (544 ms)
LUN 3 tag 0xb41c3 RTR: WRITE. CDB: 2a 00 2c 07 58 80 00 08 00 00 (541 ms)
LUN 3 tag 0xb41c4 RTR: WRITE. CDB: 2a 00 2c 07 60 80 00 08 00 00 (540 ms)
LUN 3 tag 0xb41c5 RTR: WRITE. CDB: 2a 00 2c 07 68 80 00 08 00 00 (537 ms)
LUN 3 tag 0xb41c6 RTR: WRITE. CDB: 2a 00 2c 07 70 80 00 08 00 00 (535 ms)
LUN 3 tag 0xb41c7 RTR: WRITE. CDB: 2a 00 2c 07 78 80 00 08 00 00 (532 ms)
LUN 3 tag 0xb41c8 RTR: WRITE. CDB: 2a 00 2c 07 80 80 00 08 00 00 (530 ms)
LUN 3 tag 0xb41c9 RTR: WRITE. CDB: 2a 00 2c 07 88 80 00 08 00 00 (527 ms)
LUN 3 tag 0xb41ca RTR: WRITE. CDB: 2a 00 2c 07 90 80 00 08 00 00 (524 ms)
LUN 3 tag 0xb41cb RTR: WRITE. CDB: 2a 00 2c 07 98 80 00 08 00 00 (521 ms)
LUN 3 tag 0xb41cc RTR: WRITE. CDB: 2a 00 2c 07 a0 80 00 08 00 00 (519 ms)
LUN 3 tag 0xb41cd RTR: WRITE. CDB: 2a 00 2c 07 a8 80 00 08 00 00 (516 ms)
LUN 3 tag 0xb41ce RTR: WRITE. CDB: 2a 00 2c 07 b0 80 00 08 00 00 (514 ms)
LUN 3 tag 0xb41cf RTR: WRITE. CDB: 2a 00 2c 07 b8 80 00 08 00 00 (510 ms)
LUN 3 tag 0xb41d0 RTR: WRITE. CDB: 2a 00 2c 07 c0 80 00 08 00 00 (507 ms)
LUN 3 tag 0xb41d1 RTR: WRITE. CDB: 2a 00 2c 07 c8 80 00 08 00 00 (504 ms)
LUN 3 tag 0x76638 RTR: COMPARE AND WRITE. CDB: (128 ms)
OOA queues dump done
root@freenas:~ # ctladm dumpooa
Dumping OOA queues
LUN 0 tag 0xb4221 RTR: READ. CDB: 28 00 0b aa 78 00 00 08 00 00 (82 ms)
LUN 0 tag 0xb4222 BLOCKED: READ. CDB: 28 00 0b aa 80 00 00 08 00 00 (81 ms)
LUN 0 tag 0xb4223 BLOCKED: READ. CDB: 28 00 0b aa 88 00 00 08 00 00 (81 ms)
LUN 0 tag 0xb4224 BLOCKED: READ. CDB: 28 00 0b aa 90 00 00 08 00 00 (81 ms)
LUN 0 tag 0xb4225 BLOCKED: READ. CDB: 28 00 0b aa 98 00 00 08 00 00 (81 ms)
LUN 0 tag 0xb4226 BLOCKED: READ. CDB: 28 00 0b aa a0 00 00 08 00 00 (81 ms)
LUN 0 tag 0xb4227 BLOCKED: READ. CDB: 28 00 0b aa a8 00 00 08 00 00 (81 ms)
LUN 0 tag 0xb4228 BLOCKED: READ. CDB: 28 00 0b aa b0 00 00 08 00 00 (81 ms)
LUN 0 tag 0xb4229 BLOCKED: READ. CDB: 28 00 0b aa b8 00 00 08 00 00 (81 ms)
LUN 0 tag 0xb422a BLOCKED: READ. CDB: 28 00 0b aa c0 00 00 08 00 00 (80 ms)
LUN 0 tag 0xb422b BLOCKED: READ. CDB: 28 00 0b aa c8 00 00 08 00 00 (80 ms)
LUN 0 tag 0xb422c BLOCKED: READ. CDB: 28 00 0b aa d0 00 00 08 00 00 (80 ms)
LUN 0 tag 0xb422d BLOCKED: READ. CDB: 28 00 0b aa d8 00 00 08 00 00 (80 ms)
LUN 2 tag 0xdffaf RTR: READ. CDB: 28 00 07 39 05 30 00 00 10 00 (14 ms)
LUN 3 tag 0x765f9 RTR: EXTENDED COPY. CDB: (84 ms)
LUN 3 tag 0x765fa RTR: EXTENDED COPY. CDB: (84 ms)
LUN 3 tag 0x765fb RTR: EXTENDED COPY. CDB: (84 ms)
LUN 3 tag 0x765fc RTR: EXTENDED COPY. CDB: (84 ms)
LUN 3 tag 0x765fd RTR: EXTENDED COPY. CDB: (84 ms)
LUN 3 tag 0x7664a RTR: EXTENDED COPY. CDB: (84 ms)
LUN 3 tag 0xb423b RTR: WRITE. CDB: 2a 00 2c 08 f0 80 00 08 00 00 (48 ms)
LUN 3 tag 0xb423c RTR: WRITE. CDB: 2a 00 2c 09 38 80 00 08 00 00 (47 ms)
LUN 3 tag 0xb423d RTR: WRITE. CDB: 2a 00 2c 08 f8 80 00 08 00 00 (41 ms)
LUN 3 tag 0xb423e RTR: WRITE. CDB: 2a 00 2c 09 00 80 00 08 00 00 (39 ms)
LUN 3 tag 0xb423f RTR: WRITE. CDB: 2a 00 2c 09 08 80 00 08 00 00 (36 ms)
LUN 3 tag 0xb4240 RTR: WRITE. CDB: 2a 00 2c 09 40 80 00 08 00 00 (26 ms)
LUN 3 tag 0xb4241 RTR: WRITE. CDB: 2a 00 2c 09 48 80 00 08 00 00 (21 ms)
LUN 3 tag 0xb4242 RTR: WRITE. CDB: 2a 00 2c 09 50 80 00 08 00 00 (13 ms)
LUN 3 tag 0xb4243 RTR: WRITE. CDB: 2a 00 2c 09 58 80 00 08 00 00 (8 ms)
LUN 3 tag 0xb4244 RTR: WRITE. CDB: 2a 00 2c 09 60 80 00 08 00 00 (1 ms)
OOA queues dump done
root@freenas:~ # ctladm dumpooa
Dumping OOA queues
LUN 2 tag 0xdf4af RTR: READ. CDB: 28 00 07 39 2c 90 00 00 08 00 (44 ms)
LUN 3 tag 0x76604 RTR: EXTENDED COPY. CDB: (284 ms)
LUN 3 tag 0xb428f RTR: WRITE. CDB: 2a 00 2c 0a b8 80 00 08 00 00 (159 ms)
LUN 3 tag 0xb4290 RTR: WRITE. CDB: 2a 00 2c 0a c0 80 00 08 00 00 (157 ms)
LUN 3 tag 0xb4291 RTR: WRITE. CDB: 2a 00 2c 0a c8 80 00 08 00 00 (152 ms)
OOA queues dump done
root@freenas:~ # ctladm dumpooa
Dumping OOA queues
LUN 2 tag 0xdffc6 RTR: READ. CDB: 28 00 07 39 35 d8 00 00 08 00 (67 ms)
LUN 3 tag 0x76665 RTR: EXTENDED COPY. CDB: (153 ms)
LUN 3 tag 0x76666 RTR: EXTENDED COPY. CDB: (153 ms)
LUN 3 tag 0x76667 RTR: EXTENDED COPY. CDB: (153 ms)
LUN 3 tag 0x76668 RTR: EXTENDED COPY. CDB: (153 ms)
LUN 3 tag 0xb4303 RTR: WRITE. CDB: 2a 00 2c 0c 38 80 00 08 00 00 (101 ms)
LUN 3 tag 0xb4304 RTR: WRITE. CDB: 2a 00 2c 0c 40 80 00 08 00 00 (97 ms)
LUN 3 tag 0xb4305 RTR: WRITE. CDB: 2a 00 2c 0c 48 80 00 08 00 00 (94 ms)
LUN 3 tag 0xb4306 RTR: WRITE. CDB: 2a 00 2c 0c 50 80 00 08 00 00 (91 ms)
LUN 3 tag 0xb4307 RTR: WRITE. CDB: 2a 00 2c 0c 58 80 00 08 00 00 (85 ms)
LUN 3 tag 0xb4308 RTR: WRITE. CDB: 2a 00 2c 0c 60 80 00 08 00 00 (83 ms)
LUN 3 tag 0xb4309 RTR: WRITE. CDB: 2a 00 2c 0c 68 80 00 08 00 00 (77 ms)
LUN 3 tag 0xb430a RTR: WRITE. CDB: 2a 00 2c 0c 70 80 00 08 00 00 (72 ms)
LUN 3 tag 0xb430b RTR: WRITE. CDB: 2a 00 2c 0c 78 80 00 08 00 00 (66 ms)
LUN 3 tag 0xb430c RTR: WRITE. CDB: 2a 00 2c 0c 80 80 00 08 00 00 (61 ms)
LUN 3 tag 0xb430d RTR: WRITE. CDB: 2a 00 2c 0c 88 80 00 08 00 00 (56 ms)
LUN 3 tag 0xb430e RTR: WRITE. CDB: 2a 00 2c 0c 90 80 00 08 00 00 (53 ms)
LUN 3 tag 0xb430f RTR: WRITE. CDB: 2a 00 2c 0c 98 80 00 08 00 00 (49 ms)
LUN 3 tag 0xb4310 RTR: WRITE. CDB: 2a 00 2c 0c a0 80 00 08 00 00 (45 ms)
LUN 3 tag 0xb4311 RTR: WRITE. CDB: 2a 00 2c 0c a8 80 00 08 00 00 (41 ms)
OOA queues dump done
root@freenas:~ # ctladm dumpooa
Dumping OOA queues
LUN 0 tag 0xb4312 RTR: READ. CDB: 28 00 0b ad e0 00 00 08 00 00 (318 ms)
LUN 0 tag 0xb4313 BLOCKED: READ. CDB: 28 00 0b ad e8 00 00 08 00 00 (318 ms)
LUN 0 tag 0xb4314 BLOCKED: READ. CDB: 28 00 0b ad f0 00 00 08 00 00 (318 ms)
LUN 0 tag 0xb4315 BLOCKED: READ. CDB: 28 00 0b ad f8 00 00 08 00 00 (318 ms)
LUN 0 tag 0xb4316 BLOCKED: READ. CDB: 28 00 0b ae 00 00 00 08 00 00 (317 ms)
LUN 0 tag 0xb4317 BLOCKED: READ. CDB: 28 00 0b ae 08 00 00 08 00 00 (317 ms)
LUN 0 tag 0xb4318 BLOCKED: READ. CDB: 28 00 0b ae 10 00 00 08 00 00 (317 ms)
LUN 0 tag 0xb4319 BLOCKED: READ. CDB: 28 00 0b ae 18 00 00 08 00 00 (317 ms)
LUN 0 tag 0xb431a BLOCKED: READ. CDB: 28 00 0b ae 20 00 00 08 00 00 (316 ms)
LUN 0 tag 0xb431b BLOCKED: READ. CDB: 28 00 0b ae 28 00 00 08 00 00 (316 ms)
LUN 0 tag 0xb431c BLOCKED: READ. CDB: 28 00 0b ae 30 00 00 08 00 00 (316 ms)
LUN 0 tag 0xb431d BLOCKED: READ. CDB: 28 00 0b ae 38 00 00 08 00 00 (316 ms)
LUN 0 tag 0xb431e BLOCKED: READ. CDB: 28 00 0b ae 40 00 00 08 00 00 (316 ms)
LUN 0 tag 0xb431f BLOCKED: READ. CDB: 28 00 0b ae 48 00 00 08 00 00 (316 ms)
LUN 0 tag 0xb4320 BLOCKED: READ. CDB: 28 00 0b ae 50 00 00 08 00 00 (316 ms)
LUN 0 tag 0xb4321 BLOCKED: READ. CDB: 28 00 0b ae 58 00 00 08 00 00 (316 ms)
LUN 0 tag 0xb4322 BLOCKED: READ. CDB: 28 00 0b ae 60 00 00 08 00 00 (315 ms)
LUN 0 tag 0xb4323 BLOCKED: READ. CDB: 28 00 0b ae 68 00 00 08 00 00 (315 ms)
LUN 0 tag 0xb4324 BLOCKED: READ. CDB: 28 00 0b ae 70 00 00 08 00 00 (315 ms)
LUN 0 tag 0xb4325 BLOCKED: READ. CDB: 28 00 0b ae 78 00 00 08 00 00 (315 ms)
LUN 0 tag 0xb4326 BLOCKED: READ. CDB: 28 00 0b ae 80 00 00 08 00 00 (315 ms)
LUN 0 tag 0xb4327 BLOCKED: READ. CDB: 28 00 0b ae 88 00 00 08 00 00 (315 ms)
LUN 0 tag 0xb4328 BLOCKED: READ. CDB: 28 00 0b ae 90 00 00 08 00 00 (315 ms)
LUN 0 tag 0xb4329 BLOCKED: READ. CDB: 28 00 0b ae 98 00 00 08 00 00 (315 ms)
LUN 0 tag 0xb432a BLOCKED: READ. CDB: 28 00 0b ae a0 00 00 08 00 00 (314 ms)
LUN 0 tag 0xb432b BLOCKED: READ. CDB: 28 00 0b ae a8 00 00 08 00 00 (314 ms)
LUN 0 tag 0xb432c BLOCKED: READ. CDB: 28 00 0b ae b0 00 00 08 00 00 (314 ms)
LUN 0 tag 0xb432d BLOCKED: READ. CDB: 28 00 0b ae b8 00 00 08 00 00 (314 ms)
LUN 0 tag 0xb432e BLOCKED: READ. CDB: 28 00 0b ae c0 00 00 08 00 00 (313 ms)
LUN 0 tag 0xb432f BLOCKED: READ. CDB: 28 00 0b ae c8 00 00 08 00 00 (313 ms)
LUN 0 tag 0xb4330 BLOCKED: READ. CDB: 28 00 0b ae d0 00 00 08 00 00 (313 ms)
LUN 0 tag 0xb4331 BLOCKED: READ. CDB: 28 00 0b ae d8 00 00 08 00 00 (313 ms)
LUN 2 tag 0xdf4df RTR: READ. CDB: 28 00 07 38 fe 80 00 00 08 00 (25 ms)
LUN 3 tag 0x76669 RTR: EXTENDED COPY. CDB: (318 ms)
LUN 3 tag 0x7666a RTR: EXTENDED COPY. CDB: (318 ms)
LUN 3 tag 0x7666b RTR: EXTENDED COPY. CDB: (318 ms)
LUN 3 tag 0x7666c RTR: EXTENDED COPY. CDB: (318 ms)
LUN 3 tag 0x7666d RTR: EXTENDED COPY. CDB: (318 ms)
LUN 3 tag 0x7666e RTR: EXTENDED COPY. CDB: (318 ms)
LUN 3 tag 0x7666f RTR: EXTENDED COPY. CDB: (318 ms)
LUN 3 tag 0x76670 RTR: EXTENDED COPY. CDB: (318 ms)
OOA queues dump done
root@freenas:~ # ctladm dumpooa
Dumping OOA queues
LUN 0 tag 0xb4316 RTR: READ. CDB: 28 00 0b ae 00 00 00 08 00 00 (765 ms)
LUN 0 tag 0xb4317 BLOCKED: READ. CDB: 28 00 0b ae 08 00 00 08 00 00 (765 ms)
LUN 0 tag 0xb4318 BLOCKED: READ. CDB: 28 00 0b ae 10 00 00 08 00 00 (765 ms)
LUN 0 tag 0xb4319 BLOCKED: READ. CDB: 28 00 0b ae 18 00 00 08 00 00 (765 ms)
LUN 0 tag 0xb431a BLOCKED: READ. CDB: 28 00 0b ae 20 00 00 08 00 00 (764 ms)
LUN 0 tag 0xb431b BLOCKED: READ. CDB: 28 00 0b ae 28 00 00 08 00 00 (764 ms)
LUN 0 tag 0xb431c BLOCKED: READ. CDB: 28 00 0b ae 30 00 00 08 00 00 (764 ms)
LUN 0 tag 0xb431d BLOCKED: READ. CDB: 28 00 0b ae 38 00 00 08 00 00 (764 ms)
LUN 0 tag 0xb431e BLOCKED: READ. CDB: 28 00 0b ae 40 00 00 08 00 00 (764 ms)
LUN 0 tag 0xb431f BLOCKED: READ. CDB: 28 00 0b ae 48 00 00 08 00 00 (764 ms)
LUN 0 tag 0xb4320 BLOCKED: READ. CDB: 28 00 0b ae 50 00 00 08 00 00 (764 ms)
LUN 0 tag 0xb4321 BLOCKED: READ. CDB: 28 00 0b ae 58 00 00 08 00 00 (764 ms)
LUN 0 tag 0xb4322 BLOCKED: READ. CDB: 28 00 0b ae 60 00 00 08 00 00 (763 ms)
LUN 0 tag 0xb4323 BLOCKED: READ. CDB: 28 00 0b ae 68 00 00 08 00 00 (763 ms)
LUN 0 tag 0xb4324 BLOCKED: READ. CDB: 28 00 0b ae 70 00 00 08 00 00 (763 ms)
LUN 0 tag 0xb4325 BLOCKED: READ. CDB: 28 00 0b ae 78 00 00 08 00 00 (763 ms)
LUN 0 tag 0xb4326 BLOCKED: READ. CDB: 28 00 0b ae 80 00 00 08 00 00 (763 ms)
LUN 0 tag 0xb4327 BLOCKED: READ. CDB: 28 00 0b ae 88 00 00 08 00 00 (763 ms)
LUN 0 tag 0xb4328 BLOCKED: READ. CDB: 28 00 0b ae 90 00 00 08 00 00 (763 ms)
LUN 0 tag 0xb4329 BLOCKED: READ. CDB: 28 00 0b ae 98 00 00 08 00 00 (763 ms)
LUN 0 tag 0xb432a BLOCKED: READ. CDB: 28 00 0b ae a0 00 00 08 00 00 (762 ms)
LUN 0 tag 0xb432b BLOCKED: READ. CDB: 28 00 0b ae a8 00 00 08 00 00 (762 ms)
LUN 0 tag 0xb432c BLOCKED: READ. CDB: 28 00 0b ae b0 00 00 08 00 00 (762 ms)
LUN 0 tag 0xb432d BLOCKED: READ. CDB: 28 00 0b ae b8 00 00 08 00 00 (762 ms)
LUN 0 tag 0xb432e BLOCKED: READ. CDB: 28 00 0b ae c0 00 00 08 00 00 (761 ms)
LUN 0 tag 0xb432f BLOCKED: READ. CDB: 28 00 0b ae c8 00 00 08 00 00 (761 ms)
LUN 0 tag 0xb4330 BLOCKED: READ. CDB: 28 00 0b ae d0 00 00 08 00 00 (761 ms)
LUN 0 tag 0xb4331 BLOCKED: READ. CDB: 28 00 0b ae d8 00 00 08 00 00 (761 ms)
LUN 2 tag 0xe0009 RTR: READ. CDB: 28 00 07 3a 4b b0 00 00 10 00 (32 ms)
LUN 3 tag 0x76669 RTR: EXTENDED COPY. CDB: (766 ms)
LUN 3 tag 0x7666a RTR: EXTENDED COPY. CDB: (766 ms)
LUN 3 tag 0x7666b RTR: EXTENDED COPY. CDB: (766 ms)
LUN 3 tag 0x7666c RTR: EXTENDED COPY. CDB: (766 ms)
LUN 3 tag 0x7666d RTR: EXTENDED COPY. CDB: (766 ms)
LUN 3 tag 0x7666e RTR: EXTENDED COPY. CDB: (766 ms)
LUN 3 tag 0x7666f RTR: EXTENDED COPY. CDB: (766 ms)
LUN 3 tag 0x76670 RTR: EXTENDED COPY. CDB: (766 ms)
LUN 3 tag 0xb4333 RTR: WRITE. CDB: 2a 00 2c 0c d8 80 00 08 00 00 (23 ms)
LUN 3 tag 0xb4334 RTR: WRITE. CDB: 2a 00 2c 0c e0 80 00 08 00 00 (5 ms)
LUN 3 tag 0xb4335 RTR: WRITE. CDB: 2a 00 2c 0c e8 80 00 08 00 00 (0 ms)
OOA queues dump done
root@freenas:~ # ctladm dumpooa
Dumping OOA queues
LUN 0 tag 0xb4366 RTR: READ. CDB: 28 00 0b af 80 00 00 08 00 00 (115 ms)
LUN 0 tag 0xb4367 BLOCKED: READ. CDB: 28 00 0b af 88 00 00 08 00 00 (115 ms)
LUN 0 tag 0xb4368 BLOCKED: READ. CDB: 28 00 0b af 90 00 00 08 00 00 (115 ms)
LUN 0 tag 0xb4369 BLOCKED: READ. CDB: 28 00 0b af 98 00 00 08 00 00 (115 ms)
LUN 0 tag 0xb436a BLOCKED: READ. CDB: 28 00 0b af a0 00 00 08 00 00 (114 ms)
LUN 0 tag 0xb436b BLOCKED: READ. CDB: 28 00 0b af a8 00 00 08 00 00 (114 ms)
LUN 0 tag 0xb436c BLOCKED: READ. CDB: 28 00 0b af b0 00 00 08 00 00 (114 ms)
LUN 0 tag 0xb436d BLOCKED: READ. CDB: 28 00 0b af b8 00 00 08 00 00 (114 ms)
LUN 0 tag 0xb436f BLOCKED: READ. CDB: 28 00 0b af c0 00 00 08 00 00 (113 ms)
LUN 0 tag 0xb4370 BLOCKED: READ. CDB: 28 00 0b af c8 00 00 08 00 00 (113 ms)
LUN 0 tag 0xb4371 BLOCKED: READ. CDB: 28 00 0b af d0 00 00 08 00 00 (113 ms)
LUN 0 tag 0xb4372 BLOCKED: READ. CDB: 28 00 0b af d8 00 00 08 00 00 (113 ms)
LUN 2 tag 0xdf51c RTR: READ. CDB: 28 00 07 39 86 c8 00 00 18 00 (11 ms)
LUN 3 tag 0x76673 RTR: EXTENDED COPY. CDB: (118 ms)
LUN 3 tag 0x76674 RTR: EXTENDED COPY. CDB: (118 ms)
LUN 3 tag 0x76675 RTR: EXTENDED COPY. CDB: (118 ms)
LUN 3 tag 0x76676 RTR: EXTENDED COPY. CDB: (118 ms)
LUN 3 tag 0x76677 RTR: EXTENDED COPY. CDB: (118 ms)
LUN 3 tag 0x76678 RTR: EXTENDED COPY. CDB: (118 ms)
LUN 3 tag 0x76679 RTR: EXTENDED COPY. CDB: (118 ms)
LUN 3 tag 0xb4379 RTR: WRITE. CDB: 2a 00 2c 0e 08 80 00 08 00 00 (93 ms)
LUN 3 tag 0xb437a RTR: WRITE. CDB: 2a 00 2c 0e 10 80 00 08 00 00 (90 ms)
LUN 3 tag 0xb437b RTR: WRITE. CDB: 2a 00 2c 0e 18 80 00 08 00 00 (88 ms)
LUN 3 tag 0xb437c RTR: WRITE. CDB: 2a 00 2c 0e 20 80 00 08 00 00 (84 ms)
LUN 3 tag 0xb437d RTR: WRITE. CDB: 2a 00 2c 0e 28 80 00 08 00 00 (83 ms)
LUN 3 tag 0xb437e RTR: WRITE. CDB: 2a 00 2c 0e 30 80 00 08 00 00 (81 ms)
LUN 3 tag 0xb437f RTR: WRITE. CDB: 2a 00 2c 0e 38 80 00 08 00 00 (78 ms)
LUN 3 tag 0xb4380 RTR: WRITE. CDB: 2a 00 2c 0e 40 80 00 08 00 00 (76 ms)
LUN 3 tag 0xb4381 RTR: WRITE. CDB: 2a 00 2c 0e 48 80 00 08 00 00 (74 ms)
LUN 3 tag 0xb4382 RTR: WRITE. CDB: 2a 00 2c 0e 50 80 00 08 00 00 (44 ms)
LUN 3 tag 0xb4383 RTR: WRITE. CDB: 2a 00 2c 0e 58 80 00 08 00 00 (34 ms)
LUN 3 tag 0xb4384 RTR: WRITE. CDB: 2a 00 2c 0e 60 80 00 08 00 00 (31 ms)
LUN 3 tag 0xb4385 RTR: WRITE. CDB: 2a 00 2c 0e 68 80 00 08 00 00 (29 ms)
OOA queues dump done
root@freenas:~ #

#13 Updated by Alexander Motin about 3 years ago

  • Status changed from 15 to Investigation

OK. This is more informative. But if that is all, then it seems to be more a question of tuning then real bug.

You have pool of only four disks, which can handle not so many random IOPS. Virtual machines quite often tend to fragment their data and the pool with random write, that makes bulk sequential operations like VM copying not so easy. The fact vg01/ds02 ZVOL has block size of only 8KB does not help to reduce the fragmentation either. FreeNAS for quite some time now prefers to use at least 16KB block sizes, that trade lower fragmentation by higher probably read-modify-write sequence on random data rewrite.

On the other side VMware tend to send quite a lot simultaneous XCOPY operations. It may be good for big storages with many spindles, but for the Mini it can be too much, that only increase the latency without giving real copy performance benefits. IIRC there is a way to tune that XCOPY size on VMware side. I don't have a ready link for you, but you should find it if you google.

On the third side, you may investigate different backup strategy, utilizing ZFS snapshots instead of doing two copy operations within the same pretty small and slow pool every time.

#14 Updated by Tom Schneider about 3 years ago

So - Slow drives will cause the luns to be lost to esx? This does not make sense to me.
Is there a way to change the block-size on the array from 8K to 16K?

I will look for the xcopy tuning but this only happens when I turn on VAAI, when I turn off VAAI i don't get all of these disconnects.

I have had this environment for well over a year and these issues only seem to happen when i turn on VAAI.

#15 Updated by Alexander Motin about 3 years ago

Tom Schneider wrote:

So - Slow drives will cause the luns to be lost to esx? This does not make sense to me.

Slow drives may cause latency and and command aborts. I don't know about losses, there is not enough info.

Is there a way to change the block-size on the array from 8K to 16K?

No, it is read-only. And it is not "on the array", but a property of specific ZVOL. You may create new ZVOL and move the data.

I will look for the xcopy tuning but this only happens when I turn on VAAI, when I turn off VAAI i don't get all of these disconnects.

XCOPY is one of VAAI primitives.

I have had this environment for well over a year and these issues only seem to happen when i turn on VAAI.

May be before you was limited by network speed. Or non-VAAI copy has lower request queue depth to not create so high latency.

#16 Updated by Tom Schneider about 3 years ago

ok - let's start over.

given my configuration on the Mini XL - and with 2 full duplex paths (iScsi) 1Gb Ethernet, should i be experiencing the following on multiple LUNS presented to ESXi:

Lost access to volume 57d319b1-3db52f3c-87b8-001517bfcde6 (dsf02) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
info
7/23/2017 5:40:56 PM
dsf02

Lost access to volume 57d3196b-5bc36e46-34c9-001517bfcde6 (dsf01) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
info
7/23/2017 5:40:56 PM
dsf01

Lost access to volume 57d319c5-48978f39-272a-001517bfcde6 (dsf03) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
info
7/23/2017 5:40:56 PM
dsf03

Lost access to volume 57d569fb-a00f41db-187c-001517bfcde6 (vmbackupslcl) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
info
7/23/2017 5:40:56 PM
vmbackupslcl

NOTE: It took 8 seconds to successfully restore access

#17 Updated by Alexander Motin about 3 years ago

Tom Schneider wrote:

should i be experiencing the following on multiple LUNS presented to ESXi:

No, you shouldn't. But other then delays demonstrated above you provided no other input to start with.

#18 Updated by Tom Schneider about 3 years ago

ok - please advise if i should start another thread.

What i am planning on doing is adding 4 more spindles at 16k block size and use that as the target for the copies.

I will report on the events seen from the esx side as well as the FreeNas side

other than debug dump what else would you like to see if this reproduces the problem?

#19 Updated by Alexander Motin about 3 years ago

  • Status changed from Investigation to 15

Any information about why VMware drops the datastore would be helpful. Any errors other then just aborts. If aborts is the only symptom, then `ctladm dumpooa` to find what commands cause delays and observation of disk load/delays with `gstat -I 1s -p`.

And as I have told, search for VMware XCOPY tuning. I think that operation size reduction may reduce latency.

#20 Updated by Tom Schneider about 3 years ago

  • File hostd.log added
  • File vmkernel.log added
  • File vobd.log added
  • File ctladm 20170724.txt added
  • File debug-freenas-20170724112553.tgz added

Ok - regarding xcopy - this was the recommendation: esxcfg-advcfg -s 16384 /DataMover/MaxHWTransferSize
it did not make any change to the lun disconnects.

I did create a new zvol w/16kb block size and used it as the destination for the backups.

as a test, i am just backing up ONE VM from a single LUN (dsf02) to the newly created LUN (vmbackups2)

when running this i temporarily loose connectivity on most of my luns:
Successfully restored access to volume 57d319c5-48978f39-272a-001517bfcde6 (dsf03) following connectivity
issues.
info
7/24/2017 10:12:27 AM
esx01.hq.cs2corp.com

Successfully restored access to volume 57d3196b-5bc36e46-34c9-001517bfcde6 (dsf01) following connectivity
issues.
info
7/24/2017 10:12:27 AM
esx01.hq.cs2corp.com

Successfully restored access to volume 57d319b1-3db52f3c-87b8-001517bfcde6 (dsf02) following connectivity
issues.
info
7/24/2017 10:12:27 AM
esx01.hq.cs2corp.com

Lost access to volume 57d319b1-3db52f3c-87b8-001517bfcde6 (dsf02) due to connectivity issues. Recovery
attempt is in progress and outcome will be reported shortly.
info
7/24/2017 10:12:25 AM
dsf02

Lost access to volume 57d3196b-5bc36e46-34c9-001517bfcde6 (dsf01) due to connectivity issues. Recovery
attempt is in progress and outcome will be reported shortly.
info
7/24/2017 10:12:25 AM
dsf01

Lost access to volume 57d319c5-48978f39-272a-001517bfcde6 (dsf03) due to connectivity issues. Recovery
attempt is in progress and outcome will be reported shortly.
info
7/24/2017 10:12:25 AM
dsf03

Successfully restored access to volume 57d3196b-5bc36e46-34c9-001517bfcde6 (dsf01) following connectivity
issues.
info
7/24/2017 10:00:26 AM
esx01.hq.cs2corp.com

Successfully restored access to volume 57d319c5-48978f39-272a-001517bfcde6 (dsf03) following connectivity
issues.
info
7/24/2017 10:00:26 AM
esx01.hq.cs2corp.com

Successfully restored access to volume 57d319b1-3db52f3c-87b8-001517bfcde6 (dsf02) following connectivity
issues.
info
7/24/2017 10:00:26 AM
esx01.hq.cs2corp.com

Lost access to volume 57d319b1-3db52f3c-87b8-001517bfcde6 (dsf02) due to connectivity issues. Recovery
attempt is in progress and outcome will be reported shortly.
info
7/24/2017 10:00:25 AM
dsf02

Lost access to volume 57d3196b-5bc36e46-34c9-001517bfcde6 (dsf01) due to connectivity issues. Recovery
attempt is in progress and outcome will be reported shortly.
info
7/24/2017 10:00:25 AM
dsf01

Lost access to volume 57d319c5-48978f39-272a-001517bfcde6 (dsf03) due to connectivity issues. Recovery
attempt is in progress and outcome will be reported shortly.
info
7/24/2017 10:00:25 AM
dsf03

There is a KB for ESXi regarding this scenario: https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2136081
--------------
This article provides information on understanding the lost access to volume related messages in ESXi 5.5/6.x.

The VMFS datastores are monitored through the heartbeats that are issued in the form of write operations approximately once in every 3 seconds to the VMFS volumes from the hosts. Each ESXi host accessing the VMFS datastores expects these heartbeat write I/O operations to complete within a 8 second window. If the heartbeat I/O does not complete within an 8 second window, the I/O is timed out and a subsequent heartbeat I/O is issued. If the total time of the heartbeat I/O does not complete within a 16 second window, the datastore is marked offline and a Lost access to volume log message is generated by hostd to reflect this behavior.

After a VMFS datastore is marked in an offline state, ESXi issues heartbeat I/O to the datastore approximately every 1 second until connectivity is restored. If a heartbeat I/O completes, the datastore is marked back online and host I/O is allowed to continue.
----------
This KB referred to 3 esxi log files that I have attached.

I also attached the latest debug from freenas

I tried to run the ctladm dumpooa while the backup was running the output is in ctladm 20170724.txt

I now can recreate this within 30 minutes or so from running the backup script on one VM. If there is anything else I can try please let me know.

#21 Updated by Alexander Motin about 3 years ago

Tom Schneider wrote:

Ok - regarding xcopy - this was the recommendation: esxcfg-advcfg -s 16384 /DataMover/MaxHWTransferSize
it did not make any change to the lun disconnects.

Have you tried to reduce it to just 1-2MB?

I did create a new zvol w/16kb block size and used it as the destination for the backups.

OK, but it would be more helpful for read ZVOL then for write ZVOL, while it also should be good.

I now can recreate this within 30 minutes or so from running the backup script on one VM. If there is anything else I can try please let me know.

If you can easily reproduce it, it could be useful to grab packet dump of conversation between storage and VMware exactly in time of the problem with `tcpdump -pvni {interface} -s0 -w {file} port 3260`. Though the amount of data can be too large unless you can reproduce the problem within few minutes.

#22 Updated by Tom Schneider about 3 years ago

  • File tcpdumpout added

ok - The XCOPY max transfer size was 4096kb as default, i will try again with 2048

I was using tcpdump -pvni igb0 -s0 -w /tmp/tcpdumpout port 3260 and i was getting latency changes but no disconnects. After a while i stopped tcpdump because i was concerned about the output size. Once I stopped tcpdump the volumes disconnected. it took about 3 seconds for esx to reconnect the luns.

I restarted tcpdump to see if i could capture the event that way.

ok with timestamp of 2:07:04pm from Esx i was able to capture traffic during the event. I hope the attached "tcpdumpout" shows it.

#23 Updated by Tom Schneider about 3 years ago

ok - regarding the blocksize for the first 3 zvol (dsf01, dsf02, dsf03) - i had created these with the wizard rather than the ui - i never got the chance to select blocksize - when i created the other zvols i do see they were created with 16k blocksize. I will work on recreating these but it will take a while due to the size of the volumes...

#24 Updated by Tom Schneider about 3 years ago

  • File debug-freenas-20170725135753.tgz added

ok - i recreated the 3 zvols that housed my VMs with the 16k block size. I have completed 2 full backups with no errors being reporting on the esx events screen regarding lost luns. I am wondering now if all of my issues were with the original zvols created by the initial wizard? I have attached a debug dump that occurred during my 3rd backup while the largest vm was being preserved.

If i see any other issues (other than those already reported) i will let you know.

Thank you for your patience.

#25 Updated by Dru Lavigne about 3 years ago

  • Status changed from 15 to 46

Sasha: is there anything to be fixed here?

#26 Updated by Tom Schneider about 3 years ago

FYI... I have been through multiple backup cycles and everything seems to be solved after i recreated the old data sets.

This can be closed from my perspective.

#27 Updated by Alexander Motin about 3 years ago

  • Status changed from 46 to Closed: Behaves correctly
  • Target version changed from 11.1 to N/A

Lets close this for now. While this area could probably be improved, I don't see there obvious bugs from our side that could be easily fixed.

#28 Updated by Dru Lavigne about 3 years ago

  • File deleted (debug-freenas-20170723103012.txz)

#29 Updated by Dru Lavigne about 3 years ago

  • File deleted (debug-freenas-20170723151151.tgz)

#30 Updated by Dru Lavigne about 3 years ago

  • File deleted (20170723 command line output.docx)

#31 Updated by Dru Lavigne about 3 years ago

  • File deleted (vmkernel.log)

#32 Updated by Dru Lavigne about 3 years ago

  • File deleted (vobd.log)

#33 Updated by Dru Lavigne about 3 years ago

  • File deleted (hostd.log)

#34 Updated by Dru Lavigne about 3 years ago

  • File deleted (ctladm 20170724.txt)

#35 Updated by Dru Lavigne about 3 years ago

  • File deleted (debug-freenas-20170724112553.tgz)

#36 Updated by Dru Lavigne about 3 years ago

  • File deleted (tcpdumpout)

#37 Updated by Dru Lavigne about 3 years ago

  • File deleted (debug-freenas-20170725135753.tgz)

#38 Updated by Dru Lavigne about 3 years ago

  • Private changed from Yes to No

Also available in: Atom PDF