Project

General

Profile

Bug #23818

ctl_datamove: tag 0x14d77e8 on (1:3:4) aborted On high load

Added by Arend de Groot about 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
No priority
Assignee:
Alexander Motin
Category:
OS
Target version:
Seen in:
Severity:
New
Reason for Closing:
Reason for Blocked:
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:

1x M3G77A Hewlett Packard Enterprise DL380 Gen9 2X E5-2620v3 SP8040TV
5x 726719-B21 Hewlett Packard Enterprise HP 16GB 2RX4 PC4-2133P-R KIT
1x 727060-B21 HPE FlexFabric 10Gb 2-port 556FLR-SFP+ Adapter
1x SAS 9305-24i Host Bus Adapter
2x Samsung SAMSUNG SSD 850 EVO Basic 250GB
24x MZ-75E1T0B/EU Samsung SAMSUNG SSD 850 EVO Basic 1TB

ChangeLog Required:
No

Description

Hello I'm building a new storage system fully SSD when I move 3 VM's true iSCSI (Storage Vmotion) ad the same time (30GB+ VM's) I get ctl_datamove errors and very high disk latency like 3000ms
after these errors the storage is slow.

the 9305-24 has LSI firmware version p14 with driver verion p15 like recommend
Before we tried the older 9.10 with driver p13 and firmware p12 we had more trouble with this one and also the same errors (we had only 1 CPU then now 2 because there where high cpu spike's and there gone now)

We use VMware 6.5 with iSCSI single link 10GB DAC over HP switch. (same NIC in VM hosts as in the storage)
we have a ZFS Pool default Freenas with compression enabled and ashift=12 the pool consist of 2 raidz groups off 12 disks in a stripe.

The complete error list is this:
May 6 11:06:38 zfs1 ctl_datamove: tag 0x14d77e8 on (1:3:4) aborted
May 6 11:06:38 zfs1 ctl_datamove: tag 0x14d77f2 on (1:3:4) aborted
May 6 11:06:38 zfs1 ctl_datamove: tag 0x14d77ee on (1:3:1) aborted
May 6 11:06:38 zfs1 ctl_datamove: tag 0x14d77f3 on (1:3:4) aborted
May 6 11:06:38 zfs1 ctl_datamove: tag 0x14d77ec on (1:3:4) aborted
May 6 11:06:38 zfs1 ctl_datamove: tag 0x14d77e9 on (1:3:4) aborted
May 6 11:06:38 zfs1 ctl_datamove: tag 0x14d77ed on (1:3:3) aborted
May 6 11:06:38 zfs1 ctl_datamove: tag 0x14d77ef on (1:3:3) aborted
May 6 11:06:38 zfs1 ctl_datamove: tag 0x14d77f0 on (1:3:3) aborted
May 6 11:06:58 zfs1 ctl_datamove: tag 0x14d77ea on (1:3:5) aborted
May 6 11:06:58 zfs1 ctl_datamove: tag 0x14d77eb on (1:3:5) aborted
May 6 11:06:58 zfs1 ctl_datamove: tag 0x14d77f7 on (1:3:5) aborted
May 6 11:06:58 zfs1 ctl_datamove: tag 0x14d77f8 on (1:3:5) aborted
May 6 11:07:39 zfs1 ctl_datamove: tag 0x14d83fe on (1:3:3) aborted
May 6 11:07:39 zfs1 ctl_datamove: tag 0x14d8407 on (1:3:3) aborted
May 6 11:07:39 zfs1 ctl_datamove: tag 0x14d8408 on (1:3:3) aborted
May 6 11:07:39 zfs1 ctl_datamove: tag 0x14d83f6 on (1:3:3) aborted
May 6 11:07:39 zfs1 ctl_datamove: tag 0x14d83f5 on (1:3:3) aborted
May 6 11:07:39 zfs1 ctl_datamove: tag 0x14d83fb on (1:3:3) aborted
May 6 11:07:39 zfs1 ctl_datamove: tag 0x14d83fc on (1:3:1) aborted
May 6 11:07:39 zfs1 ctl_datamove: tag 0x14d83f8 on (1:3:5) aborted
May 6 11:07:39 zfs1 ctl_datamove: tag 0x14d8403 on (1:3:5) aborted
May 6 11:07:39 zfs1 ctl_datamove: tag 0x14d8404 on (1:3:5) aborted
May 6 11:07:39 zfs1 ctl_datamove: tag 0x14d8405 on (1:3:5) aborted
May 6 11:07:39 zfs1 ctl_datamove: tag 0x14d8406 on (1:3:5) aborted
May 6 11:07:39 zfs1 ctl_datamove: tag 0x1721bf6 on (0:3:1) aborted
May 6 11:07:39 zfs1 ctl_datamove: tag 0x14d83ff on (1:3:4) aborted
May 6 11:07:39 zfs1 ctl_datamove: tag 0x14d8400 on (1:3:4) aborted
May 6 11:07:39 zfs1 ctl_datamove: tag 0x14d8401 on (1:3:4) aborted
May 6 11:07:39 zfs1 ctl_datamove: tag 0x14d8402 on (1:3:4) aborted
May 6 11:07:39 zfs1 ctl_datamove: tag 0x14d83f7 on (1:3:4) aborted

I also attached an image of disk 21 and 22 all the other disk give's the same value's about the disk latency because I don't know how to get them from the command line.
Can you please help me to fix this?

Disk latency.png (122 KB) Disk latency.png iSCSI performance hung Arend de Groot, 05/06/2017 04:46 AM
10999

Related issues

Related to FreeNAS - Bug #23846: Wrong graphs shown for disks latencyResolved2017-05-08

History

#1 Updated by Arend de Groot about 4 years ago

  • File messages added

Today I tried a Few fixes that I found on previous forum posts etc I tried disable trim and set de arc max to 70 of the 90 GB:
First it looks better no error logging but later it hungs again with a lot of error messages.
gstat give's no IO on that moment CPU and all disks are idle when it hung for 1 to 5 minutes and than it start's working again.

I attached the messages log file I hope you can do something with it?

the errors that I got are like this:
May 6 19:39:08 zfs1 (1:3:1/0): Task I/O type: 0, Tag: 0x2be23de, Tag Type: 1
May 6 19:39:08 zfs1 (1:3:1/0): ctl_process_done: 704 seconds
May 6 19:39:08 zfs1 (0:3:1/0): Task I/O type: 0, Tag: 0x1e7b45a, Tag Type: 1
May 6 19:39:08 zfs1 (0:3:1/0): ctl_process_done: 703 seconds
May 6 19:39:08 zfs1 (1:3:1/0): Task I/O type: 0, Tag: 0x2be23de, Tag Type: 1

#2 Updated by Arend de Groot about 4 years ago

  • File ctladm dumpooa.txt added

I just read about running the command ctladm dumpooa when the system hung.
I let the system hung and start the command the output is like this and I Also attached it as a tekst file.

  1. ctladm dumpooa
    Dumping OOA queues
    LUN 3 tag 0x152aa9 RTR: READ. CDB: 28 00 3b b0 b0 00 00 08 00 00 (28167 ms)
    LUN 3 tag 0x152aaa RTR: READ. CDB: 28 00 3b b0 b8 00 00 08 00 00 (28167 ms)
    LUN 3 tag 0x152ab1 BLOCKED: READ. CDB: 28 00 3b b0 c0 00 00 08 00 00 (28166 ms)

#3 Updated by Arend de Groot about 4 years ago

I saw the FreeNAS-11.0-RC is also stable now so I upgraded to this one and the problem seems to be resolved.
I couldn't replicate it anymore.

The only thing that left is the disk latency graph.
I think there is something wrong because it give's us as ms

#4 Updated by Alexander Motin about 4 years ago

  • Related to Bug #23846: Wrong graphs shown for disks latency added

#5 Updated by Alexander Motin about 4 years ago

  • Category changed from 76 to 89
  • Status changed from Unscreened to Resolved
  • Target version set to 11.0-RC

Latency graphs are indeed broken. I'll move that part into another ticket, while this one I'll close, since you confirm that problem has gone in 11.0-RC.

#6 Updated by Dru Lavigne over 3 years ago

  • File deleted (messages)

#7 Updated by Dru Lavigne over 3 years ago

  • File deleted (ctladm dumpooa.txt)

Also available in: Atom PDF