Project

General

Profile

Bug #9347

LSI 9201-16i, Firmware 16 permanently resetting under load

Added by Harald Linden over 5 years ago. Updated about 3 years ago.

Status:
Closed: Cannot reproduce
Priority:
Important
Assignee:
Josh Paetzel
Category:
OS
Target version:
Seen in:
Severity:
New
Reason for Closing:
Reason for Blocked:
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:
ChangeLog Required:
No

Description

Hardware:

Intel S2600CP4
LSI 9201-16i, FW 16.00.00.00
9 Seagate Constellation ES.3, FW 0004
5 Hitachi/HGST Ultrastar 7K4000, FW MJ6OA580

Under load, the controller is resetting about every 10 seconds:

Apr 21 18:09:37 cik-dc-m7 mps0: IOC Fault 0x40000d04, Resetting
Apr 21 18:09:37 cik-dc-m7 mps0: Reinitializing controller,
Apr 21 18:09:37 cik-dc-m7 mps0: Firmware: 16.00.00.00, Driver: 16.00.00.00-fbsd
Apr 21 18:09:37 cik-dc-m7 mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
Apr 21 18:09:37 cik-dc-m7 mps0: mps_reinit finished sc 0xffffff8000aba000 post 4 free 3
Apr 21 18:09:48 cik-dc-m7 mps0: IOC Fault 0x40000d04, Resetting
Apr 21 18:09:48 cik-dc-m7 mps0: Reinitializing controller,
Apr 21 18:09:48 cik-dc-m7 mps0: Firmware: 16.00.00.00, Driver: 16.00.00.00-fbsd
Apr 21 18:09:48 cik-dc-m7 mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
Apr 21 18:09:48 cik-dc-m7 mps0: mps_reinit finished sc 0xffffff8000aba000 post 4 free 3
Apr 21 18:10:00 cik-dc-m7 mps0: IOC Fault 0x40000d04, Resetting
Apr 21 18:10:00 cik-dc-m7 mps0: Reinitializing controller,
Apr 21 18:10:00 cik-dc-m7 mps0: Firmware: 16.00.00.00, Driver: 16.00.00.00-fbsd
Apr 21 18:10:00 cik-dc-m7 mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
Apr 21 18:10:00 cik-dc-m7 mps0: mps_reinit finished sc 0xffffff8000aba000 post 4 free 3

This affects throughput massively (a scrub is currently running at 55MB/s).

History

#1 Updated by Harald Linden over 5 years ago

This problem has not been reproducable on an almost identical system which is fitted with Seagate Constellation ES.3 SAS disks.

#2 Updated by Josh Paetzel over 5 years ago

  • Category set to 76
  • Status changed from Unscreened to Screened
  • Assignee set to Josh Paetzel
  • Target version set to 49

There are a couple of possibilities here. One is there's a particular piece of hardware that is bad. You'd isolate this by removing the drives one by one and seeing if you have the controller resets. Obviously given that the problem impacts resilver performance this is somewhat risky, so make sure you have good backups before starting this. Note that if you isolate this problem to a particular drive you haven't isolated the problem. You'll want to swap drives between slots to see if the problem follows the drive or not. If it does it's tje drive. If it doesn't it could be the slot, the cable, or possibly even the controller itself.

The second possibility is that there's an incompatibility between the controller and the drives themselves. It's probably easiest to track this down by contacting Avago/LSI support with your drive model numbers and firmware and controller serial number and firmware.

We've been on the fence about updating the driver. There's a new version appearing in the nightlies soon. Trying that out would also be a possibility, but will also entail updating the firmware on your controller.

#3 Updated by Jordan Hubbard over 5 years ago

It seems that if this were an actual bug (for us to fix) we'd be getting a lot more people reporting this, since version 16 is the "official" version of the firmware.

#4 Updated by Harald Linden over 5 years ago

After powercycling the machine (not just resetting!), the problem seems to be gone, just like in my forum post back in November (which was a different machine, but the same kind of controller).

#5 Updated by Xin Li over 5 years ago

Jordan Hubbard wrote:

It seems that if this were an actual bug (for us to fix) we'd be getting a lot more people reporting this, since version 16 is the "official" version of the firmware.

Well, the combination could be rare -- for instance, are these Hitachi/HGST Ultrastar 7K4000, FW MJ6OA580 SAS or SATA? (the drive is available in both SAS and SATA but they can be either, and the Google results I have found suggests that they are SATA), and is SAS expander used in the configuration?

Mixing SAS and SATA drives behind a SAS expander can lead to bad results when the SATA drive behave strangely, which is possible when there is high load. Newer firmware may be able to workaround the issue and improve reliability but I haven't seen such discussion in the P17-P20 release notes after a quick glance.

#6 Updated by Josh Paetzel over 5 years ago

I'd bet they are SATA and this system is direct wired, otherwise I'd expect to see an 8i or 4i controller.

#7 Updated by Harald Linden over 5 years ago

Both systems are direct wired. The one that was affected yesterday ist SATA entirely, the one that showed this beheviour a few months ago ist entirely SAS.

We updated the Seagate disks to Firmware 0004 yesterday, because Seagate announced that update as "critical/important" without going into details. We powered the system down for this and flashed the disks in a different machine where we could connect them directly to the mainboard (because the ISO supplied by Seagate doesn't support SAS controllers). I also updated the previously affected SAS-based machine, but did not remove the disks there, but flashed them from a live linux. That machine hasn't shown any problem with the new drive firmware yet.

#8 Updated by Xin Li over 5 years ago

Harald Linden wrote:

Both systems are direct wired. The one that was affected yesterday ist SATA entirely, the one that showed this beheviour a few months ago ist entirely SAS.

We updated the Seagate disks to Firmware 0004 yesterday, because Seagate announced that update as "critical/important" without going into details. We powered the system down for this and flashed the disks in a different machine where we could connect them directly to the mainboard (because the ISO supplied by Seagate doesn't support SAS controllers). I also updated the previously affected SAS-based machine, but did not remove the disks there, but flashed them from a live linux. That machine hasn't shown any problem with the new drive firmware yet.

Thanks for the update. How often do you typically see the issue when there is load? If we know that the issue was resolved by a firmware update, we could probably put that in a FAQ and possibly also give smartmontools developers a heads-up so they can warn if the firmware is old.

#9 Updated by Harald Linden over 5 years ago

  • File m7_messages_redacted added

We have encountered the problem again tonight, this time I have attached /var/log/messages below. So, the firmware upgrade on the disks doesn't seem to help. I have once again checked configuration differences between the machine that encountered the error a few months ago (and never since) and this one, and it turns out, that SMARTD is not running on the former. Our course of action will be:

  • turn off smartd on this machine as well
  • replace Seagate disks with HGST disks

#10 Updated by Harald Linden over 5 years ago

Ok, without smartd we're still seeing the error, however a lot less frequently:

Apr 30 16:19:27 cik-dc-m7 mps0: IOC Fault 0x40000d04, Resetting
Apr 30 16:19:27 cik-dc-m7 mps0: Reinitializing controller,
Apr 30 16:19:27 cik-dc-m7 mps0: Firmware: 16.00.00.00, Driver: 16.00.00.00-fbsd
Apr 30 16:19:27 cik-dc-m7 mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
Apr 30 16:19:27 cik-dc-m7 mps0: mps_reinit finished sc 0xffffff8000aba000 post 4 free 3
Apr 30 17:12:13 cik-dc-m7 mps0: IOC Fault 0x40000d04, Resetting
Apr 30 17:12:13 cik-dc-m7 mps0: Reinitializing controller,
Apr 30 17:12:13 cik-dc-m7 mps0: Firmware: 16.00.00.00, Driver: 16.00.00.00-fbsd
Apr 30 17:12:13 cik-dc-m7 mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
Apr 30 17:12:13 cik-dc-m7 mps0: mps_reinit finished sc 0xffffff8000aba000 post 4 free 3
May 1 00:00:00 cik-dc-m7 syslog-ng2803: Configuration reload request received, reloading configuration;
May 1 02:23:55 cik-dc-m7 mps0: IOC Fault 0x40000d04, Resetting
May 1 02:23:55 cik-dc-m7 mps0: Reinitializing controller,
May 1 02:23:55 cik-dc-m7 mps0: Firmware: 16.00.00.00, Driver: 16.00.00.00-fbsd
May 1 02:23:55 cik-dc-m7 mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
May 1 02:23:55 cik-dc-m7 mps0: mps_reinit finished sc 0xffffff8000aba000 post 4 free 3
May 1 08:36:16 cik-dc-m7 mps0: IOC Fault 0x40000d04, Resetting
May 1 08:36:16 cik-dc-m7 mps0: Reinitializing controller,
May 1 08:36:16 cik-dc-m7 mps0: Firmware: 16.00.00.00, Driver: 16.00.00.00-fbsd
May 1 08:36:16 cik-dc-m7 mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
May 1 08:36:16 cik-dc-m7 mps0: mps_reinit finished sc 0xffffff8000aba000 post 4 free 3
May 1 08:55:14 cik-dc-m7 mps0: IOC Fault 0x40000d04, Resetting
May 1 08:55:14 cik-dc-m7 mps0: Reinitializing controller,
May 1 08:55:14 cik-dc-m7 mps0: Firmware: 16.00.00.00, Driver: 16.00.00.00-fbsd
May 1 08:55:14 cik-dc-m7 mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
May 1 08:55:14 cik-dc-m7 mps0: mps_reinit finished sc 0xffffff8000aba000 post 4 free 3
May 1 12:57:04 cik-dc-m7 mps0: IOC Fault 0x40000d04, Resetting
May 1 12:57:04 cik-dc-m7 mps0: Reinitializing controller,
May 1 12:57:04 cik-dc-m7 mps0: Firmware: 16.00.00.00, Driver: 16.00.00.00-fbsd
May 1 12:57:04 cik-dc-m7 mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
May 1 12:57:04 cik-dc-m7 mps0: mps_reinit finished sc 0xffffff8000aba000 post 4 free 3

#11 Updated by Jordan Hubbard over 5 years ago

Is there some reason we are hanging on to a hardware malfunction of some sort in the FUTURE milestone? It's not clear to me that there's anything we can do in the face of what may be a bad card, bad cabling, bad enclosure or some other short-term configuration problem whereas bugs in FUTURE are also meant for long-term resolution, where we anticipate that some day we may have an answer.

#12 Updated by Harald Linden over 5 years ago

This is somewhat solved:

The problem remains not reproducable on the same hardware with Linux. Every load scenario running under Linux 3.19 did not show any trouble whatsoever with the controller.

However it seems we have kind of a solution: We have turned off everything in the machine that we do not need. Serial ports, additional NICs, onboard SATA controller etc. So far, we have not seen the error again. It seems, that back in November 2014 with the SATA based machine, we did the same thing and we haven't had any problems with it as well. Not really a proper solution and I'm not happy with it, but well.

#13 Updated by Josh Paetzel over 5 years ago

  • Status changed from Screened to Closed: Cannot reproduce
  • Seen in changed from to 9.3-RELEASE

#14 Avatar?id=14398&size=24x24 Updated by Kris Moore about 3 years ago

  • Target version changed from 49 to N/A

#15 Updated by Dru Lavigne almost 3 years ago

  • File deleted (m7_messages_redacted)

Also available in: Atom PDF