Project

General

Profile

Bug #28235

Bump default number of chain frames for mps(4) and mpr(4)

Added by Jamie McParland about 1 year ago. Updated about 1 year ago.

Status:
Done
Priority:
Important
Assignee:
Alexander Motin
Category:
OS
Target version:
Seen in:
Severity:
Medium
Reason for Closing:
Reason for Blocked:
Need verification
Needs QA:
No
Needs Doc:
No
Needs Merging:
No
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:
ChangeLog Required:
No

Description

I've been working with Michael Dexter and he suggested I put in a bug ticket.

We have a few freenas boxes here at the school district. Since we've upgraded to version 11 we're seeing the following errors:
Device: /dev/da38, failed to read SMART values
Device: /dev/da44, failed to read SMART values
Device: /dev/da40, failed to read SMART values
Device: /dev/da53, failed to read SMART values

We're also seeing these in the gui.

But as you see here, when I run a smartctl -a /dev/40 for example, smart says everything is ok.
https://pastebin.com/raw/xb5vdeyf

We've updated to FreeNAS-11.1-U1 but are still having the same issue.
This is happening on all 3 of our FreeNAS-11.1-U1 machines. But not on any of our 9.x machines.

After rebooting, a lot of time the error clears out, but it will come back complaining about a number of different drives.

Lastly, I saw someone on the forums having the same issue.
https://forums.freenas.org/index.php?threads/read-smart-self-test-log-failed.61359/#post-436173


Related issues

Related to FreeNAS - Bug #28201: Fix queue length reporting in mps(4) and mpr(4)Done2018-02-05

Associated revisions

Revision 6e4c786b (diff)
Added by Alexander Motin about 1 year ago

Bump default number of chain frames for mps(4) and mpr(4).

The original value was heavily underestimated, that caused warning
messages in logs, some random SMART errors and reduced performance.
This is a temporary local hack for us, since more universal I am
working on will take more time, while won't give us much benefits,
comparing to this quick hack.

Ticket: #28235

Revision 6e4c786b (diff)
Added by Alexander Motin about 1 year ago

Bump default number of chain frames for mps(4) and mpr(4).

The original value was heavily underestimated, that caused warning
messages in logs, some random SMART errors and reduced performance.
This is a temporary local hack for us, since more universal I am
working on will take more time, while won't give us much benefits,
comparing to this quick hack.

Ticket: #28235

Revision e5304e8d (diff)
Added by Alexander Motin about 1 year ago

Bump default number of chain frames for mps(4) and mpr(4).

The original value was heavily underestimated, that caused warning
messages in logs, some random SMART errors and reduced performance.
This is a temporary local hack for us, since more universal I am
working on will take more time, while won't give us much benefits,
comparing to this quick hack.

Ticket: #28235
(cherry picked from commit 6e4c786be4208f49ff4cf44e2f469ad7d4f537ab)

Revision e5304e8d (diff)
Added by Alexander Motin about 1 year ago

Bump default number of chain frames for mps(4) and mpr(4).

The original value was heavily underestimated, that caused warning
messages in logs, some random SMART errors and reduced performance.
This is a temporary local hack for us, since more universal I am
working on will take more time, while won't give us much benefits,
comparing to this quick hack.

Ticket: #28235
(cherry picked from commit 6e4c786be4208f49ff4cf44e2f469ad7d4f537ab)

Revision 203ec4f9 (diff)
Added by Alexander Motin about 1 year ago

Bump default number of chain frames for mps(4) and mpr(4).

The original value was heavily underestimated, that caused warning
messages in logs, some random SMART errors and reduced performance.
This is a temporary local hack for us, since more universal I am
working on will take more time, while won't give us much benefits,
comparing to this quick hack.

Ticket: #28235

Revision 203ec4f9 (diff)
Added by Alexander Motin about 1 year ago

Bump default number of chain frames for mps(4) and mpr(4).

The original value was heavily underestimated, that caused warning
messages in logs, some random SMART errors and reduced performance.
This is a temporary local hack for us, since more universal I am
working on will take more time, while won't give us much benefits,
comparing to this quick hack.

Ticket: #28235

History

#1 Updated by Dru Lavigne about 1 year ago

  • Private changed from No to Yes
  • Seen in changed from 11.1-U1 to 11.1-U1

Jamie: please attach a debug (System -> Advanced -> Save Debug).

#2 Updated by Jamie McParland about 1 year ago

  • File debug-san3-20180207115251.tgz added
  • File debug-san2-20180207115130.tgz added

I'm not sure if it's related, but on two of our three 11.x systems I'm getting this error as well:
mps0: Out of chain frames, consider increasing hw.mps.max_chains.

I've worked with Michael Dexter and we've upped the chains on SAN3. A little at a time, and we're currently at 8096, but still getting the error.
I've left the settings alone for that on SAN2.

In looking through our syslog, i noticed these errors started happening within a day or so of installing 11.x

Installed - 11.1 Release 12-22-2017
2017-12-24T01:56:27-08:00 san2 mps0: Out of chain frames, consider increasing hw.mps.max_chains.

Installed - 11.1 Release 12-22-2017
2017-12-24T03:34:41-08:00 san3 mps0: Out of chain frames, consider increasing hw.mps.max_chains.

On SAN3 I installed 11.1-U1 on 01/23/2018. But the hw.mps.max_chains is still happening.

We have another box call ipcamsan, which is having the same "failed to read SMART values" issue, but we're NOT seeing the hw.mps.max_chains warning on that box.
The only real difference between san2, san3, and IPCAMSAN, is ipcamsan only has one HBA. The other two boxes have more than one HBA.

#3 Updated by Dru Lavigne about 1 year ago

  • Assignee changed from Release Council to Alexander Motin
  • Reason for Blocked set to Need verification

Starting with Alexander to see if it is driver related (which was fixed for U2) or different than any of the several open SMART tickets.

#4 Updated by Alexander Motin about 1 year ago

  • Related to Bug #28201: Fix queue length reporting in mps(4) and mpr(4) added

#5 Updated by Alexander Motin about 1 year ago

  • Category changed from Middleware to OS
  • Status changed from Not Started to In Progress
  • Priority changed from No priority to Important
  • Target version set to 11.1-U2
  • Severity changed from High to Medium
  • Needs Doc changed from Yes to No

This problem indeed sounds like caused by a transient I/O errors. And there indeed seems like a good chance that it could be triggered by #28201 issue. "Out of chain frames" may also be a cause, and I see obvious issue there too, but still trying to investigate what were thinking people while tuning it as it is right now.

#6 Updated by Alexander Motin about 1 year ago

  • Subject changed from failed to read SMART values to Bump default number of chain frames for mps(4) and mpr(4)
  • Status changed from In Progress to Done
  • Needs QA changed from Yes to No
  • Needs Merging changed from Yes to No

I've pushed the quick fix, while proper more universal one will probably come at some point later.

#7 Updated by Dru Lavigne about 1 year ago

  • File deleted (debug-san2-20180207115130.tgz)

#8 Updated by Dru Lavigne about 1 year ago

  • File deleted (debug-san3-20180207115251.tgz)

#9 Updated by Dru Lavigne about 1 year ago

  • Private changed from Yes to No

#10 Updated by Michael Dexter about 1 year ago

Related forum post:

https://forums.freenas.org/index.php?threads/mps-lsi-hw-mps-max_chains.23067/#post-139146

Related FreeBSD mailing list post:

https://lists.freebsd.org/pipermail/freebsd-stable/2016-March/084316.html

Both systems with this symptom have had multiple pools.

#11 Updated by Jamie McParland about 1 year ago

Michael Dexter wrote:

Related forum post:

https://forums.freenas.org/index.php?threads/mps-lsi-hw-mps-max_chains.23067/#post-139146

Related FreeBSD mailing list post:

https://lists.freebsd.org/pipermail/freebsd-stable/2016-March/084316.html

Both systems with this symptom have had multiple pools.

Alexander Motin wrote:

I've pushed the quick fix, while proper more universal one will probably come at some point later.

Alexander Motin wrote:

I've pushed the quick fix, while proper more universal one will probably come at some point later.

Alexander Motin wrote:

I've pushed the quick fix, while proper more universal one will probably come at some point later.

It's been 24 hours since i updated to FreeNAS-11.1-U2 and i haven't seen any more warnings. So I'm thinking this is solved. Thanks so much!

Also available in: Atom PDF