Project

General

Profile

Bug #25835

Red critical alert even thought disk drive has been replaced

Added by Robert Pierce about 3 years ago. Updated almost 3 years ago.

Status:
Closed: Behaves correctly
Priority:
No priority
Assignee:
Alexander Motin
Category:
GUI (new)
Target version:
Seen in:
Severity:
New
Reason for Closing:
Reason for Blocked:
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:

Supermicro Mobo, 256GB memory, 192TB raw disk, 12 TB read cache, 1 TB write cache, AIC 60 drive box installed last Feb.

ChangeLog Required:
No

Description

I am receiving a red light critical alert saying I have a failed disk drive. The disk drive was replaced last Tuesday at 9PM and re-slivering completed last Sunday at 3PM. All disk drives are showing OK now. Why am I still getting this alert? I use the old interface because the new one does not have the information I need.

Freenas Critical Bug.docx (113 KB) Freenas Critical Bug.docx Screenshots of message and Volume Robert Pierce, 09/11/2017 08:31 AM

History

#1 Updated by Dru Lavigne about 3 years ago

  • Status changed from Unscreened to 15

Robert: please also attach a debug (System -> Advanced -> Save Debug). We'll mark the ticket private until a dev has a chance to review it.

#2 Updated by Robert Pierce about 3 years ago

  • File debug-DALPNASPROD1-20170911131159.tgz added

Debug attached. I neglected to add that previously I was receiving a error msg because I had used more than 80% of the volume. When I cleared some data off the volume the msg went away.

#3 Updated by Dru Lavigne about 3 years ago

  • Status changed from 15 to Unscreened
  • Assignee changed from Release Council to William Grzybowski
  • Private changed from No to Yes

#4 Updated by William Grzybowski about 3 years ago

  • Assignee changed from William Grzybowski to Alexander Motin

Sasha, any idea why zpool status is saying this message even though there are no read/write/cksum errors in any of the disks?

#5 Updated by Alexander Motin about 3 years ago

  • Status changed from Unscreened to Closed: Behaves correctly

I haven't noticed that on my first look, but there is one checksum error logged on one of the drives:

gptid/216a513c-fd3b-11e6-8fd4-0cc47adf0a3e  ONLINE       0     0     1

It seems to be not for the disk you've replaced. You've replaced da53, while this is da51. I see some I/O errors from da51 in logs, so may be it is not perfect also.

You may clear it with `zpool clear Volume1`, that should clear the alert, and then see how it goes further. May be check status or run SMART test over the da51.

#6 Updated by Robert Pierce about 3 years ago

Thanks for the help. I did notice ds51 resilvering at the same time da53 was resilvering but did not know why it was. I have 48 drives on this box groups into 6 groups of 8 with each group being a ZFS-2 array. When I replaced ds53 the resilvering process went on for about 5 days and it resilvered all 143 TB.

Why didnt it just resilver the one group that da53 was in?

All appears OK after zpool clear Volume1. I ran smartctl and results seem ok.

Do hope you can answer my question above.
Thanks
RP

#7 Updated by Alexander Motin about 3 years ago

Robert Pierce wrote:

Why didnt it just resilver the one group that da53 was in?

Because this is how ZFS resilver works. It has to traverse all the metadata on the pool to be able to check the data checksums. Data blocks in ZFS has no back references on files referencing them.

#8 Updated by Dru Lavigne almost 3 years ago

  • File deleted (debug-DALPNASPROD1-20170911131159.tgz)

#9 Updated by Dru Lavigne almost 3 years ago

  • Target version set to N/A
  • Private changed from Yes to No

Also available in: Atom PDF