Project

General

Profile

Feature #25182

Add Offline button for faulted devices

Added by Stuart Espey over 1 year ago. Updated 12 months ago.

Status:
Resolved
Priority:
Important
Assignee:
William Grzybowski
Category:
GUI (new)
Target version:
Estimated time:
Sprint:
Severity:
New
Backlog Priority:
Reason for Closing:
Reason for Blocked:
Needs QA:
No
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:

Description

Tested with 11.0-U1

In the Storage Volumes tab, if you click on the volume, and then the Volume status button, you can see all the component devices which make up a vdev.

If you click on a component device, then you can click offline.

But if the component device is faulted, then you can not.

You should be able to.

The reason is because if you wish to replace a faulted device with itself, then you have to offline the device, quick wipe it, and then replace the offlined device with the now wiped drive. You might do this if you have run a long test over the drive, and the drive appears okay, and you suspect the issue which caused the faulted device was not due to the device per-se, either way, you want to retry the device.

Without the ability to offline a faulted device in the GUI, you need to offline it via the CLI. The rest of the steps can be done in the UI. Without offlining, you can't wipe the drive without unmounting the pool.

An example of this situation:

root@rhea:~ # zpool status tank
  pool: tank
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
    repaired.
  scan: scrub repaired 88K in 5h11m with 0 errors on Sun Jun 18 05:11:13 2017
config:

    NAME                                            STATE     READ WRITE CKSUM
    tank                                            DEGRADED     0     0     0
      raidz2-0                                      DEGRADED     0     0     0
        gptid/e16ecdcb-86f2-11e6-bb77-001cc0071f3f  ONLINE       0     0     0
        gptid/e24c858c-86f2-11e6-bb77-001cc0071f3f  ONLINE       0     0     0
        gptid/e3250ec0-86f2-11e6-bb77-001cc0071f3f  ONLINE       0     0     0
        gptid/e3e200ca-86f2-11e6-bb77-001cc0071f3f  FAULTED      0   123     0  too many errors
        gptid/e4b6d326-86f2-11e6-bb77-001cc0071f3f  ONLINE       0     0     0

errors: No known data errors

because I can't offline the faulted device via UI, I now have to do it via CLI:

root@rhea:~ # zpool offline tank gptid/e3e200ca-86f2-11e6-bb77-001cc0071f3f

and you can see the device has successfully been offlined.

root@rhea:~ # zpool status tank
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 88K in 5h11m with 0 errors on Sun Jun 18 05:11:13 2017
config:

    NAME                                            STATE     READ WRITE CKSUM
    tank                                            DEGRADED     0     0     0
      raidz2-0                                      DEGRADED     0     0     0
        gptid/e16ecdcb-86f2-11e6-bb77-001cc0071f3f  ONLINE       0     0     0
        gptid/e24c858c-86f2-11e6-bb77-001cc0071f3f  ONLINE       0     0     0
        gptid/e3250ec0-86f2-11e6-bb77-001cc0071f3f  ONLINE       0     0     0
        2223584477133409854                         OFFLINE      0   123     0  was /dev/gptid/e3e200ca-86f2-11e6-bb77-001cc0071f3f
        gptid/e4b6d326-86f2-11e6-bb77-001cc0071f3f  ONLINE       0     0     0

errors: No known data errors

And now, when I wipe/replace via the GUI, it begins resilvering.

pool: tank
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Jul 18 13:44:02 2017
        145M scanned out of 8.74T at 7.62M/s, 334h12m to go
        27.0M resilvered, 0.00% done
config:

    NAME                                            STATE     READ WRITE CKSUM
    tank                                            ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/e16ecdcb-86f2-11e6-bb77-001cc0071f3f  ONLINE       0     0     0
        gptid/e24c858c-86f2-11e6-bb77-001cc0071f3f  ONLINE       0     0     0
        gptid/e3250ec0-86f2-11e6-bb77-001cc0071f3f  ONLINE       0     0     0
        gptid/4dfc43ea-6b6b-11e7-bea4-001cc0071f3f  ONLINE       0     0     0  (resilvering)
        gptid/e4b6d326-86f2-11e6-bb77-001cc0071f3f  ONLINE       0     0     0

I'll check the results, and scrub etc, and keep an eye on the drive, and if it faults again, I'll replace it (as its <12 months old), but as the SMART results aren't showing any errors, I can't currently easily RMA, until I prove a drive fault.

I haven't checked if this is supported in the new UI, but if it isn't it should be too.


Related issues

Related to FreeNAS - Bug #25737: Clarify replace disk instructions in GuideResolved2017-08-31

Associated revisions

Revision 3c4a8058 (diff)
Added by William Grzybowski over 1 year ago

fix(gui): show Offline for FAULTED disks

Ticket: #25182

Revision f594b7f6 (diff)
Added by William Grzybowski about 1 year ago

fix(gui): show Offline for FAULTED disks

Ticket: #25182

History

#1 Updated by Stuart Espey over 1 year ago

Maybe the correct course of action is to issue a "zpool clear", not sure if that would resolve the faulted state, but either way, you should be able to offline a faulted drive.

#2 Updated by Dru Lavigne over 1 year ago

  • Assignee changed from Release Council to William Grzybowski

William: thoughts?

#3 Updated by Stuart Espey over 1 year ago

Dru Lavigne wrote:

William: thoughts?

So, I'm working through my little disk problem, and I was able to reproduce it again. (FWIW, I believe its a backplane issue)

It appears the right approach to clear the faulted case is to run zpool clear. As soon as that is executed, the FAULTED device is switched to ONLINE and begins resilvering.

So, is there a way in the GUI to clear the status of a pool?

#4 Updated by William Grzybowski over 1 year ago

  • Status changed from Unscreened to Screened
  • Target version set to 11.2-BETA1

While Offline'ing a device in that case is possible your procedure does not make any sense to me.

If such a thing happens the correct way to repair is to scrub the pool, as opposed to offline a disk, wipe it and replace. It has absolutely no advantage.

zpool clear is also simply masquerading the problem, it assumes you have actually replaced something.
I suggest you to read more about ZFS to understand how it works, oracle has a nice documentation.

I'll enable Offline button for faulted disks, although thats not a recommended approach.

#5 Updated by William Grzybowski over 1 year ago

  • Status changed from Screened to Ready For Release
  • Priority changed from Expected to Important
  • Target version changed from 11.2-BETA1 to 11.1

#6 Updated by William Grzybowski over 1 year ago

  • Needs QA changed from Yes to No

#7 Updated by Dru Lavigne about 1 year ago

  • Related to Bug #25737: Clarify replace disk instructions in Guide added

#8 Updated by Dru Lavigne about 1 year ago

  • Target version changed from 11.1 to 11.1-BETA1

#9 Updated by Dru Lavigne 12 months ago

  • Status changed from Ready For Release to Resolved

Also available in: Atom PDF