Project

General

Profile

Feature #286

GUI for ZFS pool scrubbing: configurable frequency and archived results / error trends

Added by Jason L over 10 years ago. Updated about 8 years ago.

Status:
Closed
Priority:
Important
Assignee:
Josh Paetzel
Category:
Middleware
Target version:
-
Estimated time:
Severity:
New
Reason for Closing:
Reason for Blocked:
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:

Description

(Enhancement request to maximize zpool reliability/availability)

Automatic, periodic ZFS pool scrubbing with results collection, reporting, and alerts.

It would be great to have functionality in the GUI to streamline the setting of weekly or monthly zpool scrubs, collect and archive scrub results, and include the last few scrub results in periodic emails.

A reasonable default would be to set a weekly zpool scrub at e.g. 3am Sunday for all zpools upon creation.

Some background from the ZFS Best Practices wiki:

"Run zpool scrub on a regular basis to identify data integrity problems. If you have consumer-quality drives, consider a weekly scrubbing schedule. If you have datacenter-quality drives, consider a monthly scrubbing schedule. You should also run a scrub prior to replacing devices or temporarily reducing a pool's redundancy to ensure that all devices are currently operational."
-- http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

The implicit rationale for periodic scrubbing is to proactively detect bit errors and stay ahead of drive failures. Having recently completed a scrub maximizes the probability that the data you have on disk is correct, since if/when you enter degraded mode you can no longer correct latent errors.

History

#1 Updated by Jason L over 10 years ago

Hmm... I just noticed that there is monthly zfs scrubbing configured in /etc/periodic.conf on the default [[FreeNAS]] installation... Although, I don't see any references to 'periodic' having been run by cron on my system... I fear I'm missing something and off-base on the original enhancement request... If it's actually running under the covers, then that would remove a lot of the urgency from this enhancement request, although defaulting to weekly would still be highly desirable.

#2 Updated by Jason L over 10 years ago

Ugh. I had misconfigured email on my box. I see that periodic is running and the next zpool scrub will be in 16 days. Very sorry for my confusion!

Renaming this enhancement request to make clear that it is asking for a GUI to set the scrub threshold (weekly vs. monthly) and details of past runs to look for error trends.

#3 Updated by Josh Paetzel over 10 years ago

This ticket is very valid. As you've noticed there is a sort of undocumented scrub that takes place behind the scenes, but it would be desirable to have a scrubbing "service" much like the periodic snapshot service that makes this a more GUI orientated task.

#4 Updated by Ryan - over 10 years ago

++

#5 Updated by Xin Li about 10 years ago

++

#6 Updated by blouis79 - about 10 years ago

What is especially tricky about the current scrub is that based on the script in "/etc/periodic/daily/800.scrub-zfs", it will run when it wants to about every 30 days and clog up the system without warning. Mine was unusable - AFP could not load a directory and I had no idea a scrub was running since the GUI only mentions manual scrubs and documentation does not mention automatic scrubs.

Prefer scrubs to be run in the early hours. And 28 days could be a friendlier default than 30, else monthly could be on the first or last so at least it is known.

#7 Updated by Neil MacLeod about 10 years ago

Or schedule your own scrub using cron which - if run more frequently than every 30 days - will override the automatic periodic scrub.

The following is what I use, every Sunday at 3am. It sends a status email when the scrub is complete (it will scrub all available pools, in sequence, sending a single email at the end).

#!/bin/bash
# 
#VERSION: 0.1
#AUTHOR: Milhouse
#DESCRIPTION: Created on [[FreeNAS]] 0.7RC1 (Sardaukar), works also on [[FreeNAS]] 8.0.1+
# This script will start a scrub on each ZFS pool (one at a time) and
# will send an e-mail or display the result when everything is completed.
#
#CHANGELOG
# 

# e-mail variables
FROM=from.email@domain.com
TO=to.email@domain.com

SUBJECT="@hostname@: ZFS Scrub results" 
BODY="" 

# arguments
VERBOSE=0
SENDEMAIL=1
args=("$*")
for arg in $args; do
    case $arg in
        "-v" | "--verbose")
            VERBOSE=1
            ;;
        "-n" | "--noemail")
            SENDEMAIL=0
            ;;
        "-a" | "--author")
            echo "by gimpe at hype-o-thetic.com" 
            exit
            ;;
        "-h" | "--help" | *)
            echo " 
usage: $0 [-v --verbose|-n --noemail]
    -v --verbose    output display
    -n --noemail    don't send an e-mail with result
    -a --author     display author info (by gimpe at hype-o-thetic.com)
    -h --help       display this help
" 
            exit
            ;;
    esac
done

# work variables
ERROR=0
RUNNING=1
SEP="-------------------------------------------------------------" 

# commands & configuration
ZPOOL=/sbin/zpool
PRINTF=/usr/bin/printf
MSMTP=/usr/local/bin/msmtp
MSMTPCONF=/var/etc/msmtp.conf

# print a log message
function _log {
# add message to e-mail body
    BODY="${BODY}$1\n" 
# output to console if verbose mode
    [ $VERBOSE = 1 ] && echo "$1" 
}

# find all pools
pools=$($ZPOOL list -H -o name)

# for each pool
for pool in $pools; do
    # start scrub for $pool
    _log "@date +"%Y-%m-%d %H:%M:%S"@: Starting scrub on $pool" 
    zpool scrub $pool
    RUNNING=1
    # wait until scrub for $pool has finished running
    while [ $RUNNING = 1 ];     do
        # still running?
        if $ZPOOL status -v $pool | grep -q "scrub in progress"; then
            sleep 60
        # not running
        else
            # finished with this pool, exit
            RUNNING=0
            _log "@date +"%Y-%m-%d %H:%M:%S"@: Finished scrub on $pool" 
            _log
            _log "@$ZPOOL status -v $pool@" 
            # check for errors
            if ! $ZPOOL status -v $pool | grep -q "No known data errors"; then
                _log
                _log "*** DATA ERRORS DETECTED ON $pool ***" 
                ERROR=1
            fi
            _log
            _log "$SEP" 
        fi
    done
done

# change e-mail subject if there was error
if [ $ERROR = 1 ]; then
    SUBJECT="${SUBJECT}: ERROR(S) DETECTED" 
fi

# send e-mail
if [ $SENDEMAIL = 1 ]; then
    [ -f $MSMTP ] && $PRINTF "From:$FROM\nTo:$TO\nSubject:$SUBJECT\n\n$BODY" | $MSMTP --file=$MSMTPCONF -t || $PRINTF "$BODY" | mail -s "$SUBJECT" $TO
fi

#8 Updated by Xin Li about 10 years ago

Replying to [comment:7 [[MilhouseVH]]]:

It seems that the script effectively turns scrubbing a sequential operation, which might be not optimal. Also, I think detection of data errors should be done runtime (i.e. in the alert system) rather than just during scrubbing.

#9 Updated by Neil MacLeod about 10 years ago

Replying to [comment:8 delphij]:

Replying to [comment:7 [[MilhouseVH]]]:

It seems that the script effectively turns scrubbing a sequential operation, which might be not optimal. Also, I think detection of data errors should be done runtime (i.e. in the alert system) rather than just during scrubbing.

True, it does - but would multiple scrubs running in parallel complete any faster than the same scrubs run in sequence? Genuine question, as I don't know - I'd suspect there's a good chance the parallel scrubs would then become IO bound, if not also CPU bound, and thus highly detrimental to any users of the system. Alternatively it wouldn't be difficult to modify the script to take a pool name argument, so that separate scrubs could be scheduled at different times.

Even so, having a long running sequential scrub kick off predictably every weekend (or perhaps late on a Friday evening after the "office" has closed, if time is tight) is surely more preferable than having scrubs start on a semi-random day at 3am and thus likely to run into business hours? This is the biggest drawback with the current periodic solution, which this script - any script - would mitigate.

As for error detection and general health of the pool, I agree it should be done by the alert system (which unfortunately isn't at all reliable right now - see the forum posts on disk failures that are not being detected in 8.0.1-Release, a separate issue to the email notification system not working).

As it is I've got additional scripts which run several times a day to help me identify potential data error and health issues - I take a belt & braces approach. Most of these scripts were used with [[FreeNAS]] 0.7 and have been updated to work in FN8.

#10 Updated by Xin Li about 10 years ago

Replying to [comment:9 [[MilhouseVH]]]:

Replying to [comment:8 delphij]:

Replying to [comment:7 [[MilhouseVH]]]:

It seems that the script effectively turns scrubbing a sequential operation, which might be not optimal. Also, I think detection of data errors should be done runtime (i.e. in the alert system) rather than just during scrubbing.

True, it does - but would multiple scrubs running in parallel complete any faster than the same scrubs run in sequence? Genuine question, as I don't know - I'd suspect there's a good chance the parallel scrubs would then become IO bound, if not also CPU bound, and thus highly detrimental to any users of the system. Alternatively it wouldn't be difficult to modify the script to take a pool name argument, so that separate scrubs could be scheduled at different times.

It will be faster since scrub is not CPU bound but disk bandwidth bound, even when we fire up more scrub processes, it's still unlikely to have them eat up the bandwidth on PCIe bus because disks are still much slower than the system bus.

Even so, having a long running sequential scrub kick off predictably every weekend (or perhaps late on a Friday evening after the "office" has closed, if time is tight) is surely more preferable than having scrubs start on a semi-random day at 3am and thus likely to run into business hours? This is the biggest drawback with the current periodic solution, which this script - any script - would mitigate.

Using s/daily/weekly/g for 800.scrub-zfs and move it to /etc/periodic/weekly/ would give the predictability without any new code. This is not an "not invented here" argument -- 800.scrub-zfs does take many other things into consideration, like interval between scrubs, etc. and have better integration to system reports.

As for error detection and general health of the pool, I agree it should be done by the alert system (which unfortunately isn't at all reliable right now - see the forum posts on disk failures that are not being detected in 8.0.1-Release, a separate issue to the email notification system not working).

We need to fix alert system for sure.

As it is I've got additional scripts which run several times a day to help me identify potential data error and health issues - I take a belt & braces approach. Most of these scripts were used with [[FreeNAS]] 0.7 and have been updated to work in FN8.

#11 Updated by Neil MacLeod about 10 years ago

Replying to [comment:10 delphij]:

Using s/daily/weekly/g for 800.scrub-zfs and move it to /etc/periodic/weekly/ would give the predictability without any new code. This is not an "not invented here" argument -- 800.scrub-zfs does take many other things into consideration, like interval between scrubs, etc. and have better integration to system reports.

Don't get me wrong, I actually think having the periodic scrub after 30 days is a good thing, as it's better than nothing at all for those who aren't familiar with ZFS and who may otherwise never scrub at all.

But how would I be able to configure a long-running scrub to start at 9pm on every Friday night using the daily or weekly periodic systems? Moving 800.scrub to weekly would ensure the scrub started on a weekend (Saturday@04:15) but this may still not be the right solution in the longer term, particularly for users who want the scrub to start at other times/days.

Perhaps a system of pre-canned "system tasks" (scrubs, health checks etc.) that can be scheduled via cron would be an option?

#12 Updated by Anonymous almost 10 years ago

There are some other suggestions that may or may not be useful in ticket # 1054.

#13 Updated by William Grzybowski almost 10 years ago

Scrub GUI added in r10043, r10042, r10041, r10040, r10039, r10036 and r10035

#14 Updated by Josh Paetzel about 8 years ago

  • Status changed from Unscreened to Closed

This was added long ago. Closing it out.

Also available in: Atom PDF