Project

General

Profile

Bug #27409

Scrub problem over one particular volume

Added by Andrej Gortchivkin over 3 years ago. Updated over 3 years ago.

Status:
Closed: Duplicate
Priority:
No priority
Assignee:
Alexander Motin
Category:
OS
Target version:
Seen in:
Severity:
New
Reason for Closing:
Reason for Blocked:
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:
ChangeLog Required:
No

Description

Hi All,

I would like to report a problem related to scrubbing a ZFS volume specifically with FreeNAS 11.1. While it took me some time to narrow down the posibilities, I'm quite confident my issue persist only on one particular volume and only when scrub is engaged over it.

So here are some background details on the events I've been through so far...

Last night my FN system decided to become completely unresponsive and therefore I had to reboot it completely, so I hit the reset button and waited for the reboot to complete. My very first problem came right at the rebooting sequence when I noticed that the box got stuck at the "Beginning ZFS volume imports" step. I waited for few hours before I moved forward with troubleshooting but unfortunately the system just kept staying at that step and eventually started to post the following messages on the console screen:

slow spa_sync: started 1660 seconds ago, calls 108
slow spa_sync: started 1674 seconds ago, calls 109
slow spa_sync: started 1678 seconds ago, calls 110

Since the box didn't want to go forward anywhere from this stage I decided to go for a single-user mode and check the ZFS pools manually against any potential issues. This is where I first noticed that only one particular volume doesn't want to finish the manual import steps, so effectively the box got stuck once again and I had to reboot it manually.

I retried this procedure few times but the result was all the same. So after thinking a bit, I decided to give it a try and reboot the box into a previous boot environment which in my case was FreeNAS 11.0 U4. For my biggest surprise the volume import step went through immediately and the boot sequence was finalized just as it would normally do. Once I logged into the box I checked all the volumes against errors (none were found), however I noticed that last night the scheduled monthly scrubs were engaged over all volumes so I started to get suspicuous. While still in FreeNAS 11.0 U4, I decided to wait a bit till the problematic volume would reach at least 50% of the task, then I decided to reboot the box back into FreeNAS 11.1 and check the volume state once again from there.

This time the boot sequence went through successfully past the ZFS import steps, however once the box was fully loaded I noticed that although still with an active scrub task, it just didn't do anything. The "zpool iostat" command didn't show any data traffic for the problematic volume so the scrub was just hanging there and effectively making the volume extremely unresponsive. My other volumes were doing their scrubs in the meantime without any issues.

At this point I decided to give it a try and stop the active scrub task via command line, but the command just hung there so I once again had to reboot the box.

While I managed to finalize the scrub task via FreeNAS 11.0 U4 version, I now had to disable scrub for this volume in order to keep the box stable in FreeNAS 11.1. This is definitely happening only with FreeNAS 11.1 and I have absolutely no clue why is it happening this way only on one particular volume while the rest are running just fine.

My problematic volume has 8 x 15K SAS HDD's while all other volumes are made out of SATA drives and that's the only major difference I could think of. All HDD's are connected to LSI SAS controllers, so right now I'm completely out of ideas and therefore I would need some help in resolving this issue.

Anyone else experiencing similar behavior ?


Related issues

Related to FreeNAS - Bug #27666: Fixes related to jails for volume unlockResolved2018-01-06
Related to FreeNAS - Bug #27773: Fix livelock in in sequential scrubResolved2018-01-11
Has duplicate FreeNAS - Bug #27437: FN11.1 - sharing services become unresponsive during scrubClosed: Duplicate2017-12-26
Has duplicate FreeNAS - Bug #27358: LockupClosed: Duplicate2017-12-21

History

#1 Updated by Andrej Gortchivkin over 3 years ago

  • File debug-sns01storvm01-20171224061959.txz added

#2 Updated by Andrej Gortchivkin over 3 years ago

  • Seen in changed from Unspecified to 11.1

#3 Updated by Sean Fagan over 3 years ago

  • Category changed from 1 to 129

#4 Updated by Sean Fagan over 3 years ago

  • Status changed from Unscreened to 15
  • Assignee changed from Release Council to Sean Fagan

Does it reproduce if you do "sysctl -w vfs.zfs.zfs_scan_legacy=1" before starting a scrub?

#5 Updated by Andrej Gortchivkin over 3 years ago

Yes it does.
I just tried out a manual scrub after modifying the setting you described but the scrub is simply hanging.

Apparently it scrubbed 175M in total but now it doesn’t move up at all.
Unfortunately I’m not able to stop the scrub from FreeNAS 11.1 so I’ll have to make another intermediate move via FreeNAS 11.0 U4.

#6 Updated by Andrej Gortchivkin over 3 years ago

Also I can confirm that canceling scrub via FreeNAS 11.0 U4 for the volume restores back functionality for FreeNAS 11.1.

#7 Updated by Sean Fagan over 3 years ago

  • Assignee changed from Sean Fagan to Alexander Motin

Thoughts, Alexander? This still, I think, points a finger at the new scrub code -- even with the legacy option, it's still using a scrub task.

#8 Updated by Sean Fagan over 3 years ago

The other one to try, maybe, would be vfs.zfs.trim.enabled=0 perhaps?

#9 Updated by Andrej Gortchivkin over 3 years ago

Well... I could give this a try of course, although I'm not quite sure how does "trim" relate to scrubbing in this particular issue. Furthermore I believe "trim" existed in previous FN versions where I never experienced stuck on scrubbing. I've been upgrading my system regularly with new releases and I never experienced such issue with any of the 11.0 versions. I'm using the default scrubbing settings for each of the pools and those were working just fine till FN 11.1.

The strangest thing here is the fact that this happens only with my zpool containing SAS drives while the SATA pools (both HDD and SSD ones) are still scrubbing themselves perfectly with FN 11.1.

#10 Updated by Sean Fagan over 3 years ago

There's new scrub code in the code now. It looks like there's a deadlock -- this appears to be related to cleaning up blocks that have been released, and the trim code path (along with the scrub code path) are both waiting for a notification of some sort.

I'm guessing a scrub is happening while snapshots are being cleaned up, but that's just a guess. Alexander knows more about it than I do, although I've been going over the scrub path again today.

#11 Updated by Andrej Gortchivkin over 3 years ago

Thanks for the clarification Sean.

I just tried your suggestion but unfortunately there is no change in the behavior at all. Since this parameter was read-only, I had to include it as a new loader tunable followed by a reboot. Once I started the scrub over my pool, I noticed the very same stuck situation so I reverted back once again to FN 11.0 U4 where I was able to cancel the ongoing scrub task. It seems that for some reason once the scrub is engaged over the volume it just locks it down completely so cancellation from within FN 11.1 definitely becomes impossible. There is certainly something wrong with the FN 11.1 zfs code and while the volume lock is not immediately noticeable once the scrub task would start, it will most definitely lock up both GUI and CLI functionality at some point which in fact was how I initially noticed there is something wrong with my FN server. I would say the potential impact to FN servers is quite high.

Let me know about any other suggestions you may have. I guess for the time being I have no other options but to keep scrub tasks disabled.

#12 Updated by Dru Lavigne over 3 years ago

  • Status changed from 15 to Unscreened
  • Target version set to 11.2-BETA1

#13 Updated by Sean Fagan over 3 years ago

  • Has duplicate Bug #27437: FN11.1 - sharing services become unresponsive during scrub added

#14 Updated by Sean Fagan over 3 years ago

#15 Updated by Sean Fagan over 3 years ago

Note that I haven't been able to replicate it, but the text dumps supplied have been helpful for me to have a hypothesis:

The deadlock appears to happen when a large delete and a scrub are both in progress. As a guess, I'm thinking that when a snapshot is destroyed, and that snapshot had a large amount of files / space that had since been deleted, and a scrub is attempting to go over the same transaction group(s) that the deletion is processing, either they both depends on multiple locks that aren't ordered properly, or they both are waiting on the same condition variable (and since both are waiting, it will never be set).

One more question I have is whether the scrub in progress can be cancelled or suspended. Probably not, but see my first statement above.

#16 Updated by Andrej Gortchivkin over 3 years ago

If it is snapshot related, then this could partially answer the question why I'm seeing this only on one particular volume. On the contrary I'm using all my volumes constantly (all of them have regular snapshots set), so this kinda defeats the answer in a way.

By the way when I was trying to revive my system I also tried removing all active snapshots from my affected volume but that didn't change anything on the scrub behavior.

#17 Updated by Benno Rice over 3 years ago

Could you also try the following before starting the scrub?

sysctl -w vfs.zfs.per_txg_dirty_frees_percent=0

This disables delete throttling that was introduced in 11.1.

#18 Updated by Andrej Gortchivkin over 3 years ago

I just tried that out but the result is all the same. Scrub starts and then gets stuck at the very beginning. Once again canceling it wasn't possible until I rebooted back to FN 11.0 U4.

#19 Updated by Alberto Lorenzo Pulido over 3 years ago

Hi, i have same behaviour, and had to reboot on previous version to unlock the scrub... tryed backing up config an reinstalling directly 11.1 and scrub reset but just stuck at 262GB and doing nothing, disk activity is null with all jails and vm's stopped.

Rebooting with my old boot drive wich have all versions i have updated since 9.x did de trick and now everything is working on 11.0 U4

#20 Updated by Nicolas Gilles over 3 years ago

Hi! I have also been faced with this issue on one of my pools (SSD mirrored pair). Although in my case I have no scheduled tasks (like snapshots) on this pool, and the scrub started at midnight, with no particularl activity on the pool; no concurrent big deletion or other event needed to trigger this potential deadlock.

It seems that the scrub went about halfway through the data before locking up. In fact in my case the locking up only blocks the scrub, I can at least read data off the pool with seemingly no issues (haven't tried doing writes explicitly).

I tried rebooting once (back into 11.1) and also experienced the slow import step, which eventually finished: the console logged some txg scanning messages where initially it read 88k blocks in 50000ms, then that number fell to 15 blocks / ~50000ms for the rest of the import.

Let me know if there is anything I can do to give some more data points to help track this issue.

#21 Updated by William Grzybowski over 3 years ago

  • Related to Bug #27666: Fixes related to jails for volume unlock added

#22 Updated by Alexander Motin over 3 years ago

  • Related to Bug #27773: Fix livelock in in sequential scrub added

#23 Updated by Alexander Motin over 3 years ago

  • Status changed from Unscreened to Investigation

Just yesterday we diagnosed one more scrub problem, when scrub could become incredibly slow and eat enormous CPU time, up to blocking the system completely for number of seconds. It should be fixed in U1 update soon. Andrej, have you seen significant CPU load on your system during the stuck scrub, or it was completely idle?

#24 Updated by Andrej Gortchivkin over 3 years ago

Unfortunately I haven't checked any of the CPU stats at the time being but I'll do so in case I encounter the situation once again.

#25 Updated by Andrej Gortchivkin over 3 years ago

Just noticed the following from last night's scheduled scrub over one of my RAIDZ volumes which doesn't lock up as my SAS HDD based RAID10 volume...

I received a status mail from it and apparently it contains huge amount of messages as follow:

(0:8:5/12): COMPARE AND WRITE. CDB: 89 00 00 00 00 00 00 00 a9 48 00 00 00 01 00 00
(5:6:12/57): WRITE. CDB: 2a 00 3e 0b 55 78 00 00 08 00
(0:8:3/10): WRITE. CDB: 2a 00 3d 4f aa 90 00 00 10 00
(3:6:15/60): WRITE. CDB: 2a 00 3a c7 c3 60 00 00 08 00
(0:8:5/12): Tag: 0xa117691, type 1
(3:6:15/60): Tag: 0xb21b6b4, type 1
(3:6:15/60): ctl_datamove: 159 seconds
ctl_datamove: tag 0xb21b6b4 on (3:6:15) aborted
(3:6:15/60): WRITE. CDB: 2a 00 3a c7 cc 78 00 00 20 00
(5:6:12/57): Tag: 0x8ad14eb, type 1
(0:8:5/12): ctl_process_done: 156 seconds
(3:8:7/14): WRITE. CDB: 2a 00 3a 9e bf 70 00 00 10 00
(5:6:12/57): ctl_datamove: 151 seconds
ctl_datamove: tag 0x8ad14eb on (5:6:12) aborted
(5:6:12/57): WRITE. CDB: 2a 00 0d 08 59 78 00 00 08 00
(3:6:15/60): Tag: 0xb21b6b5, type 1
(3:8:7/14): Tag: 0x8156afc, type 1
(3:8:7/14): ctl_process_done: 158 seconds
(0:8:3/10): Tag: 0xa11786f, type 1
(3:6:15/60): ctl_datamove: 159 seconds
(0:8:3/10): ctl_datamove: 151 seconds
(3:8:7/14): WRITE. CDB: 2a 00 17 b0 39 88 00 00 08 00
ctl_datamove: tag 0xb21b6b5 on (3:6:15) aborted
(5:6:12/57): Tag: 0x8ad14ee, type 1
(5:6:12/57): ctl_datamove: 150 seconds
(3:6:15/60): WRITE. CDB: 2a 00 3a c7 cc a0 00 00 08 00
(3:8:7/14): Tag: 0x8156b0f, type 1
(3:8:7/14): ctl_datamove: 154 seconds
ctl_datamove: tag 0x8156b0f on (3:8:7) aborted
(3:6:15/60): Tag: 0xb21b6b6, type 1
(3:8:7/14): WRITE. CDB: 2a 00 17 a4 c2 60 00 00 08 00
(3:6:15/60): ctl_datamove: 159 seconds
ctl_datamove: tag 0xb21b6b6 on (3:6:15) aborted
(3:8:7/14): Tag: 0x8156b11, type 1
(3:8:7/14): ctl_datamove: 154 seconds
ctl_datamove: tag 0x8156b11 on (3:8:7) aborted
(3:8:7/14): WRITE. CDB: 2a 00 3b 58 d1 e0 00 00 08 00
(3:6:15/60): WRITE. CDB: 2a 00 3a c7 cc c0 00 00 08 00
(5:6:12/57): WRITE. CDB: 2a 00 0d 09 02 40 00 00 08 00
(3:8:7/14): Tag: 0x8156b12, type 1
(3:8:7/14): ctl_datamove: 154 seconds
(3:6:15/60): Tag: 0xb21b6b7, type 1
(3:6:15/60): ctl_datamove: 159 seconds
ctl_datamove: tag 0xb21b6b7 on (3:6:15) aborted
(3:6:15/60): WRITE. CDB: 2a 00 3a cb 42 78 00 00 08 00
ctl_datamove: tag 0x8156b12 on (3:8:7) aborted
(3:8:7/14): WRITE. CDB: 2a 00 3b 5c d1 10 00 00 08 00
(3:6:15/60): Tag: 0xb21b6b8, type 1
(3:6:15/60): ctl_datamove: 159 seconds
(3:8:7/14): Tag: 0x8156b13, type 1
(5:6:12/57): Tag: 0x8ad14b2, type 1
(3:8:7/14): ctl_datamove: 154 seconds
(5:6:12/57): ctl_process_done: 154 seconds
(0:8:3/10): WRITE. CDB: 2a 00 01 24 60 88 00 00 08 00
ctl_datamove: tag 0x8156b13 on (3:8:7) aborted
(0:8:3/10): Tag: 0xa11782a, type 1
(0:8:3/10): ctl_process_done: 155 seconds
(3:8:7/14): WRITE. CDB: 2a 00 3b 5c d1 58 00 00 08 00
ctl_datamove: tag 0xb21b6b8 on (3:6:15) aborted
(3:6:15/60): WRITE. CDB: 2a 00 3a c6 37 00 00 00 08 00
(3:8:7/14): Tag: 0x8156b14, type 1
(3:8:7/14): ctl_datamove: 154 seconds
(3:6:15/60): WRITE. CDB: 2a 00 3a e3 65 f0 00 00 08 00
(3:6:15/60): Tag: 0xb21b6ad, type 1
(3:6:15/60): ctl_process_done: 159 seconds
(3:6:15/60): WRITE. CDB: 2a 00 3a c7 c3 60 00 00 08 00
(3:6:15/60): Tag: 0xb21b6b9, type 1
(3:6:15/60): ctl_datamove: 159 seconds
(3:6:15/60): Tag: 0xb21b6b4, type 1
(3:6:15/60): ctl_process_done: 159 seconds
(5:6:12/57): WRITE. CDB: 2a 00 3e 0b 55 78 00 00 08 00
ctl_datamove: tag 0xb21b6b9 on (3:6:15) aborted
(3:6:15/60): WRITE. CDB: 2a 00 3a fb d1 d0 00 00 08 00
ctl_datamove: tag 0x8156b14 on (3:8:7) aborted
(5:6:12/57): Tag: 0x8ad14eb, type 1
(5:6:12/57): ctl_process_done: 151 seconds

Furthermore there are number of messages related to my ESXi and Windows hosts losing their iSCSI connection with the FN server. I checked the reports against past CPU usage for the time frame while scrub was running and it seems that the CPU was around 60% constantly.

I don't recall having such behavior with any of the past builds.

#26 Updated by Alexander Motin over 3 years ago

I've pushed fixes for ticket #27773, which may about the same bug as you spot. You may try the next nightly build to see whether scrub works better for you now.

#27 Updated by Andrej Gortchivkin over 3 years ago

Just tried the latest nightly build on my FN box but unfortunately I'm unable to boot it properly due to the following error:

stopped at vga_bitblt_text+0x14a: movi (%rax,%rcx,4),%15d

I'm not sure if this happens because my FN is virtualized but anyway I guess I'll have to wait for another more stable nightly build release before I can test against the latest scrub related changes.

I was using the 11-MASTER-201801160456 build during this test.

#28 Updated by Alexander Motin over 3 years ago

  • Status changed from Investigation to Closed: Duplicate

OK, lets close this for now, hoping some of two scrub issues targeted for 11.1-U1 fix it.

#29 Updated by Kurt D over 3 years ago

I had the same problem where a scrub started, won't complete, and couldn't be canceled. FWIW, it would be helpful for others to run into this and find this bug to have instructions on how to clear the flag so that performing a reboot a scrub can be aborted. Downgrading to 11.0U4 resolved the issue for me.

#30 Updated by Andrej Gortchivkin over 3 years ago

Hey Alexander,

Any idea on the 11.1 U1 release date ?

Also I would suggest that we keep this ticket open until the issue is fully resolve. The problem seems to be affecting many people now and the only solution is to downgrade back to 11.0 U4 or earlier version.

#31 Updated by Dru Lavigne over 3 years ago

Andrej: please refer to the resolved tickets that this one is a duplicate of.

We're waiting on the ETA of the FreeBSD spectre fix to determine the U1 release date. We're also determining where is the best place to document the sysctl mentioned in comment 17. Thanks!

#32 Updated by Andrej Gortchivkin over 3 years ago

Thanks Dru,

Then I'll just keep watching for new releases.
If I get some spare time I'll try to test with a different nightly build.

#33 Updated by Andrej Gortchivkin over 3 years ago

Good news...

I just upgraded my FN box to 11.1-U1 and apparently the scrub is now working properly over my SAS volume, so things seems to be in order. I'll wait till the scrub is fully done over all my volumes then I'll report back the results.

As a side note, while the scrub is still running at this very moment, the CPU utilization so far seems to be around 20 - 30 percent which is pretty decent. The only thing that made me confused a bit are the new status messages for scrub operations:

scan: scrub in progress since Fri Jan 19 07:54:14 2018
346G scanned at 682M/s, 34.7G issued at 68.4M/s, 805G total
0 repaired, 4.31% done, 0 days 03:12:18 to go
scan: scrub in progress since Fri Jan 19 07:54:14 2018
405G scanned at 551M/s, 101G issued at 137M/s, 805G total
0 repaired, 12.48% done, 0 days 01:28:01 to go
scan: scrub in progress since Fri Jan 19 07:54:14 2018
481G scanned at 497M/s, 170G issued at 176M/s, 805G total
0 repaired, 21.16% done, 0 days 01:01:36 to go

Not sure what exactly the scanned vs. issued statistics would mean here, but while I'm observing the zpool status output I see a lot of flapping in bandwidth with the read activities which varies mostly between 200 and 450 MB/sec but sometimes it goes as low as 8 to 20 MB/sec for few moments. Also the bandwidth between scanned vs. issued activities seems to be flapping a bit every now and then but I guess this is normal. I suppose this is related to the sequential scrub methodology but if someone could shed some light on this one, I would be appreciated it.

Anyway, so far things seems to be stable. I'll report back the end results once I have them...

#34 Updated by Sean Fagan over 3 years ago

The old scrub code examined each transaction group, issuing each I/O as it read it. On a fragmented pool, this meant a lot of time seeking.

The new scrub code examines each transaction group, and sorts the blocks (that's the scanning part), and queues up the reads to be processed asynchronously (that's the issued part). This means less time spent seeking.

The scan rate and issued rate will, eventually, converge.

#35 Updated by Andrej Gortchivkin over 3 years ago

Thanks for the clarification Sean. Just found the presentation about this topic (links below for those who are interested in the details).

https://www.youtube.com/watch?v=SZFwv8BdBj4&feature=youtu.be%5C
https://drive.google.com/file/d/0B5hUzsxe4cdmVU91cml1N0pKYTQ/view

So my pools are now fully scrubbed. Apart from having 3 checksum errors on the SAS volume probably caused by previous lockups and reverts back to FN 11.0-U4, everything else seems to be absolutely fine. Unfortunately I'm now hitting some data errors as one of the checksum errors is hitting two HDD's of a mirror VDEV but thankfully it's just over a single old snapshot, so hopefully when I remove it, the pool will get back to its fully normal state. Worst case scenario will be restoring data from backups, but I can live with that.

I can definitely confirm that the new scrub method works really well over my pools and I would say that the timing for its completion was improved substantially. So for the moment I'm treating this issue as resolved on my end. It would be nice to hear back from other affected users and their scrub experience with FN 11.1-U1.

Many thanks to everyone for the help.

#36 Updated by Andrej Gortchivkin over 3 years ago

Just few additional notes regarding the upgrades & scrubs...

As I mentioned in my previous post I encountered some checksum errors on my zpool which turned out to be pretty serious and therefore I was facing the situation of having a single ZVOL which couldn't be read at all. Initially I thought that only the snapshots were affected but then it turned out that the whole ZVOL was acting weird and eventually generating even more checksum errors while I was accessing data from it.

While the operating system was still seeing the mounted data underneath, it just couldn't read the files and folders (various CRC errors) which forced me to use restore procedures. While the CRC errors appeared inside the OS, at the same time the zpool went into degraded mode due to the too many checksum errors. So most important thing here to note is that if you had a volume lock up caused by previous upgrades, make sure to run a full scrub after installing the latest FN version and let it finish properly.

I'm pretty confident that in my case the checksum errors were the direct result of the previous enforced reboots while my system was unresponsive, so just have a note that if you were doing the same due to no other options, then potential data loos might be encountered. I assume that while my volume was locked up, there were still some unfinished transactions which may have scrambled some of the data.

#37 Updated by Dru Lavigne over 3 years ago

  • Target version changed from 11.2-BETA1 to N/A

#38 Updated by Dru Lavigne over 3 years ago

  • File deleted (debug-sns01storvm01-20171224061959.txz)

Also available in: Atom PDF