Project

General

Profile

Bug #35737

Use sane default zvol blocksize based on pool topology

Added by Victor Hooi over 2 years ago. Updated about 2 years ago.

Status:
Done
Priority:
No priority
Assignee:
Brandon Schneider
Category:
Middleware
Target version:
Severity:
Med High
Reason for Closing:
Reason for Blocked:
Needs QA:
No
Needs Doc:
No
Needs Merging:
No
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:
ChangeLog Required:
No

Description

I have a FreeNAS machine that becomes unresponsive, and the console is filled with messages saying "swap_pager_getswapspace".

Hardware is:
  • SuperMicro A2SDi-8C+-HLN4F
  • 64GB RAM
  • 6 x 8TB WD HDDs, currently in RAID-Z1

Software is FreeNAS-11.2-MASTER-201806210452.

Within this, I have a single Bhyve VM with 8GB of RAM, and running Ubuntu 18.04.

Within this, I have a Docker container running https://github.com/wernight/docker-qbittorrent.

When running FreeNAS + Bhyve VM - it seems ok.

But some time after spinning up the Docker container - the entire machine becomes unresponsive. This is both the Bhyve VM, as well as the parent FreeNAS machine.

This is related to this forum post:

https://forums.freenas.org/index.php?threads/freenas-becomes-unresponsive-console-keeps-printing-swap_pager_getswapspace.65226/

freenas_creating_zvol_for_bhyve.png (123 KB) freenas_creating_zvol_for_bhyve.png Victor Hooi, 06/25/2018 02:43 PM
bhyve_arc_max_tunable.png (218 KB) bhyve_arc_max_tunable.png Victor Hooi, 06/25/2018 02:44 PM
Screen Shot 2018-06-26 at 15.32.44.png (121 KB) Screen Shot 2018-06-26 at 15.32.44.png Victor Hooi, 06/25/2018 10:33 PM
19509
19513
19543

Associated revisions

Revision 66f76f5d (diff)
Added by Brandon Schneider over 2 years ago

feat(vm): Use 32K as default for volblocksize 512 bytes is too small for an optimal volblocksize with RAIDZ, while this should be configurable down the road for the user, 32K is a sane starting point. Ticket: #35737

Revision ae27ea3e (diff)
Added by Brandon Schneider about 2 years ago

feat(vm/create): Create zvol's instead of the UI Beginning of transition from a bunch of private API calls to fewer public ones handling those private parts. TIcket: #35737

Revision 1f489b4a (diff)
Added by Brandon Schneider about 2 years ago

feat(vm): Use 32K as default for volblocksize 512 bytes is too small for an optimal volblocksize, this removes that, and creating a zvol is moved to the VM middleware plugin where it should live. Related PR: https://github.com/freenas/freenas/pull/1451 Ticket: #35737

Revision 9fe61e3f (diff)
Added by Brandon Schneider about 2 years ago

feat(vm): Use recommended value as default for volblocksize 512 bytes is too small for an optimal volblocksize, this removes that, and creating a zvol is moved to the VM middleware plugin where it should live. Related PR: https://github.com/freenas/freenas/pull/1451 Ticket: #35737

Revision 1b0e3bc8 (diff)
Added by Brandon Schneider about 2 years ago

feat(vm/create): Create zvol's instead of the UI (#1451) * feat(zfs): pool.dataset.recommended_zvol_blocksize Ticket: #35757 * feat(vm/create): Create zvol's instead of the UI Beginning of transition from a bunch of private API calls to fewer public ones handling those private parts. TIcket: #35737 * Use consistent quotes OCD * Address William's Review

History

#1 Updated by Victor Hooi over 2 years ago

  • File debug-freenas-naulty-place-20180625145301.txz added
  • Private changed from No to Yes

#2 Updated by Dru Lavigne over 2 years ago

  • Assignee changed from Release Council to Alexander Motin
  • Seen in changed from Unspecified to Master - FreeNAS Nightlies

#3 Updated by Alexander Motin over 2 years ago

  • Subject changed from Freenas becomes unresponsive, console keeps printing "swap_pager_getswapspace" to Our of memory due to ARC wasted by too small blocks
  • Status changed from Unscreened to Screened

I think I have an idea what is going wrong with your system. The system becomes unresponsive as result of getting out of swap space, when everything that can be pushed out to swap is already pushed, but that is not enough. The cause of that I see in ZFS ARC, which while stored only 14GB of real data, wasted additionally 39GB of RAM in unused allocation chunks. I see two problems there: one problem is that ZFS seems to account memory as net size instead of gross, which in case of very small allocations seems can be hugely different, and actually the second problem that probably caused the first is two ZVOLs backing VMs, that were creates with 512 byte block sizes. Use of sizes that small is extremely inefficient in many ways. You should evacuate data from those ZVOLs/VMs and recreate with reasonable block size of at least 16KB or more.

#4 Updated by Victor Hooi over 2 years ago

19509
19513

Firstly, I was doing random Googling, and a post mentioned ARC cache (http://freebsd.1045724.x6.nabble.com/bhyve-uses-all-available-memory-during-IO-intensive-operations-td6223197.html), so I thought I'd try tuning it down to half my available RAM (see attached). Would this help at all?

Secondly, how do I migrate those ZVOLs to a different block size? Or do I need to completely re-create them?

Also, I created these Bhyve instances via the FreeNAS GUI, also using the GUI to create a ZVOL - I assume it simply defaults to 512 byte block sizes?

I just went through the GUI again, couldn't see any option to specify block size (see attached). Should this be an exposed option, or should the default be changed?

#5 Updated by Alexander Motin over 2 years ago

  • Subject changed from Our of memory due to ARC wasted by too small blocks to Out of memory due to ARC wasted by too small ZVOL blocks
  • Status changed from Screened to Unscreened
  • Assignee changed from Alexander Motin to William Grzybowski
  • Target version changed from Backlog to 11.2-BETA2
  • Severity changed from New to Med High

Victor Hooi wrote:

Firstly, I was doing random Googling, and a post mentioned ARC cache (http://freebsd.1045724.x6.nabble.com/bhyve-uses-all-available-memory-during-IO-intensive-operations-td6223197.html), so I thought I'd try tuning it down to half my available RAM (see attached). Would this help at all?

Generally it would help, but probably not in this case. Your problem is that for every stored block ARC these days allocates at least a 4KB memory page. In your case it means 4Kb per 512 bytes. So even half of RAM will be too much unless you fix the original issue.

Secondly, how do I migrate those ZVOLs to a different block size? Or do I need to completely re-create them?

Unfortunately ZVOL block size can not be changed after creation. The only way is to create new one and copy the data.

Also, I created these Bhyve instances via the FreeNAS GUI, also using the GUI to create a ZVOL - I assume it simply defaults to 512 byte block sizes?

I haven't tested that myself, but that sounds possible.

I just went through the GUI again, couldn't see any option to specify block size (see attached). Should this be an exposed option, or should the default be changed?

In standalone ZVOL creation UI it is an option among advanced, but also there is a logic to properly choose the default value. I guess in case of VM creation neither of those are present. I'll forward the ticket to middleware team, so they could look whether it is UI or middleware issue.

#6 Updated by Victor Hooi over 2 years ago

Ok - so it sounds like simply setting the block size from 512 bytes to 16KB of 128KB would resolve my issue then?

There's no need to set the arc_max tunable? (I'm happy to leave this alone if ordinary users shouldn't be touching this).

OK, I can re-create the ZVOl if that's the best way. Sorry if this is a basic question - I'm thinking I can create a new ZVOL manually via the FreeNAS GUI (and use the Advanced dropdown to set the ZVOL block size). I can also edit the Bhyve VM via the FreeNAS GUI to point to this new ZVOL. is there then an easy way to copy file contents from one ZVOL to another? Or am I easier to just re-create the entire Bhyve from scratch?

Yes, I think there is a bug in the Bhyve VM creation if it's defaulting to 512 bytes, and it doesn't let you over-ride it. Surely other people are hitting this issue though? Or is it that they're not doing IO intensive things on FreeNAS Bhyve VMs yet?

#7 Updated by Alexander Motin over 2 years ago

Victor Hooi wrote:

Ok - so it sounds like simply setting the block size from 512 bytes to 16KB of 128KB would resolve my issue then?

Yes. Considering you are using RAIDZ, which is not perfect for storing small objects, value of 32Kb would probably be even better.

There's no need to set the arc_max tunable? (I'm happy to leave this alone if ordinary users shouldn't be touching this).

You may reduce it, but it should not be required.

OK, I can re-create the ZVOl if that's the best way. Sorry if this is a basic question - I'm thinking I can create a new ZVOL manually via the FreeNAS GUI (and use the Advanced dropdown to set the ZVOL block size). I can also edit the Bhyve VM via the FreeNAS GUI to point to this new ZVOL. is there then an easy way to copy file contents from one ZVOL to another? Or am I easier to just re-create the entire Bhyve from scratch?

If ZVOL sizes match, it should be possible to copy the content, for example with `dd if=/dev/zvol/{pool}/{fromzvol} of=/dev/zvol/{pool}/{tozvol} bs=1m` from command line.

Yes, I think there is a bug in the Bhyve VM creation if it's defaulting to 512 bytes, and it doesn't let you over-ride it. Surely other people are hitting this issue though? Or is it that they're not doing IO intensive things on FreeNAS Bhyve VMs yet?

My guess is that not so many people are using new UI and nightly builds so far.

#8 Updated by Victor Hooi over 2 years ago

19543

I created a new ZVOL with block size set to 32 KB.

Is this a good middle-ground? What (if any) are the advantages of going to 128 KB? Underlying filesystem is ext4.

I'm currently doing a dd to the new ZVOl (300 GiB) - the machine seemed to lock up the first time I did it. Not sure why.

The Bhyve instance isn't booted, and I've set arc_max to 30GB - is it possible I'm still hitting this issue though?

Is there any workaround I can use to still copy the data across to the new ZVOL?

I've attached a graph from FreeNAS GUI showing memory usage, after a fresh boot and running dd immediately. You can see wired growing aggressively.

#9 Updated by William Grzybowski over 2 years ago

  • Category changed from OS to Middleware
  • Assignee changed from William Grzybowski to Brandon Schneider

#10 Updated by Alexander Motin over 2 years ago

Victor Hooi wrote:

I created a new ZVOL with block size set to 32 KB.

Is this a good middle-ground? What (if any) are the advantages of going to 128 KB? Underlying filesystem is ext4.

Yes, 32KB is usually as good as possible middle-ground for VMs on top of RAIDZ. Bigger blocks could give better sequential throughput, but suffer more on random. Smaller blocks leads to lower space efficiency.

I'm currently doing a dd to the new ZVOl (300 GiB) - the machine seemed to lock up the first time I did it. Not sure why.

Just a guess, it may be the same issue. I'd try to either turn arc_max down to something small, or completely disable data caching for that ZVOL with `zfs set primarycache=metadata ...`.

The Bhyve instance isn't booted, and I've set arc_max to 30GB - is it possible I'm still hitting this issue though?

Yes, the problem is not bhyve. It is just a trigger and another memory consumer.

Is there any workaround I can use to still copy the data across to the new ZVOL?

I've attached a graph from FreeNAS GUI showing memory usage, after a fresh boot and running dd immediately. You can see wired growing aggressively.

That is what ARC should do on freshly booted system. The question is only how far it goes and how much of that memory used for a good reason, not wasted.

#11 Updated by Brandon Schneider over 2 years ago

  • Status changed from Unscreened to In Progress

#12 Updated by Victor Hooi over 2 years ago

Thanks!

I was able to successfully copy over the data now using:

freenas-naulty-place# zfs get primarycache datastore-naulty-place/freenas-torrent-ne8r4s
NAME                                           PROPERTY      VALUE         SOURCE
datastore-naulty-place/freenas-torrent-ne8r4s  primarycache  all           default
freenas-naulty-place# zfs set primarycache=metadata datastore-naulty-place/freenas-torrent-ne8r4s
freenas-naulty-place# zfs get primarycache datastore-naulty-place/freenas-torrent-ne8r4s
NAME                                           PROPERTY      VALUE         SOURCE
datastore-naulty-place/freenas-torrent-ne8r4s  primarycache  metadata      local
dd if=/dev/zvol/datastore-naulty-place/freenas-torrent-ne8r4s of=/dev/zvol/datastore-naulty-place/freenas-torrent-3 bs=1m

I thought I'd try editing the Bhyve VM Disk in FreeNAS GUI to point to the new disk - however, that didn't seem to boot successfully. The machine would start, then switch to Stopped within a few seconds.

I then edited the disk back to the old disk - but it still won't boot.

This is quite odd =(.

Serial and VNC both show nothing.

Is there any other way of manually accessing the data on the ZVOL easily from within FreeNAS?

#13 Updated by Alexander Motin over 2 years ago

Victor Hooi wrote:

Is there any other way of manually accessing the data on the ZVOL easily from within FreeNAS?

By default ZVOLs look like a character device you can read or write. But if you set volmode peropery to geom, aftre reboot or pool reimport they will turn into a disk, that can be mounted. The only question is what file system is there and whether FreeNAS can mount it directly. Generally it is not a very often used approach. As alternative, you may possibly mount that disk to some other VM supporting that file system.

#14 Updated by Victor Hooi over 2 years ago

As an alternative - is there any way to debug why the Bhyve VM instance might be starting, then stopping a few seconds later?

Some kind of low-level debug interface? Or logfiles?

Happy to provide all the details, if that's a separate bug in FreeNAS as well.

#15 Updated by Brandon Schneider over 2 years ago

PR: https://github.com/freenas/webui/pull/932

DESC: Changing the default volblocksize for zvols to 32K from 512B.
RISK: Low
ACCEPTANCE: Create a VM, check the volblocksize with zfs get

#16 Updated by Brandon Schneider over 2 years ago

Victor Hooi wrote:

As an alternative - is there any way to debug why the Bhyve VM instance might be starting, then stopping a few seconds later?

Some kind of low-level debug interface? Or logfiles?

Happy to provide all the details, if that's a separate bug in FreeNAS as well.

You can tail -f /var/log/middleware.log to see what the VM plugin is reporting back, otherwise @araujo will know more

#17 Updated by Alexander Motin over 2 years ago

Brandon Schneider wrote:

PR: https://github.com/freenas/webui/pull/932

Brandon, while this is better then nothing, couldn't we make it more clever, alike to the ZVOL creation code in old UI, when volblocksize was decided depending on pool topology? 32KB is a good number of mid-size RAIDZ, while for MIRROR usually better to reduce to 16K, while for very wide RAIDZ2 may have sense even 64K. You may ask William, he implemented the previous code years ago.

#18 Updated by Brandon Schneider over 2 years ago

Alexander Motin wrote:

Brandon Schneider wrote:

PR: https://github.com/freenas/webui/pull/932

Brandon, while this is better then nothing, couldn't we make it more clever, alike to the ZVOL creation code in old UI, when volblocksize was decided depending on pool topology? 32KB is a good number of mid-size RAIDZ, while for MIRROR usually better to reduce to 16K, while for very wide RAIDZ2 may have sense even 64K. You may ask William, he implemented the previous code years ago.

I suppose. I did the least invasive change as the severity is high and we wanted it quickly. I could spend some time and try to make it more clever instead of hardcoded.

#19 Updated by Alexander Motin over 2 years ago

Thank you for the quick fix, and thank you in advance for better implementation.

#20 Updated by Victor Hooi about 2 years ago

I think I might have hit this bug as well:

https://redmine.ixsystems.com/issues/28157

Specifically, in /var/log/middleware.log, I see:

[2018/06/27 20:36:47] (WARNING) VMService.__init_guest_vmemory():830 - ===> Cannot guarantee memory for guest id: 4

I'm guessing this may be related to ARC chewing up all my memory?

I was able to start the VM, after rebooting FreeNAS =) - the old "did you try turning it off and on again" trick.

But then when I shut down that machine, and started up another one, it died within a few seconds - and I saw that in the middleware.log.

#21 Updated by Brandon Schneider about 2 years ago

Middleware PR with mav's suggestions: https://github.com/freenas/freenas/pull/1451

#22 Updated by Brandon Schneider about 2 years ago

  • Status changed from In Progress to Ready for Testing
  • Needs Merging changed from Yes to No

DESC: Create ZVOL's with a suggested size instead of a fixed 512 bytes.
RISK: Low
ACCEPTANCE: Create a VM, and the ZVOL is created with a blocksize over 512b.

#23 Updated by Dru Lavigne about 2 years ago

  • File deleted (debug-freenas-naulty-place-20180625145301.txz)

#24 Updated by Dru Lavigne about 2 years ago

  • Subject changed from Out of memory due to ARC wasted by too small ZVOL blocks to Use 32K as default zvol blocksize
  • Private changed from Yes to No

#25 Updated by Timothy Moore II about 2 years ago

  • Status changed from Ready for Testing to Failed Testing

Testing with FreeNAS [Mini | system] updated to FreeNAS-11.2-MASTER-201807120858:

Go to VMs and add a new standard VM. Go to Storage/Pools and verify name of VM zvol. Go to shell and enter `zfs get all <pool>/<vm-zvol-name> | less`. Check “volblocksize” property is >512.

Retested with all three types of standard VMs: Windows, Linux, and FreeBSD. Each VM reports a "volblocksize" of exactly "512". I also checked manually creating a zvol with a block size of 32K, creating a standard vm that uses that zvol, and confirming the "volblocksize" reports as "32K".

#26 Avatar?id=19868&size=24x24 Updated by Nick Principe about 2 years ago

Why are we choosing to diverge from the default zvol block size of 16K?

For VM zvols it is almost guaranteed to have a file system on top, for which people will likely not change (or cannot change) the logical block size from 4K, and the read-modify-write penalty is quite harsh for zvols.

The block size that makes sense for VM boot volumes is 4k, but even 16K is preferable rather than diverging from the default for a value that does not seem to make sense for the use case, like 32K.

#27 Updated by Alexander Motin about 2 years ago

The code committed should still try to get best block size from pool config, which should end up in 16K volblocksize in most cases. 32K was specified there as last resort value in case of some unimaginable errors. But according to Tim there is still 512 bytes as result, that probably means that none of those are actually working.

#28 Updated by William Grzybowski about 2 years ago

  • Status changed from Failed Testing to In Progress

This need to be tested after https://github.com/freenas/webui/pull/932 is merged

#29 Avatar?id=19868&size=24x24 Updated by Nick Principe about 2 years ago

I agree 32K is better than 512b, but I would still prefer a 16K fallback if we think we should be using that most of the time anyway on FreeNAS.

#30 Updated by Timothy Moore II about 2 years ago

Retested after https://github.com/freenas/webui/pull/932 was merged. Testing with FreeNAS System updated to FreeNAS-11.2-MASTER-201807160837:

Went to VMs and created "TrueOS 18.06", "Windows Server", and "Gentoo Minimal" VMs. Went to shell and ran `zfs get all` on each VM zvol. Each VM reports a volblocksize of 16K.

#32 Updated by Alexander Motin about 2 years ago

Please correct me if I am wrong, but looking now on final code I can't see 32K value at all. That value was proposed as a quick fix for BETA1, but I am not sure it landed even for that. I think ticket title should be updated to what was really done to not confuse people.

#33 Updated by William Grzybowski about 2 years ago

  • Subject changed from Use 32K as default zvol blocksize to Use sane default zvol blocksize based on pool topology
  • Status changed from In Progress to Passed Testing

#34 Updated by Dru Lavigne about 2 years ago

  • Status changed from Passed Testing to Done
  • Needs QA changed from Yes to No
  • Needs Doc changed from Yes to No

Also available in: Atom PDF