Project

General

Profile

Bug #18277

ZFS SLOG performance optimization to saturate NVDIMM

Added by Alexander Motin almost 3 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Expected
Assignee:
Alexander Motin
Category:
OS
Target version:
Seen in:
Severity:
New
Reason for Closing:
Reason for Blocked:
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:
ChangeLog Required:
No

Description

In sync=always mode ZFS spends too much time on context switches while doing synchronous SLOG writes, where low latency is critical for reaching reasonable throughput. This issue was known even before with SSD-based SLOG, but with NVDIMM with its effectively zero latency it became critical.

Associated revisions

Revision 2be7cbc3 (diff)
Added by Alexander Motin almost 3 years ago

Optimize ZIL itx memory allocation on FreeBSD.

These allocations can reach up to 128KB, while FreeBSD kernel allocator
can cache allocations only up to 64KB. To avoid expensive allocations
for each large ZIL write use caching zio_buf_alloc() allocator instead.

To make it possible de-inline few instances of zil_itx_destroy().

Ticket: #18277

(cherry picked from commit 81423ea480ebc62d94772351d76f01aca5b14bfe)

Revision 2be7cbc3 (diff)
Added by Alexander Motin almost 3 years ago

Optimize ZIL itx memory allocation on FreeBSD.

These allocations can reach up to 128KB, while FreeBSD kernel allocator
can cache allocations only up to 64KB. To avoid expensive allocations
for each large ZIL write use caching zio_buf_alloc() allocator instead.

To make it possible de-inline few instances of zil_itx_destroy().

Ticket: #18277

(cherry picked from commit 81423ea480ebc62d94772351d76f01aca5b14bfe)

Revision e027c6cf (diff)
Added by Alexander Motin almost 3 years ago

Completely skip cache flushing for not supporting log devices.

NVDIMM driver now writes data in non-temporal store mode, completely
bypassing CPU caches, so there is nothing to flush.

Ticket: #18277

Revision e027c6cf (diff)
Added by Alexander Motin almost 3 years ago

Completely skip cache flushing for not supporting log devices.

NVDIMM driver now writes data in non-temporal store mode, completely
bypassing CPU caches, so there is nothing to flush.

Ticket: #18277

Revision a1cde520 (diff)
Added by Alexander Motin almost 3 years ago

Execute last ZIO of log commit synchronously.

For short transactions overhead of context switch can be too large.
Skipping it gives significant latency reduction. For large ones,
including multiple ZIOs, latency is less critical, while throughput
there may become limited by checksumming speed of single CPU core.
To get best of both cases, execute last ZIO directly from calling
thread context to save latency, while all others (if there are any)
enqueue to taskqueues in traditional way.

Ticket: #18277

Revision a1cde520 (diff)
Added by Alexander Motin almost 3 years ago

Execute last ZIO of log commit synchronously.

For short transactions overhead of context switch can be too large.
Skipping it gives significant latency reduction. For large ones,
including multiple ZIOs, latency is less critical, while throughput
there may become limited by checksumming speed of single CPU core.
To get best of both cases, execute last ZIO directly from calling
thread context to save latency, while all others (if there are any)
enqueue to taskqueues in traditional way.

Ticket: #18277

Revision 72080dfc (diff)
Added by Alexander Motin almost 3 years ago

Skip context switch on GEOM I/O completion if context looks sleepable.

This allows to reduce I/O latency for devices like NVDIMM, having no
interrupts and reporting completion directly from calling thread.

This is a dirty hack, since used THREAD_CAN_SLEEP() is not generally
reliable and supposed to be used only for assertions.

Ticket: #18277

Revision 72080dfc (diff)
Added by Alexander Motin almost 3 years ago

Skip context switch on GEOM I/O completion if context looks sleepable.

This allows to reduce I/O latency for devices like NVDIMM, having no
interrupts and reporting completion directly from calling thread.

This is a dirty hack, since used THREAD_CAN_SLEEP() is not generally
reliable and supposed to be used only for assertions.

Ticket: #18277

Revision 878d1f24 (diff)
Added by Alexander Motin almost 3 years ago

Add vfs.zfs.zil_log_limit sysctl.

It is at least partially broken now, but that is another question.

Ticket: #18277

(cherry picked from commit 1c84b59fa5ef8425441d55542533c674f4d744a6)

Revision 878d1f24 (diff)
Added by Alexander Motin almost 3 years ago

Add vfs.zfs.zil_log_limit sysctl.

It is at least partially broken now, but that is another question.

Ticket: #18277

(cherry picked from commit 1c84b59fa5ef8425441d55542533c674f4d744a6)

Revision a9e82df1 (diff)
Added by Alexander Motin over 2 years ago

Optimize ZIL itx memory allocation on FreeBSD.

These allocations can reach up to 128KB, while FreeBSD kernel allocator
can cache allocations only up to 64KB. To avoid expensive allocations
for each large ZIL write use caching zio_buf_alloc() allocator instead.

To make it possible de-inline few instances of zil_itx_destroy().

Ticket: #18277

(cherry picked from commit 81423ea480ebc62d94772351d76f01aca5b14bfe)

Revision a9e82df1 (diff)
Added by Alexander Motin over 2 years ago

Optimize ZIL itx memory allocation on FreeBSD.

These allocations can reach up to 128KB, while FreeBSD kernel allocator
can cache allocations only up to 64KB. To avoid expensive allocations
for each large ZIL write use caching zio_buf_alloc() allocator instead.

To make it possible de-inline few instances of zil_itx_destroy().

Ticket: #18277

(cherry picked from commit 81423ea480ebc62d94772351d76f01aca5b14bfe)

Revision be2282c3 (diff)
Added by Alexander Motin over 2 years ago

Completely skip cache flushing for not supporting log devices.

NVDIMM driver now writes data in non-temporal store mode, completely
bypassing CPU caches, so there is nothing to flush.

Ticket: #18277

Revision be2282c3 (diff)
Added by Alexander Motin over 2 years ago

Completely skip cache flushing for not supporting log devices.

NVDIMM driver now writes data in non-temporal store mode, completely
bypassing CPU caches, so there is nothing to flush.

Ticket: #18277

Revision 0853ddfb (diff)
Added by Alexander Motin over 2 years ago

Execute last ZIO of log commit synchronously.

For short transactions overhead of context switch can be too large.
Skipping it gives significant latency reduction. For large ones,
including multiple ZIOs, latency is less critical, while throughput
there may become limited by checksumming speed of single CPU core.
To get best of both cases, execute last ZIO directly from calling
thread context to save latency, while all others (if there are any)
enqueue to taskqueues in traditional way.

Ticket: #18277

Revision 0853ddfb (diff)
Added by Alexander Motin over 2 years ago

Execute last ZIO of log commit synchronously.

For short transactions overhead of context switch can be too large.
Skipping it gives significant latency reduction. For large ones,
including multiple ZIOs, latency is less critical, while throughput
there may become limited by checksumming speed of single CPU core.
To get best of both cases, execute last ZIO directly from calling
thread context to save latency, while all others (if there are any)
enqueue to taskqueues in traditional way.

Ticket: #18277

Revision 50fb51f0 (diff)
Added by Alexander Motin over 2 years ago

Skip context switch on GEOM I/O completion if context looks sleepable.

This allows to reduce I/O latency for devices like NVDIMM, having no
interrupts and reporting completion directly from calling thread.

This is a dirty hack, since used THREAD_CAN_SLEEP() is not generally
reliable and supposed to be used only for assertions.

Ticket: #18277

Revision 50fb51f0 (diff)
Added by Alexander Motin over 2 years ago

Skip context switch on GEOM I/O completion if context looks sleepable.

This allows to reduce I/O latency for devices like NVDIMM, having no
interrupts and reporting completion directly from calling thread.

This is a dirty hack, since used THREAD_CAN_SLEEP() is not generally
reliable and supposed to be used only for assertions.

Ticket: #18277

Revision 305f8523 (diff)
Added by Alexander Motin over 2 years ago

Add vfs.zfs.zil_log_limit sysctl.

It is at least partially broken now, but that is another question.

Ticket: #18277

(cherry picked from commit 1c84b59fa5ef8425441d55542533c674f4d744a6)

Revision 305f8523 (diff)
Added by Alexander Motin over 2 years ago

Add vfs.zfs.zil_log_limit sysctl.

It is at least partially broken now, but that is another question.

Ticket: #18277

(cherry picked from commit 1c84b59fa5ef8425441d55542533c674f4d744a6)

Revision 275c639e (diff)
Added by Alexander Motin over 2 years ago

Completely skip cache flushing for not supporting log devices.

NVDIMM driver now writes data in non-temporal store mode, completely
bypassing CPU caches, so there is nothing to flush.

Ticket: #18277
(cherry picked from commit e027c6cf2d505cb114d550c607266bf7e8115906)

Revision 275c639e (diff)
Added by Alexander Motin over 2 years ago

Completely skip cache flushing for not supporting log devices.

NVDIMM driver now writes data in non-temporal store mode, completely
bypassing CPU caches, so there is nothing to flush.

Ticket: #18277
(cherry picked from commit e027c6cf2d505cb114d550c607266bf7e8115906)

Revision 5984355e (diff)
Added by Alexander Motin over 2 years ago

Execute last ZIO of log commit synchronously.

For short transactions overhead of context switch can be too large.
Skipping it gives significant latency reduction. For large ones,
including multiple ZIOs, latency is less critical, while throughput
there may become limited by checksumming speed of single CPU core.
To get best of both cases, execute last ZIO directly from calling
thread context to save latency, while all others (if there are any)
enqueue to taskqueues in traditional way.

Ticket: #18277
(cherry picked from commit a1cde5209669b8eaeabc768cf330847eb724bb5d)

Revision 5984355e (diff)
Added by Alexander Motin over 2 years ago

Execute last ZIO of log commit synchronously.

For short transactions overhead of context switch can be too large.
Skipping it gives significant latency reduction. For large ones,
including multiple ZIOs, latency is less critical, while throughput
there may become limited by checksumming speed of single CPU core.
To get best of both cases, execute last ZIO directly from calling
thread context to save latency, while all others (if there are any)
enqueue to taskqueues in traditional way.

Ticket: #18277
(cherry picked from commit a1cde5209669b8eaeabc768cf330847eb724bb5d)

Revision dc723cd8 (diff)
Added by Alexander Motin over 2 years ago

Skip context switch on GEOM I/O completion if context looks sleepable.

This allows to reduce I/O latency for devices like NVDIMM, having no
interrupts and reporting completion directly from calling thread.

This is a dirty hack, since used THREAD_CAN_SLEEP() is not generally
reliable and supposed to be used only for assertions.

Ticket: #18277
(cherry picked from commit 72080dfcb1a8b4f28eeb0028194d839f257c8721)

Revision dc723cd8 (diff)
Added by Alexander Motin over 2 years ago

Skip context switch on GEOM I/O completion if context looks sleepable.

This allows to reduce I/O latency for devices like NVDIMM, having no
interrupts and reporting completion directly from calling thread.

This is a dirty hack, since used THREAD_CAN_SLEEP() is not generally
reliable and supposed to be used only for assertions.

Ticket: #18277
(cherry picked from commit 72080dfcb1a8b4f28eeb0028194d839f257c8721)

Revision 8a6a0458 (diff)
Added by Alexander Motin over 2 years ago

Completely skip cache flushing for not supporting log devices.

NVDIMM driver now writes data in non-temporal store mode, completely
bypassing CPU caches, so there is nothing to flush.

Ticket: #18277
(cherry picked from commit e027c6cf2d505cb114d550c607266bf7e8115906)

Revision 8a6a0458 (diff)
Added by Alexander Motin over 2 years ago

Completely skip cache flushing for not supporting log devices.

NVDIMM driver now writes data in non-temporal store mode, completely
bypassing CPU caches, so there is nothing to flush.

Ticket: #18277
(cherry picked from commit e027c6cf2d505cb114d550c607266bf7e8115906)

Revision 71977676 (diff)
Added by Alexander Motin over 2 years ago

Execute last ZIO of log commit synchronously.

For short transactions overhead of context switch can be too large.
Skipping it gives significant latency reduction. For large ones,
including multiple ZIOs, latency is less critical, while throughput
there may become limited by checksumming speed of single CPU core.
To get best of both cases, execute last ZIO directly from calling
thread context to save latency, while all others (if there are any)
enqueue to taskqueues in traditional way.

Ticket: #18277
(cherry picked from commit a1cde5209669b8eaeabc768cf330847eb724bb5d)

Revision 71977676 (diff)
Added by Alexander Motin over 2 years ago

Execute last ZIO of log commit synchronously.

For short transactions overhead of context switch can be too large.
Skipping it gives significant latency reduction. For large ones,
including multiple ZIOs, latency is less critical, while throughput
there may become limited by checksumming speed of single CPU core.
To get best of both cases, execute last ZIO directly from calling
thread context to save latency, while all others (if there are any)
enqueue to taskqueues in traditional way.

Ticket: #18277
(cherry picked from commit a1cde5209669b8eaeabc768cf330847eb724bb5d)

Revision 2c9b0bde (diff)
Added by Alexander Motin over 2 years ago

Skip context switch on GEOM I/O completion if context looks sleepable.

This allows to reduce I/O latency for devices like NVDIMM, having no
interrupts and reporting completion directly from calling thread.

This is a dirty hack, since used THREAD_CAN_SLEEP() is not generally
reliable and supposed to be used only for assertions.

Ticket: #18277
(cherry picked from commit 72080dfcb1a8b4f28eeb0028194d839f257c8721)

Revision 2c9b0bde (diff)
Added by Alexander Motin over 2 years ago

Skip context switch on GEOM I/O completion if context looks sleepable.

This allows to reduce I/O latency for devices like NVDIMM, having no
interrupts and reporting completion directly from calling thread.

This is a dirty hack, since used THREAD_CAN_SLEEP() is not generally
reliable and supposed to be used only for assertions.

Ticket: #18277
(cherry picked from commit 72080dfcb1a8b4f28eeb0028194d839f257c8721)

History

#1 Updated by Alexander Motin almost 3 years ago

All patches available at the moment are committed to freebsd10 branch.

#2 Avatar?id=14398&size=24x24 Updated by Kris Moore almost 3 years ago

Sasha, Any other patches to bring in, or can we send this over to VB to get wrapped up?

#3 Updated by Alexander Motin almost 3 years ago

One more patch is under review in OpenZFS git. Hard to tell how long it may get, so may be I'll push it to FreeBSD and merge down any way.

#4 Updated by Alexander Motin almost 3 years ago

  • Status changed from Fix In Progress to Ready For Release

OK. Now everything I have is in freebsd10 branch and so is ready for 9.10.2. Optimization is usually an endless process, but so far I am out of ideas.

#5 Updated by Dru Lavigne over 1 year ago

  • Status changed from Ready For Release to Resolved

Also available in: Atom PDF