[SRU C][PATCH v2 0/6] blk-wbt: fix for LP#1810998

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

[SRU C][PATCH v2 0/6] blk-wbt: fix for LP#1810998

Mauricio Faria de Oliveira-3
BugLink: https://bugs.launchpad.net/bugs/1810998

[Impact]

 * Users may experience cpu hard lockups when performing
   rigorous writes to NVMe drives.

 * The fix addresses an scheduling issue in the original
   implementation of wbt/writeback throttling

 * The fix is commit 2887e41b910b ("blk-wbt: Avoid lock
   contention and thundering herd issue in wbt_wait"),
   plus its fix commit 38cfb5a45ee0 ("blk-wbt: improve
   waking of tasks").

 * Plus a few dependency commits for each fix.

 * Backports are trivial: mainly replace rq_wait_inc_below()
   with the equivalent atomic_inc_below(), and maintain the
   __wbt_done() signature, both due to the lack of commit
   a79050434b45 ("blk-rq-qos: refactor out common elements
   of blk-wbt"), that changes lots of other/unrelated code.

[Test Case]

 * This command has been reported to reproduce the problem:

   $ sudo iozone -R -s 5G -r 1m -S 2048 -i 0 -G -c -o -l 128 -u 128 -t 128

 * It generates stack traces as below in the original kernel,
   and does not generate them in the modified/patched kernel.

 * The user/reporter verified the test kernel with these patches
   resolved the problem.

 * The developer verified in 2 systems (4-core and 24-core but
   no NVMe) for regressions, and no error messages were logged
   to dmesg.

[Regression Potential]

 * The regression potential is contained within writeback
   throttling mechanism (block/blk-wbt.*).

 * The commits have been verified for fixes in later commits in
   linux-next as of 2019-01-08 and all known fix commits are in.

[Other Info]

 * The problem has been introduced with the blk-wbt mechanism,
   in v4.10-rc1, and the fix commits in v4.19-rc1 and -rc2,
   so only Bionic and Cosmic needs this.

[Stack Traces]

[ 393.628647] NMI watchdog: Watchdog detected hard LOCKUP on cpu 30
...
[ 393.628704] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic #21-Ubuntu
...
[ 393.628720] Call Trace:
[ 393.628721] <IRQ>
[ 393.628724] enqueue_task_fair+0x6c/0x7f0
[ 393.628726] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150
[ 393.628728] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150
[ 393.628731] activate_task+0x57/0xc0
[ 393.628735] ? sched_clock+0x9/0x10
[ 393.628736] ? sched_clock+0x9/0x10
[ 393.628738] ttwu_do_activate+0x49/0x90
[ 393.628739] try_to_wake_up+0x1df/0x490
[ 393.628741] default_wake_function+0x12/0x20
[ 393.628743] autoremove_wake_function+0x12/0x40
[ 393.628744] __wake_up_common+0x73/0x130
[ 393.628745] __wake_up_common_lock+0x80/0xc0
[ 393.628746] __wake_up+0x13/0x20
[ 393.628749] __wbt_done.part.21+0xa4/0xb0
[ 393.628749] wbt_done+0x72/0xa0
[ 393.628753] blk_mq_free_request+0xca/0x1a0
[ 393.628755] blk_mq_end_request+0x48/0x90
[ 393.628760] nvme_complete_rq+0x23/0x120 [nvme_core]
[ 393.628763] nvme_pci_complete_rq+0x7a/0x130 [nvme]
[ 393.628764] __blk_mq_complete_request+0xd2/0x140
[ 393.628766] blk_mq_complete_request+0x18/0x20
[ 393.628767] nvme_process_cq+0xe1/0x1b0 [nvme]
[ 393.628768] nvme_irq+0x23/0x50 [nvme]
[ 393.628772] __handle_irq_event_percpu+0x44/0x1a0
[ 393.628773] handle_irq_event_percpu+0x32/0x80
[ 393.628774] handle_irq_event+0x3b/0x60
[ 393.628778] handle_edge_irq+0x7c/0x190
[ 393.628779] handle_irq+0x20/0x30
[ 393.628783] do_IRQ+0x46/0xd0
[ 393.628784] common_interrupt+0x84/0x84
[ 393.628785] </IRQ>
...
[ 393.628794] ? cpuidle_enter_state+0x97/0x2f0
[ 393.628796] cpuidle_enter+0x17/0x20
[ 393.628797] call_cpuidle+0x23/0x40
[ 393.628798] do_idle+0x18c/0x1f0
[ 393.628799] cpu_startup_entry+0x73/0x80
[ 393.628802] start_secondary+0x1a6/0x200
[ 393.628804] secondary_startup_64+0xa5/0xb0
[ 393.628805] Code: ...

[ 405.981597] nvme nvme1: I/O 393 QID 6 timeout, completion polled

[ 435.597209] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 435.602858] 30-...0: (1 GPs behind) idle=e26/1/0 softirq=6834/6834 fqs=4485
[ 435.610203] (detected by 8, t=15005 jiffies, g=6396, c=6395, q=146818)
[ 435.617025] Sending NMI from CPU 8 to CPUs 30:
[ 435.617029] NMI backtrace for cpu 30
[ 435.617031] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic #21-Ubuntu
...
[ 435.617047] Call Trace:
[ 435.617048] <IRQ>
[ 435.617051] enqueue_entity+0x9f/0x6b0
[ 435.617053] enqueue_task_fair+0x6c/0x7f0
[ 435.617056] activate_task+0x57/0xc0
[ 435.617059] ? sched_clock+0x9/0x10
[ 435.617060] ? sched_clock+0x9/0x10
[ 435.617061] ttwu_do_activate+0x49/0x90
[ 435.617063] try_to_wake_up+0x1df/0x490
[ 435.617065] default_wake_function+0x12/0x20
[ 435.617067] autoremove_wake_function+0x12/0x40
[ 435.617068] __wake_up_common+0x73/0x130
[ 435.617069] __wake_up_common_lock+0x80/0xc0
[ 435.617070] __wake_up+0x13/0x20
[ 435.617073] __wbt_done.part.21+0xa4/0xb0
[ 435.617074] wbt_done+0x72/0xa0
[ 435.617077] blk_mq_free_request+0xca/0x1a0
[ 435.617079] blk_mq_end_request+0x48/0x90
[ 435.617084] nvme_complete_rq+0x23/0x120 [nvme_core]
[ 435.617087] nvme_pci_complete_rq+0x7a/0x130 [nvme]
[ 435.617088] __blk_mq_complete_request+0xd2/0x140
[ 435.617090] blk_mq_complete_request+0x18/0x20
[ 435.617091] nvme_process_cq+0xe1/0x1b0 [nvme]
[ 435.617093] nvme_irq+0x23/0x50 [nvme]
[ 435.617096] __handle_irq_event_percpu+0x44/0x1a0
[ 435.617097] handle_irq_event_percpu+0x32/0x80
[ 435.617098] handle_irq_event+0x3b/0x60
[ 435.617101] handle_edge_irq+0x7c/0x190
[ 435.617102] handle_irq+0x20/0x30
[ 435.617106] do_IRQ+0x46/0xd0
[ 435.617107] common_interrupt+0x84/0x84
[ 435.617108] </IRQ>
...
[ 435.617117] ? cpuidle_enter_state+0x97/0x2f0
[ 435.617118] cpuidle_enter+0x17/0x20
[ 435.617119] call_cpuidle+0x23/0x40
[ 435.617121] do_idle+0x18c/0x1f0
[ 435.617122] cpu_startup_entry+0x73/0x80
[ 435.617125] start_secondary+0x1a6/0x200
[ 435.617127] secondary_startup_64+0xa5/0xb0
[ 435.617128] Code: ...

Anchal Agarwal (1):
  blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait

Jens Axboe (5):
  blk-wbt: move disable check into get_limit()
  blk-wbt: use wq_has_sleeper() for wq active check
  blk-wbt: fix has-sleeper queueing check
  blk-wbt: abstract out end IO completion handler
  blk-wbt: improve waking of tasks

 block/blk-wbt.c | 107 +++++++++++++++++++++++++++++++++---------------
 1 file changed, 75 insertions(+), 32 deletions(-)

--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU C][PATCH v2 1/6] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait

Mauricio Faria de Oliveira-3
From: Anchal Agarwal <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1810998

I am currently running a large bare metal instance (i3.metal)
on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
4.18 kernel. I have a workload that simulates a database
workload and I am running into lockup issues when writeback
throttling is enabled,with the hung task detector also
kicking in.

Crash dumps show that most CPUs (up to 50 of them) are
all trying to get the wbt wait queue lock while trying to add
themselves to it in __wbt_wait (see stack traces below).

[    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
[    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
[    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
[    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
[    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
[    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
[    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
[    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
[    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
[    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
[    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
[    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
[    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.948138] Call Trace:
[    0.948139]  <IRQ>
[    0.948142]  do_raw_spin_lock+0xad/0xc0
[    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
[    0.948149]  ? __wake_up_common_lock+0x53/0x90
[    0.948150]  __wake_up_common_lock+0x53/0x90
[    0.948155]  wbt_done+0x7b/0xa0
[    0.948158]  blk_mq_free_request+0xb7/0x110
[    0.948161]  __blk_mq_complete_request+0xcb/0x140
[    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
[    0.948169]  nvme_irq+0x23/0x50 [nvme]
[    0.948173]  __handle_irq_event_percpu+0x46/0x300
[    0.948176]  handle_irq_event_percpu+0x20/0x50
[    0.948179]  handle_irq_event+0x34/0x60
[    0.948181]  handle_edge_irq+0x77/0x190
[    0.948185]  handle_irq+0xaf/0x120
[    0.948188]  do_IRQ+0x53/0x110
[    0.948191]  common_interrupt+0x87/0x87
[    0.948192]  </IRQ>
....
[    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
[    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
[    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
[    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
[    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
[    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
[    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
[    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
[    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
[    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
[    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
[    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
[    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.311154] Call Trace:
[    0.311157]  do_raw_spin_lock+0xad/0xc0
[    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
[    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
[    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
[    0.311167]  wbt_wait+0x127/0x330
[    0.311169]  ? finish_wait+0x80/0x80
[    0.311172]  ? generic_make_request+0xda/0x3b0
[    0.311174]  blk_mq_make_request+0xd6/0x7b0
[    0.311176]  ? blk_queue_enter+0x24/0x260
[    0.311178]  ? generic_make_request+0xda/0x3b0
[    0.311181]  generic_make_request+0x10c/0x3b0
[    0.311183]  ? submit_bio+0x5c/0x110
[    0.311185]  submit_bio+0x5c/0x110
[    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
[    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
[    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
[    0.311229]  ? do_writepages+0x3c/0xd0
[    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
[    0.311240]  do_writepages+0x3c/0xd0
[    0.311243]  ? _raw_spin_unlock+0x24/0x30
[    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
[    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
[    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
[    0.311253]  file_write_and_wait_range+0x34/0x90
[    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
[    0.311267]  do_fsync+0x38/0x60
[    0.311270]  SyS_fsync+0xc/0x10
[    0.311272]  do_syscall_64+0x6f/0x170
[    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7

In the original patch, wbt_done is waking up all the exclusive
processes in the wait queue, which can cause a thundering herd
if there is a large number of writer threads in the queue. The
original intention of the code seems to be to wake up one thread
only however, it uses wake_up_all() in __wbt_done(), and then
uses the following check in __wbt_wait to have only one thread
actually get out of the wait loop:

if (waitqueue_active(&rqw->wait) &&
            rqw->wait.head.next != &wait->entry)
                return false;

The problem with this is that the wait entry in wbt_wait is
define with DEFINE_WAIT, which uses the autoremove wakeup function.
That means that the above check is invalid - the wait entry will
have been removed from the queue already by the time we hit the
check in the loop.

Secondly, auto-removing the wait entries also means that the wait
queue essentially gets reordered "randomly" (e.g. threads re-add
themselves in the order they got to run after being woken up).
Additionally, new requests entering wbt_wait might overtake requests
that were queued earlier, because the wait queue will be
(temporarily) empty after the wake_up_all, so the waitqueue_active
check will not stop them. This can cause certain threads to starve
under high load.

The fix is to leave the woken up requests in the queue and remove
them in finish_wait() once the current thread breaks out of the
wait loop in __wbt_wait. This will ensure new requests always
end up at the back of the queue, and they won't overtake requests
that are already in the wait queue. With that change, the loop
in wbt_wait is also in line with many other wait loops in the kernel.
Waking up just one thread drastically reduces lock contention, as
does moving the wait queue add/remove out of the loop.

A significant drop in lockdep's lock contention numbers is seen when
running the test application on the patched kernel.

Signed-off-by: Anchal Agarwal <[hidden email]>
Signed-off-by: Frank van der Linden <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(backported from commit 2887e41b910bb14fd847cf01ab7a5993db989d88)
[mfo: backport:
 - s/rq_wait_inc_below(rqw/atomic_inc_below(&rqw->inflight/]
Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
---
 block/blk-wbt.c | 55 +++++++++++++++++++++----------------------------
 1 file changed, 24 insertions(+), 31 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 4f89b28fa652..5733d3ab8ed5 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -186,7 +186,7 @@ void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
  int diff = limit - inflight;
 
  if (!inflight || diff >= rwb->wb_background / 2)
- wake_up_all(&rqw->wait);
+ wake_up(&rqw->wait);
  }
 }
 
@@ -533,30 +533,6 @@ static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
  return limit;
 }
 
-static inline bool may_queue(struct rq_wb *rwb, struct rq_wait *rqw,
-     wait_queue_entry_t *wait, unsigned long rw)
-{
- /*
- * inc it here even if disabled, since we'll dec it at completion.
- * this only happens if the task was sleeping in __wbt_wait(),
- * and someone turned it off at the same time.
- */
- if (!rwb_enabled(rwb)) {
- atomic_inc(&rqw->inflight);
- return true;
- }
-
- /*
- * If the waitqueue is already active and we are not the next
- * in line to be woken up, wait for our turn.
- */
- if (waitqueue_active(&rqw->wait) &&
-    rqw->wait.head.next != &wait->entry)
- return false;
-
- return atomic_inc_below(&rqw->inflight, get_limit(rwb, rw));
-}
-
 /*
  * Block if we will exceed our limit, or if we are currently waiting for
  * the timer to kick off queuing again.
@@ -567,16 +543,32 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
  __acquires(lock)
 {
  struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
- DEFINE_WAIT(wait);
+ DECLARE_WAITQUEUE(wait, current);
+
+ /*
+ * inc it here even if disabled, since we'll dec it at completion.
+ * this only happens if the task was sleeping in __wbt_wait(),
+ * and someone turned it off at the same time.
+ */
+ if (!rwb_enabled(rwb)) {
+ atomic_inc(&rqw->inflight);
+ return;
+ }
 
- if (may_queue(rwb, rqw, &wait, rw))
+ if (!waitqueue_active(&rqw->wait)
+ && atomic_inc_below(&rqw->inflight, get_limit(rwb, rw)))
  return;
 
+ add_wait_queue_exclusive(&rqw->wait, &wait);
  do {
- prepare_to_wait_exclusive(&rqw->wait, &wait,
- TASK_UNINTERRUPTIBLE);
+ set_current_state(TASK_UNINTERRUPTIBLE);
+
+ if (!rwb_enabled(rwb)) {
+ atomic_inc(&rqw->inflight);
+ break;
+ }
 
- if (may_queue(rwb, rqw, &wait, rw))
+ if (atomic_inc_below(&rqw->inflight, get_limit(rwb, rw)))
  break;
 
  if (lock) {
@@ -587,7 +579,8 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
  io_schedule();
  } while (1);
 
- finish_wait(&rqw->wait, &wait);
+ __set_current_state(TASK_RUNNING);
+ remove_wait_queue(&rqw->wait, &wait);
 }
 
 static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU C][PATCH v2 2/6] blk-wbt: move disable check into get_limit()

Mauricio Faria de Oliveira-3
In reply to this post by Mauricio Faria de Oliveira-3
From: Jens Axboe <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1810998

Check it in one place, instead of in multiple places.

Tested-by: Anchal Agarwal <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(backported from commit ffa358dcaae1f2f00926484e712e06daa8953cb4)
[mfo: backport:
 - blk-wbt.c:
   - hunk 2: s/rq_wait_inc_below(rqw/atomic_inc_below(&rqw->inflight/
   - hunk 3: s/rq_wait_inc_below(rqw/atomic_inc_below(&rqw->inflight/
Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
---
 block/blk-wbt.c | 22 +++++++---------------
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 5733d3ab8ed5..84e5cefbb3bb 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -508,6 +508,13 @@ static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
 {
  unsigned int limit;
 
+ /*
+ * If we got disabled, just return UINT_MAX. This ensures that
+ * we'll properly inc a new IO, and dec+wakeup at the end.
+ */
+ if (!rwb_enabled(rwb))
+ return UINT_MAX;
+
  if ((rw & REQ_OP_MASK) == REQ_OP_DISCARD)
  return rwb->wb_background;
 
@@ -545,16 +552,6 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
  struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
  DECLARE_WAITQUEUE(wait, current);
 
- /*
- * inc it here even if disabled, since we'll dec it at completion.
- * this only happens if the task was sleeping in __wbt_wait(),
- * and someone turned it off at the same time.
- */
- if (!rwb_enabled(rwb)) {
- atomic_inc(&rqw->inflight);
- return;
- }
-
  if (!waitqueue_active(&rqw->wait)
  && atomic_inc_below(&rqw->inflight, get_limit(rwb, rw)))
  return;
@@ -563,11 +560,6 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
  do {
  set_current_state(TASK_UNINTERRUPTIBLE);
 
- if (!rwb_enabled(rwb)) {
- atomic_inc(&rqw->inflight);
- break;
- }
-
  if (atomic_inc_below(&rqw->inflight, get_limit(rwb, rw)))
  break;
 
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU C][PATCH v2 3/6] blk-wbt: use wq_has_sleeper() for wq active check

Mauricio Faria de Oliveira-3
In reply to this post by Mauricio Faria de Oliveira-3
From: Jens Axboe <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1810998

We need the memory barrier before checking the list head,
use the appropriate helper for this. The matching queue
side memory barrier is provided by set_current_state().

Tested-by: Anchal Agarwal <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(backported from commit b78820937b4762b7d30b807d7156bec1d89e4dd3)
[mfo: backport:
 - hunk 3: s/rq_wait_inc_below(rqw/atomic_inc_below(&rqw->inflight/]
Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
---
 block/blk-wbt.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 84e5cefbb3bb..08472c1a7858 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -139,7 +139,7 @@ static void rwb_wake_all(struct rq_wb *rwb)
  for (i = 0; i < WBT_NUM_RWQ; i++) {
  struct rq_wait *rqw = &rwb->rq_wait[i];
 
- if (waitqueue_active(&rqw->wait))
+ if (wq_has_sleeper(&rqw->wait))
  wake_up_all(&rqw->wait);
  }
 }
@@ -182,7 +182,7 @@ void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
  if (inflight && inflight >= limit)
  return;
 
- if (waitqueue_active(&rqw->wait)) {
+ if (wq_has_sleeper(&rqw->wait)) {
  int diff = limit - inflight;
 
  if (!inflight || diff >= rwb->wb_background / 2)
@@ -552,8 +552,8 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
  struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
  DECLARE_WAITQUEUE(wait, current);
 
- if (!waitqueue_active(&rqw->wait)
- && atomic_inc_below(&rqw->inflight, get_limit(rwb, rw)))
+ if (!wq_has_sleeper(&rqw->wait) &&
+    atomic_inc_below(&rqw->inflight, get_limit(rwb, rw)))
  return;
 
  add_wait_queue_exclusive(&rqw->wait, &wait);
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU C][PATCH v2 4/6] blk-wbt: fix has-sleeper queueing check

Mauricio Faria de Oliveira-3
In reply to this post by Mauricio Faria de Oliveira-3
From: Jens Axboe <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1810998

We need to do this inside the loop as well, or we can allow new
IO to supersede previous IO.

Tested-by: Anchal Agarwal <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(backported from commit c45e6a037a536530bd25781ac7c989e52deb2a63)
[mfo: backport:
 - hunk 1: s/rq_wait_inc_below(rqw/atomic_inc_below(&rqw->inflight/]
Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
---
 block/blk-wbt.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 08472c1a7858..d4f7a1bc1056 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -551,16 +551,17 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
 {
  struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
  DECLARE_WAITQUEUE(wait, current);
+ bool has_sleeper;
 
- if (!wq_has_sleeper(&rqw->wait) &&
-    atomic_inc_below(&rqw->inflight, get_limit(rwb, rw)))
+ has_sleeper = wq_has_sleeper(&rqw->wait);
+ if (!has_sleeper && atomic_inc_below(&rqw->inflight, get_limit(rwb, rw)))
  return;
 
  add_wait_queue_exclusive(&rqw->wait, &wait);
  do {
  set_current_state(TASK_UNINTERRUPTIBLE);
 
- if (atomic_inc_below(&rqw->inflight, get_limit(rwb, rw)))
+ if (!has_sleeper && atomic_inc_below(&rqw->inflight, get_limit(rwb, rw)))
  break;
 
  if (lock) {
@@ -569,6 +570,7 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
  spin_lock_irq(lock);
  } else
  io_schedule();
+ has_sleeper = false;
  } while (1);
 
  __set_current_state(TASK_RUNNING);
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU C][PATCH v2 5/6] blk-wbt: abstract out end IO completion handler

Mauricio Faria de Oliveira-3
In reply to this post by Mauricio Faria de Oliveira-3
From: Jens Axboe <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1810998

Prep patch for calling the handler from a different context,
no functional changes in this patch.

Tested-by: Agarwal, Anchal <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(backported from commit 061a5427530633de93ace4ef001b99961984af62)
[mfo: backport: __wbt_done():
 - keep signature (not static; parameters; no 'rqos')
 - remove the cast to 'rwb' from 'rqos' (it doesn't exist).]
Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
---
 block/blk-wbt.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index d4f7a1bc1056..fe20486bd9b4 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -144,15 +144,11 @@ static void rwb_wake_all(struct rq_wb *rwb)
  }
 }
 
-void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
+static void wbt_rqw_done(struct rq_wb *rwb, struct rq_wait *rqw,
+ enum wbt_flags wb_acct)
 {
- struct rq_wait *rqw;
  int inflight, limit;
 
- if (!(wb_acct & WBT_TRACKED))
- return;
-
- rqw = get_rq_wait(rwb, wb_acct);
  inflight = atomic_dec_return(&rqw->inflight);
 
  /*
@@ -190,6 +186,17 @@ void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
  }
 }
 
+void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
+{
+ struct rq_wait *rqw;
+
+ if (!(wb_acct & WBT_TRACKED))
+ return;
+
+ rqw = get_rq_wait(rwb, wb_acct);
+ wbt_rqw_done(rwb, rqw, wb_acct);
+}
+
 /*
  * Called on completion of a request. Note that it's also called when
  * a request is merged, when the request gets freed.
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU C][PATCH v2 6/6] blk-wbt: improve waking of tasks

Mauricio Faria de Oliveira-3
In reply to this post by Mauricio Faria de Oliveira-3
From: Jens Axboe <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1810998

We have two potential issues:

1) After commit 2887e41b910b, we only wake one process at the time when
   we finish an IO. We really want to wake up as many tasks as can
   queue IO. Before this commit, we woke up everyone, which could cause
   a thundering herd issue.

2) A task can potentially consume two wakeups, causing us to (in
   practice) miss a wakeup.

Fix both by providing our own wakeup function, which stops
__wake_up_common() from waking up more tasks if we fail to get a
queueing token. With the strict ordering we have on the wait list, this
wakes the right tasks and the right amount of tasks.

Based on a patch from Jianchao Wang <[hidden email]>.

Tested-by: Agarwal, Anchal <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(backported from commit 38cfb5a45ee013bfab5d1ae4c4738815e744b440)
[mfo: backport:
 - hunk 2: s/rq_wait_inc_below(data->rqw/atomic_inc_below(&data->rqw->inflight/
 - hunk 3: s/rq_wait_inc_below(rqw/atomic_inc_below(&rqw->inflight/]
Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
---
 block/blk-wbt.c | 63 +++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 56 insertions(+), 7 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index fe20486bd9b4..e9efcfc3a0d5 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -182,7 +182,7 @@ static void wbt_rqw_done(struct rq_wb *rwb, struct rq_wait *rqw,
  int diff = limit - inflight;
 
  if (!inflight || diff >= rwb->wb_background / 2)
- wake_up(&rqw->wait);
+ wake_up_all(&rqw->wait);
  }
 }
 
@@ -547,6 +547,34 @@ static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
  return limit;
 }
 
+struct wbt_wait_data {
+ struct wait_queue_entry wq;
+ struct task_struct *task;
+ struct rq_wb *rwb;
+ struct rq_wait *rqw;
+ unsigned long rw;
+ bool got_token;
+};
+
+static int wbt_wake_function(struct wait_queue_entry *curr, unsigned int mode,
+     int wake_flags, void *key)
+{
+ struct wbt_wait_data *data = container_of(curr, struct wbt_wait_data,
+ wq);
+
+ /*
+ * If we fail to get a budget, return -1 to interrupt the wake up
+ * loop in __wake_up_common.
+ */
+ if (!atomic_inc_below(&data->rqw->inflight, get_limit(data->rwb, data->rw)))
+ return -1;
+
+ data->got_token = true;
+ list_del_init(&curr->entry);
+ wake_up_process(data->task);
+ return 1;
+}
+
 /*
  * Block if we will exceed our limit, or if we are currently waiting for
  * the timer to kick off queuing again.
@@ -557,19 +585,40 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
  __acquires(lock)
 {
  struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
- DECLARE_WAITQUEUE(wait, current);
+ struct wbt_wait_data data = {
+ .wq = {
+ .func = wbt_wake_function,
+ .entry = LIST_HEAD_INIT(data.wq.entry),
+ },
+ .task = current,
+ .rwb = rwb,
+ .rqw = rqw,
+ .rw = rw,
+ };
  bool has_sleeper;
 
  has_sleeper = wq_has_sleeper(&rqw->wait);
  if (!has_sleeper && atomic_inc_below(&rqw->inflight, get_limit(rwb, rw)))
  return;
 
- add_wait_queue_exclusive(&rqw->wait, &wait);
+ prepare_to_wait_exclusive(&rqw->wait, &data.wq, TASK_UNINTERRUPTIBLE);
  do {
- set_current_state(TASK_UNINTERRUPTIBLE);
+ if (data.got_token)
+ break;
 
- if (!has_sleeper && atomic_inc_below(&rqw->inflight, get_limit(rwb, rw)))
+ if (!has_sleeper &&
+    atomic_inc_below(&rqw->inflight, get_limit(rwb, rw))) {
+ finish_wait(&rqw->wait, &data.wq);
+
+ /*
+ * We raced with wbt_wake_function() getting a token,
+ * which means we now have two. Put our local token
+ * and wake anyone else potentially waiting for one.
+ */
+ if (data.got_token)
+ wbt_rqw_done(rwb, rqw, wb_acct);
  break;
+ }
 
  if (lock) {
  spin_unlock_irq(lock);
@@ -577,11 +626,11 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
  spin_lock_irq(lock);
  } else
  io_schedule();
+
  has_sleeper = false;
  } while (1);
 
- __set_current_state(TASK_RUNNING);
- remove_wait_queue(&rqw->wait, &wait);
+ finish_wait(&rqw->wait, &data.wq);
 }
 
 static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

ACK: [SRU C][PATCH v2 0/6] blk-wbt: fix for LP#1810998

Stefan Bader-2
In reply to this post by Mauricio Faria de Oliveira-3
On 11.01.19 12:08, Mauricio Faria de Oliveira wrote:

> BugLink: https://bugs.launchpad.net/bugs/1810998
>
> [Impact]
>
>  * Users may experience cpu hard lockups when performing
>    rigorous writes to NVMe drives.
>
>  * The fix addresses an scheduling issue in the original
>    implementation of wbt/writeback throttling
>
>  * The fix is commit 2887e41b910b ("blk-wbt: Avoid lock
>    contention and thundering herd issue in wbt_wait"),
>    plus its fix commit 38cfb5a45ee0 ("blk-wbt: improve
>    waking of tasks").
>
>  * Plus a few dependency commits for each fix.
>
>  * Backports are trivial: mainly replace rq_wait_inc_below()
>    with the equivalent atomic_inc_below(), and maintain the
>    __wbt_done() signature, both due to the lack of commit
>    a79050434b45 ("blk-rq-qos: refactor out common elements
>    of blk-wbt"), that changes lots of other/unrelated code.
>
> [Test Case]
>
>  * This command has been reported to reproduce the problem:
>
>    $ sudo iozone -R -s 5G -r 1m -S 2048 -i 0 -G -c -o -l 128 -u 128 -t 128
>
>  * It generates stack traces as below in the original kernel,
>    and does not generate them in the modified/patched kernel.
>
>  * The user/reporter verified the test kernel with these patches
>    resolved the problem.
>
>  * The developer verified in 2 systems (4-core and 24-core but
>    no NVMe) for regressions, and no error messages were logged
>    to dmesg.
>
> [Regression Potential]
>
>  * The regression potential is contained within writeback
>    throttling mechanism (block/blk-wbt.*).
>
>  * The commits have been verified for fixes in later commits in
>    linux-next as of 2019-01-08 and all known fix commits are in.
>
> [Other Info]
>
>  * The problem has been introduced with the blk-wbt mechanism,
>    in v4.10-rc1, and the fix commits in v4.19-rc1 and -rc2,
>    so only Bionic and Cosmic needs this.
>
> [Stack Traces]
>
> [ 393.628647] NMI watchdog: Watchdog detected hard LOCKUP on cpu 30
> ...
> [ 393.628704] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic #21-Ubuntu
> ...
> [ 393.628720] Call Trace:
> [ 393.628721] <IRQ>
> [ 393.628724] enqueue_task_fair+0x6c/0x7f0
> [ 393.628726] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150
> [ 393.628728] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150
> [ 393.628731] activate_task+0x57/0xc0
> [ 393.628735] ? sched_clock+0x9/0x10
> [ 393.628736] ? sched_clock+0x9/0x10
> [ 393.628738] ttwu_do_activate+0x49/0x90
> [ 393.628739] try_to_wake_up+0x1df/0x490
> [ 393.628741] default_wake_function+0x12/0x20
> [ 393.628743] autoremove_wake_function+0x12/0x40
> [ 393.628744] __wake_up_common+0x73/0x130
> [ 393.628745] __wake_up_common_lock+0x80/0xc0
> [ 393.628746] __wake_up+0x13/0x20
> [ 393.628749] __wbt_done.part.21+0xa4/0xb0
> [ 393.628749] wbt_done+0x72/0xa0
> [ 393.628753] blk_mq_free_request+0xca/0x1a0
> [ 393.628755] blk_mq_end_request+0x48/0x90
> [ 393.628760] nvme_complete_rq+0x23/0x120 [nvme_core]
> [ 393.628763] nvme_pci_complete_rq+0x7a/0x130 [nvme]
> [ 393.628764] __blk_mq_complete_request+0xd2/0x140
> [ 393.628766] blk_mq_complete_request+0x18/0x20
> [ 393.628767] nvme_process_cq+0xe1/0x1b0 [nvme]
> [ 393.628768] nvme_irq+0x23/0x50 [nvme]
> [ 393.628772] __handle_irq_event_percpu+0x44/0x1a0
> [ 393.628773] handle_irq_event_percpu+0x32/0x80
> [ 393.628774] handle_irq_event+0x3b/0x60
> [ 393.628778] handle_edge_irq+0x7c/0x190
> [ 393.628779] handle_irq+0x20/0x30
> [ 393.628783] do_IRQ+0x46/0xd0
> [ 393.628784] common_interrupt+0x84/0x84
> [ 393.628785] </IRQ>
> ...
> [ 393.628794] ? cpuidle_enter_state+0x97/0x2f0
> [ 393.628796] cpuidle_enter+0x17/0x20
> [ 393.628797] call_cpuidle+0x23/0x40
> [ 393.628798] do_idle+0x18c/0x1f0
> [ 393.628799] cpu_startup_entry+0x73/0x80
> [ 393.628802] start_secondary+0x1a6/0x200
> [ 393.628804] secondary_startup_64+0xa5/0xb0
> [ 393.628805] Code: ...
>
> [ 405.981597] nvme nvme1: I/O 393 QID 6 timeout, completion polled
>
> [ 435.597209] INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 435.602858] 30-...0: (1 GPs behind) idle=e26/1/0 softirq=6834/6834 fqs=4485
> [ 435.610203] (detected by 8, t=15005 jiffies, g=6396, c=6395, q=146818)
> [ 435.617025] Sending NMI from CPU 8 to CPUs 30:
> [ 435.617029] NMI backtrace for cpu 30
> [ 435.617031] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic #21-Ubuntu
> ...
> [ 435.617047] Call Trace:
> [ 435.617048] <IRQ>
> [ 435.617051] enqueue_entity+0x9f/0x6b0
> [ 435.617053] enqueue_task_fair+0x6c/0x7f0
> [ 435.617056] activate_task+0x57/0xc0
> [ 435.617059] ? sched_clock+0x9/0x10
> [ 435.617060] ? sched_clock+0x9/0x10
> [ 435.617061] ttwu_do_activate+0x49/0x90
> [ 435.617063] try_to_wake_up+0x1df/0x490
> [ 435.617065] default_wake_function+0x12/0x20
> [ 435.617067] autoremove_wake_function+0x12/0x40
> [ 435.617068] __wake_up_common+0x73/0x130
> [ 435.617069] __wake_up_common_lock+0x80/0xc0
> [ 435.617070] __wake_up+0x13/0x20
> [ 435.617073] __wbt_done.part.21+0xa4/0xb0
> [ 435.617074] wbt_done+0x72/0xa0
> [ 435.617077] blk_mq_free_request+0xca/0x1a0
> [ 435.617079] blk_mq_end_request+0x48/0x90
> [ 435.617084] nvme_complete_rq+0x23/0x120 [nvme_core]
> [ 435.617087] nvme_pci_complete_rq+0x7a/0x130 [nvme]
> [ 435.617088] __blk_mq_complete_request+0xd2/0x140
> [ 435.617090] blk_mq_complete_request+0x18/0x20
> [ 435.617091] nvme_process_cq+0xe1/0x1b0 [nvme]
> [ 435.617093] nvme_irq+0x23/0x50 [nvme]
> [ 435.617096] __handle_irq_event_percpu+0x44/0x1a0
> [ 435.617097] handle_irq_event_percpu+0x32/0x80
> [ 435.617098] handle_irq_event+0x3b/0x60
> [ 435.617101] handle_edge_irq+0x7c/0x190
> [ 435.617102] handle_irq+0x20/0x30
> [ 435.617106] do_IRQ+0x46/0xd0
> [ 435.617107] common_interrupt+0x84/0x84
> [ 435.617108] </IRQ>
> ...
> [ 435.617117] ? cpuidle_enter_state+0x97/0x2f0
> [ 435.617118] cpuidle_enter+0x17/0x20
> [ 435.617119] call_cpuidle+0x23/0x40
> [ 435.617121] do_idle+0x18c/0x1f0
> [ 435.617122] cpu_startup_entry+0x73/0x80
> [ 435.617125] start_secondary+0x1a6/0x200
> [ 435.617127] secondary_startup_64+0xa5/0xb0
> [ 435.617128] Code: ...
>
> Anchal Agarwal (1):
>   blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
>
> Jens Axboe (5):
>   blk-wbt: move disable check into get_limit()
>   blk-wbt: use wq_has_sleeper() for wq active check
>   blk-wbt: fix has-sleeper queueing check
>   blk-wbt: abstract out end IO completion handler
>   blk-wbt: improve waking of tasks
>
>  block/blk-wbt.c | 107 +++++++++++++++++++++++++++++++++---------------
>  1 file changed, 75 insertions(+), 32 deletions(-)
>
Ok, this delta looks maintainable and with touching only a very specific driver
and a test-case to make it verifiable.

Acked-by: Stefan Bader <[hidden email]>


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ACK: [SRU C][PATCH v2 0/6] blk-wbt: fix for LP#1810998

Mauricio Faria de Oliveira-2
On Fri, Jan 11, 2019 at 9:18 AM Stefan Bader <[hidden email]> wrote:
>
> On 11.01.19 12:08, Mauricio Faria de Oliveira wrote:
> > BugLink: https://bugs.launchpad.net/bugs/1810998
[snip]
> >  block/blk-wbt.c | 107 +++++++++++++++++++++++++++++++++---------------
> >  1 file changed, 75 insertions(+), 32 deletions(-)
> >
>
> Ok, this delta looks maintainable and with touching only a very specific driver
> and a test-case to make it verifiable.
>
> Acked-by: Stefan Bader <[hidden email]>

Great; thanks for reviewing, Stefan.

cheers,

--
Mauricio Faria de Oliveira

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

ACK: [SRU C][PATCH v2 0/6] blk-wbt: fix for LP#1810998

Kleber Souza
In reply to this post by Mauricio Faria de Oliveira-3
On 1/11/19 12:08 PM, Mauricio Faria de Oliveira wrote:

> BugLink: https://bugs.launchpad.net/bugs/1810998
>
> [Impact]
>
>  * Users may experience cpu hard lockups when performing
>    rigorous writes to NVMe drives.
>
>  * The fix addresses an scheduling issue in the original
>    implementation of wbt/writeback throttling
>
>  * The fix is commit 2887e41b910b ("blk-wbt: Avoid lock
>    contention and thundering herd issue in wbt_wait"),
>    plus its fix commit 38cfb5a45ee0 ("blk-wbt: improve
>    waking of tasks").
>
>  * Plus a few dependency commits for each fix.
>
>  * Backports are trivial: mainly replace rq_wait_inc_below()
>    with the equivalent atomic_inc_below(), and maintain the
>    __wbt_done() signature, both due to the lack of commit
>    a79050434b45 ("blk-rq-qos: refactor out common elements
>    of blk-wbt"), that changes lots of other/unrelated code.
>
> [Test Case]
>
>  * This command has been reported to reproduce the problem:
>
>    $ sudo iozone -R -s 5G -r 1m -S 2048 -i 0 -G -c -o -l 128 -u 128 -t 128
>
>  * It generates stack traces as below in the original kernel,
>    and does not generate them in the modified/patched kernel.
>
>  * The user/reporter verified the test kernel with these patches
>    resolved the problem.
>
>  * The developer verified in 2 systems (4-core and 24-core but
>    no NVMe) for regressions, and no error messages were logged
>    to dmesg.
>
> [Regression Potential]
>
>  * The regression potential is contained within writeback
>    throttling mechanism (block/blk-wbt.*).
>
>  * The commits have been verified for fixes in later commits in
>    linux-next as of 2019-01-08 and all known fix commits are in.
>
> [Other Info]
>
>  * The problem has been introduced with the blk-wbt mechanism,
>    in v4.10-rc1, and the fix commits in v4.19-rc1 and -rc2,
>    so only Bionic and Cosmic needs this.
>
> [Stack Traces]
>
> [ 393.628647] NMI watchdog: Watchdog detected hard LOCKUP on cpu 30
> ...
> [ 393.628704] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic #21-Ubuntu
> ...
> [ 393.628720] Call Trace:
> [ 393.628721] <IRQ>
> [ 393.628724] enqueue_task_fair+0x6c/0x7f0
> [ 393.628726] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150
> [ 393.628728] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150
> [ 393.628731] activate_task+0x57/0xc0
> [ 393.628735] ? sched_clock+0x9/0x10
> [ 393.628736] ? sched_clock+0x9/0x10
> [ 393.628738] ttwu_do_activate+0x49/0x90
> [ 393.628739] try_to_wake_up+0x1df/0x490
> [ 393.628741] default_wake_function+0x12/0x20
> [ 393.628743] autoremove_wake_function+0x12/0x40
> [ 393.628744] __wake_up_common+0x73/0x130
> [ 393.628745] __wake_up_common_lock+0x80/0xc0
> [ 393.628746] __wake_up+0x13/0x20
> [ 393.628749] __wbt_done.part.21+0xa4/0xb0
> [ 393.628749] wbt_done+0x72/0xa0
> [ 393.628753] blk_mq_free_request+0xca/0x1a0
> [ 393.628755] blk_mq_end_request+0x48/0x90
> [ 393.628760] nvme_complete_rq+0x23/0x120 [nvme_core]
> [ 393.628763] nvme_pci_complete_rq+0x7a/0x130 [nvme]
> [ 393.628764] __blk_mq_complete_request+0xd2/0x140
> [ 393.628766] blk_mq_complete_request+0x18/0x20
> [ 393.628767] nvme_process_cq+0xe1/0x1b0 [nvme]
> [ 393.628768] nvme_irq+0x23/0x50 [nvme]
> [ 393.628772] __handle_irq_event_percpu+0x44/0x1a0
> [ 393.628773] handle_irq_event_percpu+0x32/0x80
> [ 393.628774] handle_irq_event+0x3b/0x60
> [ 393.628778] handle_edge_irq+0x7c/0x190
> [ 393.628779] handle_irq+0x20/0x30
> [ 393.628783] do_IRQ+0x46/0xd0
> [ 393.628784] common_interrupt+0x84/0x84
> [ 393.628785] </IRQ>
> ...
> [ 393.628794] ? cpuidle_enter_state+0x97/0x2f0
> [ 393.628796] cpuidle_enter+0x17/0x20
> [ 393.628797] call_cpuidle+0x23/0x40
> [ 393.628798] do_idle+0x18c/0x1f0
> [ 393.628799] cpu_startup_entry+0x73/0x80
> [ 393.628802] start_secondary+0x1a6/0x200
> [ 393.628804] secondary_startup_64+0xa5/0xb0
> [ 393.628805] Code: ...
>
> [ 405.981597] nvme nvme1: I/O 393 QID 6 timeout, completion polled
>
> [ 435.597209] INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 435.602858] 30-...0: (1 GPs behind) idle=e26/1/0 softirq=6834/6834 fqs=4485
> [ 435.610203] (detected by 8, t=15005 jiffies, g=6396, c=6395, q=146818)
> [ 435.617025] Sending NMI from CPU 8 to CPUs 30:
> [ 435.617029] NMI backtrace for cpu 30
> [ 435.617031] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic #21-Ubuntu
> ...
> [ 435.617047] Call Trace:
> [ 435.617048] <IRQ>
> [ 435.617051] enqueue_entity+0x9f/0x6b0
> [ 435.617053] enqueue_task_fair+0x6c/0x7f0
> [ 435.617056] activate_task+0x57/0xc0
> [ 435.617059] ? sched_clock+0x9/0x10
> [ 435.617060] ? sched_clock+0x9/0x10
> [ 435.617061] ttwu_do_activate+0x49/0x90
> [ 435.617063] try_to_wake_up+0x1df/0x490
> [ 435.617065] default_wake_function+0x12/0x20
> [ 435.617067] autoremove_wake_function+0x12/0x40
> [ 435.617068] __wake_up_common+0x73/0x130
> [ 435.617069] __wake_up_common_lock+0x80/0xc0
> [ 435.617070] __wake_up+0x13/0x20
> [ 435.617073] __wbt_done.part.21+0xa4/0xb0
> [ 435.617074] wbt_done+0x72/0xa0
> [ 435.617077] blk_mq_free_request+0xca/0x1a0
> [ 435.617079] blk_mq_end_request+0x48/0x90
> [ 435.617084] nvme_complete_rq+0x23/0x120 [nvme_core]
> [ 435.617087] nvme_pci_complete_rq+0x7a/0x130 [nvme]
> [ 435.617088] __blk_mq_complete_request+0xd2/0x140
> [ 435.617090] blk_mq_complete_request+0x18/0x20
> [ 435.617091] nvme_process_cq+0xe1/0x1b0 [nvme]
> [ 435.617093] nvme_irq+0x23/0x50 [nvme]
> [ 435.617096] __handle_irq_event_percpu+0x44/0x1a0
> [ 435.617097] handle_irq_event_percpu+0x32/0x80
> [ 435.617098] handle_irq_event+0x3b/0x60
> [ 435.617101] handle_edge_irq+0x7c/0x190
> [ 435.617102] handle_irq+0x20/0x30
> [ 435.617106] do_IRQ+0x46/0xd0
> [ 435.617107] common_interrupt+0x84/0x84
> [ 435.617108] </IRQ>
> ...
> [ 435.617117] ? cpuidle_enter_state+0x97/0x2f0
> [ 435.617118] cpuidle_enter+0x17/0x20
> [ 435.617119] call_cpuidle+0x23/0x40
> [ 435.617121] do_idle+0x18c/0x1f0
> [ 435.617122] cpu_startup_entry+0x73/0x80
> [ 435.617125] start_secondary+0x1a6/0x200
> [ 435.617127] secondary_startup_64+0xa5/0xb0
> [ 435.617128] Code: ...
>
> Anchal Agarwal (1):
>   blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
>
> Jens Axboe (5):
>   blk-wbt: move disable check into get_limit()
>   blk-wbt: use wq_has_sleeper() for wq active check
>   blk-wbt: fix has-sleeper queueing check
>   blk-wbt: abstract out end IO completion handler
>   blk-wbt: improve waking of tasks
>
>  block/blk-wbt.c | 107 +++++++++++++++++++++++++++++++++---------------
>  1 file changed, 75 insertions(+), 32 deletions(-)
>
Acked-by: Kleber Sacilotto de Souza <[hidden email]>


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

APPLIED: [SRU C][PATCH v2 0/6] blk-wbt: fix for LP#1810998

Kleber Souza
In reply to this post by Mauricio Faria de Oliveira-3
On 1/11/19 12:08 PM, Mauricio Faria de Oliveira wrote:

> BugLink: https://bugs.launchpad.net/bugs/1810998
>
> [Impact]
>
>  * Users may experience cpu hard lockups when performing
>    rigorous writes to NVMe drives.
>
>  * The fix addresses an scheduling issue in the original
>    implementation of wbt/writeback throttling
>
>  * The fix is commit 2887e41b910b ("blk-wbt: Avoid lock
>    contention and thundering herd issue in wbt_wait"),
>    plus its fix commit 38cfb5a45ee0 ("blk-wbt: improve
>    waking of tasks").
>
>  * Plus a few dependency commits for each fix.
>
>  * Backports are trivial: mainly replace rq_wait_inc_below()
>    with the equivalent atomic_inc_below(), and maintain the
>    __wbt_done() signature, both due to the lack of commit
>    a79050434b45 ("blk-rq-qos: refactor out common elements
>    of blk-wbt"), that changes lots of other/unrelated code.
>
> [Test Case]
>
>  * This command has been reported to reproduce the problem:
>
>    $ sudo iozone -R -s 5G -r 1m -S 2048 -i 0 -G -c -o -l 128 -u 128 -t 128
>
>  * It generates stack traces as below in the original kernel,
>    and does not generate them in the modified/patched kernel.
>
>  * The user/reporter verified the test kernel with these patches
>    resolved the problem.
>
>  * The developer verified in 2 systems (4-core and 24-core but
>    no NVMe) for regressions, and no error messages were logged
>    to dmesg.
>
> [Regression Potential]
>
>  * The regression potential is contained within writeback
>    throttling mechanism (block/blk-wbt.*).
>
>  * The commits have been verified for fixes in later commits in
>    linux-next as of 2019-01-08 and all known fix commits are in.
>
> [Other Info]
>
>  * The problem has been introduced with the blk-wbt mechanism,
>    in v4.10-rc1, and the fix commits in v4.19-rc1 and -rc2,
>    so only Bionic and Cosmic needs this.
>
> [Stack Traces]
>
> [ 393.628647] NMI watchdog: Watchdog detected hard LOCKUP on cpu 30
> ...
> [ 393.628704] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic #21-Ubuntu
> ...
> [ 393.628720] Call Trace:
> [ 393.628721] <IRQ>
> [ 393.628724] enqueue_task_fair+0x6c/0x7f0
> [ 393.628726] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150
> [ 393.628728] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150
> [ 393.628731] activate_task+0x57/0xc0
> [ 393.628735] ? sched_clock+0x9/0x10
> [ 393.628736] ? sched_clock+0x9/0x10
> [ 393.628738] ttwu_do_activate+0x49/0x90
> [ 393.628739] try_to_wake_up+0x1df/0x490
> [ 393.628741] default_wake_function+0x12/0x20
> [ 393.628743] autoremove_wake_function+0x12/0x40
> [ 393.628744] __wake_up_common+0x73/0x130
> [ 393.628745] __wake_up_common_lock+0x80/0xc0
> [ 393.628746] __wake_up+0x13/0x20
> [ 393.628749] __wbt_done.part.21+0xa4/0xb0
> [ 393.628749] wbt_done+0x72/0xa0
> [ 393.628753] blk_mq_free_request+0xca/0x1a0
> [ 393.628755] blk_mq_end_request+0x48/0x90
> [ 393.628760] nvme_complete_rq+0x23/0x120 [nvme_core]
> [ 393.628763] nvme_pci_complete_rq+0x7a/0x130 [nvme]
> [ 393.628764] __blk_mq_complete_request+0xd2/0x140
> [ 393.628766] blk_mq_complete_request+0x18/0x20
> [ 393.628767] nvme_process_cq+0xe1/0x1b0 [nvme]
> [ 393.628768] nvme_irq+0x23/0x50 [nvme]
> [ 393.628772] __handle_irq_event_percpu+0x44/0x1a0
> [ 393.628773] handle_irq_event_percpu+0x32/0x80
> [ 393.628774] handle_irq_event+0x3b/0x60
> [ 393.628778] handle_edge_irq+0x7c/0x190
> [ 393.628779] handle_irq+0x20/0x30
> [ 393.628783] do_IRQ+0x46/0xd0
> [ 393.628784] common_interrupt+0x84/0x84
> [ 393.628785] </IRQ>
> ...
> [ 393.628794] ? cpuidle_enter_state+0x97/0x2f0
> [ 393.628796] cpuidle_enter+0x17/0x20
> [ 393.628797] call_cpuidle+0x23/0x40
> [ 393.628798] do_idle+0x18c/0x1f0
> [ 393.628799] cpu_startup_entry+0x73/0x80
> [ 393.628802] start_secondary+0x1a6/0x200
> [ 393.628804] secondary_startup_64+0xa5/0xb0
> [ 393.628805] Code: ...
>
> [ 405.981597] nvme nvme1: I/O 393 QID 6 timeout, completion polled
>
> [ 435.597209] INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 435.602858] 30-...0: (1 GPs behind) idle=e26/1/0 softirq=6834/6834 fqs=4485
> [ 435.610203] (detected by 8, t=15005 jiffies, g=6396, c=6395, q=146818)
> [ 435.617025] Sending NMI from CPU 8 to CPUs 30:
> [ 435.617029] NMI backtrace for cpu 30
> [ 435.617031] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic #21-Ubuntu
> ...
> [ 435.617047] Call Trace:
> [ 435.617048] <IRQ>
> [ 435.617051] enqueue_entity+0x9f/0x6b0
> [ 435.617053] enqueue_task_fair+0x6c/0x7f0
> [ 435.617056] activate_task+0x57/0xc0
> [ 435.617059] ? sched_clock+0x9/0x10
> [ 435.617060] ? sched_clock+0x9/0x10
> [ 435.617061] ttwu_do_activate+0x49/0x90
> [ 435.617063] try_to_wake_up+0x1df/0x490
> [ 435.617065] default_wake_function+0x12/0x20
> [ 435.617067] autoremove_wake_function+0x12/0x40
> [ 435.617068] __wake_up_common+0x73/0x130
> [ 435.617069] __wake_up_common_lock+0x80/0xc0
> [ 435.617070] __wake_up+0x13/0x20
> [ 435.617073] __wbt_done.part.21+0xa4/0xb0
> [ 435.617074] wbt_done+0x72/0xa0
> [ 435.617077] blk_mq_free_request+0xca/0x1a0
> [ 435.617079] blk_mq_end_request+0x48/0x90
> [ 435.617084] nvme_complete_rq+0x23/0x120 [nvme_core]
> [ 435.617087] nvme_pci_complete_rq+0x7a/0x130 [nvme]
> [ 435.617088] __blk_mq_complete_request+0xd2/0x140
> [ 435.617090] blk_mq_complete_request+0x18/0x20
> [ 435.617091] nvme_process_cq+0xe1/0x1b0 [nvme]
> [ 435.617093] nvme_irq+0x23/0x50 [nvme]
> [ 435.617096] __handle_irq_event_percpu+0x44/0x1a0
> [ 435.617097] handle_irq_event_percpu+0x32/0x80
> [ 435.617098] handle_irq_event+0x3b/0x60
> [ 435.617101] handle_edge_irq+0x7c/0x190
> [ 435.617102] handle_irq+0x20/0x30
> [ 435.617106] do_IRQ+0x46/0xd0
> [ 435.617107] common_interrupt+0x84/0x84
> [ 435.617108] </IRQ>
> ...
> [ 435.617117] ? cpuidle_enter_state+0x97/0x2f0
> [ 435.617118] cpuidle_enter+0x17/0x20
> [ 435.617119] call_cpuidle+0x23/0x40
> [ 435.617121] do_idle+0x18c/0x1f0
> [ 435.617122] cpu_startup_entry+0x73/0x80
> [ 435.617125] start_secondary+0x1a6/0x200
> [ 435.617127] secondary_startup_64+0xa5/0xb0
> [ 435.617128] Code: ...
>
> Anchal Agarwal (1):
>   blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
>
> Jens Axboe (5):
>   blk-wbt: move disable check into get_limit()
>   blk-wbt: use wq_has_sleeper() for wq active check
>   blk-wbt: fix has-sleeper queueing check
>   blk-wbt: abstract out end IO completion handler
>   blk-wbt: improve waking of tasks
>
>  block/blk-wbt.c | 107 +++++++++++++++++++++++++++++++++---------------
>  1 file changed, 75 insertions(+), 32 deletions(-)
>
Applied to cosmic/master-next branch.

Thanks,
Kleber


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team