[SRU C][PATCH 0/8] blk-wbt: fix for LP#1810998

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

[SRU C][PATCH 0/8] blk-wbt: fix for LP#1810998

Mauricio Faria de Oliveira-3
BugLink: https://bugs.launchpad.net/bugs/1810998

[Impact]

 * Users may experience cpu hard lockups when performing
   rigorous writes to NVMe drives.

 * The fix addresses an scheduling issue in the original
   implementation of wbt/writeback throttling

 * The fix is commit 2887e41b910b ("blk-wbt: Avoid lock
   contention and thundering herd issue in wbt_wait"),
   plus its fix commit 38cfb5a45ee0 ("blk-wbt: improve
   waking of tasks").

 * There are additional commits to help with a cleaner
   backport and future maintenance:
   - Cosmic: total of 8 clean cherry picks.
   - Bionic: total of 13 commits, 9 clean cherry-picks
     and 4 backports, which are just changes to context
     lines (i.e. refresh) without any functional changes
     in the backport itself.

[Test Case]

 * This command has been reported to reproduce the problem:

   $ sudo iozone -R -s 5G -r 1m -S 2048 -i 0 -G -c -o -l 128 -u 128 -t 128

 * It generates stack traces as below in the original kernel,
   and does not generate them in the modified/patched kernel.

 * The user/reporter verified the test kernel with these patches
   resolved the problem.

 * The developer verified in 2 systems (4-core and 24-core but
   no NVMe) for regressions, and no error messages were logged
   to dmesg.

[Regression Potential]

 * The commits have been verified for fixes in later commits in
   linux-next as of 2019-01-08 and all known fix commits are in.

 * The regression potential is mostly contained in the writeback
   throttling (block/blk-wbt.*), as almost all of the 13 patches
   change exclusively that, except for 4 of them (2 of which are
   sysfs):

   - blk-rq-qos: refactor out common elements of blk-wbt (block/)
   - block: Protect less code with sysfs_lock in blk_{un,}register_queue() (blk-sysfs.c)
   - block: Protect less code with sysfs_lock in blk_{un,}register_queue() (blk-{mq-}sysfs.c)
   - block: pass struct request instead of struct blk_issue_stat to wbt (block/, still mostly blk-wbt.*)

[Other Info]

 * Alternatively, it is probably possible to introduce just the
   two commits that fix this with some changes to their code in
   the backport, but since the 'blk-rq-qos: refactor ..' commit
   may become a dependency for many additional/future fixes, it
   seemed interesting to pull it in earlier in the 18.04 cycle.

 * The problem has been introduced with the blk-wbt mechanism,
   in v4.10-rc1, and the fix commits in v4.19-rc1 and -rc2,
   so only Bionic and Cosmic needs this.

[Stack Traces]

[ 393.628647] NMI watchdog: Watchdog detected hard LOCKUP on cpu 30
...
[ 393.628704] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic #21-Ubuntu
...
[ 393.628720] Call Trace:
[ 393.628721] <IRQ>
[ 393.628724] enqueue_task_fair+0x6c/0x7f0
[ 393.628726] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150
[ 393.628728] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150
[ 393.628731] activate_task+0x57/0xc0
[ 393.628735] ? sched_clock+0x9/0x10
[ 393.628736] ? sched_clock+0x9/0x10
[ 393.628738] ttwu_do_activate+0x49/0x90
[ 393.628739] try_to_wake_up+0x1df/0x490
[ 393.628741] default_wake_function+0x12/0x20
[ 393.628743] autoremove_wake_function+0x12/0x40
[ 393.628744] __wake_up_common+0x73/0x130
[ 393.628745] __wake_up_common_lock+0x80/0xc0
[ 393.628746] __wake_up+0x13/0x20
[ 393.628749] __wbt_done.part.21+0xa4/0xb0
[ 393.628749] wbt_done+0x72/0xa0
[ 393.628753] blk_mq_free_request+0xca/0x1a0
[ 393.628755] blk_mq_end_request+0x48/0x90
[ 393.628760] nvme_complete_rq+0x23/0x120 [nvme_core]
[ 393.628763] nvme_pci_complete_rq+0x7a/0x130 [nvme]
[ 393.628764] __blk_mq_complete_request+0xd2/0x140
[ 393.628766] blk_mq_complete_request+0x18/0x20
[ 393.628767] nvme_process_cq+0xe1/0x1b0 [nvme]
[ 393.628768] nvme_irq+0x23/0x50 [nvme]
[ 393.628772] __handle_irq_event_percpu+0x44/0x1a0
[ 393.628773] handle_irq_event_percpu+0x32/0x80
[ 393.628774] handle_irq_event+0x3b/0x60
[ 393.628778] handle_edge_irq+0x7c/0x190
[ 393.628779] handle_irq+0x20/0x30
[ 393.628783] do_IRQ+0x46/0xd0
[ 393.628784] common_interrupt+0x84/0x84
[ 393.628785] </IRQ>
...
[ 393.628794] ? cpuidle_enter_state+0x97/0x2f0
[ 393.628796] cpuidle_enter+0x17/0x20
[ 393.628797] call_cpuidle+0x23/0x40
[ 393.628798] do_idle+0x18c/0x1f0
[ 393.628799] cpu_startup_entry+0x73/0x80
[ 393.628802] start_secondary+0x1a6/0x200
[ 393.628804] secondary_startup_64+0xa5/0xb0
[ 393.628805] Code: ...

[ 405.981597] nvme nvme1: I/O 393 QID 6 timeout, completion polled

[ 435.597209] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 435.602858] 30-...0: (1 GPs behind) idle=e26/1/0 softirq=6834/6834 fqs=4485
[ 435.610203] (detected by 8, t=15005 jiffies, g=6396, c=6395, q=146818)
[ 435.617025] Sending NMI from CPU 8 to CPUs 30:
[ 435.617029] NMI backtrace for cpu 30
[ 435.617031] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic #21-Ubuntu
...
[ 435.617047] Call Trace:
[ 435.617048] <IRQ>
[ 435.617051] enqueue_entity+0x9f/0x6b0
[ 435.617053] enqueue_task_fair+0x6c/0x7f0
[ 435.617056] activate_task+0x57/0xc0
[ 435.617059] ? sched_clock+0x9/0x10
[ 435.617060] ? sched_clock+0x9/0x10
[ 435.617061] ttwu_do_activate+0x49/0x90
[ 435.617063] try_to_wake_up+0x1df/0x490
[ 435.617065] default_wake_function+0x12/0x20
[ 435.617067] autoremove_wake_function+0x12/0x40
[ 435.617068] __wake_up_common+0x73/0x130
[ 435.617069] __wake_up_common_lock+0x80/0xc0
[ 435.617070] __wake_up+0x13/0x20
[ 435.617073] __wbt_done.part.21+0xa4/0xb0
[ 435.617074] wbt_done+0x72/0xa0
[ 435.617077] blk_mq_free_request+0xca/0x1a0
[ 435.617079] blk_mq_end_request+0x48/0x90
[ 435.617084] nvme_complete_rq+0x23/0x120 [nvme_core]
[ 435.617087] nvme_pci_complete_rq+0x7a/0x130 [nvme]
[ 435.617088] __blk_mq_complete_request+0xd2/0x140
[ 435.617090] blk_mq_complete_request+0x18/0x20
[ 435.617091] nvme_process_cq+0xe1/0x1b0 [nvme]
[ 435.617093] nvme_irq+0x23/0x50 [nvme]
[ 435.617096] __handle_irq_event_percpu+0x44/0x1a0
[ 435.617097] handle_irq_event_percpu+0x32/0x80
[ 435.617098] handle_irq_event+0x3b/0x60
[ 435.617101] handle_edge_irq+0x7c/0x190
[ 435.617102] handle_irq+0x20/0x30
[ 435.617106] do_IRQ+0x46/0xd0
[ 435.617107] common_interrupt+0x84/0x84
[ 435.617108] </IRQ>
...
[ 435.617117] ? cpuidle_enter_state+0x97/0x2f0
[ 435.617118] cpuidle_enter+0x17/0x20
[ 435.617119] call_cpuidle+0x23/0x40
[ 435.617121] do_idle+0x18c/0x1f0
[ 435.617122] cpu_startup_entry+0x73/0x80
[ 435.617125] start_secondary+0x1a6/0x200
[ 435.617127] secondary_startup_64+0xa5/0xb0
[ 435.617128] Code: ...

Anchal Agarwal (1):
  blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait

Jens Axboe (5):
  blk-wbt: move disable check into get_limit()
  blk-wbt: use wq_has_sleeper() for wq active check
  blk-wbt: fix has-sleeper queueing check
  blk-wbt: abstract out end IO completion handler
  blk-wbt: improve waking of tasks

Josef Bacik (2):
  blk-rq-qos: refactor out common elements of blk-wbt
  blk-wbt: wake up all when we scale up, not down

 block/Makefile         |   2 +-
 block/blk-core.c       |  12 +-
 block/blk-mq.c         |  12 +-
 block/blk-rq-qos.c     | 178 +++++++++++++++++
 block/blk-rq-qos.h     | 106 ++++++++++
 block/blk-settings.c   |   4 +-
 block/blk-sysfs.c      |  22 +--
 block/blk-wbt.c        | 427 +++++++++++++++++++++--------------------
 block/blk-wbt.h        |  63 +++---
 include/linux/blkdev.h |   4 +-
 10 files changed, 550 insertions(+), 280 deletions(-)
 create mode 100644 block/blk-rq-qos.c
 create mode 100644 block/blk-rq-qos.h

--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU C][PATCH 1/8] blk-rq-qos: refactor out common elements of blk-wbt

Mauricio Faria de Oliveira-3
From: Josef Bacik <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1810998

blkcg-qos is going to do essentially what wbt does, only on a cgroup
basis.  Break out the common code that will be shared between blkcg-qos
and wbt into blk-rq-qos.* so they can both utilize the same
infrastructure.

Signed-off-by: Josef Bacik <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit a79050434b45959f397042080fd1d70ffa9bd9df)
Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
---
 block/Makefile         |   2 +-
 block/blk-core.c       |  12 +-
 block/blk-mq.c         |  12 +-
 block/blk-rq-qos.c     | 178 ++++++++++++++++++++++
 block/blk-rq-qos.h     | 106 ++++++++++++++
 block/blk-settings.c   |   4 +-
 block/blk-sysfs.c      |  22 ++-
 block/blk-wbt.c        | 326 ++++++++++++++++++-----------------------
 block/blk-wbt.h        |  63 +++-----
 include/linux/blkdev.h |   4 +-
 10 files changed, 478 insertions(+), 251 deletions(-)
 create mode 100644 block/blk-rq-qos.c
 create mode 100644 block/blk-rq-qos.h

diff --git a/block/Makefile b/block/Makefile
index 6a56303b9925..981605042cad 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -9,7 +9,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
  blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
  blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \
  genhd.o partition-generic.o ioprio.o \
- badblocks.o partitions/
+ badblocks.o partitions/ blk-rq-qos.o
 
 obj-$(CONFIG_BOUNCE) += bounce.o
 obj-$(CONFIG_BLK_SCSI_REQUEST) += scsi_ioctl.o
diff --git a/block/blk-core.c b/block/blk-core.c
index f9d2e1b66e05..1f5ce35a2630 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1652,7 +1652,7 @@ void blk_requeue_request(struct request_queue *q, struct request *rq)
  blk_delete_timer(rq);
  blk_clear_rq_complete(rq);
  trace_block_rq_requeue(q, rq);
- wbt_requeue(q->rq_wb, rq);
+ rq_qos_requeue(q, rq);
 
  if (rq->rq_flags & RQF_QUEUED)
  blk_queue_end_tag(q, rq);
@@ -1759,7 +1759,7 @@ void __blk_put_request(struct request_queue *q, struct request *req)
  /* this is a bio leak */
  WARN_ON(req->bio != NULL);
 
- wbt_done(q->rq_wb, req);
+ rq_qos_done(q, req);
 
  /*
  * Request may not have originated from ll_rw_blk. if not,
@@ -2051,7 +2051,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
  }
 
 get_rq:
- wb_acct = wbt_wait(q->rq_wb, bio, q->queue_lock);
+ wb_acct = rq_qos_throttle(q, bio, q->queue_lock);
 
  /*
  * Grab a free request. This is might sleep but can not fail.
@@ -2061,7 +2061,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
  req = get_request(q, bio->bi_opf, bio, 0, GFP_NOIO);
  if (IS_ERR(req)) {
  blk_queue_exit(q);
- __wbt_done(q->rq_wb, wb_acct);
+ rq_qos_cleanup(q, wb_acct);
  if (PTR_ERR(req) == -ENOMEM)
  bio->bi_status = BLK_STS_RESOURCE;
  else
@@ -2996,7 +2996,7 @@ void blk_start_request(struct request *req)
  req->throtl_size = blk_rq_sectors(req);
 #endif
  req->rq_flags |= RQF_STATS;
- wbt_issue(req->q->rq_wb, req);
+ rq_qos_issue(req->q, req);
  }
 
  BUG_ON(blk_rq_is_complete(req));
@@ -3216,7 +3216,7 @@ void blk_finish_request(struct request *req, blk_status_t error)
  blk_account_io_done(req, now);
 
  if (req->end_io) {
- wbt_done(req->q->rq_wb, req);
+ rq_qos_done(q, req);
  req->end_io(req, error);
  } else {
  if (blk_bidi_rq(req))
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 90e8058ae557..0daa4c32727b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -506,7 +506,7 @@ void blk_mq_free_request(struct request *rq)
  if (unlikely(laptop_mode && !blk_rq_is_passthrough(rq)))
  laptop_io_completion(q->backing_dev_info);
 
- wbt_done(q->rq_wb, rq);
+ rq_qos_done(q, rq);
 
  if (blk_rq_rl(rq))
  blk_put_rl(blk_rq_rl(rq));
@@ -529,7 +529,7 @@ inline void __blk_mq_end_request(struct request *rq, blk_status_t error)
  blk_account_io_done(rq, now);
 
  if (rq->end_io) {
- wbt_done(rq->q->rq_wb, rq);
+ rq_qos_done(rq->q, rq);
  rq->end_io(rq, error);
  } else {
  if (unlikely(blk_bidi_rq(rq)))
@@ -641,7 +641,7 @@ void blk_mq_start_request(struct request *rq)
  rq->throtl_size = blk_rq_sectors(rq);
 #endif
  rq->rq_flags |= RQF_STATS;
- wbt_issue(q->rq_wb, rq);
+ rq_qos_issue(q, rq);
  }
 
  WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
@@ -667,7 +667,7 @@ static void __blk_mq_requeue_request(struct request *rq)
  blk_mq_put_driver_tag(rq);
 
  trace_block_rq_requeue(q, rq);
- wbt_requeue(q->rq_wb, rq);
+ rq_qos_requeue(q, rq);
 
  if (blk_mq_request_started(rq)) {
  WRITE_ONCE(rq->state, MQ_RQ_IDLE);
@@ -1776,13 +1776,13 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
  if (blk_mq_sched_bio_merge(q, bio))
  return BLK_QC_T_NONE;
 
- wb_acct = wbt_wait(q->rq_wb, bio, NULL);
+ wb_acct = rq_qos_throttle(q, bio, NULL);
 
  trace_block_getrq(q, bio, bio->bi_opf);
 
  rq = blk_mq_get_request(q, bio, bio->bi_opf, &data);
  if (unlikely(!rq)) {
- __wbt_done(q->rq_wb, wb_acct);
+ rq_qos_cleanup(q, wb_acct);
  if (bio->bi_opf & REQ_NOWAIT)
  bio_wouldblock_error(bio);
  return BLK_QC_T_NONE;
diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
new file mode 100644
index 000000000000..d2f2af8aa10c
--- /dev/null
+++ b/block/blk-rq-qos.c
@@ -0,0 +1,178 @@
+#include "blk-rq-qos.h"
+
+#include "blk-wbt.h"
+
+/*
+ * Increment 'v', if 'v' is below 'below'. Returns true if we succeeded,
+ * false if 'v' + 1 would be bigger than 'below'.
+ */
+static bool atomic_inc_below(atomic_t *v, int below)
+{
+ int cur = atomic_read(v);
+
+ for (;;) {
+ int old;
+
+ if (cur >= below)
+ return false;
+ old = atomic_cmpxchg(v, cur, cur + 1);
+ if (old == cur)
+ break;
+ cur = old;
+ }
+
+ return true;
+}
+
+bool rq_wait_inc_below(struct rq_wait *rq_wait, int limit)
+{
+ return atomic_inc_below(&rq_wait->inflight, limit);
+}
+
+void rq_qos_cleanup(struct request_queue *q, enum wbt_flags wb_acct)
+{
+ struct rq_qos *rqos;
+
+ for (rqos = q->rq_qos; rqos; rqos = rqos->next) {
+ if (rqos->ops->cleanup)
+ rqos->ops->cleanup(rqos, wb_acct);
+ }
+}
+
+void rq_qos_done(struct request_queue *q, struct request *rq)
+{
+ struct rq_qos *rqos;
+
+ for (rqos = q->rq_qos; rqos; rqos = rqos->next) {
+ if (rqos->ops->done)
+ rqos->ops->done(rqos, rq);
+ }
+}
+
+void rq_qos_issue(struct request_queue *q, struct request *rq)
+{
+ struct rq_qos *rqos;
+
+ for(rqos = q->rq_qos; rqos; rqos = rqos->next) {
+ if (rqos->ops->issue)
+ rqos->ops->issue(rqos, rq);
+ }
+}
+
+void rq_qos_requeue(struct request_queue *q, struct request *rq)
+{
+ struct rq_qos *rqos;
+
+ for(rqos = q->rq_qos; rqos; rqos = rqos->next) {
+ if (rqos->ops->requeue)
+ rqos->ops->requeue(rqos, rq);
+ }
+}
+
+enum wbt_flags rq_qos_throttle(struct request_queue *q, struct bio *bio,
+       spinlock_t *lock)
+{
+ struct rq_qos *rqos;
+ enum wbt_flags flags = 0;
+
+ for(rqos = q->rq_qos; rqos; rqos = rqos->next) {
+ if (rqos->ops->throttle)
+ flags |= rqos->ops->throttle(rqos, bio, lock);
+ }
+ return flags;
+}
+
+/*
+ * Return true, if we can't increase the depth further by scaling
+ */
+bool rq_depth_calc_max_depth(struct rq_depth *rqd)
+{
+ unsigned int depth;
+ bool ret = false;
+
+ /*
+ * For QD=1 devices, this is a special case. It's important for those
+ * to have one request ready when one completes, so force a depth of
+ * 2 for those devices. On the backend, it'll be a depth of 1 anyway,
+ * since the device can't have more than that in flight. If we're
+ * scaling down, then keep a setting of 1/1/1.
+ */
+ if (rqd->queue_depth == 1) {
+ if (rqd->scale_step > 0)
+ rqd->max_depth = 1;
+ else {
+ rqd->max_depth = 2;
+ ret = true;
+ }
+ } else {
+ /*
+ * scale_step == 0 is our default state. If we have suffered
+ * latency spikes, step will be > 0, and we shrink the
+ * allowed write depths. If step is < 0, we're only doing
+ * writes, and we allow a temporarily higher depth to
+ * increase performance.
+ */
+ depth = min_t(unsigned int, rqd->default_depth,
+      rqd->queue_depth);
+ if (rqd->scale_step > 0)
+ depth = 1 + ((depth - 1) >> min(31, rqd->scale_step));
+ else if (rqd->scale_step < 0) {
+ unsigned int maxd = 3 * rqd->queue_depth / 4;
+
+ depth = 1 + ((depth - 1) << -rqd->scale_step);
+ if (depth > maxd) {
+ depth = maxd;
+ ret = true;
+ }
+ }
+
+ rqd->max_depth = depth;
+ }
+
+ return ret;
+}
+
+void rq_depth_scale_up(struct rq_depth *rqd)
+{
+ /*
+ * Hit max in previous round, stop here
+ */
+ if (rqd->scaled_max)
+ return;
+
+ rqd->scale_step--;
+
+ rqd->scaled_max = rq_depth_calc_max_depth(rqd);
+}
+
+/*
+ * Scale rwb down. If 'hard_throttle' is set, do it quicker, since we
+ * had a latency violation.
+ */
+void rq_depth_scale_down(struct rq_depth *rqd, bool hard_throttle)
+{
+ /*
+ * Stop scaling down when we've hit the limit. This also prevents
+ * ->scale_step from going to crazy values, if the device can't
+ * keep up.
+ */
+ if (rqd->max_depth == 1)
+ return;
+
+ if (rqd->scale_step < 0 && hard_throttle)
+ rqd->scale_step = 0;
+ else
+ rqd->scale_step++;
+
+ rqd->scaled_max = false;
+ rq_depth_calc_max_depth(rqd);
+}
+
+void rq_qos_exit(struct request_queue *q)
+{
+ while (q->rq_qos) {
+ struct rq_qos *rqos = q->rq_qos;
+ q->rq_qos = rqos->next;
+ rqos->ops->exit(rqos);
+ }
+}
diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
new file mode 100644
index 000000000000..f9a39bd6ece3
--- /dev/null
+++ b/block/blk-rq-qos.h
@@ -0,0 +1,106 @@
+#ifndef RQ_QOS_H
+#define RQ_QOS_H
+
+#include <linux/kernel.h>
+#include <linux/blkdev.h>
+#include <linux/blk_types.h>
+#include <linux/atomic.h>
+#include <linux/wait.h>
+
+enum rq_qos_id {
+ RQ_QOS_WBT,
+ RQ_QOS_CGROUP,
+};
+
+struct rq_wait {
+ wait_queue_head_t wait;
+ atomic_t inflight;
+};
+
+struct rq_qos {
+ struct rq_qos_ops *ops;
+ struct request_queue *q;
+ enum rq_qos_id id;
+ struct rq_qos *next;
+};
+
+struct rq_qos_ops {
+ enum wbt_flags (*throttle)(struct rq_qos *, struct bio *,
+   spinlock_t *);
+ void (*issue)(struct rq_qos *, struct request *);
+ void (*requeue)(struct rq_qos *, struct request *);
+ void (*done)(struct rq_qos *, struct request *);
+ void (*cleanup)(struct rq_qos *, enum wbt_flags);
+ void (*exit)(struct rq_qos *);
+};
+
+struct rq_depth {
+ unsigned int max_depth;
+
+ int scale_step;
+ bool scaled_max;
+
+ unsigned int queue_depth;
+ unsigned int default_depth;
+};
+
+static inline struct rq_qos *rq_qos_id(struct request_queue *q,
+       enum rq_qos_id id)
+{
+ struct rq_qos *rqos;
+ for (rqos = q->rq_qos; rqos; rqos = rqos->next) {
+ if (rqos->id == id)
+ break;
+ }
+ return rqos;
+}
+
+static inline struct rq_qos *wbt_rq_qos(struct request_queue *q)
+{
+ return rq_qos_id(q, RQ_QOS_WBT);
+}
+
+static inline struct rq_qos *blkcg_rq_qos(struct request_queue *q)
+{
+ return rq_qos_id(q, RQ_QOS_CGROUP);
+}
+
+static inline void rq_wait_init(struct rq_wait *rq_wait)
+{
+ atomic_set(&rq_wait->inflight, 0);
+ init_waitqueue_head(&rq_wait->wait);
+}
+
+static inline void rq_qos_add(struct request_queue *q, struct rq_qos *rqos)
+{
+ rqos->next = q->rq_qos;
+ q->rq_qos = rqos;
+}
+
+static inline void rq_qos_del(struct request_queue *q, struct rq_qos *rqos)
+{
+ struct rq_qos *cur, *prev = NULL;
+ for (cur = q->rq_qos; cur; cur = cur->next) {
+ if (cur == rqos) {
+ if (prev)
+ prev->next = rqos->next;
+ else
+ q->rq_qos = cur;
+ break;
+ }
+ prev = cur;
+ }
+}
+
+bool rq_wait_inc_below(struct rq_wait *rq_wait, int limit);
+void rq_depth_scale_up(struct rq_depth *rqd);
+void rq_depth_scale_down(struct rq_depth *rqd, bool hard_throttle);
+bool rq_depth_calc_max_depth(struct rq_depth *rqd);
+
+void rq_qos_cleanup(struct request_queue *, enum wbt_flags);
+void rq_qos_done(struct request_queue *, struct request *);
+void rq_qos_issue(struct request_queue *, struct request *);
+void rq_qos_requeue(struct request_queue *, struct request *);
+enum wbt_flags rq_qos_throttle(struct request_queue *, struct bio *, spinlock_t *);
+void rq_qos_exit(struct request_queue *);
+#endif
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 24fff4a3d08a..ffd459969689 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -875,7 +875,7 @@ EXPORT_SYMBOL_GPL(blk_queue_flush_queueable);
 void blk_set_queue_depth(struct request_queue *q, unsigned int depth)
 {
  q->queue_depth = depth;
- wbt_set_queue_depth(q->rq_wb, depth);
+ wbt_set_queue_depth(q, depth);
 }
 EXPORT_SYMBOL(blk_set_queue_depth);
 
@@ -900,7 +900,7 @@ void blk_queue_write_cache(struct request_queue *q, bool wc, bool fua)
  queue_flag_clear(QUEUE_FLAG_FUA, q);
  spin_unlock_irq(q->queue_lock);
 
- wbt_set_write_cache(q->rq_wb, test_bit(QUEUE_FLAG_WC, &q->queue_flags));
+ wbt_set_write_cache(q, test_bit(QUEUE_FLAG_WC, &q->queue_flags));
 }
 EXPORT_SYMBOL_GPL(blk_queue_write_cache);
 
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 96c7dfc04852..380bc284ced1 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -422,16 +422,16 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
 
 static ssize_t queue_wb_lat_show(struct request_queue *q, char *page)
 {
- if (!q->rq_wb)
+ if (!wbt_rq_qos(q))
  return -EINVAL;
 
- return sprintf(page, "%llu\n", div_u64(q->rq_wb->min_lat_nsec, 1000));
+ return sprintf(page, "%llu\n", div_u64(wbt_get_min_lat(q), 1000));
 }
 
 static ssize_t queue_wb_lat_store(struct request_queue *q, const char *page,
   size_t count)
 {
- struct rq_wb *rwb;
+ struct rq_qos *rqos;
  ssize_t ret;
  s64 val;
 
@@ -441,23 +441,21 @@ static ssize_t queue_wb_lat_store(struct request_queue *q, const char *page,
  if (val < -1)
  return -EINVAL;
 
- rwb = q->rq_wb;
- if (!rwb) {
+ rqos = wbt_rq_qos(q);
+ if (!rqos) {
  ret = wbt_init(q);
  if (ret)
  return ret;
  }
 
- rwb = q->rq_wb;
  if (val == -1)
- rwb->min_lat_nsec = wbt_default_latency_nsec(q);
+ val = wbt_default_latency_nsec(q);
  else if (val >= 0)
- rwb->min_lat_nsec = val * 1000ULL;
+ val *= 1000ULL;
 
- if (rwb->enable_state == WBT_STATE_ON_DEFAULT)
- rwb->enable_state = WBT_STATE_ON_MANUAL;
+ wbt_set_min_lat(q, val);
 
- wbt_update_limits(rwb);
+ wbt_update_limits(q);
  return count;
 }
 
@@ -979,7 +977,7 @@ void blk_unregister_queue(struct gendisk *disk)
  kobject_del(&q->kobj);
  blk_trace_remove_sysfs(disk_to_dev(disk));
 
- wbt_exit(q);
+ rq_qos_exit(q);
 
  mutex_lock(&q->sysfs_lock);
  if (q->request_fn || (q->mq_ops && q->elevator))
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 4f89b28fa652..6fe20fb823e4 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -25,6 +25,7 @@
 #include <linux/swap.h>
 
 #include "blk-wbt.h"
+#include "blk-rq-qos.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/wbt.h>
@@ -78,28 +79,6 @@ static inline bool rwb_enabled(struct rq_wb *rwb)
  return rwb && rwb->wb_normal != 0;
 }
 
-/*
- * Increment 'v', if 'v' is below 'below'. Returns true if we succeeded,
- * false if 'v' + 1 would be bigger than 'below'.
- */
-static bool atomic_inc_below(atomic_t *v, int below)
-{
- int cur = atomic_read(v);
-
- for (;;) {
- int old;
-
- if (cur >= below)
- return false;
- old = atomic_cmpxchg(v, cur, cur + 1);
- if (old == cur)
- break;
- cur = old;
- }
-
- return true;
-}
-
 static void wb_timestamp(struct rq_wb *rwb, unsigned long *var)
 {
  if (rwb_enabled(rwb)) {
@@ -116,7 +95,7 @@ static void wb_timestamp(struct rq_wb *rwb, unsigned long *var)
  */
 static bool wb_recent_wait(struct rq_wb *rwb)
 {
- struct bdi_writeback *wb = &rwb->queue->backing_dev_info->wb;
+ struct bdi_writeback *wb = &rwb->rqos.q->backing_dev_info->wb;
 
  return time_before(jiffies, wb->dirty_sleep + HZ);
 }
@@ -144,8 +123,9 @@ static void rwb_wake_all(struct rq_wb *rwb)
  }
 }
 
-void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
+static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct)
 {
+ struct rq_wb *rwb = RQWB(rqos);
  struct rq_wait *rqw;
  int inflight, limit;
 
@@ -194,10 +174,9 @@ void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
  * Called on completion of a request. Note that it's also called when
  * a request is merged, when the request gets freed.
  */
-void wbt_done(struct rq_wb *rwb, struct request *rq)
+static void wbt_done(struct rq_qos *rqos, struct request *rq)
 {
- if (!rwb)
- return;
+ struct rq_wb *rwb = RQWB(rqos);
 
  if (!wbt_is_tracked(rq)) {
  if (rwb->sync_cookie == rq) {
@@ -209,72 +188,11 @@ void wbt_done(struct rq_wb *rwb, struct request *rq)
  wb_timestamp(rwb, &rwb->last_comp);
  } else {
  WARN_ON_ONCE(rq == rwb->sync_cookie);
- __wbt_done(rwb, wbt_flags(rq));
+ __wbt_done(rqos, wbt_flags(rq));
  }
  wbt_clear_state(rq);
 }
 
-/*
- * Return true, if we can't increase the depth further by scaling
- */
-static bool calc_wb_limits(struct rq_wb *rwb)
-{
- unsigned int depth;
- bool ret = false;
-
- if (!rwb->min_lat_nsec) {
- rwb->wb_max = rwb->wb_normal = rwb->wb_background = 0;
- return false;
- }
-
- /*
- * For QD=1 devices, this is a special case. It's important for those
- * to have one request ready when one completes, so force a depth of
- * 2 for those devices. On the backend, it'll be a depth of 1 anyway,
- * since the device can't have more than that in flight. If we're
- * scaling down, then keep a setting of 1/1/1.
- */
- if (rwb->queue_depth == 1) {
- if (rwb->scale_step > 0)
- rwb->wb_max = rwb->wb_normal = 1;
- else {
- rwb->wb_max = rwb->wb_normal = 2;
- ret = true;
- }
- rwb->wb_background = 1;
- } else {
- /*
- * scale_step == 0 is our default state. If we have suffered
- * latency spikes, step will be > 0, and we shrink the
- * allowed write depths. If step is < 0, we're only doing
- * writes, and we allow a temporarily higher depth to
- * increase performance.
- */
- depth = min_t(unsigned int, RWB_DEF_DEPTH, rwb->queue_depth);
- if (rwb->scale_step > 0)
- depth = 1 + ((depth - 1) >> min(31, rwb->scale_step));
- else if (rwb->scale_step < 0) {
- unsigned int maxd = 3 * rwb->queue_depth / 4;
-
- depth = 1 + ((depth - 1) << -rwb->scale_step);
- if (depth > maxd) {
- depth = maxd;
- ret = true;
- }
- }
-
- /*
- * Set our max/normal/bg queue depths based on how far
- * we have scaled down (->scale_step).
- */
- rwb->wb_max = depth;
- rwb->wb_normal = (rwb->wb_max + 1) / 2;
- rwb->wb_background = (rwb->wb_max + 3) / 4;
- }
-
- return ret;
-}
-
 static inline bool stat_sample_valid(struct blk_rq_stat *stat)
 {
  /*
@@ -307,7 +225,8 @@ enum {
 
 static int latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat)
 {
- struct backing_dev_info *bdi = rwb->queue->backing_dev_info;
+ struct backing_dev_info *bdi = rwb->rqos.q->backing_dev_info;
+ struct rq_depth *rqd = &rwb->rq_depth;
  u64 thislat;
 
  /*
@@ -351,7 +270,7 @@ static int latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat)
  return LAT_EXCEEDED;
  }
 
- if (rwb->scale_step)
+ if (rqd->scale_step)
  trace_wbt_stat(bdi, stat);
 
  return LAT_OK;
@@ -359,58 +278,48 @@ static int latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat)
 
 static void rwb_trace_step(struct rq_wb *rwb, const char *msg)
 {
- struct backing_dev_info *bdi = rwb->queue->backing_dev_info;
+ struct backing_dev_info *bdi = rwb->rqos.q->backing_dev_info;
+ struct rq_depth *rqd = &rwb->rq_depth;
 
- trace_wbt_step(bdi, msg, rwb->scale_step, rwb->cur_win_nsec,
- rwb->wb_background, rwb->wb_normal, rwb->wb_max);
+ trace_wbt_step(bdi, msg, rqd->scale_step, rwb->cur_win_nsec,
+ rwb->wb_background, rwb->wb_normal, rqd->max_depth);
 }
 
-static void scale_up(struct rq_wb *rwb)
+static void calc_wb_limits(struct rq_wb *rwb)
 {
- /*
- * Hit max in previous round, stop here
- */
- if (rwb->scaled_max)
- return;
+ if (rwb->min_lat_nsec == 0) {
+ rwb->wb_normal = rwb->wb_background = 0;
+ } else if (rwb->rq_depth.max_depth <= 2) {
+ rwb->wb_normal = rwb->rq_depth.max_depth;
+ rwb->wb_background = 1;
+ } else {
+ rwb->wb_normal = (rwb->rq_depth.max_depth + 1) / 2;
+ rwb->wb_background = (rwb->rq_depth.max_depth + 3) / 4;
+ }
+}
 
- rwb->scale_step--;
+static void scale_up(struct rq_wb *rwb)
+{
+ rq_depth_scale_up(&rwb->rq_depth);
+ calc_wb_limits(rwb);
  rwb->unknown_cnt = 0;
-
- rwb->scaled_max = calc_wb_limits(rwb);
-
- rwb_wake_all(rwb);
-
- rwb_trace_step(rwb, "step up");
+ rwb_trace_step(rwb, "scale up");
 }
 
-/*
- * Scale rwb down. If 'hard_throttle' is set, do it quicker, since we
- * had a latency violation.
- */
 static void scale_down(struct rq_wb *rwb, bool hard_throttle)
 {
- /*
- * Stop scaling down when we've hit the limit. This also prevents
- * ->scale_step from going to crazy values, if the device can't
- * keep up.
- */
- if (rwb->wb_max == 1)
- return;
-
- if (rwb->scale_step < 0 && hard_throttle)
- rwb->scale_step = 0;
- else
- rwb->scale_step++;
-
- rwb->scaled_max = false;
- rwb->unknown_cnt = 0;
+ rq_depth_scale_down(&rwb->rq_depth, hard_throttle);
  calc_wb_limits(rwb);
- rwb_trace_step(rwb, "step down");
+ rwb->unknown_cnt = 0;
+ rwb_wake_all(rwb);
+ rwb_trace_step(rwb, "scale down");
 }
 
 static void rwb_arm_timer(struct rq_wb *rwb)
 {
- if (rwb->scale_step > 0) {
+ struct rq_depth *rqd = &rwb->rq_depth;
+
+ if (rqd->scale_step > 0) {
  /*
  * We should speed this up, using some variant of a fast
  * integer inverse square root calculation. Since we only do
@@ -418,7 +327,7 @@ static void rwb_arm_timer(struct rq_wb *rwb)
  * though.
  */
  rwb->cur_win_nsec = div_u64(rwb->win_nsec << 4,
- int_sqrt((rwb->scale_step + 1) << 8));
+ int_sqrt((rqd->scale_step + 1) << 8));
  } else {
  /*
  * For step < 0, we don't want to increase/decrease the
@@ -433,12 +342,13 @@ static void rwb_arm_timer(struct rq_wb *rwb)
 static void wb_timer_fn(struct blk_stat_callback *cb)
 {
  struct rq_wb *rwb = cb->data;
+ struct rq_depth *rqd = &rwb->rq_depth;
  unsigned int inflight = wbt_inflight(rwb);
  int status;
 
  status = latency_exceeded(rwb, cb->stat);
 
- trace_wbt_timer(rwb->queue->backing_dev_info, status, rwb->scale_step,
+ trace_wbt_timer(rwb->rqos.q->backing_dev_info, status, rqd->scale_step,
  inflight);
 
  /*
@@ -469,9 +379,9 @@ static void wb_timer_fn(struct blk_stat_callback *cb)
  * currently don't have a valid read/write sample. For that
  * case, slowly return to center state (step == 0).
  */
- if (rwb->scale_step > 0)
+ if (rqd->scale_step > 0)
  scale_up(rwb);
- else if (rwb->scale_step < 0)
+ else if (rqd->scale_step < 0)
  scale_down(rwb, false);
  break;
  default:
@@ -481,19 +391,50 @@ static void wb_timer_fn(struct blk_stat_callback *cb)
  /*
  * Re-arm timer, if we have IO in flight
  */
- if (rwb->scale_step || inflight)
+ if (rqd->scale_step || inflight)
  rwb_arm_timer(rwb);
 }
 
-void wbt_update_limits(struct rq_wb *rwb)
+static void __wbt_update_limits(struct rq_wb *rwb)
 {
- rwb->scale_step = 0;
- rwb->scaled_max = false;
+ struct rq_depth *rqd = &rwb->rq_depth;
+
+ rqd->scale_step = 0;
+ rqd->scaled_max = false;
+
+ rq_depth_calc_max_depth(rqd);
  calc_wb_limits(rwb);
 
  rwb_wake_all(rwb);
 }
 
+void wbt_update_limits(struct request_queue *q)
+{
+ struct rq_qos *rqos = wbt_rq_qos(q);
+ if (!rqos)
+ return;
+ __wbt_update_limits(RQWB(rqos));
+}
+
+u64 wbt_get_min_lat(struct request_queue *q)
+{
+ struct rq_qos *rqos = wbt_rq_qos(q);
+ if (!rqos)
+ return 0;
+ return RQWB(rqos)->min_lat_nsec;
+}
+
+void wbt_set_min_lat(struct request_queue *q, u64 val)
+{
+ struct rq_qos *rqos = wbt_rq_qos(q);
+ if (!rqos)
+ return;
+ RQWB(rqos)->min_lat_nsec = val;
+ RQWB(rqos)->enable_state = WBT_STATE_ON_MANUAL;
+ __wbt_update_limits(RQWB(rqos));
+}
+
+
 static bool close_io(struct rq_wb *rwb)
 {
  const unsigned long now = jiffies;
@@ -520,7 +461,7 @@ static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
  * IO for a bit.
  */
  if ((rw & REQ_HIPRIO) || wb_recent_wait(rwb) || current_is_kswapd())
- limit = rwb->wb_max;
+ limit = rwb->rq_depth.max_depth;
  else if ((rw & REQ_BACKGROUND) || close_io(rwb)) {
  /*
  * If less than 100ms since we completed unrelated IO,
@@ -554,7 +495,7 @@ static inline bool may_queue(struct rq_wb *rwb, struct rq_wait *rqw,
     rqw->wait.head.next != &wait->entry)
  return false;
 
- return atomic_inc_below(&rqw->inflight, get_limit(rwb, rw));
+ return rq_wait_inc_below(rqw, get_limit(rwb, rw));
 }
 
 /*
@@ -614,8 +555,10 @@ static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
  * in an irq held spinlock, if it holds one when calling this function.
  * If we do sleep, we'll release and re-grab it.
  */
-enum wbt_flags wbt_wait(struct rq_wb *rwb, struct bio *bio, spinlock_t *lock)
+static enum wbt_flags wbt_wait(struct rq_qos *rqos, struct bio *bio,
+       spinlock_t *lock)
 {
+ struct rq_wb *rwb = RQWB(rqos);
  enum wbt_flags ret = 0;
 
  if (!rwb_enabled(rwb))
@@ -643,8 +586,10 @@ enum wbt_flags wbt_wait(struct rq_wb *rwb, struct bio *bio, spinlock_t *lock)
  return ret | WBT_TRACKED;
 }
 
-void wbt_issue(struct rq_wb *rwb, struct request *rq)
+void wbt_issue(struct rq_qos *rqos, struct request *rq)
 {
+ struct rq_wb *rwb = RQWB(rqos);
+
  if (!rwb_enabled(rwb))
  return;
 
@@ -661,8 +606,9 @@ void wbt_issue(struct rq_wb *rwb, struct request *rq)
  }
 }
 
-void wbt_requeue(struct rq_wb *rwb, struct request *rq)
+void wbt_requeue(struct rq_qos *rqos, struct request *rq)
 {
+ struct rq_wb *rwb = RQWB(rqos);
  if (!rwb_enabled(rwb))
  return;
  if (rq == rwb->sync_cookie) {
@@ -671,39 +617,30 @@ void wbt_requeue(struct rq_wb *rwb, struct request *rq)
  }
 }
 
-void wbt_set_queue_depth(struct rq_wb *rwb, unsigned int depth)
+void wbt_set_queue_depth(struct request_queue *q, unsigned int depth)
 {
- if (rwb) {
- rwb->queue_depth = depth;
- wbt_update_limits(rwb);
+ struct rq_qos *rqos = wbt_rq_qos(q);
+ if (rqos) {
+ RQWB(rqos)->rq_depth.queue_depth = depth;
+ __wbt_update_limits(RQWB(rqos));
  }
 }
 
-void wbt_set_write_cache(struct rq_wb *rwb, bool write_cache_on)
-{
- if (rwb)
- rwb->wc = write_cache_on;
-}
-
-/*
- * Disable wbt, if enabled by default.
- */
-void wbt_disable_default(struct request_queue *q)
+void wbt_set_write_cache(struct request_queue *q, bool write_cache_on)
 {
- struct rq_wb *rwb = q->rq_wb;
-
- if (rwb && rwb->enable_state == WBT_STATE_ON_DEFAULT)
- wbt_exit(q);
+ struct rq_qos *rqos = wbt_rq_qos(q);
+ if (rqos)
+ RQWB(rqos)->wc = write_cache_on;
 }
-EXPORT_SYMBOL_GPL(wbt_disable_default);
 
 /*
  * Enable wbt if defaults are configured that way
  */
 void wbt_enable_default(struct request_queue *q)
 {
+ struct rq_qos *rqos = wbt_rq_qos(q);
  /* Throttling already enabled? */
- if (q->rq_wb)
+ if (rqos)
  return;
 
  /* Queue not registered? Maybe shutting down... */
@@ -741,6 +678,41 @@ static int wbt_data_dir(const struct request *rq)
  return -1;
 }
 
+static void wbt_exit(struct rq_qos *rqos)
+{
+ struct rq_wb *rwb = RQWB(rqos);
+ struct request_queue *q = rqos->q;
+
+ blk_stat_remove_callback(q, rwb->cb);
+ blk_stat_free_callback(rwb->cb);
+ kfree(rwb);
+}
+
+/*
+ * Disable wbt, if enabled by default.
+ */
+void wbt_disable_default(struct request_queue *q)
+{
+ struct rq_qos *rqos = wbt_rq_qos(q);
+ struct rq_wb *rwb;
+ if (!rqos)
+ return;
+ rwb = RQWB(rqos);
+ if (rwb->enable_state == WBT_STATE_ON_DEFAULT)
+ rwb->wb_normal = 0;
+}
+EXPORT_SYMBOL_GPL(wbt_disable_default);
+
+
+static struct rq_qos_ops wbt_rqos_ops = {
+ .throttle = wbt_wait,
+ .issue = wbt_issue,
+ .requeue = wbt_requeue,
+ .done = wbt_done,
+ .cleanup = __wbt_done,
+ .exit = wbt_exit,
+};
+
 int wbt_init(struct request_queue *q)
 {
  struct rq_wb *rwb;
@@ -756,39 +728,29 @@ int wbt_init(struct request_queue *q)
  return -ENOMEM;
  }
 
- for (i = 0; i < WBT_NUM_RWQ; i++) {
- atomic_set(&rwb->rq_wait[i].inflight, 0);
- init_waitqueue_head(&rwb->rq_wait[i].wait);
- }
+ for (i = 0; i < WBT_NUM_RWQ; i++)
+ rq_wait_init(&rwb->rq_wait[i]);
 
+ rwb->rqos.id = RQ_QOS_WBT;
+ rwb->rqos.ops = &wbt_rqos_ops;
+ rwb->rqos.q = q;
  rwb->last_comp = rwb->last_issue = jiffies;
- rwb->queue = q;
  rwb->win_nsec = RWB_WINDOW_NSEC;
  rwb->enable_state = WBT_STATE_ON_DEFAULT;
- wbt_update_limits(rwb);
+ rwb->wc = 1;
+ rwb->rq_depth.default_depth = RWB_DEF_DEPTH;
+ __wbt_update_limits(rwb);
 
  /*
  * Assign rwb and add the stats callback.
  */
- q->rq_wb = rwb;
+ rq_qos_add(q, &rwb->rqos);
  blk_stat_add_callback(q, rwb->cb);
 
  rwb->min_lat_nsec = wbt_default_latency_nsec(q);
 
- wbt_set_queue_depth(rwb, blk_queue_depth(q));
- wbt_set_write_cache(rwb, test_bit(QUEUE_FLAG_WC, &q->queue_flags));
+ wbt_set_queue_depth(q, blk_queue_depth(q));
+ wbt_set_write_cache(q, test_bit(QUEUE_FLAG_WC, &q->queue_flags));
 
  return 0;
 }
-
-void wbt_exit(struct request_queue *q)
-{
- struct rq_wb *rwb = q->rq_wb;
-
- if (rwb) {
- blk_stat_remove_callback(q, rwb->cb);
- blk_stat_free_callback(rwb->cb);
- q->rq_wb = NULL;
- kfree(rwb);
- }
-}
diff --git a/block/blk-wbt.h b/block/blk-wbt.h
index 300df531d0a6..53b20a58c0a2 100644
--- a/block/blk-wbt.h
+++ b/block/blk-wbt.h
@@ -9,6 +9,7 @@
 #include <linux/ktime.h>
 
 #include "blk-stat.h"
+#include "blk-rq-qos.h"
 
 enum wbt_flags {
  WBT_TRACKED = 1, /* write, tracked for throttling */
@@ -35,20 +36,12 @@ enum {
  WBT_STATE_ON_MANUAL = 2,
 };
 
-struct rq_wait {
- wait_queue_head_t wait;
- atomic_t inflight;
-};
-
 struct rq_wb {
  /*
  * Settings that govern how we throttle
  */
  unsigned int wb_background; /* background writeback */
  unsigned int wb_normal; /* normal writeback */
- unsigned int wb_max; /* max throughput writeback */
- int scale_step;
- bool scaled_max;
 
  short enable_state; /* WBT_STATE_* */
 
@@ -67,15 +60,20 @@ struct rq_wb {
  void *sync_cookie;
 
  unsigned int wc;
- unsigned int queue_depth;
 
  unsigned long last_issue; /* last non-throttled issue */
  unsigned long last_comp; /* last non-throttled comp */
  unsigned long min_lat_nsec;
- struct request_queue *queue;
+ struct rq_qos rqos;
  struct rq_wait rq_wait[WBT_NUM_RWQ];
+ struct rq_depth rq_depth;
 };
 
+static inline struct rq_wb *RQWB(struct rq_qos *rqos)
+{
+ return container_of(rqos, struct rq_wb, rqos);
+}
+
 static inline unsigned int wbt_inflight(struct rq_wb *rwb)
 {
  unsigned int i, ret = 0;
@@ -86,6 +84,7 @@ static inline unsigned int wbt_inflight(struct rq_wb *rwb)
  return ret;
 }
 
+
 #ifdef CONFIG_BLK_WBT
 
 static inline void wbt_track(struct request *rq, enum wbt_flags flags)
@@ -93,19 +92,16 @@ static inline void wbt_track(struct request *rq, enum wbt_flags flags)
  rq->wbt_flags |= flags;
 }
 
-void __wbt_done(struct rq_wb *, enum wbt_flags);
-void wbt_done(struct rq_wb *, struct request *);
-enum wbt_flags wbt_wait(struct rq_wb *, struct bio *, spinlock_t *);
 int wbt_init(struct request_queue *);
-void wbt_exit(struct request_queue *);
-void wbt_update_limits(struct rq_wb *);
-void wbt_requeue(struct rq_wb *, struct request *);
-void wbt_issue(struct rq_wb *, struct request *);
+void wbt_update_limits(struct request_queue *);
 void wbt_disable_default(struct request_queue *);
 void wbt_enable_default(struct request_queue *);
 
-void wbt_set_queue_depth(struct rq_wb *, unsigned int);
-void wbt_set_write_cache(struct rq_wb *, bool);
+u64 wbt_get_min_lat(struct request_queue *q);
+void wbt_set_min_lat(struct request_queue *q, u64 val);
+
+void wbt_set_queue_depth(struct request_queue *, unsigned int);
+void wbt_set_write_cache(struct request_queue *, bool);
 
 u64 wbt_default_latency_nsec(struct request_queue *);
 
@@ -114,43 +110,30 @@ u64 wbt_default_latency_nsec(struct request_queue *);
 static inline void wbt_track(struct request *rq, enum wbt_flags flags)
 {
 }
-static inline void __wbt_done(struct rq_wb *rwb, enum wbt_flags flags)
-{
-}
-static inline void wbt_done(struct rq_wb *rwb, struct request *rq)
-{
-}
-static inline enum wbt_flags wbt_wait(struct rq_wb *rwb, struct bio *bio,
-      spinlock_t *lock)
-{
- return 0;
-}
 static inline int wbt_init(struct request_queue *q)
 {
  return -EINVAL;
 }
-static inline void wbt_exit(struct request_queue *q)
-{
-}
-static inline void wbt_update_limits(struct rq_wb *rwb)
+static inline void wbt_update_limits(struct request_queue *q)
 {
 }
-static inline void wbt_requeue(struct rq_wb *rwb, struct request *rq)
+static inline void wbt_disable_default(struct request_queue *q)
 {
 }
-static inline void wbt_issue(struct rq_wb *rwb, struct request *rq)
+static inline void wbt_enable_default(struct request_queue *q)
 {
 }
-static inline void wbt_disable_default(struct request_queue *q)
+static inline void wbt_set_queue_depth(struct request_queue *q, unsigned int depth)
 {
 }
-static inline void wbt_enable_default(struct request_queue *q)
+static inline void wbt_set_write_cache(struct request_queue *q, bool wc)
 {
 }
-static inline void wbt_set_queue_depth(struct rq_wb *rwb, unsigned int depth)
+static inline u64 wbt_get_min_lat(struct request_queue *q)
 {
+ return 0;
 }
-static inline void wbt_set_write_cache(struct rq_wb *rwb, bool wc)
+static inline void wbt_set_min_lat(struct request_queue *q, u64 val)
 {
 }
 static inline u64 wbt_default_latency_nsec(struct request_queue *q)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 79226ca8f80f..0cb4ab9ce18a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -42,7 +42,7 @@ struct bsg_job;
 struct blkcg_gq;
 struct blk_flush_queue;
 struct pr_ops;
-struct rq_wb;
+struct rq_qos;
 struct blk_queue_stats;
 struct blk_stat_callback;
 
@@ -445,7 +445,7 @@ struct request_queue {
  atomic_t shared_hctx_restart;
 
  struct blk_queue_stats *stats;
- struct rq_wb *rq_wb;
+ struct rq_qos *rq_qos;
 
  /*
  * If blkcg is not used, @q->root_rl serves all requests.  If blkcg
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU C][PATCH 2/8] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait

Mauricio Faria de Oliveira-3
In reply to this post by Mauricio Faria de Oliveira-3
From: Anchal Agarwal <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1810998

I am currently running a large bare metal instance (i3.metal)
on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
4.18 kernel. I have a workload that simulates a database
workload and I am running into lockup issues when writeback
throttling is enabled,with the hung task detector also
kicking in.

Crash dumps show that most CPUs (up to 50 of them) are
all trying to get the wbt wait queue lock while trying to add
themselves to it in __wbt_wait (see stack traces below).

[    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
[    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
[    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
[    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
[    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
[    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
[    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
[    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
[    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
[    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
[    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
[    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
[    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.948138] Call Trace:
[    0.948139]  <IRQ>
[    0.948142]  do_raw_spin_lock+0xad/0xc0
[    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
[    0.948149]  ? __wake_up_common_lock+0x53/0x90
[    0.948150]  __wake_up_common_lock+0x53/0x90
[    0.948155]  wbt_done+0x7b/0xa0
[    0.948158]  blk_mq_free_request+0xb7/0x110
[    0.948161]  __blk_mq_complete_request+0xcb/0x140
[    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
[    0.948169]  nvme_irq+0x23/0x50 [nvme]
[    0.948173]  __handle_irq_event_percpu+0x46/0x300
[    0.948176]  handle_irq_event_percpu+0x20/0x50
[    0.948179]  handle_irq_event+0x34/0x60
[    0.948181]  handle_edge_irq+0x77/0x190
[    0.948185]  handle_irq+0xaf/0x120
[    0.948188]  do_IRQ+0x53/0x110
[    0.948191]  common_interrupt+0x87/0x87
[    0.948192]  </IRQ>
....
[    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
[    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
[    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
[    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
[    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
[    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
[    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
[    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
[    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
[    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
[    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
[    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
[    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.311154] Call Trace:
[    0.311157]  do_raw_spin_lock+0xad/0xc0
[    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
[    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
[    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
[    0.311167]  wbt_wait+0x127/0x330
[    0.311169]  ? finish_wait+0x80/0x80
[    0.311172]  ? generic_make_request+0xda/0x3b0
[    0.311174]  blk_mq_make_request+0xd6/0x7b0
[    0.311176]  ? blk_queue_enter+0x24/0x260
[    0.311178]  ? generic_make_request+0xda/0x3b0
[    0.311181]  generic_make_request+0x10c/0x3b0
[    0.311183]  ? submit_bio+0x5c/0x110
[    0.311185]  submit_bio+0x5c/0x110
[    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
[    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
[    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
[    0.311229]  ? do_writepages+0x3c/0xd0
[    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
[    0.311240]  do_writepages+0x3c/0xd0
[    0.311243]  ? _raw_spin_unlock+0x24/0x30
[    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
[    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
[    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
[    0.311253]  file_write_and_wait_range+0x34/0x90
[    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
[    0.311267]  do_fsync+0x38/0x60
[    0.311270]  SyS_fsync+0xc/0x10
[    0.311272]  do_syscall_64+0x6f/0x170
[    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7

In the original patch, wbt_done is waking up all the exclusive
processes in the wait queue, which can cause a thundering herd
if there is a large number of writer threads in the queue. The
original intention of the code seems to be to wake up one thread
only however, it uses wake_up_all() in __wbt_done(), and then
uses the following check in __wbt_wait to have only one thread
actually get out of the wait loop:

if (waitqueue_active(&rqw->wait) &&
            rqw->wait.head.next != &wait->entry)
                return false;

The problem with this is that the wait entry in wbt_wait is
define with DEFINE_WAIT, which uses the autoremove wakeup function.
That means that the above check is invalid - the wait entry will
have been removed from the queue already by the time we hit the
check in the loop.

Secondly, auto-removing the wait entries also means that the wait
queue essentially gets reordered "randomly" (e.g. threads re-add
themselves in the order they got to run after being woken up).
Additionally, new requests entering wbt_wait might overtake requests
that were queued earlier, because the wait queue will be
(temporarily) empty after the wake_up_all, so the waitqueue_active
check will not stop them. This can cause certain threads to starve
under high load.

The fix is to leave the woken up requests in the queue and remove
them in finish_wait() once the current thread breaks out of the
wait loop in __wbt_wait. This will ensure new requests always
end up at the back of the queue, and they won't overtake requests
that are already in the wait queue. With that change, the loop
in wbt_wait is also in line with many other wait loops in the kernel.
Waking up just one thread drastically reduces lock contention, as
does moving the wait queue add/remove out of the loop.

A significant drop in lockdep's lock contention numbers is seen when
running the test application on the patched kernel.

Signed-off-by: Anchal Agarwal <[hidden email]>
Signed-off-by: Frank van der Linden <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit 2887e41b910bb14fd847cf01ab7a5993db989d88)
Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
---
 block/blk-wbt.c | 55 +++++++++++++++++++++----------------------------
 1 file changed, 24 insertions(+), 31 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 6fe20fb823e4..e817970202a6 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -166,7 +166,7 @@ static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct)
  int diff = limit - inflight;
 
  if (!inflight || diff >= rwb->wb_background / 2)
- wake_up_all(&rqw->wait);
+ wake_up(&rqw->wait);
  }
 }
 
@@ -474,30 +474,6 @@ static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
  return limit;
 }
 
-static inline bool may_queue(struct rq_wb *rwb, struct rq_wait *rqw,
-     wait_queue_entry_t *wait, unsigned long rw)
-{
- /*
- * inc it here even if disabled, since we'll dec it at completion.
- * this only happens if the task was sleeping in __wbt_wait(),
- * and someone turned it off at the same time.
- */
- if (!rwb_enabled(rwb)) {
- atomic_inc(&rqw->inflight);
- return true;
- }
-
- /*
- * If the waitqueue is already active and we are not the next
- * in line to be woken up, wait for our turn.
- */
- if (waitqueue_active(&rqw->wait) &&
-    rqw->wait.head.next != &wait->entry)
- return false;
-
- return rq_wait_inc_below(rqw, get_limit(rwb, rw));
-}
-
 /*
  * Block if we will exceed our limit, or if we are currently waiting for
  * the timer to kick off queuing again.
@@ -508,16 +484,32 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
  __acquires(lock)
 {
  struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
- DEFINE_WAIT(wait);
+ DECLARE_WAITQUEUE(wait, current);
+
+ /*
+ * inc it here even if disabled, since we'll dec it at completion.
+ * this only happens if the task was sleeping in __wbt_wait(),
+ * and someone turned it off at the same time.
+ */
+ if (!rwb_enabled(rwb)) {
+ atomic_inc(&rqw->inflight);
+ return;
+ }
 
- if (may_queue(rwb, rqw, &wait, rw))
+ if (!waitqueue_active(&rqw->wait)
+ && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
  return;
 
+ add_wait_queue_exclusive(&rqw->wait, &wait);
  do {
- prepare_to_wait_exclusive(&rqw->wait, &wait,
- TASK_UNINTERRUPTIBLE);
+ set_current_state(TASK_UNINTERRUPTIBLE);
+
+ if (!rwb_enabled(rwb)) {
+ atomic_inc(&rqw->inflight);
+ break;
+ }
 
- if (may_queue(rwb, rqw, &wait, rw))
+ if (rq_wait_inc_below(rqw, get_limit(rwb, rw)))
  break;
 
  if (lock) {
@@ -528,7 +520,8 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
  io_schedule();
  } while (1);
 
- finish_wait(&rqw->wait, &wait);
+ __set_current_state(TASK_RUNNING);
+ remove_wait_queue(&rqw->wait, &wait);
 }
 
 static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU C][PATCH 3/8] blk-wbt: move disable check into get_limit()

Mauricio Faria de Oliveira-3
In reply to this post by Mauricio Faria de Oliveira-3
From: Jens Axboe <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1810998

Check it in one place, instead of in multiple places.

Tested-by: Anchal Agarwal <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit ffa358dcaae1f2f00926484e712e06daa8953cb4)
Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
---
 block/blk-wbt.c | 22 +++++++---------------
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index e817970202a6..70075f21482f 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -449,6 +449,13 @@ static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
 {
  unsigned int limit;
 
+ /*
+ * If we got disabled, just return UINT_MAX. This ensures that
+ * we'll properly inc a new IO, and dec+wakeup at the end.
+ */
+ if (!rwb_enabled(rwb))
+ return UINT_MAX;
+
  if ((rw & REQ_OP_MASK) == REQ_OP_DISCARD)
  return rwb->wb_background;
 
@@ -486,16 +493,6 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
  struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
  DECLARE_WAITQUEUE(wait, current);
 
- /*
- * inc it here even if disabled, since we'll dec it at completion.
- * this only happens if the task was sleeping in __wbt_wait(),
- * and someone turned it off at the same time.
- */
- if (!rwb_enabled(rwb)) {
- atomic_inc(&rqw->inflight);
- return;
- }
-
  if (!waitqueue_active(&rqw->wait)
  && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
  return;
@@ -504,11 +501,6 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
  do {
  set_current_state(TASK_UNINTERRUPTIBLE);
 
- if (!rwb_enabled(rwb)) {
- atomic_inc(&rqw->inflight);
- break;
- }
-
  if (rq_wait_inc_below(rqw, get_limit(rwb, rw)))
  break;
 
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU C][PATCH 4/8] blk-wbt: use wq_has_sleeper() for wq active check

Mauricio Faria de Oliveira-3
In reply to this post by Mauricio Faria de Oliveira-3
From: Jens Axboe <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1810998

We need the memory barrier before checking the list head,
use the appropriate helper for this. The matching queue
side memory barrier is provided by set_current_state().

Tested-by: Anchal Agarwal <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit b78820937b4762b7d30b807d7156bec1d89e4dd3)
Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
---
 block/blk-wbt.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 70075f21482f..53bd504471d2 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -118,7 +118,7 @@ static void rwb_wake_all(struct rq_wb *rwb)
  for (i = 0; i < WBT_NUM_RWQ; i++) {
  struct rq_wait *rqw = &rwb->rq_wait[i];
 
- if (waitqueue_active(&rqw->wait))
+ if (wq_has_sleeper(&rqw->wait))
  wake_up_all(&rqw->wait);
  }
 }
@@ -162,7 +162,7 @@ static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct)
  if (inflight && inflight >= limit)
  return;
 
- if (waitqueue_active(&rqw->wait)) {
+ if (wq_has_sleeper(&rqw->wait)) {
  int diff = limit - inflight;
 
  if (!inflight || diff >= rwb->wb_background / 2)
@@ -493,8 +493,8 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
  struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
  DECLARE_WAITQUEUE(wait, current);
 
- if (!waitqueue_active(&rqw->wait)
- && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
+ if (!wq_has_sleeper(&rqw->wait) &&
+    rq_wait_inc_below(rqw, get_limit(rwb, rw)))
  return;
 
  add_wait_queue_exclusive(&rqw->wait, &wait);
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU C][PATCH 5/8] blk-wbt: fix has-sleeper queueing check

Mauricio Faria de Oliveira-3
In reply to this post by Mauricio Faria de Oliveira-3
From: Jens Axboe <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1810998

We need to do this inside the loop as well, or we can allow new
IO to supersede previous IO.

Tested-by: Anchal Agarwal <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit c45e6a037a536530bd25781ac7c989e52deb2a63)
Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
---
 block/blk-wbt.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 53bd504471d2..9748a2fc684b 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -492,16 +492,17 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
 {
  struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
  DECLARE_WAITQUEUE(wait, current);
+ bool has_sleeper;
 
- if (!wq_has_sleeper(&rqw->wait) &&
-    rq_wait_inc_below(rqw, get_limit(rwb, rw)))
+ has_sleeper = wq_has_sleeper(&rqw->wait);
+ if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
  return;
 
  add_wait_queue_exclusive(&rqw->wait, &wait);
  do {
  set_current_state(TASK_UNINTERRUPTIBLE);
 
- if (rq_wait_inc_below(rqw, get_limit(rwb, rw)))
+ if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
  break;
 
  if (lock) {
@@ -510,6 +511,7 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
  spin_lock_irq(lock);
  } else
  io_schedule();
+ has_sleeper = false;
  } while (1);
 
  __set_current_state(TASK_RUNNING);
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU C][PATCH 6/8] blk-wbt: abstract out end IO completion handler

Mauricio Faria de Oliveira-3
In reply to this post by Mauricio Faria de Oliveira-3
From: Jens Axboe <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1810998

Prep patch for calling the handler from a different context,
no functional changes in this patch.

Tested-by: Agarwal, Anchal <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit 061a5427530633de93ace4ef001b99961984af62)
Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
---
 block/blk-wbt.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 9748a2fc684b..72b1fa8a75c6 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -123,16 +123,11 @@ static void rwb_wake_all(struct rq_wb *rwb)
  }
 }
 
-static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct)
+static void wbt_rqw_done(struct rq_wb *rwb, struct rq_wait *rqw,
+ enum wbt_flags wb_acct)
 {
- struct rq_wb *rwb = RQWB(rqos);
- struct rq_wait *rqw;
  int inflight, limit;
 
- if (!(wb_acct & WBT_TRACKED))
- return;
-
- rqw = get_rq_wait(rwb, wb_acct);
  inflight = atomic_dec_return(&rqw->inflight);
 
  /*
@@ -170,6 +165,18 @@ static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct)
  }
 }
 
+static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct)
+{
+ struct rq_wb *rwb = RQWB(rqos);
+ struct rq_wait *rqw;
+
+ if (!(wb_acct & WBT_TRACKED))
+ return;
+
+ rqw = get_rq_wait(rwb, wb_acct);
+ wbt_rqw_done(rwb, rqw, wb_acct);
+}
+
 /*
  * Called on completion of a request. Note that it's also called when
  * a request is merged, when the request gets freed.
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU C][PATCH 7/8] blk-wbt: improve waking of tasks

Mauricio Faria de Oliveira-3
In reply to this post by Mauricio Faria de Oliveira-3
From: Jens Axboe <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1810998

We have two potential issues:

1) After commit 2887e41b910b, we only wake one process at the time when
   we finish an IO. We really want to wake up as many tasks as can
   queue IO. Before this commit, we woke up everyone, which could cause
   a thundering herd issue.

2) A task can potentially consume two wakeups, causing us to (in
   practice) miss a wakeup.

Fix both by providing our own wakeup function, which stops
__wake_up_common() from waking up more tasks if we fail to get a
queueing token. With the strict ordering we have on the wait list, this
wakes the right tasks and the right amount of tasks.

Based on a patch from Jianchao Wang <[hidden email]>.

Tested-by: Agarwal, Anchal <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit 38cfb5a45ee013bfab5d1ae4c4738815e744b440)
Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
---
 block/blk-wbt.c | 63 +++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 56 insertions(+), 7 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 72b1fa8a75c6..d6c1d42b85c5 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -161,7 +161,7 @@ static void wbt_rqw_done(struct rq_wb *rwb, struct rq_wait *rqw,
  int diff = limit - inflight;
 
  if (!inflight || diff >= rwb->wb_background / 2)
- wake_up(&rqw->wait);
+ wake_up_all(&rqw->wait);
  }
 }
 
@@ -488,6 +488,34 @@ static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
  return limit;
 }
 
+struct wbt_wait_data {
+ struct wait_queue_entry wq;
+ struct task_struct *task;
+ struct rq_wb *rwb;
+ struct rq_wait *rqw;
+ unsigned long rw;
+ bool got_token;
+};
+
+static int wbt_wake_function(struct wait_queue_entry *curr, unsigned int mode,
+     int wake_flags, void *key)
+{
+ struct wbt_wait_data *data = container_of(curr, struct wbt_wait_data,
+ wq);
+
+ /*
+ * If we fail to get a budget, return -1 to interrupt the wake up
+ * loop in __wake_up_common.
+ */
+ if (!rq_wait_inc_below(data->rqw, get_limit(data->rwb, data->rw)))
+ return -1;
+
+ data->got_token = true;
+ list_del_init(&curr->entry);
+ wake_up_process(data->task);
+ return 1;
+}
+
 /*
  * Block if we will exceed our limit, or if we are currently waiting for
  * the timer to kick off queuing again.
@@ -498,19 +526,40 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
  __acquires(lock)
 {
  struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
- DECLARE_WAITQUEUE(wait, current);
+ struct wbt_wait_data data = {
+ .wq = {
+ .func = wbt_wake_function,
+ .entry = LIST_HEAD_INIT(data.wq.entry),
+ },
+ .task = current,
+ .rwb = rwb,
+ .rqw = rqw,
+ .rw = rw,
+ };
  bool has_sleeper;
 
  has_sleeper = wq_has_sleeper(&rqw->wait);
  if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
  return;
 
- add_wait_queue_exclusive(&rqw->wait, &wait);
+ prepare_to_wait_exclusive(&rqw->wait, &data.wq, TASK_UNINTERRUPTIBLE);
  do {
- set_current_state(TASK_UNINTERRUPTIBLE);
+ if (data.got_token)
+ break;
 
- if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
+ if (!has_sleeper &&
+    rq_wait_inc_below(rqw, get_limit(rwb, rw))) {
+ finish_wait(&rqw->wait, &data.wq);
+
+ /*
+ * We raced with wbt_wake_function() getting a token,
+ * which means we now have two. Put our local token
+ * and wake anyone else potentially waiting for one.
+ */
+ if (data.got_token)
+ wbt_rqw_done(rwb, rqw, wb_acct);
  break;
+ }
 
  if (lock) {
  spin_unlock_irq(lock);
@@ -518,11 +567,11 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
  spin_lock_irq(lock);
  } else
  io_schedule();
+
  has_sleeper = false;
  } while (1);
 
- __set_current_state(TASK_RUNNING);
- remove_wait_queue(&rqw->wait, &wait);
+ finish_wait(&rqw->wait, &data.wq);
 }
 
 static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU C][PATCH 8/8] blk-wbt: wake up all when we scale up, not down

Mauricio Faria de Oliveira-3
In reply to this post by Mauricio Faria de Oliveira-3
From: Josef Bacik <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1810998

Tetsuo brought to my attention that I screwed up the scale_up/scale_down
helpers when I factored out the rq-qos code.  We need to wake up all the
waiters when we add slots for requests to make, not when we shrink the
slots.  Otherwise we'll end up things waiting forever.  This was a
mistake and simply puts everything back the way it was.

cc: [hidden email]
Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt")
eported-by: Tetsuo Handa <[hidden email]>
Signed-off-by: Josef Bacik <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit 5e65a20341e02df637d1c16cd487858d2c6a876a)
Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
---
 block/blk-wbt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index d6c1d42b85c5..4e30632d3773 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -310,6 +310,7 @@ static void scale_up(struct rq_wb *rwb)
  rq_depth_scale_up(&rwb->rq_depth);
  calc_wb_limits(rwb);
  rwb->unknown_cnt = 0;
+ rwb_wake_all(rwb);
  rwb_trace_step(rwb, "scale up");
 }
 
@@ -318,7 +319,6 @@ static void scale_down(struct rq_wb *rwb, bool hard_throttle)
  rq_depth_scale_down(&rwb->rq_depth, hard_throttle);
  calc_wb_limits(rwb);
  rwb->unknown_cnt = 0;
- rwb_wake_all(rwb);
  rwb_trace_step(rwb, "scale down");
 }
 
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

NACK: [SRU C][PATCH 0/8] blk-wbt: fix for LP#1810998

Stefan Bader-2
In reply to this post by Mauricio Faria de Oliveira-3
On 10.01.19 04:19, Mauricio Faria de Oliveira wrote:

> BugLink: https://bugs.launchpad.net/bugs/1810998
>
> [Impact]
>
>  * Users may experience cpu hard lockups when performing
>    rigorous writes to NVMe drives.
>
>  * The fix addresses an scheduling issue in the original
>    implementation of wbt/writeback throttling
>
>  * The fix is commit 2887e41b910b ("blk-wbt: Avoid lock
>    contention and thundering herd issue in wbt_wait"),
>    plus its fix commit 38cfb5a45ee0 ("blk-wbt: improve
>    waking of tasks").
>
>  * There are additional commits to help with a cleaner
>    backport and future maintenance:
>    - Cosmic: total of 8 clean cherry picks.
>    - Bionic: total of 13 commits, 9 clean cherry-picks
>      and 4 backports, which are just changes to context
>      lines (i.e. refresh) without any functional changes
>      in the backport itself.
>
> [Test Case]
>
>  * This command has been reported to reproduce the problem:
>
>    $ sudo iozone -R -s 5G -r 1m -S 2048 -i 0 -G -c -o -l 128 -u 128 -t 128
>
>  * It generates stack traces as below in the original kernel,
>    and does not generate them in the modified/patched kernel.
>
>  * The user/reporter verified the test kernel with these patches
>    resolved the problem.
>
>  * The developer verified in 2 systems (4-core and 24-core but
>    no NVMe) for regressions, and no error messages were logged
>    to dmesg.
>
> [Regression Potential]
>
>  * The commits have been verified for fixes in later commits in
>    linux-next as of 2019-01-08 and all known fix commits are in.
>
>  * The regression potential is mostly contained in the writeback
>    throttling (block/blk-wbt.*), as almost all of the 13 patches
>    change exclusively that, except for 4 of them (2 of which are
>    sysfs):
>
>    - blk-rq-qos: refactor out common elements of blk-wbt (block/)
>    - block: Protect less code with sysfs_lock in blk_{un,}register_queue() (blk-sysfs.c)
>    - block: Protect less code with sysfs_lock in blk_{un,}register_queue() (blk-{mq-}sysfs.c)
>    - block: pass struct request instead of struct blk_issue_stat to wbt (block/, still mostly blk-wbt.*)
>
> [Other Info]
>
>  * Alternatively, it is probably possible to introduce just the
>    two commits that fix this with some changes to their code in
>    the backport, but since the 'blk-rq-qos: refactor ..' commit
>    may become a dependency for many additional/future fixes, it
>    seemed interesting to pull it in earlier in the 18.04 cycle.
>
>  * The problem has been introduced with the blk-wbt mechanism,
>    in v4.10-rc1, and the fix commits in v4.19-rc1 and -rc2,
>    so only Bionic and Cosmic needs this.
>
> [Stack Traces]
>
> [ 393.628647] NMI watchdog: Watchdog detected hard LOCKUP on cpu 30
> ...
> [ 393.628704] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic #21-Ubuntu
> ...
> [ 393.628720] Call Trace:
> [ 393.628721] <IRQ>
> [ 393.628724] enqueue_task_fair+0x6c/0x7f0
> [ 393.628726] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150
> [ 393.628728] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150
> [ 393.628731] activate_task+0x57/0xc0
> [ 393.628735] ? sched_clock+0x9/0x10
> [ 393.628736] ? sched_clock+0x9/0x10
> [ 393.628738] ttwu_do_activate+0x49/0x90
> [ 393.628739] try_to_wake_up+0x1df/0x490
> [ 393.628741] default_wake_function+0x12/0x20
> [ 393.628743] autoremove_wake_function+0x12/0x40
> [ 393.628744] __wake_up_common+0x73/0x130
> [ 393.628745] __wake_up_common_lock+0x80/0xc0
> [ 393.628746] __wake_up+0x13/0x20
> [ 393.628749] __wbt_done.part.21+0xa4/0xb0
> [ 393.628749] wbt_done+0x72/0xa0
> [ 393.628753] blk_mq_free_request+0xca/0x1a0
> [ 393.628755] blk_mq_end_request+0x48/0x90
> [ 393.628760] nvme_complete_rq+0x23/0x120 [nvme_core]
> [ 393.628763] nvme_pci_complete_rq+0x7a/0x130 [nvme]
> [ 393.628764] __blk_mq_complete_request+0xd2/0x140
> [ 393.628766] blk_mq_complete_request+0x18/0x20
> [ 393.628767] nvme_process_cq+0xe1/0x1b0 [nvme]
> [ 393.628768] nvme_irq+0x23/0x50 [nvme]
> [ 393.628772] __handle_irq_event_percpu+0x44/0x1a0
> [ 393.628773] handle_irq_event_percpu+0x32/0x80
> [ 393.628774] handle_irq_event+0x3b/0x60
> [ 393.628778] handle_edge_irq+0x7c/0x190
> [ 393.628779] handle_irq+0x20/0x30
> [ 393.628783] do_IRQ+0x46/0xd0
> [ 393.628784] common_interrupt+0x84/0x84
> [ 393.628785] </IRQ>
> ...
> [ 393.628794] ? cpuidle_enter_state+0x97/0x2f0
> [ 393.628796] cpuidle_enter+0x17/0x20
> [ 393.628797] call_cpuidle+0x23/0x40
> [ 393.628798] do_idle+0x18c/0x1f0
> [ 393.628799] cpu_startup_entry+0x73/0x80
> [ 393.628802] start_secondary+0x1a6/0x200
> [ 393.628804] secondary_startup_64+0xa5/0xb0
> [ 393.628805] Code: ...
>
> [ 405.981597] nvme nvme1: I/O 393 QID 6 timeout, completion polled
>
> [ 435.597209] INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 435.602858] 30-...0: (1 GPs behind) idle=e26/1/0 softirq=6834/6834 fqs=4485
> [ 435.610203] (detected by 8, t=15005 jiffies, g=6396, c=6395, q=146818)
> [ 435.617025] Sending NMI from CPU 8 to CPUs 30:
> [ 435.617029] NMI backtrace for cpu 30
> [ 435.617031] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic #21-Ubuntu
> ...
> [ 435.617047] Call Trace:
> [ 435.617048] <IRQ>
> [ 435.617051] enqueue_entity+0x9f/0x6b0
> [ 435.617053] enqueue_task_fair+0x6c/0x7f0
> [ 435.617056] activate_task+0x57/0xc0
> [ 435.617059] ? sched_clock+0x9/0x10
> [ 435.617060] ? sched_clock+0x9/0x10
> [ 435.617061] ttwu_do_activate+0x49/0x90
> [ 435.617063] try_to_wake_up+0x1df/0x490
> [ 435.617065] default_wake_function+0x12/0x20
> [ 435.617067] autoremove_wake_function+0x12/0x40
> [ 435.617068] __wake_up_common+0x73/0x130
> [ 435.617069] __wake_up_common_lock+0x80/0xc0
> [ 435.617070] __wake_up+0x13/0x20
> [ 435.617073] __wbt_done.part.21+0xa4/0xb0
> [ 435.617074] wbt_done+0x72/0xa0
> [ 435.617077] blk_mq_free_request+0xca/0x1a0
> [ 435.617079] blk_mq_end_request+0x48/0x90
> [ 435.617084] nvme_complete_rq+0x23/0x120 [nvme_core]
> [ 435.617087] nvme_pci_complete_rq+0x7a/0x130 [nvme]
> [ 435.617088] __blk_mq_complete_request+0xd2/0x140
> [ 435.617090] blk_mq_complete_request+0x18/0x20
> [ 435.617091] nvme_process_cq+0xe1/0x1b0 [nvme]
> [ 435.617093] nvme_irq+0x23/0x50 [nvme]
> [ 435.617096] __handle_irq_event_percpu+0x44/0x1a0
> [ 435.617097] handle_irq_event_percpu+0x32/0x80
> [ 435.617098] handle_irq_event+0x3b/0x60
> [ 435.617101] handle_edge_irq+0x7c/0x190
> [ 435.617102] handle_irq+0x20/0x30
> [ 435.617106] do_IRQ+0x46/0xd0
> [ 435.617107] common_interrupt+0x84/0x84
> [ 435.617108] </IRQ>
> ...
> [ 435.617117] ? cpuidle_enter_state+0x97/0x2f0
> [ 435.617118] cpuidle_enter+0x17/0x20
> [ 435.617119] call_cpuidle+0x23/0x40
> [ 435.617121] do_idle+0x18c/0x1f0
> [ 435.617122] cpu_startup_entry+0x73/0x80
> [ 435.617125] start_secondary+0x1a6/0x200
> [ 435.617127] secondary_startup_64+0xa5/0xb0
> [ 435.617128] Code: ...
>
> Anchal Agarwal (1):
>   blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
>
> Jens Axboe (5):
>   blk-wbt: move disable check into get_limit()
>   blk-wbt: use wq_has_sleeper() for wq active check
>   blk-wbt: fix has-sleeper queueing check
>   blk-wbt: abstract out end IO completion handler
>   blk-wbt: improve waking of tasks
>
> Josef Bacik (2):
>   blk-rq-qos: refactor out common elements of blk-wbt
>   blk-wbt: wake up all when we scale up, not down
>
>  block/Makefile         |   2 +-
>  block/blk-core.c       |  12 +-
>  block/blk-mq.c         |  12 +-
>  block/blk-rq-qos.c     | 178 +++++++++++++++++
>  block/blk-rq-qos.h     | 106 ++++++++++
>  block/blk-settings.c   |   4 +-
>  block/blk-sysfs.c      |  22 +--
>  block/blk-wbt.c        | 427 +++++++++++++++++++++--------------------
>  block/blk-wbt.h        |  63 +++---
>  include/linux/blkdev.h |   4 +-
>  10 files changed, 550 insertions(+), 280 deletions(-)
>  create mode 100644 block/blk-rq-qos.c
>  create mode 100644 block/blk-rq-qos.h
>
Same for Cosmic, if there is hope for a minimal backport being possible, please.




--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: NACK: [SRU C][PATCH 0/8] blk-wbt: fix for LP#1810998

Mauricio Faria de Oliveira-3
On Thu, Jan 10, 2019 at 1:37 PM Stefan Bader <[hidden email]> wrote:
[snip]
> Same for Cosmic, if there is hope for a minimal backport being possible, please.

Alright.

>
>
>


--
Mauricio Faria de Oliveira

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team