[SRU B][PATCH 00/13] blk-wbt: fix for LP#1810998

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

[SRU B][PATCH 00/13] blk-wbt: fix for LP#1810998

Mauricio Faria de Oliveira-3
BugLink: https://bugs.launchpad.net/bugs/1810998

[Impact]

 * Users may experience cpu hard lockups when performing
   rigorous writes to NVMe drives.

 * The fix addresses an scheduling issue in the original
   implementation of wbt/writeback throttling

 * The fix is commit 2887e41b910b ("blk-wbt: Avoid lock
   contention and thundering herd issue in wbt_wait"),
   plus its fix commit 38cfb5a45ee0 ("blk-wbt: improve
   waking of tasks").

 * There are additional commits to help with a cleaner
   backport and future maintenance:
   - Cosmic: total of 8 clean cherry picks.
   - Bionic: total of 13 commits, 9 clean cherry-picks
     and 4 backports, which are just changes to context
     lines (i.e. refresh) without any functional changes
     in the backport itself.

[Test Case]

 * This command has been reported to reproduce the problem:

   $ sudo iozone -R -s 5G -r 1m -S 2048 -i 0 -G -c -o -l 128 -u 128 -t 128

 * It generates stack traces as below in the original kernel,
   and does not generate them in the modified/patched kernel.

 * The user/reporter verified the test kernel with these patches
   resolved the problem.

 * The developer verified in 2 systems (4-core and 24-core but
   no NVMe) for regressions, and no error messages were logged
   to dmesg.

[Regression Potential]

 * The commits have been verified for fixes in later commits in
   linux-next as of 2019-01-08 and all known fix commits are in.

 * The regression potential is mostly contained in the writeback
   throttling (block/blk-wbt.*), as almost all of the 13 patches
   change exclusively that, except for 4 of them (2 of which are
   sysfs):

   - blk-rq-qos: refactor out common elements of blk-wbt (block/)
   - block: Protect less code with sysfs_lock in blk_{un,}register_queue() (blk-sysfs.c)
   - block: Protect less code with sysfs_lock in blk_{un,}register_queue() (blk-{mq-}sysfs.c)
   - block: pass struct request instead of struct blk_issue_stat to wbt (block/, still mostly blk-wbt.*)

[Other Info]

 * Alternatively, it is probably possible to introduce just the
   two commits that fix this with some changes to their code in
   the backport, but since the 'blk-rq-qos: refactor ..' commit
   may become a dependency for many additional/future fixes, it
   seemed interesting to pull it in earlier in the 18.04 cycle.

 * The problem has been introduced with the blk-wbt mechanism,
   in v4.10-rc1, and the fix commits in v4.19-rc1 and -rc2,
   so only Bionic and Cosmic needs this.

[Stack Traces]

[ 393.628647] NMI watchdog: Watchdog detected hard LOCKUP on cpu 30
...
[ 393.628704] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic #21-Ubuntu
...
[ 393.628720] Call Trace:
[ 393.628721] <IRQ>
[ 393.628724] enqueue_task_fair+0x6c/0x7f0
[ 393.628726] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150
[ 393.628728] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150
[ 393.628731] activate_task+0x57/0xc0
[ 393.628735] ? sched_clock+0x9/0x10
[ 393.628736] ? sched_clock+0x9/0x10
[ 393.628738] ttwu_do_activate+0x49/0x90
[ 393.628739] try_to_wake_up+0x1df/0x490
[ 393.628741] default_wake_function+0x12/0x20
[ 393.628743] autoremove_wake_function+0x12/0x40
[ 393.628744] __wake_up_common+0x73/0x130
[ 393.628745] __wake_up_common_lock+0x80/0xc0
[ 393.628746] __wake_up+0x13/0x20
[ 393.628749] __wbt_done.part.21+0xa4/0xb0
[ 393.628749] wbt_done+0x72/0xa0
[ 393.628753] blk_mq_free_request+0xca/0x1a0
[ 393.628755] blk_mq_end_request+0x48/0x90
[ 393.628760] nvme_complete_rq+0x23/0x120 [nvme_core]
[ 393.628763] nvme_pci_complete_rq+0x7a/0x130 [nvme]
[ 393.628764] __blk_mq_complete_request+0xd2/0x140
[ 393.628766] blk_mq_complete_request+0x18/0x20
[ 393.628767] nvme_process_cq+0xe1/0x1b0 [nvme]
[ 393.628768] nvme_irq+0x23/0x50 [nvme]
[ 393.628772] __handle_irq_event_percpu+0x44/0x1a0
[ 393.628773] handle_irq_event_percpu+0x32/0x80
[ 393.628774] handle_irq_event+0x3b/0x60
[ 393.628778] handle_edge_irq+0x7c/0x190
[ 393.628779] handle_irq+0x20/0x30
[ 393.628783] do_IRQ+0x46/0xd0
[ 393.628784] common_interrupt+0x84/0x84
[ 393.628785] </IRQ>
...
[ 393.628794] ? cpuidle_enter_state+0x97/0x2f0
[ 393.628796] cpuidle_enter+0x17/0x20
[ 393.628797] call_cpuidle+0x23/0x40
[ 393.628798] do_idle+0x18c/0x1f0
[ 393.628799] cpu_startup_entry+0x73/0x80
[ 393.628802] start_secondary+0x1a6/0x200
[ 393.628804] secondary_startup_64+0xa5/0xb0
[ 393.628805] Code: ...

[ 405.981597] nvme nvme1: I/O 393 QID 6 timeout, completion polled

[ 435.597209] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 435.602858] 30-...0: (1 GPs behind) idle=e26/1/0 softirq=6834/6834 fqs=4485
[ 435.610203] (detected by 8, t=15005 jiffies, g=6396, c=6395, q=146818)
[ 435.617025] Sending NMI from CPU 8 to CPUs 30:
[ 435.617029] NMI backtrace for cpu 30
[ 435.617031] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic #21-Ubuntu
...
[ 435.617047] Call Trace:
[ 435.617048] <IRQ>
[ 435.617051] enqueue_entity+0x9f/0x6b0
[ 435.617053] enqueue_task_fair+0x6c/0x7f0
[ 435.617056] activate_task+0x57/0xc0
[ 435.617059] ? sched_clock+0x9/0x10
[ 435.617060] ? sched_clock+0x9/0x10
[ 435.617061] ttwu_do_activate+0x49/0x90
[ 435.617063] try_to_wake_up+0x1df/0x490
[ 435.617065] default_wake_function+0x12/0x20
[ 435.617067] autoremove_wake_function+0x12/0x40
[ 435.617068] __wake_up_common+0x73/0x130
[ 435.617069] __wake_up_common_lock+0x80/0xc0
[ 435.617070] __wake_up+0x13/0x20
[ 435.617073] __wbt_done.part.21+0xa4/0xb0
[ 435.617074] wbt_done+0x72/0xa0
[ 435.617077] blk_mq_free_request+0xca/0x1a0
[ 435.617079] blk_mq_end_request+0x48/0x90
[ 435.617084] nvme_complete_rq+0x23/0x120 [nvme_core]
[ 435.617087] nvme_pci_complete_rq+0x7a/0x130 [nvme]
[ 435.617088] __blk_mq_complete_request+0xd2/0x140
[ 435.617090] blk_mq_complete_request+0x18/0x20
[ 435.617091] nvme_process_cq+0xe1/0x1b0 [nvme]
[ 435.617093] nvme_irq+0x23/0x50 [nvme]
[ 435.617096] __handle_irq_event_percpu+0x44/0x1a0
[ 435.617097] handle_irq_event_percpu+0x32/0x80
[ 435.617098] handle_irq_event+0x3b/0x60
[ 435.617101] handle_edge_irq+0x7c/0x190
[ 435.617102] handle_irq+0x20/0x30
[ 435.617106] do_IRQ+0x46/0xd0
[ 435.617107] common_interrupt+0x84/0x84
[ 435.617108] </IRQ>
...
[ 435.617117] ? cpuidle_enter_state+0x97/0x2f0
[ 435.617118] cpuidle_enter+0x17/0x20
[ 435.617119] call_cpuidle+0x23/0x40
[ 435.617121] do_idle+0x18c/0x1f0
[ 435.617122] cpu_startup_entry+0x73/0x80
[ 435.617125] start_secondary+0x1a6/0x200
[ 435.617127] secondary_startup_64+0xa5/0xb0
[ 435.617128] Code: ...

Anchal Agarwal (1):
  blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait

Bart Van Assche (1):
  block: Protect less code with sysfs_lock in blk_{un,}register_queue()

Jens Axboe (6):
  blk-wbt: pass in enum wbt_flags to get_rq_wait()
  blk-wbt: move disable check into get_limit()
  blk-wbt: use wq_has_sleeper() for wq active check
  blk-wbt: fix has-sleeper queueing check
  blk-wbt: abstract out end IO completion handler
  blk-wbt: improve waking of tasks

Josef Bacik (2):
  blk-rq-qos: refactor out common elements of blk-wbt
  blk-wbt: wake up all when we scale up, not down

Mike Snitzer (1):
  block: properly protect the 'queue' kobj in blk_unregister_queue

Omar Sandoval (2):
  block: move some wbt helpers to blk-wbt.c
  block: pass struct request instead of struct blk_issue_stat to wbt

 block/Makefile         |   2 +-
 block/blk-core.c       |  14 +-
 block/blk-mq-sysfs.c   |   9 +-
 block/blk-mq.c         |  14 +-
 block/blk-rq-qos.c     | 178 +++++++++++++++
 block/blk-rq-qos.h     | 106 +++++++++
 block/blk-settings.c   |   4 +-
 block/blk-sysfs.c      |  62 +++--
 block/blk-wbt.c        | 499 ++++++++++++++++++++++-------------------
 block/blk-wbt.h        |  96 +++-----
 include/linux/blkdev.h |   4 +-
 11 files changed, 643 insertions(+), 345 deletions(-)
 create mode 100644 block/blk-rq-qos.c
 create mode 100644 block/blk-rq-qos.h

--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU B][PATCH 01/13] blk-wbt: pass in enum wbt_flags to get_rq_wait()

Mauricio Faria de Oliveira-3
From: Jens Axboe <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1810998

This is in preparation for having more write queues, in which
case we would have needed to pass in more information than just
a simple 'is_kswapd' boolean.

Reviewed-by: Darrick J. Wong <[hidden email]>
Reviewed-by: Omar Sandoval <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit 8bea60901974ad44b06b08d52e1dd421ea8c6e9c)
Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
---
 block/blk-wbt.c | 25 +++++++++++++++----------
 block/blk-wbt.h |  4 +++-
 2 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index f92fc84b5e2c..4afa0ebb5953 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -101,9 +101,13 @@ static bool wb_recent_wait(struct rq_wb *rwb)
  return time_before(jiffies, wb->dirty_sleep + HZ);
 }
 
-static inline struct rq_wait *get_rq_wait(struct rq_wb *rwb, bool is_kswapd)
+static inline struct rq_wait *get_rq_wait(struct rq_wb *rwb,
+  enum wbt_flags wb_acct)
 {
- return &rwb->rq_wait[is_kswapd];
+ if (wb_acct & WBT_KSWAPD)
+ return &rwb->rq_wait[WBT_RWQ_KSWAPD];
+
+ return &rwb->rq_wait[WBT_RWQ_BG];
 }
 
 static void rwb_wake_all(struct rq_wb *rwb)
@@ -126,7 +130,7 @@ void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
  if (!(wb_acct & WBT_TRACKED))
  return;
 
- rqw = get_rq_wait(rwb, wb_acct & WBT_KSWAPD);
+ rqw = get_rq_wait(rwb, wb_acct);
  inflight = atomic_dec_return(&rqw->inflight);
 
  /*
@@ -529,11 +533,12 @@ static inline bool may_queue(struct rq_wb *rwb, struct rq_wait *rqw,
  * Block if we will exceed our limit, or if we are currently waiting for
  * the timer to kick off queuing again.
  */
-static void __wbt_wait(struct rq_wb *rwb, unsigned long rw, spinlock_t *lock)
+static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
+       unsigned long rw, spinlock_t *lock)
  __releases(lock)
  __acquires(lock)
 {
- struct rq_wait *rqw = get_rq_wait(rwb, current_is_kswapd());
+ struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
  DEFINE_WAIT(wait);
 
  if (may_queue(rwb, rqw, &wait, rw))
@@ -584,7 +589,7 @@ static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
  */
 enum wbt_flags wbt_wait(struct rq_wb *rwb, struct bio *bio, spinlock_t *lock)
 {
- unsigned int ret = 0;
+ enum wbt_flags ret = 0;
 
  if (!rwb_enabled(rwb))
  return 0;
@@ -598,14 +603,14 @@ enum wbt_flags wbt_wait(struct rq_wb *rwb, struct bio *bio, spinlock_t *lock)
  return ret;
  }
 
- __wbt_wait(rwb, bio->bi_opf, lock);
+ if (current_is_kswapd())
+ ret |= WBT_KSWAPD;
+
+ __wbt_wait(rwb, ret, bio->bi_opf, lock);
 
  if (!blk_stat_is_active(rwb->cb))
  rwb_arm_timer(rwb);
 
- if (current_is_kswapd())
- ret |= WBT_KSWAPD;
-
  return ret | WBT_TRACKED;
 }
 
diff --git a/block/blk-wbt.h b/block/blk-wbt.h
index a232c98fbf4d..8038b4a0d4ef 100644
--- a/block/blk-wbt.h
+++ b/block/blk-wbt.h
@@ -19,7 +19,9 @@ enum wbt_flags {
 };
 
 enum {
- WBT_NUM_RWQ = 2,
+ WBT_RWQ_BG = 0,
+ WBT_RWQ_KSWAPD,
+ WBT_NUM_RWQ,
 };
 
 /*
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU B][PATCH 02/13] block: move some wbt helpers to blk-wbt.c

Mauricio Faria de Oliveira-3
In reply to this post by Mauricio Faria de Oliveira-3
From: Omar Sandoval <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1810998

A few helpers are only used from blk-wbt.c, so move them there, and put
wbt_track() behind the CONFIG_BLK_WBT typedef. This is in preparation
for changing how the wbt flags are tracked.

Signed-off-by: Omar Sandoval <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit 934031a12980511c020acf7d91f9035e34d0b5b8)
Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
---
 block/blk-wbt.c | 20 ++++++++++++++++++++
 block/blk-wbt.h | 33 ++++++++-------------------------
 2 files changed, 28 insertions(+), 25 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 4afa0ebb5953..cec4de703a6a 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -29,6 +29,26 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/wbt.h>
 
+static inline void wbt_clear_state(struct blk_issue_stat *stat)
+{
+ stat->stat &= ~BLK_STAT_RES_MASK;
+}
+
+static inline enum wbt_flags wbt_stat_to_mask(struct blk_issue_stat *stat)
+{
+ return (stat->stat & BLK_STAT_RES_MASK) >> BLK_STAT_RES_SHIFT;
+}
+
+static inline bool wbt_is_tracked(struct blk_issue_stat *stat)
+{
+ return (stat->stat >> BLK_STAT_RES_SHIFT) & WBT_TRACKED;
+}
+
+static inline bool wbt_is_read(struct blk_issue_stat *stat)
+{
+ return (stat->stat >> BLK_STAT_RES_SHIFT) & WBT_READ;
+}
+
 enum {
  /*
  * Default setting, we'll scale up (to 75% of QD max) or down (min 1)
diff --git a/block/blk-wbt.h b/block/blk-wbt.h
index 8038b4a0d4ef..6a48459987b1 100644
--- a/block/blk-wbt.h
+++ b/block/blk-wbt.h
@@ -33,31 +33,6 @@ enum {
  WBT_STATE_ON_MANUAL = 2,
 };
 
-static inline void wbt_clear_state(struct blk_issue_stat *stat)
-{
- stat->stat &= ~BLK_STAT_RES_MASK;
-}
-
-static inline enum wbt_flags wbt_stat_to_mask(struct blk_issue_stat *stat)
-{
- return (stat->stat & BLK_STAT_RES_MASK) >> BLK_STAT_RES_SHIFT;
-}
-
-static inline void wbt_track(struct blk_issue_stat *stat, enum wbt_flags wb_acct)
-{
- stat->stat |= ((u64) wb_acct) << BLK_STAT_RES_SHIFT;
-}
-
-static inline bool wbt_is_tracked(struct blk_issue_stat *stat)
-{
- return (stat->stat >> BLK_STAT_RES_SHIFT) & WBT_TRACKED;
-}
-
-static inline bool wbt_is_read(struct blk_issue_stat *stat)
-{
- return (stat->stat >> BLK_STAT_RES_SHIFT) & WBT_READ;
-}
-
 struct rq_wait {
  wait_queue_head_t wait;
  atomic_t inflight;
@@ -111,6 +86,11 @@ static inline unsigned int wbt_inflight(struct rq_wb *rwb)
 
 #ifdef CONFIG_BLK_WBT
 
+static inline void wbt_track(struct blk_issue_stat *stat, enum wbt_flags flags)
+{
+ stat->stat |= ((u64)flags) << BLK_STAT_RES_SHIFT;
+}
+
 void __wbt_done(struct rq_wb *, enum wbt_flags);
 void wbt_done(struct rq_wb *, struct blk_issue_stat *);
 enum wbt_flags wbt_wait(struct rq_wb *, struct bio *, spinlock_t *);
@@ -129,6 +109,9 @@ u64 wbt_default_latency_nsec(struct request_queue *);
 
 #else
 
+static inline void wbt_track(struct blk_issue_stat *stat, enum wbt_flags flags)
+{
+}
 static inline void __wbt_done(struct rq_wb *rwb, enum wbt_flags flags)
 {
 }
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU B][PATCH 03/13] block: pass struct request instead of struct blk_issue_stat to wbt

Mauricio Faria de Oliveira-3
In reply to this post by Mauricio Faria de Oliveira-3
From: Omar Sandoval <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1810998

issue_stat is going to go away, so first make writeback throttling take
the containing request, update the internal wbt helpers accordingly, and
change rwb->sync_cookie to be the request pointer instead of the
issue_stat pointer. No functional change.

Signed-off-by: Omar Sandoval <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(backported from commit a8a45941706bca05ef9234a17f5e4a50b9835a44)
[mfo: backport:
 - blk-core.c
   - hunk 4: update last context line due to lack of commit e14575b3d45
     ("block: convert REQ_ATOM_COMPLETE to stealing rq->__deadline bit")
     not required.
 - blk-mq.c
   - hunk 3: update last context line due to lack of commit 1d9bd5161ba
     ("blk-mq: replace timeout synchronization with a RCU and generation
     based scheme"), not required.
   - hunk 4: likewise, and commit 5a61c36398d0 ("blk-mq: remove
     REQ_ATOM_STARTED"), not required.]
Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
---
 block/blk-core.c | 10 ++++-----
 block/blk-mq.c   | 10 ++++-----
 block/blk-wbt.c  | 53 ++++++++++++++++++++++++------------------------
 block/blk-wbt.h  | 18 ++++++++--------
 4 files changed, 45 insertions(+), 46 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index c5d5c8587770..6ed5c0eb29fc 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1553,7 +1553,7 @@ void blk_requeue_request(struct request_queue *q, struct request *rq)
  blk_delete_timer(rq);
  blk_clear_rq_complete(rq);
  trace_block_rq_requeue(q, rq);
- wbt_requeue(q->rq_wb, &rq->issue_stat);
+ wbt_requeue(q->rq_wb, rq);
 
  if (rq->rq_flags & RQF_QUEUED)
  blk_queue_end_tag(q, rq);
@@ -1659,7 +1659,7 @@ void __blk_put_request(struct request_queue *q, struct request *req)
  /* this is a bio leak */
  WARN_ON(req->bio != NULL);
 
- wbt_done(q->rq_wb, &req->issue_stat);
+ wbt_done(q->rq_wb, req);
 
  /*
  * Request may not have originated from ll_rw_blk. if not,
@@ -1970,7 +1970,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
  goto out_unlock;
  }
 
- wbt_track(&req->issue_stat, wb_acct);
+ wbt_track(req, wb_acct);
 
  /*
  * After dropping the lock and possibly sleeping here, our request
@@ -2849,7 +2849,7 @@ void blk_start_request(struct request *req)
  if (test_bit(QUEUE_FLAG_STATS, &req->q->queue_flags)) {
  blk_stat_set_issue(&req->issue_stat, blk_rq_sectors(req));
  req->rq_flags |= RQF_STATS;
- wbt_issue(req->q->rq_wb, &req->issue_stat);
+ wbt_issue(req->q->rq_wb, req);
  }
 
  BUG_ON(test_bit(REQ_ATOM_COMPLETE, &req->atomic_flags));
@@ -3068,7 +3068,7 @@ void blk_finish_request(struct request *req, blk_status_t error)
  blk_account_io_done(req);
 
  if (req->end_io) {
- wbt_done(req->q->rq_wb, &req->issue_stat);
+ wbt_done(req->q->rq_wb, req);
  req->end_io(req, error);
  } else {
  if (blk_bidi_rq(req))
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2f8436314985..7172ae4b9993 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -499,7 +499,7 @@ void blk_mq_free_request(struct request *rq)
  if (unlikely(laptop_mode && !blk_rq_is_passthrough(rq)))
  laptop_io_completion(q->backing_dev_info);
 
- wbt_done(q->rq_wb, &rq->issue_stat);
+ wbt_done(q->rq_wb, rq);
 
  if (blk_rq_rl(rq))
  blk_put_rl(blk_rq_rl(rq));
@@ -520,7 +520,7 @@ inline void __blk_mq_end_request(struct request *rq, blk_status_t error)
  blk_account_io_done(rq);
 
  if (rq->end_io) {
- wbt_done(rq->q->rq_wb, &rq->issue_stat);
+ wbt_done(rq->q->rq_wb, rq);
  rq->end_io(rq, error);
  } else {
  if (unlikely(blk_bidi_rq(rq)))
@@ -614,7 +614,7 @@ void blk_mq_start_request(struct request *rq)
  if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {
  blk_stat_set_issue(&rq->issue_stat, blk_rq_sectors(rq));
  rq->rq_flags |= RQF_STATS;
- wbt_issue(q->rq_wb, &rq->issue_stat);
+ wbt_issue(q->rq_wb, rq);
  }
 
  blk_add_timer(rq);
@@ -673,7 +673,7 @@ static void __blk_mq_requeue_request(struct request *rq)
  blk_mq_put_driver_tag(rq);
 
  trace_block_rq_requeue(q, rq);
- wbt_requeue(q->rq_wb, &rq->issue_stat);
+ wbt_requeue(q->rq_wb, rq);
 
  if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
  if (q->dma_drain_size && blk_rq_bytes(rq))
@@ -1770,7 +1770,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
  return BLK_QC_T_NONE;
  }
 
- wbt_track(&rq->issue_stat, wb_acct);
+ wbt_track(rq, wb_acct);
 
  cookie = request_to_qc_t(data.hctx, rq);
 
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index cec4de703a6a..efc7f9fa2346 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -29,24 +29,24 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/wbt.h>
 
-static inline void wbt_clear_state(struct blk_issue_stat *stat)
+static inline void wbt_clear_state(struct request *rq)
 {
- stat->stat &= ~BLK_STAT_RES_MASK;
+ rq->issue_stat.stat &= ~BLK_STAT_RES_MASK;
 }
 
-static inline enum wbt_flags wbt_stat_to_mask(struct blk_issue_stat *stat)
+static inline enum wbt_flags wbt_flags(struct request *rq)
 {
- return (stat->stat & BLK_STAT_RES_MASK) >> BLK_STAT_RES_SHIFT;
+ return (rq->issue_stat.stat & BLK_STAT_RES_MASK) >> BLK_STAT_RES_SHIFT;
 }
 
-static inline bool wbt_is_tracked(struct blk_issue_stat *stat)
+static inline bool wbt_is_tracked(struct request *rq)
 {
- return (stat->stat >> BLK_STAT_RES_SHIFT) & WBT_TRACKED;
+ return (rq->issue_stat.stat >> BLK_STAT_RES_SHIFT) & WBT_TRACKED;
 }
 
-static inline bool wbt_is_read(struct blk_issue_stat *stat)
+static inline bool wbt_is_read(struct request *rq)
 {
- return (stat->stat >> BLK_STAT_RES_SHIFT) & WBT_READ;
+ return (rq->issue_stat.stat >> BLK_STAT_RES_SHIFT) & WBT_READ;
 }
 
 enum {
@@ -189,24 +189,24 @@ void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
  * Called on completion of a request. Note that it's also called when
  * a request is merged, when the request gets freed.
  */
-void wbt_done(struct rq_wb *rwb, struct blk_issue_stat *stat)
+void wbt_done(struct rq_wb *rwb, struct request *rq)
 {
  if (!rwb)
  return;
 
- if (!wbt_is_tracked(stat)) {
- if (rwb->sync_cookie == stat) {
+ if (!wbt_is_tracked(rq)) {
+ if (rwb->sync_cookie == rq) {
  rwb->sync_issue = 0;
  rwb->sync_cookie = NULL;
  }
 
- if (wbt_is_read(stat))
+ if (wbt_is_read(rq))
  wb_timestamp(rwb, &rwb->last_comp);
  } else {
- WARN_ON_ONCE(stat == rwb->sync_cookie);
- __wbt_done(rwb, wbt_stat_to_mask(stat));
+ WARN_ON_ONCE(rq == rwb->sync_cookie);
+ __wbt_done(rwb, wbt_flags(rq));
  }
- wbt_clear_state(stat);
+ wbt_clear_state(rq);
 }
 
 /*
@@ -634,30 +634,29 @@ enum wbt_flags wbt_wait(struct rq_wb *rwb, struct bio *bio, spinlock_t *lock)
  return ret | WBT_TRACKED;
 }
 
-void wbt_issue(struct rq_wb *rwb, struct blk_issue_stat *stat)
+void wbt_issue(struct rq_wb *rwb, struct request *rq)
 {
  if (!rwb_enabled(rwb))
  return;
 
  /*
- * Track sync issue, in case it takes a long time to complete. Allows
- * us to react quicker, if a sync IO takes a long time to complete.
- * Note that this is just a hint. 'stat' can go away when the
- * request completes, so it's important we never dereference it. We
- * only use the address to compare with, which is why we store the
- * sync_issue time locally.
+ * Track sync issue, in case it takes a long time to complete. Allows us
+ * to react quicker, if a sync IO takes a long time to complete. Note
+ * that this is just a hint. The request can go away when it completes,
+ * so it's important we never dereference it. We only use the address to
+ * compare with, which is why we store the sync_issue time locally.
  */
- if (wbt_is_read(stat) && !rwb->sync_issue) {
- rwb->sync_cookie = stat;
- rwb->sync_issue = blk_stat_time(stat);
+ if (wbt_is_read(rq) && !rwb->sync_issue) {
+ rwb->sync_cookie = rq;
+ rwb->sync_issue = blk_stat_time(&rq->issue_stat);
  }
 }
 
-void wbt_requeue(struct rq_wb *rwb, struct blk_issue_stat *stat)
+void wbt_requeue(struct rq_wb *rwb, struct request *rq)
 {
  if (!rwb_enabled(rwb))
  return;
- if (stat == rwb->sync_cookie) {
+ if (rq == rwb->sync_cookie) {
  rwb->sync_issue = 0;
  rwb->sync_cookie = NULL;
  }
diff --git a/block/blk-wbt.h b/block/blk-wbt.h
index 6a48459987b1..5cf2398caf49 100644
--- a/block/blk-wbt.h
+++ b/block/blk-wbt.h
@@ -86,19 +86,19 @@ static inline unsigned int wbt_inflight(struct rq_wb *rwb)
 
 #ifdef CONFIG_BLK_WBT
 
-static inline void wbt_track(struct blk_issue_stat *stat, enum wbt_flags flags)
+static inline void wbt_track(struct request *rq, enum wbt_flags flags)
 {
- stat->stat |= ((u64)flags) << BLK_STAT_RES_SHIFT;
+ rq->issue_stat.stat |= ((u64)flags) << BLK_STAT_RES_SHIFT;
 }
 
 void __wbt_done(struct rq_wb *, enum wbt_flags);
-void wbt_done(struct rq_wb *, struct blk_issue_stat *);
+void wbt_done(struct rq_wb *, struct request *);
 enum wbt_flags wbt_wait(struct rq_wb *, struct bio *, spinlock_t *);
 int wbt_init(struct request_queue *);
 void wbt_exit(struct request_queue *);
 void wbt_update_limits(struct rq_wb *);
-void wbt_requeue(struct rq_wb *, struct blk_issue_stat *);
-void wbt_issue(struct rq_wb *, struct blk_issue_stat *);
+void wbt_requeue(struct rq_wb *, struct request *);
+void wbt_issue(struct rq_wb *, struct request *);
 void wbt_disable_default(struct request_queue *);
 void wbt_enable_default(struct request_queue *);
 
@@ -109,13 +109,13 @@ u64 wbt_default_latency_nsec(struct request_queue *);
 
 #else
 
-static inline void wbt_track(struct blk_issue_stat *stat, enum wbt_flags flags)
+static inline void wbt_track(struct request *rq, enum wbt_flags flags)
 {
 }
 static inline void __wbt_done(struct rq_wb *rwb, enum wbt_flags flags)
 {
 }
-static inline void wbt_done(struct rq_wb *rwb, struct blk_issue_stat *stat)
+static inline void wbt_done(struct rq_wb *rwb, struct request *rq)
 {
 }
 static inline enum wbt_flags wbt_wait(struct rq_wb *rwb, struct bio *bio,
@@ -133,10 +133,10 @@ static inline void wbt_exit(struct request_queue *q)
 static inline void wbt_update_limits(struct rq_wb *rwb)
 {
 }
-static inline void wbt_requeue(struct rq_wb *rwb, struct blk_issue_stat *stat)
+static inline void wbt_requeue(struct rq_wb *rwb, struct request *rq)
 {
 }
-static inline void wbt_issue(struct rq_wb *rwb, struct blk_issue_stat *stat)
+static inline void wbt_issue(struct rq_wb *rwb, struct request *rq)
 {
 }
 static inline void wbt_disable_default(struct request_queue *q)
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU B][PATCH 04/13] block: properly protect the 'queue' kobj in blk_unregister_queue

Mauricio Faria de Oliveira-3
In reply to this post by Mauricio Faria de Oliveira-3
From: Mike Snitzer <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1810998

The original commit e9a823fb34a8b (block: fix warning when I/O elevator
is changed as request_queue is being removed) is pretty conflated.
"conflated" because the resource being protected by q->sysfs_lock isn't
the queue_flags (it is the 'queue' kobj).

q->sysfs_lock serializes __elevator_change() (via elv_iosched_store)
from racing with blk_unregister_queue():
1) By holding q->sysfs_lock first, __elevator_change() can complete
before a racing blk_unregister_queue().
2) Conversely, __elevator_change() is testing for QUEUE_FLAG_REGISTERED
in case elv_iosched_store() loses the race with blk_unregister_queue(),
it needs a way to know the 'queue' kobj isn't there.

Expand the scope of blk_unregister_queue()'s q->sysfs_lock use so it is
held until after the 'queue' kobj is removed.

To do so blk_mq_unregister_dev() must not also take q->sysfs_lock.  So
rename __blk_mq_unregister_dev() to blk_mq_unregister_dev().

Also, blk_unregister_queue() should use q->queue_lock to protect against
any concurrent writes to q->queue_flags -- even though chances are the
queue is being cleaned up so no concurrent writes are likely.

Fixes: e9a823fb34a8b ("block: fix warning when I/O elevator is changed as request_queue is being removed")
Signed-off-by: Mike Snitzer <[hidden email]>
Reviewed-by: Ming Lei <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit 667257e8b2988c0183ba23e2bcd6900e87961606)
Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
---
 block/blk-mq-sysfs.c |  9 +--------
 block/blk-sysfs.c    | 13 ++++++++++---
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 79969c3c234f..a54b4b070f1c 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -248,7 +248,7 @@ static int blk_mq_register_hctx(struct blk_mq_hw_ctx *hctx)
  return ret;
 }
 
-static void __blk_mq_unregister_dev(struct device *dev, struct request_queue *q)
+void blk_mq_unregister_dev(struct device *dev, struct request_queue *q)
 {
  struct blk_mq_hw_ctx *hctx;
  int i;
@@ -265,13 +265,6 @@ static void __blk_mq_unregister_dev(struct device *dev, struct request_queue *q)
  q->mq_sysfs_init_done = false;
 }
 
-void blk_mq_unregister_dev(struct device *dev, struct request_queue *q)
-{
- mutex_lock(&q->sysfs_lock);
- __blk_mq_unregister_dev(dev, q);
- mutex_unlock(&q->sysfs_lock);
-}
-
 void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx)
 {
  kobject_init(&hctx->kobj, &blk_mq_hw_ktype);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 870484eaed1f..9272452ff456 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -929,12 +929,17 @@ void blk_unregister_queue(struct gendisk *disk)
  if (WARN_ON(!q))
  return;
 
+ /*
+ * Protect against the 'queue' kobj being accessed
+ * while/after it is removed.
+ */
  mutex_lock(&q->sysfs_lock);
- queue_flag_clear_unlocked(QUEUE_FLAG_REGISTERED, q);
- mutex_unlock(&q->sysfs_lock);
 
- wbt_exit(q);
+ spin_lock_irq(q->queue_lock);
+ queue_flag_clear(QUEUE_FLAG_REGISTERED, q);
+ spin_unlock_irq(q->queue_lock);
 
+ wbt_exit(q);
 
  if (q->mq_ops)
  blk_mq_unregister_dev(disk_to_dev(disk), q);
@@ -946,4 +951,6 @@ void blk_unregister_queue(struct gendisk *disk)
  kobject_del(&q->kobj);
  blk_trace_remove_sysfs(disk_to_dev(disk));
  kobject_put(&disk_to_dev(disk)->kobj);
+
+ mutex_unlock(&q->sysfs_lock);
 }
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU B][PATCH 05/13] block: Protect less code with sysfs_lock in blk_{un, }register_queue()

Mauricio Faria de Oliveira-3
In reply to this post by Mauricio Faria de Oliveira-3
From: Bart Van Assche <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1810998

The __blk_mq_register_dev(), blk_mq_unregister_dev(),
elv_register_queue() and elv_unregister_queue() calls need to be
protected with sysfs_lock but other code in these functions not.
Hence protect only this code with sysfs_lock. This patch fixes a
locking inversion issue in blk_unregister_queue() and also in an
error path of blk_register_queue(): it is not allowed to hold
sysfs_lock around the kobject_del(&q->kobj) call.

Reviewed-by: Christoph Hellwig <[hidden email]>
Signed-off-by: Bart Van Assche <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(backported from commit 2c2086afc2b8b974fac32cb028e73dc27bfae442)
[mfo: backport:
 - blk-sysfs.c
   - hunk 3: update upper 2 context lines (blk_register_queue()
     is not yet exported, and that is not required here.]
Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
---
 block/blk-sysfs.c | 37 ++++++++++++++++++++++++++++---------
 1 file changed, 28 insertions(+), 9 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 9272452ff456..c29ec0cfab1d 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -853,6 +853,10 @@ struct kobj_type blk_queue_ktype = {
  .release = blk_release_queue,
 };
 
+/**
+ * blk_register_queue - register a block layer queue with sysfs
+ * @disk: Disk of which the request queue should be registered with sysfs.
+ */
 int blk_register_queue(struct gendisk *disk)
 {
  int ret;
@@ -909,11 +913,12 @@ int blk_register_queue(struct gendisk *disk)
  if (q->request_fn || (q->mq_ops && q->elevator)) {
  ret = elv_register_queue(q);
  if (ret) {
+ mutex_unlock(&q->sysfs_lock);
  kobject_uevent(&q->kobj, KOBJ_REMOVE);
  kobject_del(&q->kobj);
  blk_trace_remove_sysfs(dev);
  kobject_put(&dev->kobj);
- goto unlock;
+ return ret;
  }
  }
  ret = 0;
@@ -922,6 +927,13 @@ int blk_register_queue(struct gendisk *disk)
  return ret;
 }
 
+/**
+ * blk_unregister_queue - counterpart of blk_register_queue()
+ * @disk: Disk of which the request queue should be unregistered from sysfs.
+ *
+ * Note: the caller is responsible for guaranteeing that this function is called
+ * after blk_register_queue() has finished.
+ */
 void blk_unregister_queue(struct gendisk *disk)
 {
  struct request_queue *q = disk->queue;
@@ -930,8 +942,9 @@ void blk_unregister_queue(struct gendisk *disk)
  return;
 
  /*
- * Protect against the 'queue' kobj being accessed
- * while/after it is removed.
+ * Since sysfs_remove_dir() prevents adding new directory entries
+ * before removal of existing entries starts, protect against
+ * concurrent elv_iosched_store() calls.
  */
  mutex_lock(&q->sysfs_lock);
 
@@ -939,18 +952,24 @@ void blk_unregister_queue(struct gendisk *disk)
  queue_flag_clear(QUEUE_FLAG_REGISTERED, q);
  spin_unlock_irq(q->queue_lock);
 
- wbt_exit(q);
-
+ /*
+ * Remove the sysfs attributes before unregistering the queue data
+ * structures that can be modified through sysfs.
+ */
  if (q->mq_ops)
  blk_mq_unregister_dev(disk_to_dev(disk), q);
-
- if (q->request_fn || (q->mq_ops && q->elevator))
- elv_unregister_queue(q);
+ mutex_unlock(&q->sysfs_lock);
 
  kobject_uevent(&q->kobj, KOBJ_REMOVE);
  kobject_del(&q->kobj);
  blk_trace_remove_sysfs(disk_to_dev(disk));
- kobject_put(&disk_to_dev(disk)->kobj);
 
+ wbt_exit(q);
+
+ mutex_lock(&q->sysfs_lock);
+ if (q->request_fn || (q->mq_ops && q->elevator))
+ elv_unregister_queue(q);
  mutex_unlock(&q->sysfs_lock);
+
+ kobject_put(&disk_to_dev(disk)->kobj);
 }
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU B][PATCH 06/13] blk-rq-qos: refactor out common elements of blk-wbt

Mauricio Faria de Oliveira-3
In reply to this post by Mauricio Faria de Oliveira-3
From: Josef Bacik <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1810998

blkcg-qos is going to do essentially what wbt does, only on a cgroup
basis.  Break out the common code that will be shared between blkcg-qos
and wbt into blk-rq-qos.* so they can both utilize the same
infrastructure.

Signed-off-by: Josef Bacik <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(backported from commit a79050434b45959f397042080fd1d70ffa9bd9df)
[mfo: backport:
 - blk-core.c:
   - hunk 4: refresh last context line due to lack of commit c3036021c7bd
     ("block: use GFP_NOIO instead of __GFP_DIRECT_RECLAIM"), not required
     nor its dependencies, for this fix.
   - hunk 5: refresh upper 2 context lines and last context line due to
     lack of commit 544ccc8dc904 ("block: get rid of struct blk_issue_stat")
     and commit e14575b3d457 ("block: convert REQ_ATOM_COMPLETE to stealing
     rq->__deadline bit"), not required.
   - hunk 6: refresh top context line due to lack of commit 522a777566f
     ("block: consolidate struct request timestamp fields"), not required.
 - blk-mq.c:
   - hunk 2: refresh top context line due to lack of commit 522a777566f
     ("block: consolidate struct request timestamp fields"), not required.
   - hunk 3: update upper context lines due to lack of commit 544ccc8dc904
     ("block: get rid of struct blk_issue_stat"), not required, and the
     lower context lines due to lack of commit 1d9bd5161ba3 ("blk-mq:
     replace timeout synchronization with a RCU and generation based
     scheme"), not required either for this fix.
   - hunk 4: refresh lower context lines due to lack of commit 1d9bd5161ba3
     ("blk-mq: replace timeout synchronization with a RCU and generation based
     scheme"), plus commit 5a61c36398d06 ("blk-mq: remove REQ_ATOM_STARTED"),
     not required either.
   - hunk 5: refresh upper context lines due to lack of commit 544ccc8dc904
     ("block: get rid of struct blk_issue_stat"), not required.
 - blkdev.h:
   - hunk 2: refresh upper context lines due to lack of commit 97889f9ac24f
     ("blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()"),
     not required.]
Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
---
 block/Makefile         |   2 +-
 block/blk-core.c       |  12 +-
 block/blk-mq.c         |  12 +-
 block/blk-rq-qos.c     | 178 ++++++++++++++++++++++
 block/blk-rq-qos.h     | 106 ++++++++++++++
 block/blk-settings.c   |   4 +-
 block/blk-sysfs.c      |  22 ++-
 block/blk-wbt.c        | 326 ++++++++++++++++++-----------------------
 block/blk-wbt.h        |  63 +++-----
 include/linux/blkdev.h |   4 +-
 10 files changed, 478 insertions(+), 251 deletions(-)
 create mode 100644 block/blk-rq-qos.c
 create mode 100644 block/blk-rq-qos.h

diff --git a/block/Makefile b/block/Makefile
index 6a56303b9925..981605042cad 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -9,7 +9,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
  blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
  blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \
  genhd.o partition-generic.o ioprio.o \
- badblocks.o partitions/
+ badblocks.o partitions/ blk-rq-qos.o
 
 obj-$(CONFIG_BOUNCE) += bounce.o
 obj-$(CONFIG_BLK_SCSI_REQUEST) += scsi_ioctl.o
diff --git a/block/blk-core.c b/block/blk-core.c
index 6ed5c0eb29fc..5659da721708 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1553,7 +1553,7 @@ void blk_requeue_request(struct request_queue *q, struct request *rq)
  blk_delete_timer(rq);
  blk_clear_rq_complete(rq);
  trace_block_rq_requeue(q, rq);
- wbt_requeue(q->rq_wb, rq);
+ rq_qos_requeue(q, rq);
 
  if (rq->rq_flags & RQF_QUEUED)
  blk_queue_end_tag(q, rq);
@@ -1659,7 +1659,7 @@ void __blk_put_request(struct request_queue *q, struct request *req)
  /* this is a bio leak */
  WARN_ON(req->bio != NULL);
 
- wbt_done(q->rq_wb, req);
+ rq_qos_done(q, req);
 
  /*
  * Request may not have originated from ll_rw_blk. if not,
@@ -1951,7 +1951,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
  }
 
 get_rq:
- wb_acct = wbt_wait(q->rq_wb, bio, q->queue_lock);
+ wb_acct = rq_qos_throttle(q, bio, q->queue_lock);
 
  /*
  * Grab a free request. This is might sleep but can not fail.
@@ -1961,7 +1961,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
  req = get_request(q, bio->bi_opf, bio, 0);
  if (IS_ERR(req)) {
  blk_queue_exit(q);
- __wbt_done(q->rq_wb, wb_acct);
+ rq_qos_cleanup(q, wb_acct);
  if (PTR_ERR(req) == -ENOMEM)
  bio->bi_status = BLK_STS_RESOURCE;
  else
@@ -2849,7 +2849,7 @@ void blk_start_request(struct request *req)
  if (test_bit(QUEUE_FLAG_STATS, &req->q->queue_flags)) {
  blk_stat_set_issue(&req->issue_stat, blk_rq_sectors(req));
  req->rq_flags |= RQF_STATS;
- wbt_issue(req->q->rq_wb, req);
+ rq_qos_issue(req->q, req);
  }
 
  BUG_ON(test_bit(REQ_ATOM_COMPLETE, &req->atomic_flags));
@@ -3068,7 +3068,7 @@ void blk_finish_request(struct request *req, blk_status_t error)
  blk_account_io_done(req);
 
  if (req->end_io) {
- wbt_done(req->q->rq_wb, req);
+ rq_qos_done(q, req);
  req->end_io(req, error);
  } else {
  if (blk_bidi_rq(req))
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7172ae4b9993..b741cf6b4c62 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -499,7 +499,7 @@ void blk_mq_free_request(struct request *rq)
  if (unlikely(laptop_mode && !blk_rq_is_passthrough(rq)))
  laptop_io_completion(q->backing_dev_info);
 
- wbt_done(q->rq_wb, rq);
+ rq_qos_done(q, rq);
 
  if (blk_rq_rl(rq))
  blk_put_rl(blk_rq_rl(rq));
@@ -520,7 +520,7 @@ inline void __blk_mq_end_request(struct request *rq, blk_status_t error)
  blk_account_io_done(rq);
 
  if (rq->end_io) {
- wbt_done(rq->q->rq_wb, rq);
+ rq_qos_done(rq->q, rq);
  rq->end_io(rq, error);
  } else {
  if (unlikely(blk_bidi_rq(rq)))
@@ -614,7 +614,7 @@ void blk_mq_start_request(struct request *rq)
  if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {
  blk_stat_set_issue(&rq->issue_stat, blk_rq_sectors(rq));
  rq->rq_flags |= RQF_STATS;
- wbt_issue(q->rq_wb, rq);
+ rq_qos_issue(q, rq);
  }
 
  blk_add_timer(rq);
@@ -673,7 +673,7 @@ static void __blk_mq_requeue_request(struct request *rq)
  blk_mq_put_driver_tag(rq);
 
  trace_block_rq_requeue(q, rq);
- wbt_requeue(q->rq_wb, rq);
+ rq_qos_requeue(q, rq);
 
  if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
  if (q->dma_drain_size && blk_rq_bytes(rq))
@@ -1758,13 +1758,13 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
  if (blk_mq_sched_bio_merge(q, bio))
  return BLK_QC_T_NONE;
 
- wb_acct = wbt_wait(q->rq_wb, bio, NULL);
+ wb_acct = rq_qos_throttle(q, bio, NULL);
 
  trace_block_getrq(q, bio, bio->bi_opf);
 
  rq = blk_mq_get_request(q, bio, bio->bi_opf, &data);
  if (unlikely(!rq)) {
- __wbt_done(q->rq_wb, wb_acct);
+ rq_qos_cleanup(q, wb_acct);
  if (bio->bi_opf & REQ_NOWAIT)
  bio_wouldblock_error(bio);
  return BLK_QC_T_NONE;
diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
new file mode 100644
index 000000000000..d2f2af8aa10c
--- /dev/null
+++ b/block/blk-rq-qos.c
@@ -0,0 +1,178 @@
+#include "blk-rq-qos.h"
+
+#include "blk-wbt.h"
+
+/*
+ * Increment 'v', if 'v' is below 'below'. Returns true if we succeeded,
+ * false if 'v' + 1 would be bigger than 'below'.
+ */
+static bool atomic_inc_below(atomic_t *v, int below)
+{
+ int cur = atomic_read(v);
+
+ for (;;) {
+ int old;
+
+ if (cur >= below)
+ return false;
+ old = atomic_cmpxchg(v, cur, cur + 1);
+ if (old == cur)
+ break;
+ cur = old;
+ }
+
+ return true;
+}
+
+bool rq_wait_inc_below(struct rq_wait *rq_wait, int limit)
+{
+ return atomic_inc_below(&rq_wait->inflight, limit);
+}
+
+void rq_qos_cleanup(struct request_queue *q, enum wbt_flags wb_acct)
+{
+ struct rq_qos *rqos;
+
+ for (rqos = q->rq_qos; rqos; rqos = rqos->next) {
+ if (rqos->ops->cleanup)
+ rqos->ops->cleanup(rqos, wb_acct);
+ }
+}
+
+void rq_qos_done(struct request_queue *q, struct request *rq)
+{
+ struct rq_qos *rqos;
+
+ for (rqos = q->rq_qos; rqos; rqos = rqos->next) {
+ if (rqos->ops->done)
+ rqos->ops->done(rqos, rq);
+ }
+}
+
+void rq_qos_issue(struct request_queue *q, struct request *rq)
+{
+ struct rq_qos *rqos;
+
+ for(rqos = q->rq_qos; rqos; rqos = rqos->next) {
+ if (rqos->ops->issue)
+ rqos->ops->issue(rqos, rq);
+ }
+}
+
+void rq_qos_requeue(struct request_queue *q, struct request *rq)
+{
+ struct rq_qos *rqos;
+
+ for(rqos = q->rq_qos; rqos; rqos = rqos->next) {
+ if (rqos->ops->requeue)
+ rqos->ops->requeue(rqos, rq);
+ }
+}
+
+enum wbt_flags rq_qos_throttle(struct request_queue *q, struct bio *bio,
+       spinlock_t *lock)
+{
+ struct rq_qos *rqos;
+ enum wbt_flags flags = 0;
+
+ for(rqos = q->rq_qos; rqos; rqos = rqos->next) {
+ if (rqos->ops->throttle)
+ flags |= rqos->ops->throttle(rqos, bio, lock);
+ }
+ return flags;
+}
+
+/*
+ * Return true, if we can't increase the depth further by scaling
+ */
+bool rq_depth_calc_max_depth(struct rq_depth *rqd)
+{
+ unsigned int depth;
+ bool ret = false;
+
+ /*
+ * For QD=1 devices, this is a special case. It's important for those
+ * to have one request ready when one completes, so force a depth of
+ * 2 for those devices. On the backend, it'll be a depth of 1 anyway,
+ * since the device can't have more than that in flight. If we're
+ * scaling down, then keep a setting of 1/1/1.
+ */
+ if (rqd->queue_depth == 1) {
+ if (rqd->scale_step > 0)
+ rqd->max_depth = 1;
+ else {
+ rqd->max_depth = 2;
+ ret = true;
+ }
+ } else {
+ /*
+ * scale_step == 0 is our default state. If we have suffered
+ * latency spikes, step will be > 0, and we shrink the
+ * allowed write depths. If step is < 0, we're only doing
+ * writes, and we allow a temporarily higher depth to
+ * increase performance.
+ */
+ depth = min_t(unsigned int, rqd->default_depth,
+      rqd->queue_depth);
+ if (rqd->scale_step > 0)
+ depth = 1 + ((depth - 1) >> min(31, rqd->scale_step));
+ else if (rqd->scale_step < 0) {
+ unsigned int maxd = 3 * rqd->queue_depth / 4;
+
+ depth = 1 + ((depth - 1) << -rqd->scale_step);
+ if (depth > maxd) {
+ depth = maxd;
+ ret = true;
+ }
+ }
+
+ rqd->max_depth = depth;
+ }
+
+ return ret;
+}
+
+void rq_depth_scale_up(struct rq_depth *rqd)
+{
+ /*
+ * Hit max in previous round, stop here
+ */
+ if (rqd->scaled_max)
+ return;
+
+ rqd->scale_step--;
+
+ rqd->scaled_max = rq_depth_calc_max_depth(rqd);
+}
+
+/*
+ * Scale rwb down. If 'hard_throttle' is set, do it quicker, since we
+ * had a latency violation.
+ */
+void rq_depth_scale_down(struct rq_depth *rqd, bool hard_throttle)
+{
+ /*
+ * Stop scaling down when we've hit the limit. This also prevents
+ * ->scale_step from going to crazy values, if the device can't
+ * keep up.
+ */
+ if (rqd->max_depth == 1)
+ return;
+
+ if (rqd->scale_step < 0 && hard_throttle)
+ rqd->scale_step = 0;
+ else
+ rqd->scale_step++;
+
+ rqd->scaled_max = false;
+ rq_depth_calc_max_depth(rqd);
+}
+
+void rq_qos_exit(struct request_queue *q)
+{
+ while (q->rq_qos) {
+ struct rq_qos *rqos = q->rq_qos;
+ q->rq_qos = rqos->next;
+ rqos->ops->exit(rqos);
+ }
+}
diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
new file mode 100644
index 000000000000..f9a39bd6ece3
--- /dev/null
+++ b/block/blk-rq-qos.h
@@ -0,0 +1,106 @@
+#ifndef RQ_QOS_H
+#define RQ_QOS_H
+
+#include <linux/kernel.h>
+#include <linux/blkdev.h>
+#include <linux/blk_types.h>
+#include <linux/atomic.h>
+#include <linux/wait.h>
+
+enum rq_qos_id {
+ RQ_QOS_WBT,
+ RQ_QOS_CGROUP,
+};
+
+struct rq_wait {
+ wait_queue_head_t wait;
+ atomic_t inflight;
+};
+
+struct rq_qos {
+ struct rq_qos_ops *ops;
+ struct request_queue *q;
+ enum rq_qos_id id;
+ struct rq_qos *next;
+};
+
+struct rq_qos_ops {
+ enum wbt_flags (*throttle)(struct rq_qos *, struct bio *,
+   spinlock_t *);
+ void (*issue)(struct rq_qos *, struct request *);
+ void (*requeue)(struct rq_qos *, struct request *);
+ void (*done)(struct rq_qos *, struct request *);
+ void (*cleanup)(struct rq_qos *, enum wbt_flags);
+ void (*exit)(struct rq_qos *);
+};
+
+struct rq_depth {
+ unsigned int max_depth;
+
+ int scale_step;
+ bool scaled_max;
+
+ unsigned int queue_depth;
+ unsigned int default_depth;
+};
+
+static inline struct rq_qos *rq_qos_id(struct request_queue *q,
+       enum rq_qos_id id)
+{
+ struct rq_qos *rqos;
+ for (rqos = q->rq_qos; rqos; rqos = rqos->next) {
+ if (rqos->id == id)
+ break;
+ }
+ return rqos;
+}
+
+static inline struct rq_qos *wbt_rq_qos(struct request_queue *q)
+{
+ return rq_qos_id(q, RQ_QOS_WBT);
+}
+
+static inline struct rq_qos *blkcg_rq_qos(struct request_queue *q)
+{
+ return rq_qos_id(q, RQ_QOS_CGROUP);
+}
+
+static inline void rq_wait_init(struct rq_wait *rq_wait)
+{
+ atomic_set(&rq_wait->inflight, 0);
+ init_waitqueue_head(&rq_wait->wait);
+}
+
+static inline void rq_qos_add(struct request_queue *q, struct rq_qos *rqos)
+{
+ rqos->next = q->rq_qos;
+ q->rq_qos = rqos;
+}
+
+static inline void rq_qos_del(struct request_queue *q, struct rq_qos *rqos)
+{
+ struct rq_qos *cur, *prev = NULL;
+ for (cur = q->rq_qos; cur; cur = cur->next) {
+ if (cur == rqos) {
+ if (prev)
+ prev->next = rqos->next;
+ else
+ q->rq_qos = cur;
+ break;
+ }
+ prev = cur;
+ }
+}
+
+bool rq_wait_inc_below(struct rq_wait *rq_wait, int limit);
+void rq_depth_scale_up(struct rq_depth *rqd);
+void rq_depth_scale_down(struct rq_depth *rqd, bool hard_throttle);
+bool rq_depth_calc_max_depth(struct rq_depth *rqd);
+
+void rq_qos_cleanup(struct request_queue *, enum wbt_flags);
+void rq_qos_done(struct request_queue *, struct request *);
+void rq_qos_issue(struct request_queue *, struct request *);
+void rq_qos_requeue(struct request_queue *, struct request *);
+enum wbt_flags rq_qos_throttle(struct request_queue *, struct bio *, spinlock_t *);
+void rq_qos_exit(struct request_queue *);
+#endif
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 48ebe6be07b7..93982d81e6d2 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -877,7 +877,7 @@ EXPORT_SYMBOL_GPL(blk_queue_flush_queueable);
 void blk_set_queue_depth(struct request_queue *q, unsigned int depth)
 {
  q->queue_depth = depth;
- wbt_set_queue_depth(q->rq_wb, depth);
+ wbt_set_queue_depth(q, depth);
 }
 EXPORT_SYMBOL(blk_set_queue_depth);
 
@@ -902,7 +902,7 @@ void blk_queue_write_cache(struct request_queue *q, bool wc, bool fua)
  queue_flag_clear(QUEUE_FLAG_FUA, q);
  spin_unlock_irq(q->queue_lock);
 
- wbt_set_write_cache(q->rq_wb, test_bit(QUEUE_FLAG_WC, &q->queue_flags));
+ wbt_set_write_cache(q, test_bit(QUEUE_FLAG_WC, &q->queue_flags));
 }
 EXPORT_SYMBOL_GPL(blk_queue_write_cache);
 
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index c29ec0cfab1d..d3eb9c734062 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -426,16 +426,16 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
 
 static ssize_t queue_wb_lat_show(struct request_queue *q, char *page)
 {
- if (!q->rq_wb)
+ if (!wbt_rq_qos(q))
  return -EINVAL;
 
- return sprintf(page, "%llu\n", div_u64(q->rq_wb->min_lat_nsec, 1000));
+ return sprintf(page, "%llu\n", div_u64(wbt_get_min_lat(q), 1000));
 }
 
 static ssize_t queue_wb_lat_store(struct request_queue *q, const char *page,
   size_t count)
 {
- struct rq_wb *rwb;
+ struct rq_qos *rqos;
  ssize_t ret;
  s64 val;
 
@@ -445,23 +445,21 @@ static ssize_t queue_wb_lat_store(struct request_queue *q, const char *page,
  if (val < -1)
  return -EINVAL;
 
- rwb = q->rq_wb;
- if (!rwb) {
+ rqos = wbt_rq_qos(q);
+ if (!rqos) {
  ret = wbt_init(q);
  if (ret)
  return ret;
  }
 
- rwb = q->rq_wb;
  if (val == -1)
- rwb->min_lat_nsec = wbt_default_latency_nsec(q);
+ val = wbt_default_latency_nsec(q);
  else if (val >= 0)
- rwb->min_lat_nsec = val * 1000ULL;
+ val *= 1000ULL;
 
- if (rwb->enable_state == WBT_STATE_ON_DEFAULT)
- rwb->enable_state = WBT_STATE_ON_MANUAL;
+ wbt_set_min_lat(q, val);
 
- wbt_update_limits(rwb);
+ wbt_update_limits(q);
  return count;
 }
 
@@ -964,7 +962,7 @@ void blk_unregister_queue(struct gendisk *disk)
  kobject_del(&q->kobj);
  blk_trace_remove_sysfs(disk_to_dev(disk));
 
- wbt_exit(q);
+ rq_qos_exit(q);
 
  mutex_lock(&q->sysfs_lock);
  if (q->request_fn || (q->mq_ops && q->elevator))
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index efc7f9fa2346..4a5dc6d8d9ef 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -25,6 +25,7 @@
 #include <linux/swap.h>
 
 #include "blk-wbt.h"
+#include "blk-rq-qos.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/wbt.h>
@@ -78,28 +79,6 @@ static inline bool rwb_enabled(struct rq_wb *rwb)
  return rwb && rwb->wb_normal != 0;
 }
 
-/*
- * Increment 'v', if 'v' is below 'below'. Returns true if we succeeded,
- * false if 'v' + 1 would be bigger than 'below'.
- */
-static bool atomic_inc_below(atomic_t *v, int below)
-{
- int cur = atomic_read(v);
-
- for (;;) {
- int old;
-
- if (cur >= below)
- return false;
- old = atomic_cmpxchg(v, cur, cur + 1);
- if (old == cur)
- break;
- cur = old;
- }
-
- return true;
-}
-
 static void wb_timestamp(struct rq_wb *rwb, unsigned long *var)
 {
  if (rwb_enabled(rwb)) {
@@ -116,7 +95,7 @@ static void wb_timestamp(struct rq_wb *rwb, unsigned long *var)
  */
 static bool wb_recent_wait(struct rq_wb *rwb)
 {
- struct bdi_writeback *wb = &rwb->queue->backing_dev_info->wb;
+ struct bdi_writeback *wb = &rwb->rqos.q->backing_dev_info->wb;
 
  return time_before(jiffies, wb->dirty_sleep + HZ);
 }
@@ -142,8 +121,9 @@ static void rwb_wake_all(struct rq_wb *rwb)
  }
 }
 
-void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
+static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct)
 {
+ struct rq_wb *rwb = RQWB(rqos);
  struct rq_wait *rqw;
  int inflight, limit;
 
@@ -189,10 +169,9 @@ void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
  * Called on completion of a request. Note that it's also called when
  * a request is merged, when the request gets freed.
  */
-void wbt_done(struct rq_wb *rwb, struct request *rq)
+static void wbt_done(struct rq_qos *rqos, struct request *rq)
 {
- if (!rwb)
- return;
+ struct rq_wb *rwb = RQWB(rqos);
 
  if (!wbt_is_tracked(rq)) {
  if (rwb->sync_cookie == rq) {
@@ -204,72 +183,11 @@ void wbt_done(struct rq_wb *rwb, struct request *rq)
  wb_timestamp(rwb, &rwb->last_comp);
  } else {
  WARN_ON_ONCE(rq == rwb->sync_cookie);
- __wbt_done(rwb, wbt_flags(rq));
+ __wbt_done(rqos, wbt_flags(rq));
  }
  wbt_clear_state(rq);
 }
 
-/*
- * Return true, if we can't increase the depth further by scaling
- */
-static bool calc_wb_limits(struct rq_wb *rwb)
-{
- unsigned int depth;
- bool ret = false;
-
- if (!rwb->min_lat_nsec) {
- rwb->wb_max = rwb->wb_normal = rwb->wb_background = 0;
- return false;
- }
-
- /*
- * For QD=1 devices, this is a special case. It's important for those
- * to have one request ready when one completes, so force a depth of
- * 2 for those devices. On the backend, it'll be a depth of 1 anyway,
- * since the device can't have more than that in flight. If we're
- * scaling down, then keep a setting of 1/1/1.
- */
- if (rwb->queue_depth == 1) {
- if (rwb->scale_step > 0)
- rwb->wb_max = rwb->wb_normal = 1;
- else {
- rwb->wb_max = rwb->wb_normal = 2;
- ret = true;
- }
- rwb->wb_background = 1;
- } else {
- /*
- * scale_step == 0 is our default state. If we have suffered
- * latency spikes, step will be > 0, and we shrink the
- * allowed write depths. If step is < 0, we're only doing
- * writes, and we allow a temporarily higher depth to
- * increase performance.
- */
- depth = min_t(unsigned int, RWB_DEF_DEPTH, rwb->queue_depth);
- if (rwb->scale_step > 0)
- depth = 1 + ((depth - 1) >> min(31, rwb->scale_step));
- else if (rwb->scale_step < 0) {
- unsigned int maxd = 3 * rwb->queue_depth / 4;
-
- depth = 1 + ((depth - 1) << -rwb->scale_step);
- if (depth > maxd) {
- depth = maxd;
- ret = true;
- }
- }
-
- /*
- * Set our max/normal/bg queue depths based on how far
- * we have scaled down (->scale_step).
- */
- rwb->wb_max = depth;
- rwb->wb_normal = (rwb->wb_max + 1) / 2;
- rwb->wb_background = (rwb->wb_max + 3) / 4;
- }
-
- return ret;
-}
-
 static inline bool stat_sample_valid(struct blk_rq_stat *stat)
 {
  /*
@@ -302,7 +220,8 @@ enum {
 
 static int latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat)
 {
- struct backing_dev_info *bdi = rwb->queue->backing_dev_info;
+ struct backing_dev_info *bdi = rwb->rqos.q->backing_dev_info;
+ struct rq_depth *rqd = &rwb->rq_depth;
  u64 thislat;
 
  /*
@@ -346,7 +265,7 @@ static int latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat)
  return LAT_EXCEEDED;
  }
 
- if (rwb->scale_step)
+ if (rqd->scale_step)
  trace_wbt_stat(bdi, stat);
 
  return LAT_OK;
@@ -354,58 +273,48 @@ static int latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat)
 
 static void rwb_trace_step(struct rq_wb *rwb, const char *msg)
 {
- struct backing_dev_info *bdi = rwb->queue->backing_dev_info;
+ struct backing_dev_info *bdi = rwb->rqos.q->backing_dev_info;
+ struct rq_depth *rqd = &rwb->rq_depth;
 
- trace_wbt_step(bdi, msg, rwb->scale_step, rwb->cur_win_nsec,
- rwb->wb_background, rwb->wb_normal, rwb->wb_max);
+ trace_wbt_step(bdi, msg, rqd->scale_step, rwb->cur_win_nsec,
+ rwb->wb_background, rwb->wb_normal, rqd->max_depth);
 }
 
-static void scale_up(struct rq_wb *rwb)
+static void calc_wb_limits(struct rq_wb *rwb)
 {
- /*
- * Hit max in previous round, stop here
- */
- if (rwb->scaled_max)
- return;
+ if (rwb->min_lat_nsec == 0) {
+ rwb->wb_normal = rwb->wb_background = 0;
+ } else if (rwb->rq_depth.max_depth <= 2) {
+ rwb->wb_normal = rwb->rq_depth.max_depth;
+ rwb->wb_background = 1;
+ } else {
+ rwb->wb_normal = (rwb->rq_depth.max_depth + 1) / 2;
+ rwb->wb_background = (rwb->rq_depth.max_depth + 3) / 4;
+ }
+}
 
- rwb->scale_step--;
+static void scale_up(struct rq_wb *rwb)
+{
+ rq_depth_scale_up(&rwb->rq_depth);
+ calc_wb_limits(rwb);
  rwb->unknown_cnt = 0;
-
- rwb->scaled_max = calc_wb_limits(rwb);
-
- rwb_wake_all(rwb);
-
- rwb_trace_step(rwb, "step up");
+ rwb_trace_step(rwb, "scale up");
 }
 
-/*
- * Scale rwb down. If 'hard_throttle' is set, do it quicker, since we
- * had a latency violation.
- */
 static void scale_down(struct rq_wb *rwb, bool hard_throttle)
 {
- /*
- * Stop scaling down when we've hit the limit. This also prevents
- * ->scale_step from going to crazy values, if the device can't
- * keep up.
- */
- if (rwb->wb_max == 1)
- return;
-
- if (rwb->scale_step < 0 && hard_throttle)
- rwb->scale_step = 0;
- else
- rwb->scale_step++;
-
- rwb->scaled_max = false;
- rwb->unknown_cnt = 0;
+ rq_depth_scale_down(&rwb->rq_depth, hard_throttle);
  calc_wb_limits(rwb);
- rwb_trace_step(rwb, "step down");
+ rwb->unknown_cnt = 0;
+ rwb_wake_all(rwb);
+ rwb_trace_step(rwb, "scale down");
 }
 
 static void rwb_arm_timer(struct rq_wb *rwb)
 {
- if (rwb->scale_step > 0) {
+ struct rq_depth *rqd = &rwb->rq_depth;
+
+ if (rqd->scale_step > 0) {
  /*
  * We should speed this up, using some variant of a fast
  * integer inverse square root calculation. Since we only do
@@ -413,7 +322,7 @@ static void rwb_arm_timer(struct rq_wb *rwb)
  * though.
  */
  rwb->cur_win_nsec = div_u64(rwb->win_nsec << 4,
- int_sqrt((rwb->scale_step + 1) << 8));
+ int_sqrt((rqd->scale_step + 1) << 8));
  } else {
  /*
  * For step < 0, we don't want to increase/decrease the
@@ -428,12 +337,13 @@ static void rwb_arm_timer(struct rq_wb *rwb)
 static void wb_timer_fn(struct blk_stat_callback *cb)
 {
  struct rq_wb *rwb = cb->data;
+ struct rq_depth *rqd = &rwb->rq_depth;
  unsigned int inflight = wbt_inflight(rwb);
  int status;
 
  status = latency_exceeded(rwb, cb->stat);
 
- trace_wbt_timer(rwb->queue->backing_dev_info, status, rwb->scale_step,
+ trace_wbt_timer(rwb->rqos.q->backing_dev_info, status, rqd->scale_step,
  inflight);
 
  /*
@@ -464,9 +374,9 @@ static void wb_timer_fn(struct blk_stat_callback *cb)
  * currently don't have a valid read/write sample. For that
  * case, slowly return to center state (step == 0).
  */
- if (rwb->scale_step > 0)
+ if (rqd->scale_step > 0)
  scale_up(rwb);
- else if (rwb->scale_step < 0)
+ else if (rqd->scale_step < 0)
  scale_down(rwb, false);
  break;
  default:
@@ -476,19 +386,50 @@ static void wb_timer_fn(struct blk_stat_callback *cb)
  /*
  * Re-arm timer, if we have IO in flight
  */
- if (rwb->scale_step || inflight)
+ if (rqd->scale_step || inflight)
  rwb_arm_timer(rwb);
 }
 
-void wbt_update_limits(struct rq_wb *rwb)
+static void __wbt_update_limits(struct rq_wb *rwb)
 {
- rwb->scale_step = 0;
- rwb->scaled_max = false;
+ struct rq_depth *rqd = &rwb->rq_depth;
+
+ rqd->scale_step = 0;
+ rqd->scaled_max = false;
+
+ rq_depth_calc_max_depth(rqd);
  calc_wb_limits(rwb);
 
  rwb_wake_all(rwb);
 }
 
+void wbt_update_limits(struct request_queue *q)
+{
+ struct rq_qos *rqos = wbt_rq_qos(q);
+ if (!rqos)
+ return;
+ __wbt_update_limits(RQWB(rqos));
+}
+
+u64 wbt_get_min_lat(struct request_queue *q)
+{
+ struct rq_qos *rqos = wbt_rq_qos(q);
+ if (!rqos)
+ return 0;
+ return RQWB(rqos)->min_lat_nsec;
+}
+
+void wbt_set_min_lat(struct request_queue *q, u64 val)
+{
+ struct rq_qos *rqos = wbt_rq_qos(q);
+ if (!rqos)
+ return;
+ RQWB(rqos)->min_lat_nsec = val;
+ RQWB(rqos)->enable_state = WBT_STATE_ON_MANUAL;
+ __wbt_update_limits(RQWB(rqos));
+}
+
+
 static bool close_io(struct rq_wb *rwb)
 {
  const unsigned long now = jiffies;
@@ -512,7 +453,7 @@ static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
  * IO for a bit.
  */
  if ((rw & REQ_HIPRIO) || wb_recent_wait(rwb) || current_is_kswapd())
- limit = rwb->wb_max;
+ limit = rwb->rq_depth.max_depth;
  else if ((rw & REQ_BACKGROUND) || close_io(rwb)) {
  /*
  * If less than 100ms since we completed unrelated IO,
@@ -546,7 +487,7 @@ static inline bool may_queue(struct rq_wb *rwb, struct rq_wait *rqw,
     rqw->wait.head.next != &wait->entry)
  return false;
 
- return atomic_inc_below(&rqw->inflight, get_limit(rwb, rw));
+ return rq_wait_inc_below(rqw, get_limit(rwb, rw));
 }
 
 /*
@@ -607,8 +548,10 @@ static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
  * in an irq held spinlock, if it holds one when calling this function.
  * If we do sleep, we'll release and re-grab it.
  */
-enum wbt_flags wbt_wait(struct rq_wb *rwb, struct bio *bio, spinlock_t *lock)
+static enum wbt_flags wbt_wait(struct rq_qos *rqos, struct bio *bio,
+       spinlock_t *lock)
 {
+ struct rq_wb *rwb = RQWB(rqos);
  enum wbt_flags ret = 0;
 
  if (!rwb_enabled(rwb))
@@ -634,8 +577,10 @@ enum wbt_flags wbt_wait(struct rq_wb *rwb, struct bio *bio, spinlock_t *lock)
  return ret | WBT_TRACKED;
 }
 
-void wbt_issue(struct rq_wb *rwb, struct request *rq)
+void wbt_issue(struct rq_qos *rqos, struct request *rq)
 {
+ struct rq_wb *rwb = RQWB(rqos);
+
  if (!rwb_enabled(rwb))
  return;
 
@@ -652,8 +597,9 @@ void wbt_issue(struct rq_wb *rwb, struct request *rq)
  }
 }
 
-void wbt_requeue(struct rq_wb *rwb, struct request *rq)
+void wbt_requeue(struct rq_qos *rqos, struct request *rq)
 {
+ struct rq_wb *rwb = RQWB(rqos);
  if (!rwb_enabled(rwb))
  return;
  if (rq == rwb->sync_cookie) {
@@ -662,39 +608,30 @@ void wbt_requeue(struct rq_wb *rwb, struct request *rq)
  }
 }
 
-void wbt_set_queue_depth(struct rq_wb *rwb, unsigned int depth)
+void wbt_set_queue_depth(struct request_queue *q, unsigned int depth)
 {
- if (rwb) {
- rwb->queue_depth = depth;
- wbt_update_limits(rwb);
+ struct rq_qos *rqos = wbt_rq_qos(q);
+ if (rqos) {
+ RQWB(rqos)->rq_depth.queue_depth = depth;
+ __wbt_update_limits(RQWB(rqos));
  }
 }
 
-void wbt_set_write_cache(struct rq_wb *rwb, bool write_cache_on)
-{
- if (rwb)
- rwb->wc = write_cache_on;
-}
-
-/*
- * Disable wbt, if enabled by default.
- */
-void wbt_disable_default(struct request_queue *q)
+void wbt_set_write_cache(struct request_queue *q, bool write_cache_on)
 {
- struct rq_wb *rwb = q->rq_wb;
-
- if (rwb && rwb->enable_state == WBT_STATE_ON_DEFAULT)
- wbt_exit(q);
+ struct rq_qos *rqos = wbt_rq_qos(q);
+ if (rqos)
+ RQWB(rqos)->wc = write_cache_on;
 }
-EXPORT_SYMBOL_GPL(wbt_disable_default);
 
 /*
  * Enable wbt if defaults are configured that way
  */
 void wbt_enable_default(struct request_queue *q)
 {
+ struct rq_qos *rqos = wbt_rq_qos(q);
  /* Throttling already enabled? */
- if (q->rq_wb)
+ if (rqos)
  return;
 
  /* Queue not registered? Maybe shutting down... */
@@ -732,6 +669,41 @@ static int wbt_data_dir(const struct request *rq)
  return -1;
 }
 
+static void wbt_exit(struct rq_qos *rqos)
+{
+ struct rq_wb *rwb = RQWB(rqos);
+ struct request_queue *q = rqos->q;
+
+ blk_stat_remove_callback(q, rwb->cb);
+ blk_stat_free_callback(rwb->cb);
+ kfree(rwb);
+}
+
+/*
+ * Disable wbt, if enabled by default.
+ */
+void wbt_disable_default(struct request_queue *q)
+{
+ struct rq_qos *rqos = wbt_rq_qos(q);
+ struct rq_wb *rwb;
+ if (!rqos)
+ return;
+ rwb = RQWB(rqos);
+ if (rwb->enable_state == WBT_STATE_ON_DEFAULT)
+ rwb->wb_normal = 0;
+}
+EXPORT_SYMBOL_GPL(wbt_disable_default);
+
+
+static struct rq_qos_ops wbt_rqos_ops = {
+ .throttle = wbt_wait,
+ .issue = wbt_issue,
+ .requeue = wbt_requeue,
+ .done = wbt_done,
+ .cleanup = __wbt_done,
+ .exit = wbt_exit,
+};
+
 int wbt_init(struct request_queue *q)
 {
  struct rq_wb *rwb;
@@ -749,39 +721,29 @@ int wbt_init(struct request_queue *q)
  return -ENOMEM;
  }
 
- for (i = 0; i < WBT_NUM_RWQ; i++) {
- atomic_set(&rwb->rq_wait[i].inflight, 0);
- init_waitqueue_head(&rwb->rq_wait[i].wait);
- }
+ for (i = 0; i < WBT_NUM_RWQ; i++)
+ rq_wait_init(&rwb->rq_wait[i]);
 
+ rwb->rqos.id = RQ_QOS_WBT;
+ rwb->rqos.ops = &wbt_rqos_ops;
+ rwb->rqos.q = q;
  rwb->last_comp = rwb->last_issue = jiffies;
- rwb->queue = q;
  rwb->win_nsec = RWB_WINDOW_NSEC;
  rwb->enable_state = WBT_STATE_ON_DEFAULT;
- wbt_update_limits(rwb);
+ rwb->wc = 1;
+ rwb->rq_depth.default_depth = RWB_DEF_DEPTH;
+ __wbt_update_limits(rwb);
 
  /*
  * Assign rwb and add the stats callback.
  */
- q->rq_wb = rwb;
+ rq_qos_add(q, &rwb->rqos);
  blk_stat_add_callback(q, rwb->cb);
 
  rwb->min_lat_nsec = wbt_default_latency_nsec(q);
 
- wbt_set_queue_depth(rwb, blk_queue_depth(q));
- wbt_set_write_cache(rwb, test_bit(QUEUE_FLAG_WC, &q->queue_flags));
+ wbt_set_queue_depth(q, blk_queue_depth(q));
+ wbt_set_write_cache(q, test_bit(QUEUE_FLAG_WC, &q->queue_flags));
 
  return 0;
 }
-
-void wbt_exit(struct request_queue *q)
-{
- struct rq_wb *rwb = q->rq_wb;
-
- if (rwb) {
- blk_stat_remove_callback(q, rwb->cb);
- blk_stat_free_callback(rwb->cb);
- q->rq_wb = NULL;
- kfree(rwb);
- }
-}
diff --git a/block/blk-wbt.h b/block/blk-wbt.h
index 5cf2398caf49..eeb031545acb 100644
--- a/block/blk-wbt.h
+++ b/block/blk-wbt.h
@@ -9,6 +9,7 @@
 #include <linux/ktime.h>
 
 #include "blk-stat.h"
+#include "blk-rq-qos.h"
 
 enum wbt_flags {
  WBT_TRACKED = 1, /* write, tracked for throttling */
@@ -33,20 +34,12 @@ enum {
  WBT_STATE_ON_MANUAL = 2,
 };
 
-struct rq_wait {
- wait_queue_head_t wait;
- atomic_t inflight;
-};
-
 struct rq_wb {
  /*
  * Settings that govern how we throttle
  */
  unsigned int wb_background; /* background writeback */
  unsigned int wb_normal; /* normal writeback */
- unsigned int wb_max; /* max throughput writeback */
- int scale_step;
- bool scaled_max;
 
  short enable_state; /* WBT_STATE_* */
 
@@ -65,15 +58,20 @@ struct rq_wb {
  void *sync_cookie;
 
  unsigned int wc;
- unsigned int queue_depth;
 
  unsigned long last_issue; /* last non-throttled issue */
  unsigned long last_comp; /* last non-throttled comp */
  unsigned long min_lat_nsec;
- struct request_queue *queue;
+ struct rq_qos rqos;
  struct rq_wait rq_wait[WBT_NUM_RWQ];
+ struct rq_depth rq_depth;
 };
 
+static inline struct rq_wb *RQWB(struct rq_qos *rqos)
+{
+ return container_of(rqos, struct rq_wb, rqos);
+}
+
 static inline unsigned int wbt_inflight(struct rq_wb *rwb)
 {
  unsigned int i, ret = 0;
@@ -84,6 +82,7 @@ static inline unsigned int wbt_inflight(struct rq_wb *rwb)
  return ret;
 }
 
+
 #ifdef CONFIG_BLK_WBT
 
 static inline void wbt_track(struct request *rq, enum wbt_flags flags)
@@ -91,19 +90,16 @@ static inline void wbt_track(struct request *rq, enum wbt_flags flags)
  rq->issue_stat.stat |= ((u64)flags) << BLK_STAT_RES_SHIFT;
 }
 
-void __wbt_done(struct rq_wb *, enum wbt_flags);
-void wbt_done(struct rq_wb *, struct request *);
-enum wbt_flags wbt_wait(struct rq_wb *, struct bio *, spinlock_t *);
 int wbt_init(struct request_queue *);
-void wbt_exit(struct request_queue *);
-void wbt_update_limits(struct rq_wb *);
-void wbt_requeue(struct rq_wb *, struct request *);
-void wbt_issue(struct rq_wb *, struct request *);
+void wbt_update_limits(struct request_queue *);
 void wbt_disable_default(struct request_queue *);
 void wbt_enable_default(struct request_queue *);
 
-void wbt_set_queue_depth(struct rq_wb *, unsigned int);
-void wbt_set_write_cache(struct rq_wb *, bool);
+u64 wbt_get_min_lat(struct request_queue *q);
+void wbt_set_min_lat(struct request_queue *q, u64 val);
+
+void wbt_set_queue_depth(struct request_queue *, unsigned int);
+void wbt_set_write_cache(struct request_queue *, bool);
 
 u64 wbt_default_latency_nsec(struct request_queue *);
 
@@ -112,43 +108,30 @@ u64 wbt_default_latency_nsec(struct request_queue *);
 static inline void wbt_track(struct request *rq, enum wbt_flags flags)
 {
 }
-static inline void __wbt_done(struct rq_wb *rwb, enum wbt_flags flags)
-{
-}
-static inline void wbt_done(struct rq_wb *rwb, struct request *rq)
-{
-}
-static inline enum wbt_flags wbt_wait(struct rq_wb *rwb, struct bio *bio,
-      spinlock_t *lock)
-{
- return 0;
-}
 static inline int wbt_init(struct request_queue *q)
 {
  return -EINVAL;
 }
-static inline void wbt_exit(struct request_queue *q)
-{
-}
-static inline void wbt_update_limits(struct rq_wb *rwb)
+static inline void wbt_update_limits(struct request_queue *q)
 {
 }
-static inline void wbt_requeue(struct rq_wb *rwb, struct request *rq)
+static inline void wbt_disable_default(struct request_queue *q)
 {
 }
-static inline void wbt_issue(struct rq_wb *rwb, struct request *rq)
+static inline void wbt_enable_default(struct request_queue *q)
 {
 }
-static inline void wbt_disable_default(struct request_queue *q)
+static inline void wbt_set_queue_depth(struct request_queue *q, unsigned int depth)
 {
 }
-static inline void wbt_enable_default(struct request_queue *q)
+static inline void wbt_set_write_cache(struct request_queue *q, bool wc)
 {
 }
-static inline void wbt_set_queue_depth(struct rq_wb *rwb, unsigned int depth)
+static inline u64 wbt_get_min_lat(struct request_queue *q)
 {
+ return 0;
 }
-static inline void wbt_set_write_cache(struct rq_wb *rwb, bool wc)
+static inline void wbt_set_min_lat(struct request_queue *q, u64 val)
 {
 }
 static inline u64 wbt_default_latency_nsec(struct request_queue *q)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 50c81c2b9dea..ad7625b18fcc 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -40,7 +40,7 @@ struct bsg_job;
 struct blkcg_gq;
 struct blk_flush_queue;
 struct pr_ops;
-struct rq_wb;
+struct rq_qos;
 struct blk_queue_stats;
 struct blk_stat_callback;
 
@@ -415,7 +415,7 @@ struct request_queue {
  atomic_t shared_hctx_restart;
 
  struct blk_queue_stats *stats;
- struct rq_wb *rq_wb;
+ struct rq_qos *rq_qos;
 
  /*
  * If blkcg is not used, @q->root_rl serves all requests.  If blkcg
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU B][PATCH 07/13] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait

Mauricio Faria de Oliveira-3
In reply to this post by Mauricio Faria de Oliveira-3
From: Anchal Agarwal <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1810998

I am currently running a large bare metal instance (i3.metal)
on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
4.18 kernel. I have a workload that simulates a database
workload and I am running into lockup issues when writeback
throttling is enabled,with the hung task detector also
kicking in.

Crash dumps show that most CPUs (up to 50 of them) are
all trying to get the wbt wait queue lock while trying to add
themselves to it in __wbt_wait (see stack traces below).

[    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
[    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
[    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
[    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
[    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
[    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
[    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
[    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
[    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
[    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
[    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
[    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
[    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.948138] Call Trace:
[    0.948139]  <IRQ>
[    0.948142]  do_raw_spin_lock+0xad/0xc0
[    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
[    0.948149]  ? __wake_up_common_lock+0x53/0x90
[    0.948150]  __wake_up_common_lock+0x53/0x90
[    0.948155]  wbt_done+0x7b/0xa0
[    0.948158]  blk_mq_free_request+0xb7/0x110
[    0.948161]  __blk_mq_complete_request+0xcb/0x140
[    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
[    0.948169]  nvme_irq+0x23/0x50 [nvme]
[    0.948173]  __handle_irq_event_percpu+0x46/0x300
[    0.948176]  handle_irq_event_percpu+0x20/0x50
[    0.948179]  handle_irq_event+0x34/0x60
[    0.948181]  handle_edge_irq+0x77/0x190
[    0.948185]  handle_irq+0xaf/0x120
[    0.948188]  do_IRQ+0x53/0x110
[    0.948191]  common_interrupt+0x87/0x87
[    0.948192]  </IRQ>
....
[    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
[    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
[    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
[    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
[    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
[    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
[    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
[    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
[    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
[    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
[    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
[    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
[    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.311154] Call Trace:
[    0.311157]  do_raw_spin_lock+0xad/0xc0
[    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
[    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
[    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
[    0.311167]  wbt_wait+0x127/0x330
[    0.311169]  ? finish_wait+0x80/0x80
[    0.311172]  ? generic_make_request+0xda/0x3b0
[    0.311174]  blk_mq_make_request+0xd6/0x7b0
[    0.311176]  ? blk_queue_enter+0x24/0x260
[    0.311178]  ? generic_make_request+0xda/0x3b0
[    0.311181]  generic_make_request+0x10c/0x3b0
[    0.311183]  ? submit_bio+0x5c/0x110
[    0.311185]  submit_bio+0x5c/0x110
[    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
[    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
[    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
[    0.311229]  ? do_writepages+0x3c/0xd0
[    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
[    0.311240]  do_writepages+0x3c/0xd0
[    0.311243]  ? _raw_spin_unlock+0x24/0x30
[    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
[    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
[    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
[    0.311253]  file_write_and_wait_range+0x34/0x90
[    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
[    0.311267]  do_fsync+0x38/0x60
[    0.311270]  SyS_fsync+0xc/0x10
[    0.311272]  do_syscall_64+0x6f/0x170
[    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7

In the original patch, wbt_done is waking up all the exclusive
processes in the wait queue, which can cause a thundering herd
if there is a large number of writer threads in the queue. The
original intention of the code seems to be to wake up one thread
only however, it uses wake_up_all() in __wbt_done(), and then
uses the following check in __wbt_wait to have only one thread
actually get out of the wait loop:

if (waitqueue_active(&rqw->wait) &&
            rqw->wait.head.next != &wait->entry)
                return false;

The problem with this is that the wait entry in wbt_wait is
define with DEFINE_WAIT, which uses the autoremove wakeup function.
That means that the above check is invalid - the wait entry will
have been removed from the queue already by the time we hit the
check in the loop.

Secondly, auto-removing the wait entries also means that the wait
queue essentially gets reordered "randomly" (e.g. threads re-add
themselves in the order they got to run after being woken up).
Additionally, new requests entering wbt_wait might overtake requests
that were queued earlier, because the wait queue will be
(temporarily) empty after the wake_up_all, so the waitqueue_active
check will not stop them. This can cause certain threads to starve
under high load.

The fix is to leave the woken up requests in the queue and remove
them in finish_wait() once the current thread breaks out of the
wait loop in __wbt_wait. This will ensure new requests always
end up at the back of the queue, and they won't overtake requests
that are already in the wait queue. With that change, the loop
in wbt_wait is also in line with many other wait loops in the kernel.
Waking up just one thread drastically reduces lock contention, as
does moving the wait queue add/remove out of the loop.

A significant drop in lockdep's lock contention numbers is seen when
running the test application on the patched kernel.

Signed-off-by: Anchal Agarwal <[hidden email]>
Signed-off-by: Frank van der Linden <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit 2887e41b910bb14fd847cf01ab7a5993db989d88)
Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
---
 block/blk-wbt.c | 55 +++++++++++++++++++++----------------------------
 1 file changed, 24 insertions(+), 31 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 4a5dc6d8d9ef..52dbec3a116e 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -161,7 +161,7 @@ static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct)
  int diff = limit - inflight;
 
  if (!inflight || diff >= rwb->wb_background / 2)
- wake_up_all(&rqw->wait);
+ wake_up(&rqw->wait);
  }
 }
 
@@ -466,30 +466,6 @@ static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
  return limit;
 }
 
-static inline bool may_queue(struct rq_wb *rwb, struct rq_wait *rqw,
-     wait_queue_entry_t *wait, unsigned long rw)
-{
- /*
- * inc it here even if disabled, since we'll dec it at completion.
- * this only happens if the task was sleeping in __wbt_wait(),
- * and someone turned it off at the same time.
- */
- if (!rwb_enabled(rwb)) {
- atomic_inc(&rqw->inflight);
- return true;
- }
-
- /*
- * If the waitqueue is already active and we are not the next
- * in line to be woken up, wait for our turn.
- */
- if (waitqueue_active(&rqw->wait) &&
-    rqw->wait.head.next != &wait->entry)
- return false;
-
- return rq_wait_inc_below(rqw, get_limit(rwb, rw));
-}
-
 /*
  * Block if we will exceed our limit, or if we are currently waiting for
  * the timer to kick off queuing again.
@@ -500,16 +476,32 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
  __acquires(lock)
 {
  struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
- DEFINE_WAIT(wait);
+ DECLARE_WAITQUEUE(wait, current);
 
- if (may_queue(rwb, rqw, &wait, rw))
+ /*
+ * inc it here even if disabled, since we'll dec it at completion.
+ * this only happens if the task was sleeping in __wbt_wait(),
+ * and someone turned it off at the same time.
+ */
+ if (!rwb_enabled(rwb)) {
+ atomic_inc(&rqw->inflight);
  return;
+ }
 
+ if (!waitqueue_active(&rqw->wait)
+ && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
+ return;
+
+ add_wait_queue_exclusive(&rqw->wait, &wait);
  do {
- prepare_to_wait_exclusive(&rqw->wait, &wait,
- TASK_UNINTERRUPTIBLE);
+ set_current_state(TASK_UNINTERRUPTIBLE);
+
+ if (!rwb_enabled(rwb)) {
+ atomic_inc(&rqw->inflight);
+ break;
+ }
 
- if (may_queue(rwb, rqw, &wait, rw))
+ if (rq_wait_inc_below(rqw, get_limit(rwb, rw)))
  break;
 
  if (lock) {
@@ -520,7 +512,8 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
  io_schedule();
  } while (1);
 
- finish_wait(&rqw->wait, &wait);
+ __set_current_state(TASK_RUNNING);
+ remove_wait_queue(&rqw->wait, &wait);
 }
 
 static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU B][PATCH 08/13] blk-wbt: move disable check into get_limit()

Mauricio Faria de Oliveira-3
In reply to this post by Mauricio Faria de Oliveira-3
From: Jens Axboe <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1810998

Check it in one place, instead of in multiple places.

Tested-by: Anchal Agarwal <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(backported from commit ffa358dcaae1f2f00926484e712e06daa8953cb4)
[mfo: backport:
 - blk-wbt.c:
   - hunk 1: refresh lower context lines due to lack of commit
     782f569774d7 ("blk-wbt: throttle discards like background
     writes"), not required / introduces a new thing/behavior.]
Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
---
 block/blk-wbt.c | 22 +++++++---------------
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 52dbec3a116e..d6a85e2a07c9 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -444,6 +444,13 @@ static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
 {
  unsigned int limit;
 
+ /*
+ * If we got disabled, just return UINT_MAX. This ensures that
+ * we'll properly inc a new IO, and dec+wakeup at the end.
+ */
+ if (!rwb_enabled(rwb))
+ return UINT_MAX;
+
  /*
  * At this point we know it's a buffered write. If this is
  * kswapd trying to free memory, or REQ_SYNC is set, then
@@ -478,16 +485,6 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
  struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
  DECLARE_WAITQUEUE(wait, current);
 
- /*
- * inc it here even if disabled, since we'll dec it at completion.
- * this only happens if the task was sleeping in __wbt_wait(),
- * and someone turned it off at the same time.
- */
- if (!rwb_enabled(rwb)) {
- atomic_inc(&rqw->inflight);
- return;
- }
-
  if (!waitqueue_active(&rqw->wait)
  && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
  return;
@@ -496,11 +493,6 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
  do {
  set_current_state(TASK_UNINTERRUPTIBLE);
 
- if (!rwb_enabled(rwb)) {
- atomic_inc(&rqw->inflight);
- break;
- }
-
  if (rq_wait_inc_below(rqw, get_limit(rwb, rw)))
  break;
 
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU B][PATCH 09/13] blk-wbt: use wq_has_sleeper() for wq active check

Mauricio Faria de Oliveira-3
In reply to this post by Mauricio Faria de Oliveira-3
From: Jens Axboe <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1810998

We need the memory barrier before checking the list head,
use the appropriate helper for this. The matching queue
side memory barrier is provided by set_current_state().

Tested-by: Anchal Agarwal <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit b78820937b4762b7d30b807d7156bec1d89e4dd3)
Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
---
 block/blk-wbt.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index d6a85e2a07c9..5f975288e9fb 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -116,7 +116,7 @@ static void rwb_wake_all(struct rq_wb *rwb)
  for (i = 0; i < WBT_NUM_RWQ; i++) {
  struct rq_wait *rqw = &rwb->rq_wait[i];
 
- if (waitqueue_active(&rqw->wait))
+ if (wq_has_sleeper(&rqw->wait))
  wake_up_all(&rqw->wait);
  }
 }
@@ -157,7 +157,7 @@ static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct)
  if (inflight && inflight >= limit)
  return;
 
- if (waitqueue_active(&rqw->wait)) {
+ if (wq_has_sleeper(&rqw->wait)) {
  int diff = limit - inflight;
 
  if (!inflight || diff >= rwb->wb_background / 2)
@@ -485,8 +485,8 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
  struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
  DECLARE_WAITQUEUE(wait, current);
 
- if (!waitqueue_active(&rqw->wait)
- && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
+ if (!wq_has_sleeper(&rqw->wait) &&
+    rq_wait_inc_below(rqw, get_limit(rwb, rw)))
  return;
 
  add_wait_queue_exclusive(&rqw->wait, &wait);
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU B][PATCH 10/13] blk-wbt: fix has-sleeper queueing check

Mauricio Faria de Oliveira-3
In reply to this post by Mauricio Faria de Oliveira-3
From: Jens Axboe <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1810998

We need to do this inside the loop as well, or we can allow new
IO to supersede previous IO.

Tested-by: Anchal Agarwal <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit c45e6a037a536530bd25781ac7c989e52deb2a63)
Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
---
 block/blk-wbt.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 5f975288e9fb..2d9e0e8b92e1 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -484,16 +484,17 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
 {
  struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
  DECLARE_WAITQUEUE(wait, current);
+ bool has_sleeper;
 
- if (!wq_has_sleeper(&rqw->wait) &&
-    rq_wait_inc_below(rqw, get_limit(rwb, rw)))
+ has_sleeper = wq_has_sleeper(&rqw->wait);
+ if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
  return;
 
  add_wait_queue_exclusive(&rqw->wait, &wait);
  do {
  set_current_state(TASK_UNINTERRUPTIBLE);
 
- if (rq_wait_inc_below(rqw, get_limit(rwb, rw)))
+ if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
  break;
 
  if (lock) {
@@ -502,6 +503,7 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
  spin_lock_irq(lock);
  } else
  io_schedule();
+ has_sleeper = false;
  } while (1);
 
  __set_current_state(TASK_RUNNING);
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU B][PATCH 11/13] blk-wbt: abstract out end IO completion handler

Mauricio Faria de Oliveira-3
In reply to this post by Mauricio Faria de Oliveira-3
From: Jens Axboe <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1810998

Prep patch for calling the handler from a different context,
no functional changes in this patch.

Tested-by: Agarwal, Anchal <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit 061a5427530633de93ace4ef001b99961984af62)
Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
---
 block/blk-wbt.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 2d9e0e8b92e1..6180dd527666 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -121,16 +121,11 @@ static void rwb_wake_all(struct rq_wb *rwb)
  }
 }
 
-static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct)
+static void wbt_rqw_done(struct rq_wb *rwb, struct rq_wait *rqw,
+ enum wbt_flags wb_acct)
 {
- struct rq_wb *rwb = RQWB(rqos);
- struct rq_wait *rqw;
  int inflight, limit;
 
- if (!(wb_acct & WBT_TRACKED))
- return;
-
- rqw = get_rq_wait(rwb, wb_acct);
  inflight = atomic_dec_return(&rqw->inflight);
 
  /*
@@ -165,6 +160,18 @@ static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct)
  }
 }
 
+static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct)
+{
+ struct rq_wb *rwb = RQWB(rqos);
+ struct rq_wait *rqw;
+
+ if (!(wb_acct & WBT_TRACKED))
+ return;
+
+ rqw = get_rq_wait(rwb, wb_acct);
+ wbt_rqw_done(rwb, rqw, wb_acct);
+}
+
 /*
  * Called on completion of a request. Note that it's also called when
  * a request is merged, when the request gets freed.
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU B][PATCH 12/13] blk-wbt: improve waking of tasks

Mauricio Faria de Oliveira-3
In reply to this post by Mauricio Faria de Oliveira-3
From: Jens Axboe <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1810998

We have two potential issues:

1) After commit 2887e41b910b, we only wake one process at the time when
   we finish an IO. We really want to wake up as many tasks as can
   queue IO. Before this commit, we woke up everyone, which could cause
   a thundering herd issue.

2) A task can potentially consume two wakeups, causing us to (in
   practice) miss a wakeup.

Fix both by providing our own wakeup function, which stops
__wake_up_common() from waking up more tasks if we fail to get a
queueing token. With the strict ordering we have on the wait list, this
wakes the right tasks and the right amount of tasks.

Based on a patch from Jianchao Wang <[hidden email]>.

Tested-by: Agarwal, Anchal <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit 38cfb5a45ee013bfab5d1ae4c4738815e744b440)
Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
---
 block/blk-wbt.c | 63 +++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 56 insertions(+), 7 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 6180dd527666..0a68645aae6b 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -156,7 +156,7 @@ static void wbt_rqw_done(struct rq_wb *rwb, struct rq_wait *rqw,
  int diff = limit - inflight;
 
  if (!inflight || diff >= rwb->wb_background / 2)
- wake_up(&rqw->wait);
+ wake_up_all(&rqw->wait);
  }
 }
 
@@ -480,6 +480,34 @@ static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
  return limit;
 }
 
+struct wbt_wait_data {
+ struct wait_queue_entry wq;
+ struct task_struct *task;
+ struct rq_wb *rwb;
+ struct rq_wait *rqw;
+ unsigned long rw;
+ bool got_token;
+};
+
+static int wbt_wake_function(struct wait_queue_entry *curr, unsigned int mode,
+     int wake_flags, void *key)
+{
+ struct wbt_wait_data *data = container_of(curr, struct wbt_wait_data,
+ wq);
+
+ /*
+ * If we fail to get a budget, return -1 to interrupt the wake up
+ * loop in __wake_up_common.
+ */
+ if (!rq_wait_inc_below(data->rqw, get_limit(data->rwb, data->rw)))
+ return -1;
+
+ data->got_token = true;
+ list_del_init(&curr->entry);
+ wake_up_process(data->task);
+ return 1;
+}
+
 /*
  * Block if we will exceed our limit, or if we are currently waiting for
  * the timer to kick off queuing again.
@@ -490,19 +518,40 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
  __acquires(lock)
 {
  struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
- DECLARE_WAITQUEUE(wait, current);
+ struct wbt_wait_data data = {
+ .wq = {
+ .func = wbt_wake_function,
+ .entry = LIST_HEAD_INIT(data.wq.entry),
+ },
+ .task = current,
+ .rwb = rwb,
+ .rqw = rqw,
+ .rw = rw,
+ };
  bool has_sleeper;
 
  has_sleeper = wq_has_sleeper(&rqw->wait);
  if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
  return;
 
- add_wait_queue_exclusive(&rqw->wait, &wait);
+ prepare_to_wait_exclusive(&rqw->wait, &data.wq, TASK_UNINTERRUPTIBLE);
  do {
- set_current_state(TASK_UNINTERRUPTIBLE);
+ if (data.got_token)
+ break;
 
- if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
+ if (!has_sleeper &&
+    rq_wait_inc_below(rqw, get_limit(rwb, rw))) {
+ finish_wait(&rqw->wait, &data.wq);
+
+ /*
+ * We raced with wbt_wake_function() getting a token,
+ * which means we now have two. Put our local token
+ * and wake anyone else potentially waiting for one.
+ */
+ if (data.got_token)
+ wbt_rqw_done(rwb, rqw, wb_acct);
  break;
+ }
 
  if (lock) {
  spin_unlock_irq(lock);
@@ -510,11 +559,11 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
  spin_lock_irq(lock);
  } else
  io_schedule();
+
  has_sleeper = false;
  } while (1);
 
- __set_current_state(TASK_RUNNING);
- remove_wait_queue(&rqw->wait, &wait);
+ finish_wait(&rqw->wait, &data.wq);
 }
 
 static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU B][PATCH 13/13] blk-wbt: wake up all when we scale up, not down

Mauricio Faria de Oliveira-3
In reply to this post by Mauricio Faria de Oliveira-3
From: Josef Bacik <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1810998

Tetsuo brought to my attention that I screwed up the scale_up/scale_down
helpers when I factored out the rq-qos code.  We need to wake up all the
waiters when we add slots for requests to make, not when we shrink the
slots.  Otherwise we'll end up things waiting forever.  This was a
mistake and simply puts everything back the way it was.

cc: [hidden email]
Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt")
eported-by: Tetsuo Handa <[hidden email]>
Signed-off-by: Josef Bacik <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit 5e65a20341e02df637d1c16cd487858d2c6a876a)
Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
---
 block/blk-wbt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 0a68645aae6b..9e9d733bae77 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -305,6 +305,7 @@ static void scale_up(struct rq_wb *rwb)
  rq_depth_scale_up(&rwb->rq_depth);
  calc_wb_limits(rwb);
  rwb->unknown_cnt = 0;
+ rwb_wake_all(rwb);
  rwb_trace_step(rwb, "scale up");
 }
 
@@ -313,7 +314,6 @@ static void scale_down(struct rq_wb *rwb, bool hard_throttle)
  rq_depth_scale_down(&rwb->rq_depth, hard_throttle);
  calc_wb_limits(rwb);
  rwb->unknown_cnt = 0;
- rwb_wake_all(rwb);
  rwb_trace_step(rwb, "scale down");
 }
 
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

NACK: [SRU B][PATCH 00/13] blk-wbt: fix for LP#1810998

Stefan Bader-2
In reply to this post by Mauricio Faria de Oliveira-3
On 10.01.19 04:12, Mauricio Faria de Oliveira wrote:

> BugLink: https://bugs.launchpad.net/bugs/1810998
>
> [Impact]
>
>  * Users may experience cpu hard lockups when performing
>    rigorous writes to NVMe drives.
>
>  * The fix addresses an scheduling issue in the original
>    implementation of wbt/writeback throttling
>
>  * The fix is commit 2887e41b910b ("blk-wbt: Avoid lock
>    contention and thundering herd issue in wbt_wait"),
>    plus its fix commit 38cfb5a45ee0 ("blk-wbt: improve
>    waking of tasks").
>
>  * There are additional commits to help with a cleaner
>    backport and future maintenance:
>    - Cosmic: total of 8 clean cherry picks.
>    - Bionic: total of 13 commits, 9 clean cherry-picks
>      and 4 backports, which are just changes to context
>      lines (i.e. refresh) without any functional changes
>      in the backport itself.
>
> [Test Case]
>
>  * This command has been reported to reproduce the problem:
>
>    $ sudo iozone -R -s 5G -r 1m -S 2048 -i 0 -G -c -o -l 128 -u 128 -t 128
>
>  * It generates stack traces as below in the original kernel,
>    and does not generate them in the modified/patched kernel.
>
>  * The user/reporter verified the test kernel with these patches
>    resolved the problem.
>
>  * The developer verified in 2 systems (4-core and 24-core but
>    no NVMe) for regressions, and no error messages were logged
>    to dmesg.
>
> [Regression Potential]
>
>  * The commits have been verified for fixes in later commits in
>    linux-next as of 2019-01-08 and all known fix commits are in.
>
>  * The regression potential is mostly contained in the writeback
>    throttling (block/blk-wbt.*), as almost all of the 13 patches
>    change exclusively that, except for 4 of them (2 of which are
>    sysfs):
>
>    - blk-rq-qos: refactor out common elements of blk-wbt (block/)
>    - block: Protect less code with sysfs_lock in blk_{un,}register_queue() (blk-sysfs.c)
>    - block: Protect less code with sysfs_lock in blk_{un,}register_queue() (blk-{mq-}sysfs.c)
>    - block: pass struct request instead of struct blk_issue_stat to wbt (block/, still mostly blk-wbt.*)
>
> [Other Info]
>
>  * Alternatively, it is probably possible to introduce just the
>    two commits that fix this with some changes to their code in
>    the backport, but since the 'blk-rq-qos: refactor ..' commit
>    may become a dependency for many additional/future fixes, it
>    seemed interesting to pull it in earlier in the 18.04 cycle.
>
>  * The problem has been introduced with the blk-wbt mechanism,
>    in v4.10-rc1, and the fix commits in v4.19-rc1 and -rc2,
>    so only Bionic and Cosmic needs this.
>
> [Stack Traces]
>
> [ 393.628647] NMI watchdog: Watchdog detected hard LOCKUP on cpu 30
> ...
> [ 393.628704] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic #21-Ubuntu
> ...
> [ 393.628720] Call Trace:
> [ 393.628721] <IRQ>
> [ 393.628724] enqueue_task_fair+0x6c/0x7f0
> [ 393.628726] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150
> [ 393.628728] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150
> [ 393.628731] activate_task+0x57/0xc0
> [ 393.628735] ? sched_clock+0x9/0x10
> [ 393.628736] ? sched_clock+0x9/0x10
> [ 393.628738] ttwu_do_activate+0x49/0x90
> [ 393.628739] try_to_wake_up+0x1df/0x490
> [ 393.628741] default_wake_function+0x12/0x20
> [ 393.628743] autoremove_wake_function+0x12/0x40
> [ 393.628744] __wake_up_common+0x73/0x130
> [ 393.628745] __wake_up_common_lock+0x80/0xc0
> [ 393.628746] __wake_up+0x13/0x20
> [ 393.628749] __wbt_done.part.21+0xa4/0xb0
> [ 393.628749] wbt_done+0x72/0xa0
> [ 393.628753] blk_mq_free_request+0xca/0x1a0
> [ 393.628755] blk_mq_end_request+0x48/0x90
> [ 393.628760] nvme_complete_rq+0x23/0x120 [nvme_core]
> [ 393.628763] nvme_pci_complete_rq+0x7a/0x130 [nvme]
> [ 393.628764] __blk_mq_complete_request+0xd2/0x140
> [ 393.628766] blk_mq_complete_request+0x18/0x20
> [ 393.628767] nvme_process_cq+0xe1/0x1b0 [nvme]
> [ 393.628768] nvme_irq+0x23/0x50 [nvme]
> [ 393.628772] __handle_irq_event_percpu+0x44/0x1a0
> [ 393.628773] handle_irq_event_percpu+0x32/0x80
> [ 393.628774] handle_irq_event+0x3b/0x60
> [ 393.628778] handle_edge_irq+0x7c/0x190
> [ 393.628779] handle_irq+0x20/0x30
> [ 393.628783] do_IRQ+0x46/0xd0
> [ 393.628784] common_interrupt+0x84/0x84
> [ 393.628785] </IRQ>
> ...
> [ 393.628794] ? cpuidle_enter_state+0x97/0x2f0
> [ 393.628796] cpuidle_enter+0x17/0x20
> [ 393.628797] call_cpuidle+0x23/0x40
> [ 393.628798] do_idle+0x18c/0x1f0
> [ 393.628799] cpu_startup_entry+0x73/0x80
> [ 393.628802] start_secondary+0x1a6/0x200
> [ 393.628804] secondary_startup_64+0xa5/0xb0
> [ 393.628805] Code: ...
>
> [ 405.981597] nvme nvme1: I/O 393 QID 6 timeout, completion polled
>
> [ 435.597209] INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 435.602858] 30-...0: (1 GPs behind) idle=e26/1/0 softirq=6834/6834 fqs=4485
> [ 435.610203] (detected by 8, t=15005 jiffies, g=6396, c=6395, q=146818)
> [ 435.617025] Sending NMI from CPU 8 to CPUs 30:
> [ 435.617029] NMI backtrace for cpu 30
> [ 435.617031] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic #21-Ubuntu
> ...
> [ 435.617047] Call Trace:
> [ 435.617048] <IRQ>
> [ 435.617051] enqueue_entity+0x9f/0x6b0
> [ 435.617053] enqueue_task_fair+0x6c/0x7f0
> [ 435.617056] activate_task+0x57/0xc0
> [ 435.617059] ? sched_clock+0x9/0x10
> [ 435.617060] ? sched_clock+0x9/0x10
> [ 435.617061] ttwu_do_activate+0x49/0x90
> [ 435.617063] try_to_wake_up+0x1df/0x490
> [ 435.617065] default_wake_function+0x12/0x20
> [ 435.617067] autoremove_wake_function+0x12/0x40
> [ 435.617068] __wake_up_common+0x73/0x130
> [ 435.617069] __wake_up_common_lock+0x80/0xc0
> [ 435.617070] __wake_up+0x13/0x20
> [ 435.617073] __wbt_done.part.21+0xa4/0xb0
> [ 435.617074] wbt_done+0x72/0xa0
> [ 435.617077] blk_mq_free_request+0xca/0x1a0
> [ 435.617079] blk_mq_end_request+0x48/0x90
> [ 435.617084] nvme_complete_rq+0x23/0x120 [nvme_core]
> [ 435.617087] nvme_pci_complete_rq+0x7a/0x130 [nvme]
> [ 435.617088] __blk_mq_complete_request+0xd2/0x140
> [ 435.617090] blk_mq_complete_request+0x18/0x20
> [ 435.617091] nvme_process_cq+0xe1/0x1b0 [nvme]
> [ 435.617093] nvme_irq+0x23/0x50 [nvme]
> [ 435.617096] __handle_irq_event_percpu+0x44/0x1a0
> [ 435.617097] handle_irq_event_percpu+0x32/0x80
> [ 435.617098] handle_irq_event+0x3b/0x60
> [ 435.617101] handle_edge_irq+0x7c/0x190
> [ 435.617102] handle_irq+0x20/0x30
> [ 435.617106] do_IRQ+0x46/0xd0
> [ 435.617107] common_interrupt+0x84/0x84
> [ 435.617108] </IRQ>
> ...
> [ 435.617117] ? cpuidle_enter_state+0x97/0x2f0
> [ 435.617118] cpuidle_enter+0x17/0x20
> [ 435.617119] call_cpuidle+0x23/0x40
> [ 435.617121] do_idle+0x18c/0x1f0
> [ 435.617122] cpu_startup_entry+0x73/0x80
> [ 435.617125] start_secondary+0x1a6/0x200
> [ 435.617127] secondary_startup_64+0xa5/0xb0
> [ 435.617128] Code: ...
>
> Anchal Agarwal (1):
>   blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
>
> Bart Van Assche (1):
>   block: Protect less code with sysfs_lock in blk_{un,}register_queue()
>
> Jens Axboe (6):
>   blk-wbt: pass in enum wbt_flags to get_rq_wait()
>   blk-wbt: move disable check into get_limit()
>   blk-wbt: use wq_has_sleeper() for wq active check
>   blk-wbt: fix has-sleeper queueing check
>   blk-wbt: abstract out end IO completion handler
>   blk-wbt: improve waking of tasks
>
> Josef Bacik (2):
>   blk-rq-qos: refactor out common elements of blk-wbt
>   blk-wbt: wake up all when we scale up, not down
>
> Mike Snitzer (1):
>   block: properly protect the 'queue' kobj in blk_unregister_queue
>
> Omar Sandoval (2):
>   block: move some wbt helpers to blk-wbt.c
>   block: pass struct request instead of struct blk_issue_stat to wbt
>
>  block/Makefile         |   2 +-
>  block/blk-core.c       |  14 +-
>  block/blk-mq-sysfs.c   |   9 +-
>  block/blk-mq.c         |  14 +-
>  block/blk-rq-qos.c     | 178 +++++++++++++++
>  block/blk-rq-qos.h     | 106 +++++++++
>  block/blk-settings.c   |   4 +-
>  block/blk-sysfs.c      |  62 +++--
>  block/blk-wbt.c        | 499 ++++++++++++++++++++++-------------------
>  block/blk-wbt.h        |  96 +++-----
>  include/linux/blkdev.h |   4 +-
>  11 files changed, 643 insertions(+), 345 deletions(-)
>  create mode 100644 block/blk-rq-qos.c
>  create mode 100644 block/blk-rq-qos.h
>
Sorry, in this case, if there is any chance to make a backport of the two
changes actually addressing the problem without changing half the block layer, I
prefer changing as little as possible. That actually makes future maintenance
simpler because that is done assuming to have a 4.15 based kernel.

-Stefan


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: NACK: [SRU B][PATCH 00/13] blk-wbt: fix for LP#1810998

Mauricio Faria de Oliveira-3
On Thu, Jan 10, 2019 at 1:36 PM Stefan Bader <[hidden email]> wrote:
>
> On 10.01.19 04:12, Mauricio Faria de Oliveira wrote:
> > BugLink: https://bugs.launchpad.net/bugs/1810998
[snip]
> Sorry, in this case, if there is any chance to make a backport of the two
> changes actually addressing the problem without changing half the block layer, I
> prefer changing as little as possible. That actually makes future maintenance
> simpler because that is done assuming to have a 4.15 based kernel.

No worries, that's certainly understandable. I'll work on the simpler
approach. Thanks.

>
> -Stefan
>


--
Mauricio Faria de Oliveira

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team