Yakkety SRU - ppc64_cpu command is hung w HDs, SSDs and NVMe

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Yakkety SRU - ppc64_cpu command is hung w HDs, SSDs and NVMe

Tim Gardner-2
https://bugs.launchpad.net/bugs/1662666

[PATCH 1/3] blk-mq: Avoid memory reclaim when remapping queues
[PATCH 2/3] blk-mq: Fix failed allocation path when mapping queues
[PATCH 3/3] blk-mq: Always schedule hctx->next_cpu

rtg

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[PATCH 1/3] blk-mq: Avoid memory reclaim when remapping queues

Tim Gardner-2
From: Gabriel Krisman Bertazi <[hidden email]>

BugLink: http://bugs.launchpad.net/bugs/1662666

While stressing memory and IO at the same time we changed SMT settings,
we were able to consistently trigger deadlocks in the mm system, which
froze the entire machine.

I think that under memory stress conditions, the large allocations
performed by blk_mq_init_rq_map may trigger a reclaim, which stalls
waiting on the block layer remmaping completion, thus deadlocking the
system.  The trace below was collected after the machine stalled,
waiting for the hotplug event completion.

The simplest fix for this is to make allocations in this path
non-reclaimable, with GFP_NOIO.  With this patch, We couldn't hit the
issue anymore.

This should apply on top of Jens's for-next branch cleanly.

Changes since v1:
  - Use GFP_NOIO instead of GFP_NOWAIT.

 Call Trace:
[c000000f0160aaf0] [c000000f0160ab50] 0xc000000f0160ab50 (unreliable)
[c000000f0160acc0] [c000000000016624] __switch_to+0x2e4/0x430
[c000000f0160ad20] [c000000000b1a880] __schedule+0x310/0x9b0
[c000000f0160ae00] [c000000000b1af68] schedule+0x48/0xc0
[c000000f0160ae30] [c000000000b1b4b0] schedule_preempt_disabled+0x20/0x30
[c000000f0160ae50] [c000000000b1d4fc] __mutex_lock_slowpath+0xec/0x1f0
[c000000f0160aed0] [c000000000b1d678] mutex_lock+0x78/0xa0
[c000000f0160af00] [d000000019413cac] xfs_reclaim_inodes_ag+0x33c/0x380 [xfs]
[c000000f0160b0b0] [d000000019415164] xfs_reclaim_inodes_nr+0x54/0x70 [xfs]
[c000000f0160b0f0] [d0000000194297f8] xfs_fs_free_cached_objects+0x38/0x60 [xfs]
[c000000f0160b120] [c0000000003172c8] super_cache_scan+0x1f8/0x210
[c000000f0160b190] [c00000000026301c] shrink_slab.part.13+0x21c/0x4c0
[c000000f0160b2d0] [c000000000268088] shrink_zone+0x2d8/0x3c0
[c000000f0160b380] [c00000000026834c] do_try_to_free_pages+0x1dc/0x520
[c000000f0160b450] [c00000000026876c] try_to_free_pages+0xdc/0x250
[c000000f0160b4e0] [c000000000251978] __alloc_pages_nodemask+0x868/0x10d0
[c000000f0160b6f0] [c000000000567030] blk_mq_init_rq_map+0x160/0x380
[c000000f0160b7a0] [c00000000056758c] blk_mq_map_swqueue+0x33c/0x360
[c000000f0160b820] [c000000000567904] blk_mq_queue_reinit+0x64/0xb0
[c000000f0160b850] [c00000000056a16c] blk_mq_queue_reinit_notify+0x19c/0x250
[c000000f0160b8a0] [c0000000000f5d38] notifier_call_chain+0x98/0x100
[c000000f0160b8f0] [c0000000000c5fb0] __cpu_notify+0x70/0xe0
[c000000f0160b930] [c0000000000c63c4] notify_prepare+0x44/0xb0
[c000000f0160b9b0] [c0000000000c52f4] cpuhp_invoke_callback+0x84/0x250
[c000000f0160ba10] [c0000000000c570c] cpuhp_up_callbacks+0x5c/0x120
[c000000f0160ba60] [c0000000000c7cb8] _cpu_up+0xf8/0x1d0
[c000000f0160bac0] [c0000000000c7eb0] do_cpu_up+0x120/0x150
[c000000f0160bb40] [c0000000006fe024] cpu_subsys_online+0x64/0xe0
[c000000f0160bb90] [c0000000006f5124] device_online+0xb4/0x120
[c000000f0160bbd0] [c0000000006f5244] online_store+0xb4/0xc0
[c000000f0160bc20] [c0000000006f0a68] dev_attr_store+0x68/0xa0
[c000000f0160bc60] [c0000000003ccc30] sysfs_kf_write+0x80/0xb0
[c000000f0160bca0] [c0000000003cbabc] kernfs_fop_write+0x17c/0x250
[c000000f0160bcf0] [c00000000030fe6c] __vfs_write+0x6c/0x1e0
[c000000f0160bd90] [c000000000311490] vfs_write+0xd0/0x270
[c000000f0160bde0] [c0000000003131fc] SyS_write+0x6c/0x110
[c000000f0160be30] [c000000000009204] system_call+0x38/0xec

Signed-off-by: Gabriel Krisman Bertazi <[hidden email]>
Cc: Brian King <[hidden email]>
Cc: Douglas Miller <[hidden email]>
Cc: [hidden email]
Cc: [hidden email]
Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit 36e1f3d107867b25c616c2fd294f5a1c9d4e5d09)
Signed-off-by: Tim Gardner <[hidden email]>
---
 block/blk-mq.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 494e0d8..0cd0ca7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1533,7 +1533,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
  INIT_LIST_HEAD(&tags->page_list);
 
  tags->rqs = kzalloc_node(set->queue_depth * sizeof(struct request *),
- GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY,
+ GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
  set->numa_node);
  if (!tags->rqs) {
  blk_mq_free_tags(tags);
@@ -1559,7 +1559,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
 
  do {
  page = alloc_pages_node(set->numa_node,
- GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO,
+ GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO,
  this_order);
  if (page)
  break;
@@ -1580,7 +1580,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
  * Allow kmemleak to scan these pages as they contain pointers
  * to additional allocations like via ops->init_request().
  */
- kmemleak_alloc(p, order_to_size(this_order), 1, GFP_KERNEL);
+ kmemleak_alloc(p, order_to_size(this_order), 1, GFP_NOIO);
  entries_per_page = order_to_size(this_order) / rq_size;
  to_do = min(entries_per_page, set->queue_depth - i);
  left -= to_do * rq_size;
--
2.7.4


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[PATCH 2/3] blk-mq: Fix failed allocation path when mapping queues

Tim Gardner-2
In reply to this post by Tim Gardner-2
From: Gabriel Krisman Bertazi <[hidden email]>

BugLink: http://bugs.launchpad.net/bugs/1662666

In blk_mq_map_swqueue, there is a memory optimization that frees the
tags of a queue that has gone unmapped.  Later, if that hctx is remapped
after another topology change, the tags need to be reallocated.

If this allocation fails, a simple WARN_ON triggers, but the block layer
ends up with an active hctx without any corresponding set of tags.
Then, any income IO to that hctx can trigger an Oops.

I can reproduce it consistently by running IO, flipping CPUs on and off
and eventually injecting a memory allocation failure in that path.

In the fix below, if the system experiences a failed allocation of any
hctx's tags, we remap all the ctxs of that queue to the hctx_0, which
should always keep it's tags.  There is a minor performance hit, since
our mapping just got worse after the error path, but this is
the simplest solution to handle this error path.  The performance hit
will disappear after another successful remap.

I considered dropping the memory optimization all together, but it
seemed a bad trade-off to handle this very specific error case.

This should apply cleanly on top of Jens' for-next branch.

The Oops is the one below:

SP (3fff935ce4d0) is in userspace
1:mon> e
cpu 0x1: Vector: 300 (Data Access) at [c000000fe99eb110]
    pc: c0000000005e868c: __sbitmap_queue_get+0x2c/0x180
    lr: c000000000575328: __bt_get+0x48/0xd0
    sp: c000000fe99eb390
   msr: 900000010280b033
   dar: 28
 dsisr: 40000000
  current = 0xc000000fe9966800
  paca    = 0xc000000007e80300   softe: 0        irq_happened: 0x01
    pid   = 11035, comm = aio-stress
Linux version 4.8.0-rc6+ (root@bean) (gcc version 5.4.0 20160609
(Ubuntu/IBM 5.4.0-6ubuntu1~16.04.2) ) #3 SMP Mon Oct 10 20:16:53 CDT 2016
1:mon> s
[c000000fe99eb3d0] c000000000575328 __bt_get+0x48/0xd0
[c000000fe99eb400] c000000000575838 bt_get.isra.1+0x78/0x2d0
[c000000fe99eb480] c000000000575cb4 blk_mq_get_tag+0x44/0x100
[c000000fe99eb4b0] c00000000056f6f4 __blk_mq_alloc_request+0x44/0x220
[c000000fe99eb500] c000000000570050 blk_mq_map_request+0x100/0x1f0
[c000000fe99eb580] c000000000574650 blk_mq_make_request+0xf0/0x540
[c000000fe99eb640] c000000000561c44 generic_make_request+0x144/0x230
[c000000fe99eb690] c000000000561e00 submit_bio+0xd0/0x200
[c000000fe99eb740] c0000000003ef740 ext4_io_submit+0x90/0xb0
[c000000fe99eb770] c0000000003e95d8 ext4_writepages+0x588/0xdd0
[c000000fe99eb910] c00000000025a9f0 do_writepages+0x60/0xc0
[c000000fe99eb940] c000000000246c88 __filemap_fdatawrite_range+0xf8/0x180
[c000000fe99eb9e0] c000000000246f90 filemap_write_and_wait_range+0x70/0xf0
[c000000fe99eba20] c0000000003dd844 ext4_sync_file+0x214/0x540
[c000000fe99eba80] c000000000364718 vfs_fsync_range+0x78/0x130
[c000000fe99ebad0] c0000000003dd46c ext4_file_write_iter+0x35c/0x430
[c000000fe99ebb90] c00000000038c280 aio_run_iocb+0x3b0/0x450
[c000000fe99ebce0] c00000000038dc28 do_io_submit+0x368/0x730
[c000000fe99ebe30] c000000000009404 system_call+0x38/0xec

Signed-off-by: Gabriel Krisman Bertazi <[hidden email]>
Cc: Brian King <[hidden email]>
Cc: Douglas Miller <[hidden email]>
Cc: [hidden email]
Cc: [hidden email]
Reviewed-by: Douglas Miller <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit d1b1cea1e58477dad88ff769f54c0d2dfa56d923)
Signed-off-by: Tim Gardner <[hidden email]>
---
 block/blk-mq.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0cd0ca7..b53e2e3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1826,7 +1826,7 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
 static void blk_mq_map_swqueue(struct request_queue *q,
        const struct cpumask *online_mask)
 {
- unsigned int i;
+ unsigned int i, hctx_idx;
  struct blk_mq_hw_ctx *hctx;
  struct blk_mq_ctx *ctx;
  struct blk_mq_tag_set *set = q->tag_set;
@@ -1849,6 +1849,21 @@ static void blk_mq_map_swqueue(struct request_queue *q,
  if (!cpumask_test_cpu(i, online_mask))
  continue;
 
+ hctx_idx = q->mq_map[i];
+ /* unmapped hw queue can be remapped after CPU topo changed */
+ if (!set->tags[hctx_idx]) {
+ set->tags[hctx_idx] = blk_mq_init_rq_map(set, hctx_idx);
+
+ /*
+ * If tags initialization fail for some hctx,
+ * that hctx won't be brought online.  In this
+ * case, remap the current ctx to hctx[0] which
+ * is guaranteed to always have tags allocated
+ */
+ if (!set->tags[hctx_idx])
+ q->mq_map[i] = 0;
+ }
+
  ctx = per_cpu_ptr(q->queue_ctx, i);
  hctx = q->mq_ops->map_queue(q, i);
 
@@ -1867,7 +1882,11 @@ static void blk_mq_map_swqueue(struct request_queue *q,
  * disable it and free the request entries.
  */
  if (!hctx->nr_ctx) {
- if (set->tags[i]) {
+ /* Never unmap queue 0.  We need it as a
+ * fallback in case of a new remap fails
+ * allocation
+ */
+ if (i && set->tags[i]) {
  blk_mq_free_rq_map(set, set->tags[i], i);
  set->tags[i] = NULL;
  }
@@ -1875,9 +1894,6 @@ static void blk_mq_map_swqueue(struct request_queue *q,
  continue;
  }
 
- /* unmapped hw queue can be remapped after CPU topo changed */
- if (!set->tags[i])
- set->tags[i] = blk_mq_init_rq_map(set, i);
  hctx->tags = set->tags[i];
  WARN_ON(!hctx->tags);
 
--
2.7.4


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[PATCH 3/3] blk-mq: Always schedule hctx->next_cpu

Tim Gardner-2
In reply to this post by Tim Gardner-2
From: Gabriel Krisman Bertazi <[hidden email]>

BugLink: http://bugs.launchpad.net/bugs/1662666

Commit 0e87e58bf60e ("blk-mq: improve warning for running a queue on the
wrong CPU") attempts to avoid triggering the WARN_ON in
__blk_mq_run_hw_queue when the expected CPU is dead.  Problem is, in the
last batch execution before round robin, blk_mq_hctx_next_cpu can
schedule a dead CPU and also update next_cpu to the next alive CPU in
the mask, which will trigger the WARN_ON despite the previous
workaround.

The following patch fixes this scenario by always scheduling the value
in hctx->next_cpu.  This changes the moment when we round-robin the CPU
running the hctx, but it really doesn't matter, since it still executes
BLK_MQ_CPU_WORK_BATCH times in a row before switching to another CPU.

Fixes: 0e87e58bf60e ("blk-mq: improve warning for running a queue on the wrong CPU")
Signed-off-by: Gabriel Krisman Bertazi <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit c02ebfdddbafa9a6a0f52fbd715e6bfa229af9d3)
Signed-off-by: Tim Gardner <[hidden email]>
---
 block/blk-mq.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b53e2e3..5497010 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -916,7 +916,7 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
  return WORK_CPU_UNBOUND;
 
  if (--hctx->next_cpu_batch <= 0) {
- int cpu = hctx->next_cpu, next_cpu;
+ int next_cpu;
 
  next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask);
  if (next_cpu >= nr_cpu_ids)
@@ -924,8 +924,6 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
 
  hctx->next_cpu = next_cpu;
  hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
-
- return cpu;
  }
 
  return hctx->next_cpu;
--
2.7.4


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

ACK: Yakkety SRU - ppc64_cpu command is hung w HDs, SSDs and NVMe

Stefan Bader-2
In reply to this post by Tim Gardner-2



--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

ACK: Yakkety SRU - ppc64_cpu command is hung w HDs, SSDs and NVMe

Marcelo Henrique Cerri
In reply to this post by Tim Gardner-2
--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team

signature.asc (484 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

APPLIED: Yakkety SRU - ppc64_cpu command is hung w HDs, SSDs and NVMe

Thadeu Lima de Souza Cascardo-3
In reply to this post by Tim Gardner-2
Applied to yakkety master-next branch.

Cascardo.

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team