[Bionic][PATCH v2 00/21] blk-mq scheduler fixes

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[Bionic][PATCH v2 00/21] blk-mq scheduler fixes

Jose Ricardo Ziviani
From: Jose Ricardo Ziviani <[hidden email]>

Hello team!

Weeks ago I sent a patchset with:
 * genirq/affinity: Spread irq vectors among present CPUs as far as possible
 * blk-mq: simplify queue mapping & schedule with each possisble CPU

Unfortunately they broke some cases, in particular the hpsa driver, so they
had to be reverted. However, the bugs in blk_mq are leading to a very
unstable KVM/Qemu virtual machines whenever CPU hotplug/unplug and SMT
changes are performed, also impacting live migration.

So, this is a new attempt to have the patches included. This new version
includes all related fixes available upstream.

It's based on Ubuntu-4.15.0-21.22 tag.

Thank you!

Jose Ricardo Ziviani

Bart Van Assche (1):
  blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended
    delays

Christoph Hellwig (2):
  genirq/affinity: assign vectors to all possible CPUs
  blk-mq: simplify queue mapping & schedule with each possisble CPU

Jens Axboe (1):
  blk-mq: fix discard merge with scheduler attached

Ming Lei (16):
  genirq/affinity: Rename *node_to_possible_cpumask as *node_to_cpumask
  genirq/affinity: Move actual irq vector spreading into a helper
    function
  genirq/affinity: Allow irq spreading from a given starting point
  genirq/affinity: Spread irq vectors among present CPUs as far as
    possible
  blk-mq: make sure hctx->next_cpu is set correctly
  blk-mq: make sure that correct hctx->next_cpu is set
  blk-mq: don't keep offline CPUs mapped to hctx 0
  blk-mq: avoid to write intermediate result to hctx->next_cpu
  blk-mq: introduce blk_mq_hw_queue_first_cpu() to figure out first cpu
  blk-mq: don't check queue mapped in __blk_mq_delay_run_hw_queue()
  nvme: pci: pass max vectors as num_possible_cpus() to
    pci_alloc_irq_vectors
  scsi: hpsa: fix selection of reply queue
  scsi: megaraid_sas: fix selection of reply queue
  scsi: core: introduce force_blk_mq
  scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
  scsi: virtio_scsi: unify scsi_host_template

Thomas Gleixner (1):
  genirq/affinity: Don't return with empty affinity masks on error

 block/blk-core.c                            |   2 +
 block/blk-merge.c                           |  29 +++-
 block/blk-mq-cpumap.c                       |   5 -
 block/blk-mq.c                              |  65 +++++---
 drivers/nvme/host/pci.c                     |   2 +-
 drivers/scsi/hosts.c                        |   1 +
 drivers/scsi/hpsa.c                         |  73 ++++++---
 drivers/scsi/hpsa.h                         |   1 +
 drivers/scsi/megaraid/megaraid_sas.h        |   1 +
 drivers/scsi/megaraid/megaraid_sas_base.c   |  39 ++++-
 drivers/scsi/megaraid/megaraid_sas_fusion.c |  12 +-
 drivers/scsi/virtio_scsi.c                  | 129 ++-------------
 include/scsi/scsi_host.h                    |   3 +
 kernel/irq/affinity.c                       | 166 +++++++++++++-------
 14 files changed, 296 insertions(+), 232 deletions(-)

--
2.17.0


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[Bionic][PATCH v2 01/21] genirq/affinity: assign vectors to all possible CPUs

Jose Ricardo Ziviani
From: Christoph Hellwig <[hidden email]>

Currently we assign managed interrupt vectors to all present CPUs.  This
works fine for systems were we only online/offline CPUs.  But in case of
systems that support physical CPU hotplug (or the virtualized version of
it) this means the additional CPUs covered for in the ACPI tables or on
the command line are not catered for.  To fix this we'd either need to
introduce new hotplug CPU states just for this case, or we can start
assining vectors to possible but not present CPUs.

Reported-by: Christian Borntraeger <[hidden email]>
Tested-by: Christian Borntraeger <[hidden email]>
Tested-by: Stefan Haberland <[hidden email]>
Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU")
Cc: [hidden email]
Cc: Thomas Gleixner <[hidden email]>
Signed-off-by: Christoph Hellwig <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit 84676c1f21e8ff54befe985f4f14dc1edc10046b)
Signed-off-by: Jose Ricardo Ziviani <[hidden email]>
---
 kernel/irq/affinity.c | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index e12d35108225..a37a3b4b6342 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -39,7 +39,7 @@ static void irq_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
  }
 }
 
-static cpumask_var_t *alloc_node_to_present_cpumask(void)
+static cpumask_var_t *alloc_node_to_possible_cpumask(void)
 {
  cpumask_var_t *masks;
  int node;
@@ -62,7 +62,7 @@ static cpumask_var_t *alloc_node_to_present_cpumask(void)
  return NULL;
 }
 
-static void free_node_to_present_cpumask(cpumask_var_t *masks)
+static void free_node_to_possible_cpumask(cpumask_var_t *masks)
 {
  int node;
 
@@ -71,22 +71,22 @@ static void free_node_to_present_cpumask(cpumask_var_t *masks)
  kfree(masks);
 }
 
-static void build_node_to_present_cpumask(cpumask_var_t *masks)
+static void build_node_to_possible_cpumask(cpumask_var_t *masks)
 {
  int cpu;
 
- for_each_present_cpu(cpu)
+ for_each_possible_cpu(cpu)
  cpumask_set_cpu(cpu, masks[cpu_to_node(cpu)]);
 }
 
-static int get_nodes_in_cpumask(cpumask_var_t *node_to_present_cpumask,
+static int get_nodes_in_cpumask(cpumask_var_t *node_to_possible_cpumask,
  const struct cpumask *mask, nodemask_t *nodemsk)
 {
  int n, nodes = 0;
 
  /* Calculate the number of nodes in the supplied affinity mask */
  for_each_node(n) {
- if (cpumask_intersects(mask, node_to_present_cpumask[n])) {
+ if (cpumask_intersects(mask, node_to_possible_cpumask[n])) {
  node_set(n, *nodemsk);
  nodes++;
  }
@@ -109,7 +109,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
  int last_affv = affv + affd->pre_vectors;
  nodemask_t nodemsk = NODE_MASK_NONE;
  struct cpumask *masks;
- cpumask_var_t nmsk, *node_to_present_cpumask;
+ cpumask_var_t nmsk, *node_to_possible_cpumask;
 
  /*
  * If there aren't any vectors left after applying the pre/post
@@ -125,8 +125,8 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
  if (!masks)
  goto out;
 
- node_to_present_cpumask = alloc_node_to_present_cpumask();
- if (!node_to_present_cpumask)
+ node_to_possible_cpumask = alloc_node_to_possible_cpumask();
+ if (!node_to_possible_cpumask)
  goto out;
 
  /* Fill out vectors at the beginning that don't need affinity */
@@ -135,8 +135,8 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 
  /* Stabilize the cpumasks */
  get_online_cpus();
- build_node_to_present_cpumask(node_to_present_cpumask);
- nodes = get_nodes_in_cpumask(node_to_present_cpumask, cpu_present_mask,
+ build_node_to_possible_cpumask(node_to_possible_cpumask);
+ nodes = get_nodes_in_cpumask(node_to_possible_cpumask, cpu_possible_mask,
      &nodemsk);
 
  /*
@@ -146,7 +146,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
  if (affv <= nodes) {
  for_each_node_mask(n, nodemsk) {
  cpumask_copy(masks + curvec,
-     node_to_present_cpumask[n]);
+     node_to_possible_cpumask[n]);
  if (++curvec == last_affv)
  break;
  }
@@ -160,7 +160,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
  vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes;
 
  /* Get the cpus on this node which are in the mask */
- cpumask_and(nmsk, cpu_present_mask, node_to_present_cpumask[n]);
+ cpumask_and(nmsk, cpu_possible_mask, node_to_possible_cpumask[n]);
 
  /* Calculate the number of cpus per vector */
  ncpus = cpumask_weight(nmsk);
@@ -192,7 +192,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
  /* Fill out vectors at the end that don't need affinity */
  for (; curvec < nvecs; curvec++)
  cpumask_copy(masks + curvec, irq_default_affinity);
- free_node_to_present_cpumask(node_to_present_cpumask);
+ free_node_to_possible_cpumask(node_to_possible_cpumask);
 out:
  free_cpumask_var(nmsk);
  return masks;
@@ -214,7 +214,7 @@ int irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity
  return 0;
 
  get_online_cpus();
- ret = min_t(int, cpumask_weight(cpu_present_mask), vecs) + resv;
+ ret = min_t(int, cpumask_weight(cpu_possible_mask), vecs) + resv;
  put_online_cpus();
  return ret;
 }
--
2.17.0


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[Bionic][PATCH v2 02/21] genirq/affinity: Don't return with empty affinity masks on error

Jose Ricardo Ziviani
In reply to this post by Jose Ricardo Ziviani
From: Thomas Gleixner <[hidden email]>

When the allocation of node_to_possible_cpumask fails, then
irq_create_affinity_masks() returns with a pointer to the empty affinity
masks array, which will cause malfunction.

Reorder the allocations so the masks array allocation comes last and every
failure path returns NULL.

Fixes: 9a0ef98e186d ("genirq/affinity: Assign vectors to all present CPUs")
Signed-off-by: Thomas Gleixner <[hidden email]>
Cc: Christoph Hellwig <[hidden email]>
Cc: Ming Lei <[hidden email]>
(cherry picked from commit 0211e12dd0a5385ecffd3557bc570dbad7fcf245)
Signed-off-by: Jose Ricardo Ziviani <[hidden email]>
---
 kernel/irq/affinity.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index a37a3b4b6342..e0665549af59 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -108,7 +108,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
  int affv = nvecs - affd->pre_vectors - affd->post_vectors;
  int last_affv = affv + affd->pre_vectors;
  nodemask_t nodemsk = NODE_MASK_NONE;
- struct cpumask *masks;
+ struct cpumask *masks = NULL;
  cpumask_var_t nmsk, *node_to_possible_cpumask;
 
  /*
@@ -121,13 +121,13 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
  if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
  return NULL;
 
- masks = kcalloc(nvecs, sizeof(*masks), GFP_KERNEL);
- if (!masks)
- goto out;
-
  node_to_possible_cpumask = alloc_node_to_possible_cpumask();
  if (!node_to_possible_cpumask)
- goto out;
+ goto outcpumsk;
+
+ masks = kcalloc(nvecs, sizeof(*masks), GFP_KERNEL);
+ if (!masks)
+ goto outnodemsk;
 
  /* Fill out vectors at the beginning that don't need affinity */
  for (curvec = 0; curvec < affd->pre_vectors; curvec++)
@@ -192,8 +192,9 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
  /* Fill out vectors at the end that don't need affinity */
  for (; curvec < nvecs; curvec++)
  cpumask_copy(masks + curvec, irq_default_affinity);
+outnodemsk:
  free_node_to_possible_cpumask(node_to_possible_cpumask);
-out:
+outcpumsk:
  free_cpumask_var(nmsk);
  return masks;
 }
--
2.17.0


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[Bionic][PATCH v2 03/21] genirq/affinity: Rename *node_to_possible_cpumask as *node_to_cpumask

Jose Ricardo Ziviani
In reply to this post by Jose Ricardo Ziviani
From: Ming Lei <[hidden email]>

The following patches will introduce two stage irq spreading for improving
irq spread on all possible CPUs.

No functional change.

Signed-off-by: Ming Lei <[hidden email]>
Signed-off-by: Thomas Gleixner <[hidden email]>
Reviewed-by: Christoph Hellwig <[hidden email]>
Cc: Jens Axboe <[hidden email]>
Cc: [hidden email]
Cc: Laurence Oberman <[hidden email]>
Cc: Christoph Hellwig <[hidden email]>
Link: https://lkml.kernel.org/r/20180308105358.1506-2-ming.lei@...

(cherry picked from commit 47778f33dcba7feb92031643b37e477892f82b62)
Signed-off-by: Jose Ricardo Ziviani <[hidden email]>
---
 kernel/irq/affinity.c | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index e0665549af59..272c968d9ef1 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -39,7 +39,7 @@ static void irq_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
  }
 }
 
-static cpumask_var_t *alloc_node_to_possible_cpumask(void)
+static cpumask_var_t *alloc_node_to_cpumask(void)
 {
  cpumask_var_t *masks;
  int node;
@@ -62,7 +62,7 @@ static cpumask_var_t *alloc_node_to_possible_cpumask(void)
  return NULL;
 }
 
-static void free_node_to_possible_cpumask(cpumask_var_t *masks)
+static void free_node_to_cpumask(cpumask_var_t *masks)
 {
  int node;
 
@@ -71,7 +71,7 @@ static void free_node_to_possible_cpumask(cpumask_var_t *masks)
  kfree(masks);
 }
 
-static void build_node_to_possible_cpumask(cpumask_var_t *masks)
+static void build_node_to_cpumask(cpumask_var_t *masks)
 {
  int cpu;
 
@@ -79,14 +79,14 @@ static void build_node_to_possible_cpumask(cpumask_var_t *masks)
  cpumask_set_cpu(cpu, masks[cpu_to_node(cpu)]);
 }
 
-static int get_nodes_in_cpumask(cpumask_var_t *node_to_possible_cpumask,
+static int get_nodes_in_cpumask(cpumask_var_t *node_to_cpumask,
  const struct cpumask *mask, nodemask_t *nodemsk)
 {
  int n, nodes = 0;
 
  /* Calculate the number of nodes in the supplied affinity mask */
  for_each_node(n) {
- if (cpumask_intersects(mask, node_to_possible_cpumask[n])) {
+ if (cpumask_intersects(mask, node_to_cpumask[n])) {
  node_set(n, *nodemsk);
  nodes++;
  }
@@ -109,7 +109,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
  int last_affv = affv + affd->pre_vectors;
  nodemask_t nodemsk = NODE_MASK_NONE;
  struct cpumask *masks = NULL;
- cpumask_var_t nmsk, *node_to_possible_cpumask;
+ cpumask_var_t nmsk, *node_to_cpumask;
 
  /*
  * If there aren't any vectors left after applying the pre/post
@@ -121,8 +121,8 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
  if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
  return NULL;
 
- node_to_possible_cpumask = alloc_node_to_possible_cpumask();
- if (!node_to_possible_cpumask)
+ node_to_cpumask = alloc_node_to_cpumask();
+ if (!node_to_cpumask)
  goto outcpumsk;
 
  masks = kcalloc(nvecs, sizeof(*masks), GFP_KERNEL);
@@ -135,8 +135,8 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 
  /* Stabilize the cpumasks */
  get_online_cpus();
- build_node_to_possible_cpumask(node_to_possible_cpumask);
- nodes = get_nodes_in_cpumask(node_to_possible_cpumask, cpu_possible_mask,
+ build_node_to_cpumask(node_to_cpumask);
+ nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_possible_mask,
      &nodemsk);
 
  /*
@@ -146,7 +146,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
  if (affv <= nodes) {
  for_each_node_mask(n, nodemsk) {
  cpumask_copy(masks + curvec,
-     node_to_possible_cpumask[n]);
+     node_to_cpumask[n]);
  if (++curvec == last_affv)
  break;
  }
@@ -160,7 +160,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
  vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes;
 
  /* Get the cpus on this node which are in the mask */
- cpumask_and(nmsk, cpu_possible_mask, node_to_possible_cpumask[n]);
+ cpumask_and(nmsk, cpu_possible_mask, node_to_cpumask[n]);
 
  /* Calculate the number of cpus per vector */
  ncpus = cpumask_weight(nmsk);
@@ -193,7 +193,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
  for (; curvec < nvecs; curvec++)
  cpumask_copy(masks + curvec, irq_default_affinity);
 outnodemsk:
- free_node_to_possible_cpumask(node_to_possible_cpumask);
+ free_node_to_cpumask(node_to_cpumask);
 outcpumsk:
  free_cpumask_var(nmsk);
  return masks;
--
2.17.0


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[Bionic][PATCH v2 04/21] genirq/affinity: Move actual irq vector spreading into a helper function

Jose Ricardo Ziviani
In reply to this post by Jose Ricardo Ziviani
From: Ming Lei <[hidden email]>

No functional change, just prepare for converting to 2-stage irq vector
spreading.

Signed-off-by: Ming Lei <[hidden email]>
Signed-off-by: Thomas Gleixner <[hidden email]>
Reviewed-by: Christoph Hellwig <[hidden email]>
Cc: Jens Axboe <[hidden email]>
Cc: [hidden email]
Cc: Laurence Oberman <[hidden email]>
Cc: Christoph Hellwig <[hidden email]>
Link: https://lkml.kernel.org/r/20180308105358.1506-3-ming.lei@...

(cherry picked from commit b3e6aaa8d94d618e685c4df08bef991a4fb43923)
Signed-off-by: Jose Ricardo Ziviani <[hidden email]>
---
 kernel/irq/affinity.c | 97 ++++++++++++++++++++++++-------------------
 1 file changed, 55 insertions(+), 42 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 272c968d9ef1..a9c36904500c 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -94,50 +94,19 @@ static int get_nodes_in_cpumask(cpumask_var_t *node_to_cpumask,
  return nodes;
 }
 
-/**
- * irq_create_affinity_masks - Create affinity masks for multiqueue spreading
- * @nvecs: The total number of vectors
- * @affd: Description of the affinity requirements
- *
- * Returns the masks pointer or NULL if allocation failed.
- */
-struct cpumask *
-irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
+static int irq_build_affinity_masks(int nvecs, const struct irq_affinity *affd,
+    cpumask_var_t *node_to_cpumask,
+    const struct cpumask *cpu_mask,
+    struct cpumask *nmsk,
+    struct cpumask *masks)
 {
- int n, nodes, cpus_per_vec, extra_vecs, curvec;
  int affv = nvecs - affd->pre_vectors - affd->post_vectors;
  int last_affv = affv + affd->pre_vectors;
+ int curvec = affd->pre_vectors;
  nodemask_t nodemsk = NODE_MASK_NONE;
- struct cpumask *masks = NULL;
- cpumask_var_t nmsk, *node_to_cpumask;
+ int n, nodes, cpus_per_vec, extra_vecs;
 
- /*
- * If there aren't any vectors left after applying the pre/post
- * vectors don't bother with assigning affinity.
- */
- if (!affv)
- return NULL;
-
- if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
- return NULL;
-
- node_to_cpumask = alloc_node_to_cpumask();
- if (!node_to_cpumask)
- goto outcpumsk;
-
- masks = kcalloc(nvecs, sizeof(*masks), GFP_KERNEL);
- if (!masks)
- goto outnodemsk;
-
- /* Fill out vectors at the beginning that don't need affinity */
- for (curvec = 0; curvec < affd->pre_vectors; curvec++)
- cpumask_copy(masks + curvec, irq_default_affinity);
-
- /* Stabilize the cpumasks */
- get_online_cpus();
- build_node_to_cpumask(node_to_cpumask);
- nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_possible_mask,
-     &nodemsk);
+ nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, &nodemsk);
 
  /*
  * If the number of nodes in the mask is greater than or equal the
@@ -150,7 +119,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
  if (++curvec == last_affv)
  break;
  }
- goto done;
+ goto out;
  }
 
  for_each_node_mask(n, nodemsk) {
@@ -160,7 +129,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
  vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes;
 
  /* Get the cpus on this node which are in the mask */
- cpumask_and(nmsk, cpu_possible_mask, node_to_cpumask[n]);
+ cpumask_and(nmsk, cpu_mask, node_to_cpumask[n]);
 
  /* Calculate the number of cpus per vector */
  ncpus = cpumask_weight(nmsk);
@@ -186,7 +155,51 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
  --nodes;
  }
 
-done:
+out:
+ return curvec - affd->pre_vectors;
+}
+
+/**
+ * irq_create_affinity_masks - Create affinity masks for multiqueue spreading
+ * @nvecs: The total number of vectors
+ * @affd: Description of the affinity requirements
+ *
+ * Returns the masks pointer or NULL if allocation failed.
+ */
+struct cpumask *
+irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
+{
+ cpumask_var_t nmsk, *node_to_cpumask;
+ struct cpumask *masks = NULL;
+ int curvec;
+
+ /*
+ * If there aren't any vectors left after applying the pre/post
+ * vectors don't bother with assigning affinity.
+ */
+ if (nvecs == affd->pre_vectors + affd->post_vectors)
+ return NULL;
+
+ if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
+ return NULL;
+
+ node_to_cpumask = alloc_node_to_cpumask();
+ if (!node_to_cpumask)
+ goto outcpumsk;
+
+ masks = kcalloc(nvecs, sizeof(*masks), GFP_KERNEL);
+ if (!masks)
+ goto outnodemsk;
+
+ /* Fill out vectors at the beginning that don't need affinity */
+ for (curvec = 0; curvec < affd->pre_vectors; curvec++)
+ cpumask_copy(masks + curvec, irq_default_affinity);
+
+ /* Stabilize the cpumasks */
+ get_online_cpus();
+ build_node_to_cpumask(node_to_cpumask);
+ curvec += irq_build_affinity_masks(nvecs, affd, node_to_cpumask,
+   cpu_possible_mask, nmsk, masks);
  put_online_cpus();
 
  /* Fill out vectors at the end that don't need affinity */
--
2.17.0


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[Bionic][PATCH v2 05/21] genirq/affinity: Allow irq spreading from a given starting point

Jose Ricardo Ziviani
In reply to this post by Jose Ricardo Ziviani
From: Ming Lei <[hidden email]>

To support two stage irq vector spreading, it's required to add a starting
point to the spreading function. No functional change, just preparatory
work for the actual two stage change.

[ tglx: Renamed variables, tidied up the code and massaged changelog ]

Signed-off-by: Ming Lei <[hidden email]>
Signed-off-by: Thomas Gleixner <[hidden email]>
Reviewed-by: Christoph Hellwig <[hidden email]>
Cc: Jens Axboe <[hidden email]>
Cc: [hidden email]
Cc: Laurence Oberman <[hidden email]>
Cc: Christoph Hellwig <[hidden email]>
Link: https://lkml.kernel.org/r/20180308105358.1506-4-ming.lei@...

(cherry picked from commit 1a2d0914e23aab386f5d5acb689777e24151c2c8)
Signed-off-by: Jose Ricardo Ziviani <[hidden email]>
---
 kernel/irq/affinity.c | 35 ++++++++++++++++++++---------------
 1 file changed, 20 insertions(+), 15 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index a9c36904500c..213695a27ddb 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -94,17 +94,17 @@ static int get_nodes_in_cpumask(cpumask_var_t *node_to_cpumask,
  return nodes;
 }
 
-static int irq_build_affinity_masks(int nvecs, const struct irq_affinity *affd,
+static int irq_build_affinity_masks(const struct irq_affinity *affd,
+    int startvec, int numvecs,
     cpumask_var_t *node_to_cpumask,
     const struct cpumask *cpu_mask,
     struct cpumask *nmsk,
     struct cpumask *masks)
 {
- int affv = nvecs - affd->pre_vectors - affd->post_vectors;
- int last_affv = affv + affd->pre_vectors;
- int curvec = affd->pre_vectors;
+ int n, nodes, cpus_per_vec, extra_vecs, done = 0;
+ int last_affv = affd->pre_vectors + numvecs;
+ int curvec = startvec;
  nodemask_t nodemsk = NODE_MASK_NONE;
- int n, nodes, cpus_per_vec, extra_vecs;
 
  nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, &nodemsk);
 
@@ -112,12 +112,13 @@ static int irq_build_affinity_masks(int nvecs, const struct irq_affinity *affd,
  * If the number of nodes in the mask is greater than or equal the
  * number of vectors we just spread the vectors across the nodes.
  */
- if (affv <= nodes) {
+ if (numvecs <= nodes) {
  for_each_node_mask(n, nodemsk) {
- cpumask_copy(masks + curvec,
-     node_to_cpumask[n]);
+ cpumask_copy(masks + curvec, node_to_cpumask[n]);
+ if (++done == numvecs)
+ break;
  if (++curvec == last_affv)
- break;
+ curvec = affd->pre_vectors;
  }
  goto out;
  }
@@ -126,7 +127,7 @@ static int irq_build_affinity_masks(int nvecs, const struct irq_affinity *affd,
  int ncpus, v, vecs_to_assign, vecs_per_node;
 
  /* Spread the vectors per node */
- vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes;
+ vecs_per_node = (numvecs - (curvec - affd->pre_vectors)) / nodes;
 
  /* Get the cpus on this node which are in the mask */
  cpumask_and(nmsk, cpu_mask, node_to_cpumask[n]);
@@ -150,13 +151,16 @@ static int irq_build_affinity_masks(int nvecs, const struct irq_affinity *affd,
  irq_spread_init_one(masks + curvec, nmsk, cpus_per_vec);
  }
 
- if (curvec >= last_affv)
+ done += v;
+ if (done >= numvecs)
  break;
+ if (curvec >= last_affv)
+ curvec = affd->pre_vectors;
  --nodes;
  }
 
 out:
- return curvec - affd->pre_vectors;
+ return done;
 }
 
 /**
@@ -169,9 +173,9 @@ static int irq_build_affinity_masks(int nvecs, const struct irq_affinity *affd,
 struct cpumask *
 irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 {
+ int curvec, affvecs = nvecs - affd->pre_vectors - affd->post_vectors;
  cpumask_var_t nmsk, *node_to_cpumask;
  struct cpumask *masks = NULL;
- int curvec;
 
  /*
  * If there aren't any vectors left after applying the pre/post
@@ -198,8 +202,9 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
  /* Stabilize the cpumasks */
  get_online_cpus();
  build_node_to_cpumask(node_to_cpumask);
- curvec += irq_build_affinity_masks(nvecs, affd, node_to_cpumask,
-   cpu_possible_mask, nmsk, masks);
+ curvec += irq_build_affinity_masks(affd, curvec, affvecs,
+   node_to_cpumask, cpu_possible_mask,
+   nmsk, masks);
  put_online_cpus();
 
  /* Fill out vectors at the end that don't need affinity */
--
2.17.0


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[Bionic][PATCH v2 06/21] genirq/affinity: Spread irq vectors among present CPUs as far as possible

Jose Ricardo Ziviani
In reply to this post by Jose Ricardo Ziviani
From: Ming Lei <[hidden email]>

Commit 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
tried to spread the interrupts accross all possible CPUs to make sure that
in case of phsyical hotplug (e.g. virtualization) the CPUs which get
plugged in after the device was initialized are targeted by a hardware
queue and the corresponding interrupt.

This has a downside in cases where the ACPI tables claim that there are
more possible CPUs than present CPUs and the number of interrupts to spread
out is smaller than the number of possible CPUs. These bogus ACPI tables
are unfortunately not uncommon.

In such a case the vector spreading algorithm assigns interrupts to CPUs
which can never be utilized and as a consequence these interrupts are
unused instead of being mapped to present CPUs. As a result the performance
of the device is suboptimal.

To fix this spread the interrupt vectors in two stages:

 1) Spread as many interrupts as possible among the present CPUs

 2) Spread the remaining vectors among non present CPUs

On a 8 core system, where CPU 0-3 are present and CPU 4-7 are not present,
for a device with 4 queues the resulting interrupt affinity is:

  1) Before 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
        irq 39, cpu list 0
        irq 40, cpu list 1
        irq 41, cpu list 2
        irq 42, cpu list 3

  2) With 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
        irq 39, cpu list 0-2
        irq 40, cpu list 3-4,6
        irq 41, cpu list 5
        irq 42, cpu list 7

  3) With the refined vector spread applied:
        irq 39, cpu list 0,4
        irq 40, cpu list 1,6
        irq 41, cpu list 2,5
        irq 42, cpu list 3,7

On a 8 core system, where all CPUs are present the resulting interrupt
affinity for the 4 queues is:

        irq 39, cpu list 0,1
        irq 40, cpu list 2,3
        irq 41, cpu list 4,5
        irq 42, cpu list 6,7

This is independent of the number of CPUs which are online at the point of
initialization because in such a system the offline CPUs can be easily
onlined afterwards, while in non-present CPUs need to be plugged physically
or virtually which requires external interaction.

The downside of this approach is that in case of physical hotplug the
interrupt vector spreading might be suboptimal when CPUs 4-7 are physically
plugged. Suboptimal from a NUMA point of view and due to the single target
nature of interrupt affinities the later plugged CPUs might not be targeted
by interrupts at all.

Though, physical hotplug systems are not the common case while the broken
ACPI table disease is wide spread. So it's preferred to have as many
interrupts as possible utilized at the point where the device is
initialized.

Block multi-queue devices like NVME create a hardware queue per possible
CPU, so the goal of commit 84676c1f21 to assign one interrupt vector per
possible CPU is still achieved even with physical/virtual hotplug.

[ tglx: Changed from online to present CPUs for the first spreading stage,
  renamed variables for readability sake, added comments and massaged
  changelog ]

Reported-by: Laurence Oberman <[hidden email]>
Signed-off-by: Ming Lei <[hidden email]>
Signed-off-by: Thomas Gleixner <[hidden email]>
Reviewed-by: Christoph Hellwig <[hidden email]>
Cc: Jens Axboe <[hidden email]>
Cc: [hidden email]
Cc: Christoph Hellwig <[hidden email]>
Link: https://lkml.kernel.org/r/20180308105358.1506-5-ming.lei@...

(cherry picked from commit d3056812e7dfe6bf4f8ad9e397a9116dd5d32d15)
Signed-off-by: Jose Ricardo Ziviani <[hidden email]>
---
 kernel/irq/affinity.c | 43 +++++++++++++++++++++++++++++++++++++------
 1 file changed, 37 insertions(+), 6 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 213695a27ddb..f4f29b9d90ee 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -106,6 +106,9 @@ static int irq_build_affinity_masks(const struct irq_affinity *affd,
  int curvec = startvec;
  nodemask_t nodemsk = NODE_MASK_NONE;
 
+ if (!cpumask_weight(cpu_mask))
+ return 0;
+
  nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, &nodemsk);
 
  /*
@@ -173,8 +176,9 @@ static int irq_build_affinity_masks(const struct irq_affinity *affd,
 struct cpumask *
 irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 {
- int curvec, affvecs = nvecs - affd->pre_vectors - affd->post_vectors;
- cpumask_var_t nmsk, *node_to_cpumask;
+ int affvecs = nvecs - affd->pre_vectors - affd->post_vectors;
+ int curvec, usedvecs;
+ cpumask_var_t nmsk, npresmsk, *node_to_cpumask;
  struct cpumask *masks = NULL;
 
  /*
@@ -187,9 +191,12 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
  if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
  return NULL;
 
+ if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL))
+ goto outcpumsk;
+
  node_to_cpumask = alloc_node_to_cpumask();
  if (!node_to_cpumask)
- goto outcpumsk;
+ goto outnpresmsk;
 
  masks = kcalloc(nvecs, sizeof(*masks), GFP_KERNEL);
  if (!masks)
@@ -202,16 +209,40 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
  /* Stabilize the cpumasks */
  get_online_cpus();
  build_node_to_cpumask(node_to_cpumask);
- curvec += irq_build_affinity_masks(affd, curvec, affvecs,
-   node_to_cpumask, cpu_possible_mask,
-   nmsk, masks);
+
+ /* Spread on present CPUs starting from affd->pre_vectors */
+ usedvecs = irq_build_affinity_masks(affd, curvec, affvecs,
+    node_to_cpumask, cpu_present_mask,
+    nmsk, masks);
+
+ /*
+ * Spread on non present CPUs starting from the next vector to be
+ * handled. If the spreading of present CPUs already exhausted the
+ * vector space, assign the non present CPUs to the already spread
+ * out vectors.
+ */
+ if (usedvecs >= affvecs)
+ curvec = affd->pre_vectors;
+ else
+ curvec = affd->pre_vectors + usedvecs;
+ cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask);
+ usedvecs += irq_build_affinity_masks(affd, curvec, affvecs,
+     node_to_cpumask, npresmsk,
+     nmsk, masks);
  put_online_cpus();
 
  /* Fill out vectors at the end that don't need affinity */
+ if (usedvecs >= affvecs)
+ curvec = affd->pre_vectors + affvecs;
+ else
+ curvec = affd->pre_vectors + usedvecs;
  for (; curvec < nvecs; curvec++)
  cpumask_copy(masks + curvec, irq_default_affinity);
+
 outnodemsk:
  free_node_to_cpumask(node_to_cpumask);
+outnpresmsk:
+ free_cpumask_var(npresmsk);
 outcpumsk:
  free_cpumask_var(nmsk);
  return masks;
--
2.17.0


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[Bionic][PATCH v2 07/21] blk-mq: simplify queue mapping & schedule with each possisble CPU

Jose Ricardo Ziviani
In reply to this post by Jose Ricardo Ziviani
From: Christoph Hellwig <[hidden email]>

The previous patch assigns interrupt vectors to all possible CPUs, so
now hctx can be mapped to possible CPUs, this patch applies this fact
to simplify queue mapping & schedule so that we don't need to handle
CPU hotplug for dealing with physical CPU plug & unplug. With this
simplication, we can work well on physical CPU plug & unplug, which
is a normal use case for VM at least.

Make sure we allocate blk_mq_ctx structures for all possible CPUs, and
set hctx->numa_node for possible CPUs which are mapped to this hctx. And
only choose the online CPUs for schedule.

Reported-by: Christian Borntraeger <[hidden email]>
Tested-by: Christian Borntraeger <[hidden email]>
Tested-by: Stefan Haberland <[hidden email]>
Cc: Thomas Gleixner <[hidden email]>
Signed-off-by: Christoph Hellwig <[hidden email]>
Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU")
(merged the three into one because any single one may not work, and fix
selecting online CPUs for scheduler)
Signed-off-by: Ming Lei <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit 20e4d813931961fe26d26a1e98b3aba6ec00b130)
Signed-off-by: Jose Ricardo Ziviani <[hidden email]>
---
 block/blk-mq.c | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1a80d8c4f3ec..50ef7dc4c41e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -443,7 +443,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
  blk_queue_exit(q);
  return ERR_PTR(-EXDEV);
  }
- cpu = cpumask_first(alloc_data.hctx->cpumask);
+ cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
  alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
 
  rq = blk_mq_get_request(q, NULL, op, &alloc_data);
@@ -1263,9 +1263,10 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
  if (--hctx->next_cpu_batch <= 0) {
  int next_cpu;
 
- next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask);
+ next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
+ cpu_online_mask);
  if (next_cpu >= nr_cpu_ids)
- next_cpu = cpumask_first(hctx->cpumask);
+ next_cpu = cpumask_first_and(hctx->cpumask,cpu_online_mask);
 
  hctx->next_cpu = next_cpu;
  hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
@@ -2137,16 +2138,11 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
  INIT_LIST_HEAD(&__ctx->rq_list);
  __ctx->queue = q;
 
- /* If the cpu isn't present, the cpu is mapped to first hctx */
- if (!cpu_present(i))
- continue;
-
- hctx = blk_mq_map_queue(q, i);
-
  /*
  * Set local node, IFF we have more than one hw queue. If
  * not, we remain on the home node of the device
  */
+ hctx = blk_mq_map_queue(q, i);
  if (nr_hw_queues > 1 && hctx->numa_node == NUMA_NO_NODE)
  hctx->numa_node = local_memory_node(cpu_to_node(i));
  }
@@ -2203,7 +2199,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
  *
  * If the cpu isn't present, the cpu is mapped to first hctx.
  */
- for_each_present_cpu(i) {
+ for_each_possible_cpu(i) {
  hctx_idx = q->mq_map[i];
  /* unmapped hw queue can be remapped after CPU topo changed */
  if (!set->tags[hctx_idx] &&
@@ -2257,7 +2253,8 @@ static void blk_mq_map_swqueue(struct request_queue *q)
  /*
  * Initialize batch roundrobin counts
  */
- hctx->next_cpu = cpumask_first(hctx->cpumask);
+ hctx->next_cpu = cpumask_first_and(hctx->cpumask,
+ cpu_online_mask);
  hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
  }
 }
--
2.17.0


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[Bionic][PATCH v2 08/21] blk-mq: make sure hctx->next_cpu is set correctly

Jose Ricardo Ziviani
In reply to this post by Jose Ricardo Ziviani
From: Ming Lei <[hidden email]>

When hctx->next_cpu is set from possible online CPUs, there is one
race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
break workqueue.

The race can be triggered in the following two sitations:

1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called
to dispatch requests from the DEAD cpu context, but at that
time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
a bad value.

2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue
should be run on the other CPU A, then CPU A may become offline at the
same time and all CPUs in hctx->cpumask become offline.

This patch deals with this issue by re-selecting next CPU, and making
sure it is set correctly.

Cc: Christian Borntraeger <[hidden email]>
Cc: Stefan Haberland <[hidden email]>
Cc: Christoph Hellwig <[hidden email]>
Cc: Thomas Gleixner <[hidden email]>
Reported-by: "jianchao.wang" <[hidden email]>
Tested-by: "jianchao.wang" <[hidden email]>
Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: Ming Lei <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit 7bed45954b95601230ebf387d3e4e20e4a3cc025)
Signed-off-by: Jose Ricardo Ziviani <[hidden email]>
---
 block/blk-mq.c | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 50ef7dc4c41e..59f4127bdaed 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1257,21 +1257,47 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
  */
 static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
 {
+ bool tried = false;
+
  if (hctx->queue->nr_hw_queues == 1)
  return WORK_CPU_UNBOUND;
 
  if (--hctx->next_cpu_batch <= 0) {
  int next_cpu;
-
+select_cpu:
  next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
  cpu_online_mask);
  if (next_cpu >= nr_cpu_ids)
  next_cpu = cpumask_first_and(hctx->cpumask,cpu_online_mask);
 
- hctx->next_cpu = next_cpu;
+ /*
+ * No online CPU is found, so have to make sure hctx->next_cpu
+ * is set correctly for not breaking workqueue.
+ */
+ if (next_cpu >= nr_cpu_ids)
+ hctx->next_cpu = cpumask_first(hctx->cpumask);
+ else
+ hctx->next_cpu = next_cpu;
  hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
  }
 
+ /*
+ * Do unbound schedule if we can't find a online CPU for this hctx,
+ * and it should only happen in the path of handling CPU DEAD.
+ */
+ if (!cpu_online(hctx->next_cpu)) {
+ if (!tried) {
+ tried = true;
+ goto select_cpu;
+ }
+
+ /*
+ * Make sure to re-select CPU next time once after CPUs
+ * in hctx->cpumask become online again.
+ */
+ hctx->next_cpu_batch = 1;
+ return WORK_CPU_UNBOUND;
+ }
  return hctx->next_cpu;
 }
 
--
2.17.0


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[Bionic][PATCH v2 09/21] blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays

Jose Ricardo Ziviani
In reply to this post by Jose Ricardo Ziviani
From: Bart Van Assche <[hidden email]>

Make sure that calling blk_mq_run_hw_queue() or
blk_mq_kick_requeue_list() triggers a queue run without delay even
if blk_mq_delay_run_hw_queue() has been called recently and if its
delay has not yet expired.

Reviewed-by: Mike Snitzer <[hidden email]>
Signed-off-by: Bart Van Assche <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit ae943d20624de0a6aac7dd0597616dce2c498029)
Signed-off-by: Jose Ricardo Ziviani <[hidden email]>
---
 block/blk-mq.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 59f4127bdaed..f5e7624d6fe1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -731,7 +731,7 @@ EXPORT_SYMBOL(blk_mq_add_to_requeue_list);
 
 void blk_mq_kick_requeue_list(struct request_queue *q)
 {
- kblockd_schedule_delayed_work(&q->requeue_work, 0);
+ kblockd_mod_delayed_work_on(WORK_CPU_UNBOUND, &q->requeue_work, 0);
 }
 EXPORT_SYMBOL(blk_mq_kick_requeue_list);
 
@@ -1321,9 +1321,8 @@ static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
  put_cpu();
  }
 
- kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
- &hctx->run_work,
- msecs_to_jiffies(msecs));
+ kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work,
+    msecs_to_jiffies(msecs));
 }
 
 void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
--
2.17.0


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[Bionic][PATCH v2 10/21] blk-mq: fix discard merge with scheduler attached

Jose Ricardo Ziviani
In reply to this post by Jose Ricardo Ziviani
From: Jens Axboe <[hidden email]>

I ran into an issue on my laptop that triggered a bug on the
discard path:

WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3/0x430
 Modules linked in: rfcomm fuse ctr ccm bnep arc4 binfmt_misc snd_hda_codec_hdmi nls_iso8859_1 nls_cp437 vfat snd_hda_codec_conexant fat snd_hda_codec_generic iwlmvm snd_hda_intel snd_hda_codec snd_hwdep mac80211 snd_hda_core snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq x86_pkg_temp_thermal intel_powerclamp kvm_intel uvcvideo iwlwifi btusb snd_seq_device videobuf2_vmalloc btintel videobuf2_memops kvm snd_timer videobuf2_v4l2 bluetooth irqbypass videobuf2_core aesni_intel aes_x86_64 crypto_simd cryptd snd glue_helper videodev cfg80211 ecdh_generic soundcore hid_generic usbhid hid i915 psmouse e1000e ptp pps_core xhci_pci xhci_hcd intel_gtt
 CPU: 2 PID: 207 Comm: jbd2/nvme0n1p7- Tainted: G     U           4.15.0+ #176
 Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET59W (1.33 ) 12/19/2017
 RIP: 0010:nvme_setup_cmd+0x3d3/0x430
 RSP: 0018:ffff880423e9f838 EFLAGS: 00010217
 RAX: 0000000000000000 RBX: ffff880423e9f8c8 RCX: 0000000000010000
 RDX: ffff88022b200010 RSI: 0000000000000002 RDI: 00000000327f0000
 RBP: ffff880421251400 R08: ffff88022b200000 R09: 0000000000000009
 R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000ffff
 R13: ffff88042341e280 R14: 000000000000ffff R15: ffff880421251440
 FS:  0000000000000000(0000) GS:ffff880441500000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000055b684795030 CR3: 0000000002e09006 CR4: 00000000001606e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  nvme_queue_rq+0x40/0xa00
  ? __sbitmap_queue_get+0x24/0x90
  ? blk_mq_get_tag+0xa3/0x250
  ? wait_woken+0x80/0x80
  ? blk_mq_get_driver_tag+0x97/0xf0
  blk_mq_dispatch_rq_list+0x7b/0x4a0
  ? deadline_remove_request+0x49/0xb0
  blk_mq_do_dispatch_sched+0x4f/0xc0
  blk_mq_sched_dispatch_requests+0x106/0x170
  __blk_mq_run_hw_queue+0x53/0xa0
  __blk_mq_delay_run_hw_queue+0x83/0xa0
  blk_mq_run_hw_queue+0x6c/0xd0
  blk_mq_sched_insert_request+0x96/0x140
  __blk_mq_try_issue_directly+0x3d/0x190
  blk_mq_try_issue_directly+0x30/0x70
  blk_mq_make_request+0x1a4/0x6a0
  generic_make_request+0xfd/0x2f0
  ? submit_bio+0x5c/0x110
  submit_bio+0x5c/0x110
  ? __blkdev_issue_discard+0x152/0x200
  submit_bio_wait+0x43/0x60
  ext4_process_freed_data+0x1cd/0x440
  ? account_page_dirtied+0xe2/0x1a0
  ext4_journal_commit_callback+0x4a/0xc0
  jbd2_journal_commit_transaction+0x17e2/0x19e0
  ? kjournald2+0xb0/0x250
  kjournald2+0xb0/0x250
  ? wait_woken+0x80/0x80
  ? commit_timeout+0x10/0x10
  kthread+0x111/0x130
  ? kthread_create_worker_on_cpu+0x50/0x50
  ? do_group_exit+0x3a/0xa0
  ret_from_fork+0x1f/0x30
 Code: 73 89 c1 83 ce 10 c1 e1 10 09 ca 83 f8 04 0f 87 0f ff ff ff 8b 4d 20 48 8b 7d 00 c1 e9 09 48 01 8c c7 00 08 00 00 e9 f8 fe ff ff <0f> ff 4c 89 c7 41 bc 0a 00 00 00 e8 0d 78 d6 ff e9 a1 fc ff ff
 ---[ end trace 50d361cc444506c8 ]---
 print_req_error: I/O error, dev nvme0n1, sector 847167488

Decoding the assembly, the request claims to have 0xffff segments,
while nvme counts two. This turns out to be because we don't check
for a data carrying request on the mq scheduler path, and since
blk_phys_contig_segment() returns true for a non-data request,
we decrement the initial segment count of 0 and end up with
0xffff in the unsigned short.

There are a few issues here:

1) We should initialize the segment count for a discard to 1.
2) The discard merging is currently using the data limits for
   segments and sectors.

Fix this up by having attempt_merge() correctly identify the
request, and by initializing the segment count correctly
for discards.

This can only be triggered with mq-deadline on discard capable
devices right now, which isn't a common configuration.

Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit 445251d0f4d329aa061f323546cd6388a3bb7ab5)
Signed-off-by: Jose Ricardo Ziviani <[hidden email]>
---
 block/blk-core.c  |  2 ++
 block/blk-merge.c | 29 ++++++++++++++++++++++++++---
 2 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index b725d9e340c2..f6343ce5a80d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -3260,6 +3260,8 @@ void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
 {
  if (bio_has_data(bio))
  rq->nr_phys_segments = bio_phys_segments(q, bio);
+ else if (bio_op(bio) == REQ_OP_DISCARD)
+ rq->nr_phys_segments = 1;
 
  rq->__data_len = bio->bi_iter.bi_size;
  rq->bio = rq->biotail = bio;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index f5dedd57dff6..8d60a5bbcef9 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -551,6 +551,24 @@ static bool req_no_special_merge(struct request *req)
  return !q->mq_ops && req->special;
 }
 
+static bool req_attempt_discard_merge(struct request_queue *q, struct request *req,
+ struct request *next)
+{
+ unsigned short segments = blk_rq_nr_discard_segments(req);
+
+ if (segments >= queue_max_discard_segments(q))
+ goto no_merge;
+ if (blk_rq_sectors(req) + bio_sectors(next->bio) >
+    blk_rq_get_max_sectors(req, blk_rq_pos(req)))
+ goto no_merge;
+
+ req->nr_phys_segments = segments + blk_rq_nr_discard_segments(next);
+ return true;
+no_merge:
+ req_set_nomerge(q, req);
+ return false;
+}
+
 static int ll_merge_requests_fn(struct request_queue *q, struct request *req,
  struct request *next)
 {
@@ -684,9 +702,13 @@ static struct request *attempt_merge(struct request_queue *q,
  * If we are allowed to merge, then append bio list
  * from next to rq and release next. merge_requests_fn
  * will have updated segment counts, update sector
- * counts here.
+ * counts here. Handle DISCARDs separately, as they
+ * have separate settings.
  */
- if (!ll_merge_requests_fn(q, req, next))
+ if (req_op(req) == REQ_OP_DISCARD) {
+ if (!req_attempt_discard_merge(q, req, next))
+ return NULL;
+ } else if (!ll_merge_requests_fn(q, req, next))
  return NULL;
 
  /*
@@ -716,7 +738,8 @@ static struct request *attempt_merge(struct request_queue *q,
 
  req->__data_len += blk_rq_bytes(next);
 
- elv_merge_requests(q, req, next);
+ if (req_op(req) != REQ_OP_DISCARD)
+ elv_merge_requests(q, req, next);
 
  /*
  * 'next' is going away, so update stats accordingly
--
2.17.0


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[Bionic][PATCH v2 11/21] blk-mq: make sure that correct hctx->next_cpu is set

Jose Ricardo Ziviani
In reply to this post by Jose Ricardo Ziviani
From: Ming Lei <[hidden email]>

From commit 20e4d81393196 (blk-mq: simplify queue mapping & schedule
with each possisble CPU), one hctx can be mapped from all offline CPUs,
then hctx->next_cpu can be set as wrong.

This patch fixes this issue by making hctx->next_cpu pointing to the
first CPU in hctx->cpumask if all CPUs in hctx->cpumask are offline.

Cc: Stefan Haberland <[hidden email]>
Tested-by: Christian Borntraeger <[hidden email]>
Reviewed-by: Christoph Hellwig <[hidden email]>
Reviewed-by: Sagi Grimberg <[hidden email]>
Fixes: 20e4d81393196 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Cc: [hidden email]
Signed-off-by: Ming Lei <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit a1c735fb790745f94a359df45c11df4a69760389)
Signed-off-by: Jose Ricardo Ziviani <[hidden email]>
---
 block/blk-mq.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f5e7624d6fe1..ce5a08e3c0f3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2280,6 +2280,8 @@ static void blk_mq_map_swqueue(struct request_queue *q)
  */
  hctx->next_cpu = cpumask_first_and(hctx->cpumask,
  cpu_online_mask);
+ if (hctx->next_cpu >= nr_cpu_ids)
+ hctx->next_cpu = cpumask_first(hctx->cpumask);
  hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
  }
 }
--
2.17.0


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[Bionic][PATCH v2 12/21] blk-mq: don't keep offline CPUs mapped to hctx 0

Jose Ricardo Ziviani
In reply to this post by Jose Ricardo Ziviani
From: Ming Lei <[hidden email]>

From commit 4b855ad37194 ("blk-mq: Create hctx for each present CPU),
blk-mq doesn't remap queue after CPU topo is changed, that said when
some of these offline CPUs become online, they are still mapped to
hctx 0, then hctx 0 may become the bottleneck of IO dispatch and
completion.

This patch sets up the mapping from the beginning, and aligns to
queue mapping for PCI device (blk_mq_pci_map_queues()).

Cc: Stefan Haberland <[hidden email]>
Cc: Keith Busch <[hidden email]>
Cc: [hidden email]
Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU)
Tested-by: Christian Borntraeger <[hidden email]>
Reviewed-by: Christoph Hellwig <[hidden email]>
Reviewed-by: Sagi Grimberg <[hidden email]>
Signed-off-by: Ming Lei <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit bffa9909a6b48d8ca3398dec601bc9162a4020c4)
Signed-off-by: Jose Ricardo Ziviani <[hidden email]>
---
 block/blk-mq-cpumap.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 9f8cffc8a701..3eb169f15842 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -16,11 +16,6 @@
 
 static int cpu_to_queue_index(unsigned int nr_queues, const int cpu)
 {
- /*
- * Non present CPU will be mapped to queue index 0.
- */
- if (!cpu_present(cpu))
- return 0;
  return cpu % nr_queues;
 }
 
--
2.17.0


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[Bionic][PATCH v2 13/21] blk-mq: avoid to write intermediate result to hctx->next_cpu

Jose Ricardo Ziviani
In reply to this post by Jose Ricardo Ziviani
From: Ming Lei <[hidden email]>

This patch figures out the final selected CPU, then writes
it to hctx->next_cpu once, then we can avoid to intermediate
next cpu observed from other dispatch paths.

Cc: Stefan Haberland <[hidden email]>
Tested-by: Christian Borntraeger <[hidden email]>
Reviewed-by: Christoph Hellwig <[hidden email]>
Reviewed-by: Sagi Grimberg <[hidden email]>
Signed-off-by: Ming Lei <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit 476f8c98a9bccccbb97866974ffc80879adf2bbb)
Signed-off-by: Jose Ricardo Ziviani <[hidden email]>
---
 block/blk-mq.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ce5a08e3c0f3..15578008dcf2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1258,26 +1258,24 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
 {
  bool tried = false;
+ int next_cpu = hctx->next_cpu;
 
  if (hctx->queue->nr_hw_queues == 1)
  return WORK_CPU_UNBOUND;
 
  if (--hctx->next_cpu_batch <= 0) {
- int next_cpu;
 select_cpu:
- next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
+ next_cpu = cpumask_next_and(next_cpu, hctx->cpumask,
  cpu_online_mask);
  if (next_cpu >= nr_cpu_ids)
- next_cpu = cpumask_first_and(hctx->cpumask,cpu_online_mask);
+ next_cpu = cpumask_first_and(hctx->cpumask, cpu_online_mask);
 
  /*
  * No online CPU is found, so have to make sure hctx->next_cpu
  * is set correctly for not breaking workqueue.
  */
  if (next_cpu >= nr_cpu_ids)
- hctx->next_cpu = cpumask_first(hctx->cpumask);
- else
- hctx->next_cpu = next_cpu;
+ next_cpu = cpumask_first(hctx->cpumask);
  hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
  }
 
@@ -1285,7 +1283,7 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
  * Do unbound schedule if we can't find a online CPU for this hctx,
  * and it should only happen in the path of handling CPU DEAD.
  */
- if (!cpu_online(hctx->next_cpu)) {
+ if (!cpu_online(next_cpu)) {
  if (!tried) {
  tried = true;
  goto select_cpu;
@@ -1295,10 +1293,13 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
  * Make sure to re-select CPU next time once after CPUs
  * in hctx->cpumask become online again.
  */
+ hctx->next_cpu = next_cpu;
  hctx->next_cpu_batch = 1;
  return WORK_CPU_UNBOUND;
  }
- return hctx->next_cpu;
+
+ hctx->next_cpu = next_cpu;
+ return next_cpu;
 }
 
 static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
--
2.17.0


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[Bionic][PATCH v2 14/21] blk-mq: introduce blk_mq_hw_queue_first_cpu() to figure out first cpu

Jose Ricardo Ziviani
In reply to this post by Jose Ricardo Ziviani
From: Ming Lei <[hidden email]>

This patch introduces helper of blk_mq_hw_queue_first_cpu() for
figuring out the hctx's first cpu, and code duplication can be
avoided.

Cc: Stefan Haberland <[hidden email]>
Tested-by: Christian Borntraeger <[hidden email]>
Reviewed-by: Christoph Hellwig <[hidden email]>
Reviewed-by: Sagi Grimberg <[hidden email]>
Signed-off-by: Ming Lei <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit f82ddf1923b90f89665d08cf219287c8f9deb739)
Signed-off-by: Jose Ricardo Ziviani <[hidden email]>
---
 block/blk-mq.c | 23 +++++++++++------------
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 15578008dcf2..14053f521554 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1249,6 +1249,15 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
  }
 }
 
+static inline int blk_mq_first_mapped_cpu(struct blk_mq_hw_ctx *hctx)
+{
+ int cpu = cpumask_first_and(hctx->cpumask, cpu_online_mask);
+
+ if (cpu >= nr_cpu_ids)
+ cpu = cpumask_first(hctx->cpumask);
+ return cpu;
+}
+
 /*
  * It'd be great if the workqueue API had a way to pass
  * in a mask and had some smarts for more clever placement.
@@ -1268,14 +1277,7 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
  next_cpu = cpumask_next_and(next_cpu, hctx->cpumask,
  cpu_online_mask);
  if (next_cpu >= nr_cpu_ids)
- next_cpu = cpumask_first_and(hctx->cpumask, cpu_online_mask);
-
- /*
- * No online CPU is found, so have to make sure hctx->next_cpu
- * is set correctly for not breaking workqueue.
- */
- if (next_cpu >= nr_cpu_ids)
- next_cpu = cpumask_first(hctx->cpumask);
+ next_cpu = blk_mq_first_mapped_cpu(hctx);
  hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
  }
 
@@ -2279,10 +2281,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
  /*
  * Initialize batch roundrobin counts
  */
- hctx->next_cpu = cpumask_first_and(hctx->cpumask,
- cpu_online_mask);
- if (hctx->next_cpu >= nr_cpu_ids)
- hctx->next_cpu = cpumask_first(hctx->cpumask);
+ hctx->next_cpu = blk_mq_first_mapped_cpu(hctx);
  hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
  }
 }
--
2.17.0


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[Bionic][PATCH v2 15/21] blk-mq: don't check queue mapped in __blk_mq_delay_run_hw_queue()

Jose Ricardo Ziviani
In reply to this post by Jose Ricardo Ziviani
From: Ming Lei <[hidden email]>

There are several reasons for removing the check:

1) blk_mq_hw_queue_mapped() returns true always now since each hctx
may be mapped by one CPU at least

2) when there isn't any online CPU mapped to this hctx, there won't
be any IO queued to this CPU, blk_mq_run_hw_queue() only runs queue
if there is IO queued to this hctx

3) If __blk_mq_delay_run_hw_queue() is called by blk_mq_delay_run_hw_queue(),
which is run from blk_mq_dispatch_rq_list() or scsi_mq_get_budget(), and
the hctx to be handled has to be mapped.

Cc: Stefan Haberland <[hidden email]>
Tested-by: Christian Borntraeger <[hidden email]>
Reviewed-by: Christoph Hellwig <[hidden email]>
Reviewed-by: Sagi Grimberg <[hidden email]>
Signed-off-by: Ming Lei <[hidden email]>
Signed-off-by: Jens Axboe <[hidden email]>
(cherry picked from commit efea8450c3d2d3918029b36f59ef612be57d91ae)
Signed-off-by: Jose Ricardo Ziviani <[hidden email]>
---
 block/blk-mq.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 14053f521554..ffa19ff0a574 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1307,9 +1307,6 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
 static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
  unsigned long msecs)
 {
- if (WARN_ON_ONCE(!blk_mq_hw_queue_mapped(hctx)))
- return;
-
  if (unlikely(blk_mq_hctx_stopped(hctx)))
  return;
 
--
2.17.0


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[Bionic][PATCH v2 16/21] nvme: pci: pass max vectors as num_possible_cpus() to pci_alloc_irq_vectors

Jose Ricardo Ziviani
In reply to this post by Jose Ricardo Ziviani
From: Ming Lei <[hidden email]>

84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
has switched to do irq vectors spread among all possible CPUs, so
pass num_possible_cpus() as max vecotrs to be assigned.

For example, in a 8 cores system, 0~3 online, 4~8 offline/not present,
see 'lscpu':

        [ming@box]$lscpu
        Architecture:          x86_64
        CPU op-mode(s):        32-bit, 64-bit
        Byte Order:            Little Endian
        CPU(s):                4
        On-line CPU(s) list:   0-3
        Thread(s) per core:    1
        Core(s) per socket:    2
        Socket(s):             2
        NUMA node(s):          2
        ...
        NUMA node0 CPU(s):     0-3
        NUMA node1 CPU(s):
        ...

1) before this patch, follows the allocated vectors and their affinity:
        irq 47, cpu list 0,4
        irq 48, cpu list 1,6
        irq 49, cpu list 2,5
        irq 50, cpu list 3,7

2) after this patch, follows the allocated vectors and their affinity:
        irq 43, cpu list 0
        irq 44, cpu list 1
        irq 45, cpu list 2
        irq 46, cpu list 3
        irq 47, cpu list 4
        irq 48, cpu list 6
        irq 49, cpu list 5
        irq 50, cpu list 7

Cc: Keith Busch <[hidden email]>
Cc: Sagi Grimberg <[hidden email]>
Cc: Thomas Gleixner <[hidden email]>
Cc: Christoph Hellwig <[hidden email]>
Signed-off-by: Ming Lei <[hidden email]>
Reviewed-by: Christoph Hellwig <[hidden email]>
Signed-off-by: Keith Busch <[hidden email]>
(cherry picked from commit 16ccfff2897613007b5eda9e29d65303c6280026)
Signed-off-by: Jose Ricardo Ziviani <[hidden email]>
---
 drivers/nvme/host/pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3a0fcb7a6c76..e57f19a6db4f 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1898,7 +1898,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
  int result, nr_io_queues;
  unsigned long size;
 
- nr_io_queues = num_present_cpus();
+ nr_io_queues = num_possible_cpus();
  result = nvme_set_queue_count(&dev->ctrl, &nr_io_queues);
  if (result < 0)
  return result;
--
2.17.0


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[Bionic][PATCH v2 17/21] scsi: hpsa: fix selection of reply queue

Jose Ricardo Ziviani
In reply to this post by Jose Ricardo Ziviani
From: Ming Lei <[hidden email]>

Since commit 84676c1f21e8 ("genirq/affinity: assign vectors to all
possible CPUs") we could end up with an MSI-X vector that did not have
any online CPUs mapped. This would lead to I/O hangs since there was no
CPU to receive the completion.

Retrieve IRQ affinity information using pci_irq_get_affinity() and use
this mapping to choose a reply queue.

[mkp: tweaked commit desc]

Cc: Hannes Reinecke <[hidden email]>
Cc: "Martin K. Petersen" <[hidden email]>,
Cc: James Bottomley <[hidden email]>,
Cc: Christoph Hellwig <[hidden email]>,
Cc: Don Brace <[hidden email]>
Cc: Kashyap Desai <[hidden email]>
Cc: Laurence Oberman <[hidden email]>
Cc: Meelis Roos <[hidden email]>
Cc: Artem Bityutskiy <[hidden email]>
Cc: Mike Snitzer <[hidden email]>
Fixes: 84676c1f21e8 ("genirq/affinity: assign vectors to all possible CPUs")
Signed-off-by: Ming Lei <[hidden email]>
Tested-by: Laurence Oberman <[hidden email]>
Tested-by: Don Brace <[hidden email]>
Tested-by: Artem Bityutskiy <[hidden email]>
Acked-by: Don Brace <[hidden email]>
Reviewed-by: Christoph Hellwig <[hidden email]>
Signed-off-by: Martin K. Petersen <[hidden email]>
(cherry picked from commit 8b834bff1b73dce46f4e9f5e84af6f73fed8b0ef)
Signed-off-by: Jose Ricardo Ziviani <[hidden email]>
---
 drivers/scsi/hpsa.c | 73 +++++++++++++++++++++++++++++++++------------
 drivers/scsi/hpsa.h |  1 +
 2 files changed, 55 insertions(+), 19 deletions(-)

diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
index b54d17be6d01..211975cc28c9 100644
--- a/drivers/scsi/hpsa.c
+++ b/drivers/scsi/hpsa.c
@@ -1045,11 +1045,7 @@ static void set_performant_mode(struct ctlr_info *h, struct CommandList *c,
  c->busaddr |= 1 | (h->blockFetchTable[c->Header.SGList] << 1);
  if (unlikely(!h->msix_vectors))
  return;
- if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
- c->Header.ReplyQueue =
- raw_smp_processor_id() % h->nreply_queues;
- else
- c->Header.ReplyQueue = reply_queue % h->nreply_queues;
+ c->Header.ReplyQueue = reply_queue;
  }
 }
 
@@ -1063,10 +1059,7 @@ static void set_ioaccel1_performant_mode(struct ctlr_info *h,
  * Tell the controller to post the reply to the queue for this
  * processor.  This seems to give the best I/O throughput.
  */
- if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
- cp->ReplyQueue = smp_processor_id() % h->nreply_queues;
- else
- cp->ReplyQueue = reply_queue % h->nreply_queues;
+ cp->ReplyQueue = reply_queue;
  /*
  * Set the bits in the address sent down to include:
  *  - performant mode bit (bit 0)
@@ -1087,10 +1080,7 @@ static void set_ioaccel2_tmf_performant_mode(struct ctlr_info *h,
  /* Tell the controller to post the reply to the queue for this
  * processor.  This seems to give the best I/O throughput.
  */
- if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
- cp->reply_queue = smp_processor_id() % h->nreply_queues;
- else
- cp->reply_queue = reply_queue % h->nreply_queues;
+ cp->reply_queue = reply_queue;
  /* Set the bits in the address sent down to include:
  *  - performant mode bit not used in ioaccel mode 2
  *  - pull count (bits 0-3)
@@ -1109,10 +1099,7 @@ static void set_ioaccel2_performant_mode(struct ctlr_info *h,
  * Tell the controller to post the reply to the queue for this
  * processor.  This seems to give the best I/O throughput.
  */
- if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
- cp->reply_queue = smp_processor_id() % h->nreply_queues;
- else
- cp->reply_queue = reply_queue % h->nreply_queues;
+ cp->reply_queue = reply_queue;
  /*
  * Set the bits in the address sent down to include:
  *  - performant mode bit not used in ioaccel mode 2
@@ -1157,6 +1144,8 @@ static void __enqueue_cmd_and_start_io(struct ctlr_info *h,
 {
  dial_down_lockup_detection_during_fw_flash(h, c);
  atomic_inc(&h->commands_outstanding);
+
+ reply_queue = h->reply_map[raw_smp_processor_id()];
  switch (c->cmd_type) {
  case CMD_IOACCEL1:
  set_ioaccel1_performant_mode(h, c, reply_queue);
@@ -7384,6 +7373,26 @@ static void hpsa_disable_interrupt_mode(struct ctlr_info *h)
  h->msix_vectors = 0;
 }
 
+static void hpsa_setup_reply_map(struct ctlr_info *h)
+{
+ const struct cpumask *mask;
+ unsigned int queue, cpu;
+
+ for (queue = 0; queue < h->msix_vectors; queue++) {
+ mask = pci_irq_get_affinity(h->pdev, queue);
+ if (!mask)
+ goto fallback;
+
+ for_each_cpu(cpu, mask)
+ h->reply_map[cpu] = queue;
+ }
+ return;
+
+fallback:
+ for_each_possible_cpu(cpu)
+ h->reply_map[cpu] = 0;
+}
+
 /* If MSI/MSI-X is supported by the kernel we will try to enable it on
  * controllers that are capable. If not, we use legacy INTx mode.
  */
@@ -7779,6 +7788,10 @@ static int hpsa_pci_init(struct ctlr_info *h)
  err = hpsa_interrupt_mode(h);
  if (err)
  goto clean1;
+
+ /* setup mapping between CPU and reply queue */
+ hpsa_setup_reply_map(h);
+
  err = hpsa_pci_find_memory_BAR(h->pdev, &h->paddr);
  if (err)
  goto clean2; /* intmode+region, pci */
@@ -8490,6 +8503,28 @@ static struct workqueue_struct *hpsa_create_controller_wq(struct ctlr_info *h,
  return wq;
 }
 
+static void hpda_free_ctlr_info(struct ctlr_info *h)
+{
+ kfree(h->reply_map);
+ kfree(h);
+}
+
+static struct ctlr_info *hpda_alloc_ctlr_info(void)
+{
+ struct ctlr_info *h;
+
+ h = kzalloc(sizeof(*h), GFP_KERNEL);
+ if (!h)
+ return NULL;
+
+ h->reply_map = kzalloc(sizeof(*h->reply_map) * nr_cpu_ids, GFP_KERNEL);
+ if (!h->reply_map) {
+ kfree(h);
+ return NULL;
+ }
+ return h;
+}
+
 static int hpsa_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 {
  int dac, rc;
@@ -8527,7 +8562,7 @@ static int hpsa_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
  * the driver.  See comments in hpsa.h for more info.
  */
  BUILD_BUG_ON(sizeof(struct CommandList) % COMMANDLIST_ALIGNMENT);
- h = kzalloc(sizeof(*h), GFP_KERNEL);
+ h = hpda_alloc_ctlr_info();
  if (!h) {
  dev_err(&pdev->dev, "Failed to allocate controller head\n");
  return -ENOMEM;
@@ -8926,7 +8961,7 @@ static void hpsa_remove_one(struct pci_dev *pdev)
  h->lockup_detected = NULL; /* init_one 2 */
  /* (void) pci_disable_pcie_error_reporting(pdev); */ /* init_one 1 */
 
- kfree(h); /* init_one 1 */
+ hpda_free_ctlr_info(h); /* init_one 1 */
 }
 
 static int hpsa_suspend(__attribute__((unused)) struct pci_dev *pdev,
diff --git a/drivers/scsi/hpsa.h b/drivers/scsi/hpsa.h
index 018f980a701c..fb9f5e7f8209 100644
--- a/drivers/scsi/hpsa.h
+++ b/drivers/scsi/hpsa.h
@@ -158,6 +158,7 @@ struct bmic_controller_parameters {
 #pragma pack()
 
 struct ctlr_info {
+ unsigned int *reply_map;
  int ctlr;
  char devname[8];
  char    *product_name;
--
2.17.0


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[Bionic][PATCH v2 18/21] scsi: megaraid_sas: fix selection of reply queue

Jose Ricardo Ziviani
In reply to this post by Jose Ricardo Ziviani
From: Ming Lei <[hidden email]>

Since commit 84676c1f21e8 ("genirq/affinity: assign vectors to all
possible CPUs") we could end up with an MSI-X vector that did not have
any online CPUs mapped. This would lead to I/O hangs since there was no
CPU to receive the completion.

Retrieve IRQ affinity information using pci_irq_get_affinity() and use
this mapping to choose a reply queue.

[mkp: tweaked commit desc]

Cc: Hannes Reinecke <[hidden email]>
Cc: "Martin K. Petersen" <[hidden email]>,
Cc: James Bottomley <[hidden email]>,
Cc: Christoph Hellwig <[hidden email]>,
Cc: Don Brace <[hidden email]>
Cc: Kashyap Desai <[hidden email]>
Cc: Laurence Oberman <[hidden email]>
Cc: Mike Snitzer <[hidden email]>
Cc: Meelis Roos <[hidden email]>
Cc: Artem Bityutskiy <[hidden email]>
Fixes: 84676c1f21e8 ("genirq/affinity: assign vectors to all possible CPUs")
Signed-off-by: Ming Lei <[hidden email]>
Acked-by: Kashyap Desai <[hidden email]>
Tested-by: Kashyap Desai <[hidden email]>
Reviewed-by: Christoph Hellwig <[hidden email]>
Tested-by: Artem Bityutskiy <[hidden email]>
Signed-off-by: Martin K. Petersen <[hidden email]>
(cherry picked from commit adbe552349f2d1e48357a00e564d26135e586634)
Signed-off-by: Jose Ricardo Ziviani <[hidden email]>
---
 drivers/scsi/megaraid/megaraid_sas.h        |  1 +
 drivers/scsi/megaraid/megaraid_sas_base.c   | 39 +++++++++++++++++++--
 drivers/scsi/megaraid/megaraid_sas_fusion.c | 12 +++----
 3 files changed, 41 insertions(+), 11 deletions(-)

diff --git a/drivers/scsi/megaraid/megaraid_sas.h b/drivers/scsi/megaraid/megaraid_sas.h
index f5a36ccb8606..394a8c6b8064 100644
--- a/drivers/scsi/megaraid/megaraid_sas.h
+++ b/drivers/scsi/megaraid/megaraid_sas.h
@@ -2107,6 +2107,7 @@ enum MR_PD_TYPE {
 
 struct megasas_instance {
 
+ unsigned int *reply_map;
  __le32 *producer;
  dma_addr_t producer_h;
  __le32 *consumer;
diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index d4129469a73c..de75c88a2d16 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -5140,6 +5140,26 @@ megasas_setup_jbod_map(struct megasas_instance *instance)
  instance->use_seqnum_jbod_fp = false;
 }
 
+static void megasas_setup_reply_map(struct megasas_instance *instance)
+{
+ const struct cpumask *mask;
+ unsigned int queue, cpu;
+
+ for (queue = 0; queue < instance->msix_vectors; queue++) {
+ mask = pci_irq_get_affinity(instance->pdev, queue);
+ if (!mask)
+ goto fallback;
+
+ for_each_cpu(cpu, mask)
+ instance->reply_map[cpu] = queue;
+ }
+ return;
+
+fallback:
+ for_each_possible_cpu(cpu)
+ instance->reply_map[cpu] = cpu % instance->msix_vectors;
+}
+
 /**
  * megasas_init_fw - Initializes the FW
  * @instance: Adapter soft state
@@ -5318,6 +5338,8 @@ static int megasas_init_fw(struct megasas_instance *instance)
  goto fail_setup_irqs;
  }
 
+ megasas_setup_reply_map(instance);
+
  dev_info(&instance->pdev->dev,
  "firmware supports msix\t: (%d)", fw_msix_count);
  dev_info(&instance->pdev->dev,
@@ -6094,20 +6116,29 @@ static inline int megasas_alloc_mfi_ctrl_mem(struct megasas_instance *instance)
  */
 static int megasas_alloc_ctrl_mem(struct megasas_instance *instance)
 {
+ instance->reply_map = kzalloc(sizeof(unsigned int) * nr_cpu_ids,
+      GFP_KERNEL);
+ if (!instance->reply_map)
+ return -ENOMEM;
+
  switch (instance->adapter_type) {
  case MFI_SERIES:
  if (megasas_alloc_mfi_ctrl_mem(instance))
- return -ENOMEM;
+ goto fail;
  break;
  case VENTURA_SERIES:
  case THUNDERBOLT_SERIES:
  case INVADER_SERIES:
  if (megasas_alloc_fusion_context(instance))
- return -ENOMEM;
+ goto fail;
  break;
  }
 
  return 0;
+ fail:
+ kfree(instance->reply_map);
+ instance->reply_map = NULL;
+ return -ENOMEM;
 }
 
 /*
@@ -6119,6 +6150,7 @@ static int megasas_alloc_ctrl_mem(struct megasas_instance *instance)
  */
 static inline void megasas_free_ctrl_mem(struct megasas_instance *instance)
 {
+ kfree(instance->reply_map);
  if (instance->adapter_type == MFI_SERIES) {
  if (instance->producer)
  pci_free_consistent(instance->pdev, sizeof(u32),
@@ -6512,7 +6544,6 @@ static int megasas_probe_one(struct pci_dev *pdev,
  pci_free_irq_vectors(instance->pdev);
 fail_init_mfi:
  scsi_host_put(host);
-
 fail_alloc_instance:
  pci_disable_device(pdev);
 
@@ -6717,6 +6748,8 @@ megasas_resume(struct pci_dev *pdev)
  if (rval < 0)
  goto fail_reenable_msix;
 
+ megasas_setup_reply_map(instance);
+
  if (instance->adapter_type != MFI_SERIES) {
  megasas_reset_reply_desc(instance);
  if (megasas_ioc_init_fusion(instance)) {
diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c b/drivers/scsi/megaraid/megaraid_sas_fusion.c
index be2992509b8c..7ef81d6290b4 100644
--- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
+++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
@@ -2641,11 +2641,8 @@ megasas_build_ldio_fusion(struct megasas_instance *instance,
  fp_possible = (io_info.fpOkForIo > 0) ? true : false;
  }
 
- /* Use raw_smp_processor_id() for now until cmd->request->cpu is CPU
-   id by default, not CPU group id, otherwise all MSI-X queues won't
-   be utilized */
- cmd->request_desc->SCSIIO.MSIxIndex = instance->msix_vectors ?
- raw_smp_processor_id() % instance->msix_vectors : 0;
+ cmd->request_desc->SCSIIO.MSIxIndex =
+ instance->reply_map[raw_smp_processor_id()];
 
  praid_context = &io_request->RaidContext;
 
@@ -2967,10 +2964,9 @@ megasas_build_syspd_fusion(struct megasas_instance *instance,
  }
 
  cmd->request_desc->SCSIIO.DevHandle = io_request->DevHandle;
- cmd->request_desc->SCSIIO.MSIxIndex =
- instance->msix_vectors ?
- (raw_smp_processor_id() % instance->msix_vectors) : 0;
 
+ cmd->request_desc->SCSIIO.MSIxIndex =
+ instance->reply_map[raw_smp_processor_id()];
 
  if (!fp_possible) {
  /* system pd firmware path */
--
2.17.0


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[Bionic][PATCH v2 19/21] scsi: core: introduce force_blk_mq

Jose Ricardo Ziviani
In reply to this post by Jose Ricardo Ziviani
From: Ming Lei <[hidden email]>

This patch introduces 'force_blk_mq' to the scsi_host_template so that
drivers that have no desire to support the legacy I/O path can signal
blk-mq only support.

[mkp: commit desc]

Cc: Omar Sandoval <[hidden email]>,
Cc: "Martin K. Petersen" <[hidden email]>,
Cc: James Bottomley <[hidden email]>,
Cc: Christoph Hellwig <[hidden email]>,
Cc: Don Brace <[hidden email]>
Cc: Kashyap Desai <[hidden email]>
Cc: Mike Snitzer <[hidden email]>
Cc: Laurence Oberman <[hidden email]>
Signed-off-by: Ming Lei <[hidden email]>
Reviewed-by: Hannes Reinecke <[hidden email]>
Reviewed-by: Christoph Hellwig <[hidden email]>
Signed-off-by: Martin K. Petersen <[hidden email]>
(cherry picked from commit 2f31115e940c4afd49b99c33123534e2ac924ffb)
Signed-off-by: Jose Ricardo Ziviani <[hidden email]>
---
 drivers/scsi/hosts.c     | 1 +
 include/scsi/scsi_host.h | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index dd9464920456..ef22b275d050 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -474,6 +474,7 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int privsize)
  shost->dma_boundary = 0xffffffff;
 
  shost->use_blk_mq = scsi_use_blk_mq;
+ shost->use_blk_mq = scsi_use_blk_mq || shost->hostt->force_blk_mq;
 
  device_initialize(&shost->shost_gendev);
  dev_set_name(&shost->shost_gendev, "host%d", shost->host_no);
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index a8b7bf879ced..9c1e4bad6581 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -452,6 +452,9 @@ struct scsi_host_template {
  /* True if the controller does not support WRITE SAME */
  unsigned no_write_same:1;
 
+ /* True if the low-level driver supports blk-mq only */
+ unsigned force_blk_mq:1;
+
  /*
  * Countdown for host blocking with no commands outstanding.
  */
--
2.17.0


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
12