[SRU Xenial][PATCH 0/2] Improve our SAUCE for virtio-scsi reqs counter (fix CPU soft lockup)

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

[SRU Xenial][PATCH 0/2] Improve our SAUCE for virtio-scsi reqs counter (fix CPU soft lockup)

Mauricio Faria de Oliveira-3
BugLink: https://bugs.launchpad.net/bugs/1798110

[Impact]

 * Detaching virtio-scsi disk in Xenial guest can cause
   CPU soft lockup in guest (and take 100% CPU in host).

 * It may prevent further progress on other tasks that
   depend on resources locked earlier in the SCSI target
   removal stack, and/or impact other SCSI functionality.

 * The fix resolves a corner case in the requests counter
   in the virtio SCSI target, which impacts a downstream
   (SAUCE) patch in the virtio-scsi target removal handler
   that depends on the requests counter value to be zero.

[Test Case]

 * See LP #1798110 (this bug)'s comment #3 (too long for
   this section -- synthetic case with GDB+QEMU) and
   comment #4 (organic test case in cloud instance).

[Regression Potential]

 * It seem low -- this only affects the SCSI command requeue
   path with regards to the reference counter, which is only
   used with real chance of problems in our downstream patch
   (which is now passing this testcase).

 * The other less serious issue would be decrementing it to
   a negative / < 0 value, which is not possible with this
   driver logic (see commit message), because the reqs counter
   is always incremented before calling virtscsi_queuecommand(),
   where this decrement operation is inserted.


Mauricio Faria de Oliveira (2):
  UBUNTU: SAUCE: (no-up) virtio-scsi: Decrement reqs counter before SCSI
    command requeue
  UBUNTU: SAUCE: (no-up) virtio-scsi: (cosmetic) reestablish
    virtscsi_pick_vq_mq() signature

 drivers/scsi/virtio_scsi.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU Xenial][PATCH 1/2] UBUNTU: SAUCE: (no-up) virtio-scsi: Decrement reqs counter before SCSI command requeue

Mauricio Faria de Oliveira-3
BugLink: https://bugs.launchpad.net/bugs/1798110

The 'reqs' counter is incremented in the SCSI .queuecommand() path,
right before virtscsi_queuecommand() is called, in either
 - virtscsi_queuecommand_single(), or
 - virtscsi_queuecommand_multi(), via virtscsi_pick_vq{_mq}().

And it's decremented only in the SCSI command completion callback
(after the command is successfully queued and completed by adapter):
 - virtscsi_complete_cmd().

This allows for the counter to be incremented but _not_ decremented
if virtscsi_queuecommand() gets an error to add/kick the command to
the virtio ring (i.e., virtscsi_kick_cmd() fails with not 0 nor EIO).

    static virtscsi_queuecommand(...)
    {
            ...
            ret = virtscsi_kick_cmd(...)
            if (ret == -EIO) {
                    ...
                    virtscsi_complete_cmd(vscsi, cmd);
                    ...
            } else if (ret != 0) {
                    return SCSI_MLQUEUE_HOST_BUSY;
            }
            return 0;
    }

In that case, the return code SCSI_MLQUEUE_HOST_BUSY causes the SCSI
command to be requeued by the SCSI layer, which sends it again later
in the .queuecommand() path -- incrementing the reqs counter _again_.

This may happen many times for the same SCSI command, depending on
the virtio ring condition/implementation, so the reqs counter will
never return to zero (it's decremented only once in the completion
callback). And it may happen for (m)any SCSI commands in this path.

Unfortunately.. that causes a problem with a downstream/SAUCE patch
for Xenial, which uses the reqs counter to sync with the completion
callback: commit f1f609d8015e ("UBUNTU: SAUCE: (no-up) virtio-scsi:
Fix race in target free"), and waits for the value to become zero.

This problem plus that patch prevent the SCSI target removal from
finishing, eventually causing a CPU soft lockup on another CPU that
is waiting for some kernel resource that is/remains locked in the
stack chain of this CPU.

This has been verified 1) with a synthetic test case with QEMU+GDB
that fakes the number of available elements in virtio ring for one
time (harmless), so to force the SCSI command to be requeued, then
uses QEMU monitor to remove the virtio-scsi target.

_AND_ 2) with the test-case reported by the customer (a for-loop on
a cloud instance that repeatedly mounts the virtio-scsi drive, copy
data out of it, unmount it, then detach the virtio-scsi drive).
(Here, the problem usually happens in the 1st or 2nd iteration, but
with the patch it has run for 35 iterations without any problems).

Upstream has done away with the reqs counter (originally used only
to check if any requests were still active, for steering;  not for
our sync purposes).  Instead of trying to find an alternative sync
way for now let's just fix the behavior which we know is incorrect.

Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
Tested-by: Mauricio Faria de Oliveira <[hidden email]>
Tested-by: David Coronel <[hidden email]>
---
 drivers/scsi/virtio_scsi.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index b4a41d581021..572722e86bef 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -571,6 +571,10 @@ static int virtscsi_queuecommand(struct virtio_scsi *vscsi,
  virtscsi_complete_cmd(vscsi, cmd);
  spin_unlock_irqrestore(&req_vq->vq_lock, flags);
  } else if (ret != 0) {
+ /* The SCSI command requeue will increment 'tgt->reqs' again. */
+ struct virtio_scsi_target_state *tgt =
+ scsi_target(sc->device)->hostdata;
+ atomic_dec(&tgt->reqs);
  return SCSI_MLQUEUE_HOST_BUSY;
  }
  return 0;
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU Xenial][PATCH 2/2] UBUNTU: SAUCE: (no-up) virtio-scsi: (cosmetic) reestablish virtscsi_pick_vq_mq() signature

Mauricio Faria de Oliveira-3
In reply to this post by Mauricio Faria de Oliveira-3
BugLink: https://bugs.launchpad.net/bugs/1798110

The commit a6c038a533f3 ("UBUNTU: SAUCE: (no-up) virtio-scsi: Increment
reqs counter.") changed the upstream signature of virtscsi_pick_vq_mq()
just to pass the 'struct virtio_scsi_target_state *tgt' pointer, which
can be derived from the 'scsi_cmnd *sc' pointer with the scsi_target()
macro (it indeed is, in many sites in this source file.)

This doesn't functionally matter because that function is static, thus
not called externally, but in order to maintain/reestablish the same
signature as upstream (since it's not really required to change it),
revert that part of the commit and use the scsi_target() macro, as
used elsewhere, to get the 'tgt' pointer from 'sc'.

Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
Tested-by: Mauricio Faria de Oliveira <[hidden email]>
Tested-by: David Coronel <[hidden email]>
---
 drivers/scsi/virtio_scsi.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 572722e86bef..377ef50d57b4 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -592,10 +592,12 @@ static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
 }
 
 static struct virtio_scsi_vq *virtscsi_pick_vq_mq(struct virtio_scsi *vscsi,
-  struct virtio_scsi_target_state *tgt, struct scsi_cmnd *sc)
+  struct scsi_cmnd *sc)
 {
  u32 tag = blk_mq_unique_tag(sc->request);
  u16 hwq = blk_mq_unique_tag_to_hwq(tag);
+ struct virtio_scsi_target_state *tgt =
+ scsi_target(sc->device)->hostdata;
 
  atomic_inc(&tgt->reqs);
  return &vscsi->req_vqs[hwq];
@@ -647,7 +649,7 @@ static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
  struct virtio_scsi_vq *req_vq;
 
  if (shost_use_blk_mq(sh))
- req_vq = virtscsi_pick_vq_mq(vscsi, tgt, sc);
+ req_vq = virtscsi_pick_vq_mq(vscsi, sc);
  else
  req_vq = virtscsi_pick_vq(vscsi, tgt);
 
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

ACK/cmnt: [SRU Xenial][PATCH 0/2] Improve our SAUCE for virtio-scsi reqs counter (fix CPU soft lockup)

Khaled Elmously
In reply to this post by Mauricio Faria de Oliveira-3
On 2018-10-16 12:38:18 , Mauricio Faria de Oliveira wrote:

> BugLink: https://bugs.launchpad.net/bugs/1798110
>
> [Impact]
>
>  * Detaching virtio-scsi disk in Xenial guest can cause
>    CPU soft lockup in guest (and take 100% CPU in host).
>
>  * It may prevent further progress on other tasks that
>    depend on resources locked earlier in the SCSI target
>    removal stack, and/or impact other SCSI functionality.
>
>  * The fix resolves a corner case in the requests counter
>    in the virtio SCSI target, which impacts a downstream
>    (SAUCE) patch in the virtio-scsi target removal handler
>    that depends on the requests counter value to be zero.
>
> [Test Case]
>
>  * See LP #1798110 (this bug)'s comment #3 (too long for
>    this section -- synthetic case with GDB+QEMU) and
>    comment #4 (organic test case in cloud instance).
>
> [Regression Potential]
>
>  * It seem low -- this only affects the SCSI command requeue
>    path with regards to the reference counter, which is only
>    used with real chance of problems in our downstream patch
>    (which is now passing this testcase).
>
>  * The other less serious issue would be decrementing it to
>    a negative / < 0 value, which is not possible with this
>    driver logic (see commit message), because the reqs counter
>    is always incremented before calling virtscsi_queuecommand(),
>    where this decrement operation is inserted.
>
>
> Mauricio Faria de Oliveira (2):
>   UBUNTU: SAUCE: (no-up) virtio-scsi: Decrement reqs counter before SCSI
>     command requeue
>   UBUNTU: SAUCE: (no-up) virtio-scsi: (cosmetic) reestablish
>     virtscsi_pick_vq_mq() signature
>
>  drivers/scsi/virtio_scsi.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
>

 - Maybe I'm misunderstanding what you mean by 'no-up'. I assumed it means "no operation", as in, no change in the resulting binary (e.g., whitespace changes, comment changes, casting etc.). Both patches seem to be "op" changes
 - The second patch doesn't actually appear to be related to the 'reqs' counter or any related soft-lockups
 - It seems strange to me that you'd have a loop that repeatedly decrements a counter because you know it will be incremented again when about to retry. I would think it makes more sense to increment the counter only once when queuing the command and decrement only once on success or "giving up" - though I assume you did it this way for a reason?


None of the above points are major enough for me to NAK this patchset, but I wanted to bring them to your attention (in case you missed something).

Acked-by: Khalid Elmously <[hidden email]>


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

Re: ACK/cmnt: [SRU Xenial][PATCH 0/2] Improve our SAUCE for virtio-scsi reqs counter (fix CPU soft lockup)

Mauricio Faria de Oliveira-3
On Tue, Oct 16, 2018 at 1:42 PM Khaled Elmously
<[hidden email]> wrote:

>
> On 2018-10-16 12:38:18 , Mauricio Faria de Oliveira wrote:
> > BugLink: https://bugs.launchpad.net/bugs/1798110
> >
> > [Impact]
> >
> >  * Detaching virtio-scsi disk in Xenial guest can cause
> >    CPU soft lockup in guest (and take 100% CPU in host).
> >
> >  * It may prevent further progress on other tasks that
> >    depend on resources locked earlier in the SCSI target
> >    removal stack, and/or impact other SCSI functionality.
> >
> >  * The fix resolves a corner case in the requests counter
> >    in the virtio SCSI target, which impacts a downstream
> >    (SAUCE) patch in the virtio-scsi target removal handler
> >    that depends on the requests counter value to be zero.
> >
> > [Test Case]
> >
> >  * See LP #1798110 (this bug)'s comment #3 (too long for
> >    this section -- synthetic case with GDB+QEMU) and
> >    comment #4 (organic test case in cloud instance).
> >
> > [Regression Potential]
> >
> >  * It seem low -- this only affects the SCSI command requeue
> >    path with regards to the reference counter, which is only
> >    used with real chance of problems in our downstream patch
> >    (which is now passing this testcase).
> >
> >  * The other less serious issue would be decrementing it to
> >    a negative / < 0 value, which is not possible with this
> >    driver logic (see commit message), because the reqs counter
> >    is always incremented before calling virtscsi_queuecommand(),
> >    where this decrement operation is inserted.
> >
> >
> > Mauricio Faria de Oliveira (2):
> >   UBUNTU: SAUCE: (no-up) virtio-scsi: Decrement reqs counter before SCSI
> >     command requeue
> >   UBUNTU: SAUCE: (no-up) virtio-scsi: (cosmetic) reestablish
> >     virtscsi_pick_vq_mq() signature
> >
> >  drivers/scsi/virtio_scsi.c | 10 ++++++++--
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> >
>

Hi Khalid,

Thanks for reviewing.

>  - Maybe I'm misunderstanding what you mean by 'no-up'. I assumed it means "no operation", as in, no change in the resulting binary (e.g., whitespace changes, comment changes, casting etc.). Both patches seem to be "op" changes

Hm, I meant 'no up-stream' (e.g., downstream/SAUCE patch that is not
expected to go upstream, as described in the StablePatchFormat wiki
page [3].

>  - The second patch doesn't actually appear to be related to the 'reqs' counter or any related soft-lockups

It's related to the reqs counter "area", and you're absolutely right
about no relation to the soft lockup.

I should have mentioned the 2nd patch was more of a 'while still here'
patch, sorry.
It is related to the last applied virtio-scsi no-up patch, that
touches the reqs counter functionality as well.

>  - It seems strange to me that you'd have a loop that repeatedly decrements a counter because you know it will be incremented again when about to retry. I would think it makes more sense to increment the counter only once when queuing the command and decrement only once on success or "giving up" - though I assume you did it this way for a reason?

I didn' t follow you on the loop scenario (I don't see a loop unless a
SCSI commands repeatedly fails to be kicked and gets requeued/resent
by the SCSI layer - is that it?)

I agree, and indeed, there's a reason -- upstream only used the reqs
counter in the .queuecommand path for queue steering
in two cases, 1) single-virtqueue,  and 2) multi-virtqueue without
blk-mq -- so not all cases had it.
(and the first one makes it hard to move the increment call around, see below..)

Then we introduced the first reqs-related SAUCE patch [1] (spin-loop
in virtscsi_target_destroy() until reqs is zero),
which fixed a race, but introduced a regression for multi-virtqueue
with blk-mq (which didn't increment the counter).

The second reqs-related SAUCE patch [2] fixed that regression for
multi-virtqueue blk-mq,
but for that it ended up adding the increment in the blk-mq related
function as well.
(even there, when cases would do atomic_inc(), it would be hard to
move, as one case has atomic_inc_return()
 which checks for the counter value right there.. so this complicates
moving that around.)

The patch [1] introduced another regression (this CPU softlockup),
which this patch now addresses.

I'd like to investigate later whether there's a way we can accomplish
the sync in commit [1]
without resorting to the reqs counter, as it fixed one issued and
introduced two for now.
This would allow to drop these SAUCE patches.

I'll likely have a chat w/ Jay about it this week, but since the fix
was obvious and conservative,
I imagined it would be better to have it fixed on top for now, and
then refactored/re-fixed later).

[1] f1f609d8015e UBUNTU: SAUCE: (no-up) virtio-scsi: Fix race in target free
[2] a6c038a533f3 UBUNTU: SAUCE: (no-up) virtio-scsi: Increment reqs counter.


> None of the above points are major enough for me to NAK this patchset, but I wanted to bring them to your attention (in case you missed something).

Sure, thanks for the careful review.
Hope this helps clarify things!

> Acked-by: Khalid Elmously <[hidden email]>
>

[3] https://wiki.ubuntu.com/Kernel/Dev/StablePatchFormat

--
Mauricio Faria de Oliveira

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

ACK: [SRU Xenial][PATCH 1/2] UBUNTU: SAUCE: (no-up) virtio-scsi: Decrement reqs counter before SCSI command requeue

Stefan Bader-2
In reply to this post by Mauricio Faria de Oliveira-3
On 16.10.18 17:38, Mauricio Faria de Oliveira wrote:

> BugLink: https://bugs.launchpad.net/bugs/1798110
>
> The 'reqs' counter is incremented in the SCSI .queuecommand() path,
> right before virtscsi_queuecommand() is called, in either
>  - virtscsi_queuecommand_single(), or
>  - virtscsi_queuecommand_multi(), via virtscsi_pick_vq{_mq}().
>
> And it's decremented only in the SCSI command completion callback
> (after the command is successfully queued and completed by adapter):
>  - virtscsi_complete_cmd().
>
> This allows for the counter to be incremented but _not_ decremented
> if virtscsi_queuecommand() gets an error to add/kick the command to
> the virtio ring (i.e., virtscsi_kick_cmd() fails with not 0 nor EIO).
>
>     static virtscsi_queuecommand(...)
>     {
>             ...
>             ret = virtscsi_kick_cmd(...)
>             if (ret == -EIO) {
>                     ...
>                     virtscsi_complete_cmd(vscsi, cmd);
>                     ...
>             } else if (ret != 0) {
>                     return SCSI_MLQUEUE_HOST_BUSY;
>             }
>             return 0;
>     }
>
> In that case, the return code SCSI_MLQUEUE_HOST_BUSY causes the SCSI
> command to be requeued by the SCSI layer, which sends it again later
> in the .queuecommand() path -- incrementing the reqs counter _again_.
>
> This may happen many times for the same SCSI command, depending on
> the virtio ring condition/implementation, so the reqs counter will
> never return to zero (it's decremented only once in the completion
> callback). And it may happen for (m)any SCSI commands in this path.
>
> Unfortunately.. that causes a problem with a downstream/SAUCE patch
> for Xenial, which uses the reqs counter to sync with the completion
> callback: commit f1f609d8015e ("UBUNTU: SAUCE: (no-up) virtio-scsi:
> Fix race in target free"), and waits for the value to become zero.
>
> This problem plus that patch prevent the SCSI target removal from
> finishing, eventually causing a CPU soft lockup on another CPU that
> is waiting for some kernel resource that is/remains locked in the
> stack chain of this CPU.
>
> This has been verified 1) with a synthetic test case with QEMU+GDB
> that fakes the number of available elements in virtio ring for one
> time (harmless), so to force the SCSI command to be requeued, then
> uses QEMU monitor to remove the virtio-scsi target.
>
> _AND_ 2) with the test-case reported by the customer (a for-loop on
> a cloud instance that repeatedly mounts the virtio-scsi drive, copy
> data out of it, unmount it, then detach the virtio-scsi drive).
> (Here, the problem usually happens in the 1st or 2nd iteration, but
> with the patch it has run for 35 iterations without any problems).
>
> Upstream has done away with the reqs counter (originally used only
> to check if any requests were still active, for steering;  not for
> our sync purposes).  Instead of trying to find an alternative sync
> way for now let's just fix the behavior which we know is incorrect.
>
> Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
> Tested-by: Mauricio Faria de Oliveira <[hidden email]>
> Tested-by: David Coronel <[hidden email]>
Acked-by: Stefan Bader <[hidden email]>

> ---
>  drivers/scsi/virtio_scsi.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index b4a41d581021..572722e86bef 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -571,6 +571,10 @@ static int virtscsi_queuecommand(struct virtio_scsi *vscsi,
>   virtscsi_complete_cmd(vscsi, cmd);
>   spin_unlock_irqrestore(&req_vq->vq_lock, flags);
>   } else if (ret != 0) {
> + /* The SCSI command requeue will increment 'tgt->reqs' again. */
> + struct virtio_scsi_target_state *tgt =
> + scsi_target(sc->device)->hostdata;
> + atomic_dec(&tgt->reqs);
>   return SCSI_MLQUEUE_HOST_BUSY;
>   }
>   return 0;
>


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

NACK: [SRU Xenial][PATCH 2/2] UBUNTU: SAUCE: (no-up) virtio-scsi: (cosmetic) reestablish virtscsi_pick_vq_mq() signature

Stefan Bader-2
In reply to this post by Mauricio Faria de Oliveira-3
On 16.10.18 17:38, Mauricio Faria de Oliveira wrote:

> BugLink: https://bugs.launchpad.net/bugs/1798110
>
> The commit a6c038a533f3 ("UBUNTU: SAUCE: (no-up) virtio-scsi: Increment
> reqs counter.") changed the upstream signature of virtscsi_pick_vq_mq()
> just to pass the 'struct virtio_scsi_target_state *tgt' pointer, which
> can be derived from the 'scsi_cmnd *sc' pointer with the scsi_target()
> macro (it indeed is, in many sites in this source file.)
>
> This doesn't functionally matter because that function is static, thus
> not called externally, but in order to maintain/reestablish the same
> signature as upstream (since it's not really required to change it),
> revert that part of the commit and use the scsi_target() macro, as
> used elsewhere, to get the 'tgt' pointer from 'sc'.
>
> Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
> Tested-by: Mauricio Faria de Oliveira <[hidden email]>
> Tested-by: David Coronel <[hidden email]>
> ---
I was thinking about this one a while but in the end I think this does not
qualify for SRU. Even less by piggy-backing on a fix for a completely different
issue. But even individually it does not solve any bug (not even an issue for
external drivers using the interface as the function is static).
For those reasons I do not think this should be added. Lets only fix the bug
that needs fixing.

-Stefan

>  drivers/scsi/virtio_scsi.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index 572722e86bef..377ef50d57b4 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -592,10 +592,12 @@ static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
>  }
>  
>  static struct virtio_scsi_vq *virtscsi_pick_vq_mq(struct virtio_scsi *vscsi,
> -  struct virtio_scsi_target_state *tgt, struct scsi_cmnd *sc)
> +  struct scsi_cmnd *sc)
>  {
>   u32 tag = blk_mq_unique_tag(sc->request);
>   u16 hwq = blk_mq_unique_tag_to_hwq(tag);
> + struct virtio_scsi_target_state *tgt =
> + scsi_target(sc->device)->hostdata;
>  
>   atomic_inc(&tgt->reqs);
>   return &vscsi->req_vqs[hwq];
> @@ -647,7 +649,7 @@ static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
>   struct virtio_scsi_vq *req_vq;
>  
>   if (shost_use_blk_mq(sh))
> - req_vq = virtscsi_pick_vq_mq(vscsi, tgt, sc);
> + req_vq = virtscsi_pick_vq_mq(vscsi, sc);
>   else
>   req_vq = virtscsi_pick_vq(vscsi, tgt);
>  
>


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: NACK: [SRU Xenial][PATCH 2/2] UBUNTU: SAUCE: (no-up) virtio-scsi: (cosmetic) reestablish virtscsi_pick_vq_mq() signature

Mauricio Faria de Oliveira-3
On Wed, Oct 17, 2018 at 5:20 AM Stefan Bader <[hidden email]> wrote:

>
> On 16.10.18 17:38, Mauricio Faria de Oliveira wrote:
> > BugLink: https://bugs.launchpad.net/bugs/1798110
> >
> > The commit a6c038a533f3 ("UBUNTU: SAUCE: (no-up) virtio-scsi: Increment
> > reqs counter.") changed the upstream signature of virtscsi_pick_vq_mq()
> > just to pass the 'struct virtio_scsi_target_state *tgt' pointer, which
> > can be derived from the 'scsi_cmnd *sc' pointer with the scsi_target()
> > macro (it indeed is, in many sites in this source file.)
> >
> > This doesn't functionally matter because that function is static, thus
> > not called externally, but in order to maintain/reestablish the same
> > signature as upstream (since it's not really required to change it),
> > revert that part of the commit and use the scsi_target() macro, as
> > used elsewhere, to get the 'tgt' pointer from 'sc'.
> >
> > Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
> > Tested-by: Mauricio Faria de Oliveira <[hidden email]>
> > Tested-by: David Coronel <[hidden email]>
> > ---
>
> I was thinking about this one a while but in the end I think this does not
> qualify for SRU. Even less by piggy-backing on a fix for a completely different
> issue. But even individually it does not solve any bug (not even an issue for
> external drivers using the interface as the function is static).
> For those reasons I do not think this should be added. Lets only fix the bug
> that needs fixing.

Good point. Indeed it's not a fix, so I guess it would only be worth it
if it actually caused problems for other patches (i.e., actual fixes) to apply,
which is not the case.

Thanks for reviewing.

>
> -Stefan
>
> >  drivers/scsi/virtio_scsi.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> > index 572722e86bef..377ef50d57b4 100644
> > --- a/drivers/scsi/virtio_scsi.c
> > +++ b/drivers/scsi/virtio_scsi.c
> > @@ -592,10 +592,12 @@ static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
> >  }
> >
> >  static struct virtio_scsi_vq *virtscsi_pick_vq_mq(struct virtio_scsi *vscsi,
> > -                                               struct virtio_scsi_target_state *tgt, struct scsi_cmnd *sc)
> > +                                               struct scsi_cmnd *sc)
> >  {
> >       u32 tag = blk_mq_unique_tag(sc->request);
> >       u16 hwq = blk_mq_unique_tag_to_hwq(tag);
> > +     struct virtio_scsi_target_state *tgt =
> > +                             scsi_target(sc->device)->hostdata;
> >
> >       atomic_inc(&tgt->reqs);
> >       return &vscsi->req_vqs[hwq];
> > @@ -647,7 +649,7 @@ static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
> >       struct virtio_scsi_vq *req_vq;
> >
> >       if (shost_use_blk_mq(sh))
> > -             req_vq = virtscsi_pick_vq_mq(vscsi, tgt, sc);
> > +             req_vq = virtscsi_pick_vq_mq(vscsi, sc);
> >       else
> >               req_vq = virtscsi_pick_vq(vscsi, tgt);
> >
> >
>
>


--
Mauricio Faria de Oliveira

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

Re: ACK/cmnt: [SRU Xenial][PATCH 0/2] Improve our SAUCE for virtio-scsi reqs counter (fix CPU soft lockup)

Khaled Elmously
In reply to this post by Mauricio Faria de Oliveira-3
On 2018-10-16 15:08:48 , Mauricio Faria de Oliveira wrote:

> On Tue, Oct 16, 2018 at 1:42 PM Khaled Elmously
> <[hidden email]> wrote:
> >
> > On 2018-10-16 12:38:18 , Mauricio Faria de Oliveira wrote:
> > > BugLink: https://bugs.launchpad.net/bugs/1798110
> > >
> > > [Impact]
> > >
> > >  * Detaching virtio-scsi disk in Xenial guest can cause
> > >    CPU soft lockup in guest (and take 100% CPU in host).
> > >
> > >  * It may prevent further progress on other tasks that
> > >    depend on resources locked earlier in the SCSI target
> > >    removal stack, and/or impact other SCSI functionality.
> > >
> > >  * The fix resolves a corner case in the requests counter
> > >    in the virtio SCSI target, which impacts a downstream
> > >    (SAUCE) patch in the virtio-scsi target removal handler
> > >    that depends on the requests counter value to be zero.
> > >
> > > [Test Case]
> > >
> > >  * See LP #1798110 (this bug)'s comment #3 (too long for
> > >    this section -- synthetic case with GDB+QEMU) and
> > >    comment #4 (organic test case in cloud instance).
> > >
> > > [Regression Potential]
> > >
> > >  * It seem low -- this only affects the SCSI command requeue
> > >    path with regards to the reference counter, which is only
> > >    used with real chance of problems in our downstream patch
> > >    (which is now passing this testcase).
> > >
> > >  * The other less serious issue would be decrementing it to
> > >    a negative / < 0 value, which is not possible with this
> > >    driver logic (see commit message), because the reqs counter
> > >    is always incremented before calling virtscsi_queuecommand(),
> > >    where this decrement operation is inserted.
> > >
> > >
> > > Mauricio Faria de Oliveira (2):
> > >   UBUNTU: SAUCE: (no-up) virtio-scsi: Decrement reqs counter before SCSI
> > >     command requeue
> > >   UBUNTU: SAUCE: (no-up) virtio-scsi: (cosmetic) reestablish
> > >     virtscsi_pick_vq_mq() signature
> > >
> > >  drivers/scsi/virtio_scsi.c | 10 ++++++++--
> > >  1 file changed, 8 insertions(+), 2 deletions(-)
> > >
> >
>
> Hi Khalid,
>
> Thanks for reviewing.
>
> >  - Maybe I'm misunderstanding what you mean by 'no-up'. I assumed it means "no operation", as in, no change in the resulting binary (e.g., whitespace changes, comment changes, casting etc.). Both patches seem to be "op" changes
>
> Hm, I meant 'no up-stream' (e.g., downstream/SAUCE patch that is not
> expected to go upstream, as described in the StablePatchFormat wiki
> page [3].
>

I had no idea that that's what "no-up" means :) Thanks


> >  - The second patch doesn't actually appear to be related to the 'reqs' counter or any related soft-lockups
>
> It's related to the reqs counter "area", and you're absolutely right
> about no relation to the soft lockup.
>
> I should have mentioned the 2nd patch was more of a 'while still here'
> patch, sorry.
> It is related to the last applied virtio-scsi no-up patch, that
> touches the reqs counter functionality as well.
>
> >  - It seems strange to me that you'd have a loop that repeatedly decrements a counter because you know it will be incremented again when about to retry. I would think it makes more sense to increment the counter only once when queuing the command and decrement only once on success or "giving up" - though I assume you did it this way for a reason?
>
> I didn' t follow you on the loop scenario (I don't see a loop unless a
> SCSI commands repeatedly fails to be kicked and gets requeued/resent
> by the SCSI layer - is that it?)

Yep, that's what I meant.


>
> I agree, and indeed, there's a reason -- upstream only used the reqs
> counter in the .queuecommand path for queue steering
> in two cases, 1) single-virtqueue,  and 2) multi-virtqueue without
> blk-mq -- so not all cases had it.
> (and the first one makes it hard to move the increment call around, see below..)
>
> Then we introduced the first reqs-related SAUCE patch [1] (spin-loop
> in virtscsi_target_destroy() until reqs is zero),
> which fixed a race, but introduced a regression for multi-virtqueue
> with blk-mq (which didn't increment the counter).
>
> The second reqs-related SAUCE patch [2] fixed that regression for
> multi-virtqueue blk-mq,
> but for that it ended up adding the increment in the blk-mq related
> function as well.
> (even there, when cases would do atomic_inc(), it would be hard to
> move, as one case has atomic_inc_return()
>  which checks for the counter value right there.. so this complicates
> moving that around.)
>
> The patch [1] introduced another regression (this CPU softlockup),
> which this patch now addresses.
>
> I'd like to investigate later whether there's a way we can accomplish
> the sync in commit [1]
> without resorting to the reqs counter, as it fixed one issued and
> introduced two for now.
> This would allow to drop these SAUCE patches.
>
> I'll likely have a chat w/ Jay about it this week, but since the fix
> was obvious and conservative,
> I imagined it would be better to have it fixed on top for now, and
> then refactored/re-fixed later).

Fair enough - and thanks for clarifying in detail


>
> [1] f1f609d8015e UBUNTU: SAUCE: (no-up) virtio-scsi: Fix race in target free
> [2] a6c038a533f3 UBUNTU: SAUCE: (no-up) virtio-scsi: Increment reqs counter.
>
>
> > None of the above points are major enough for me to NAK this patchset, but I wanted to bring them to your attention (in case you missed something).
>
> Sure, thanks for the careful review.
> Hope this helps clarify things!
>
> > Acked-by: Khalid Elmously <[hidden email]>
> >
>
> [3] https://wiki.ubuntu.com/Kernel/Dev/StablePatchFormat
>
> --
> Mauricio Faria de Oliveira

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

Re: ACK/cmnt: [SRU Xenial][PATCH 0/2] Improve our SAUCE for virtio-scsi reqs counter (fix CPU soft lockup)

Mauricio Faria de Oliveira-3
On Wed, Oct 17, 2018 at 11:59 AM Khaled Elmously
<[hidden email]> wrote:
[snip]
> > >  - Maybe I'm misunderstanding what you mean by 'no-up'. I assumed it means "no operation", as in, no change in the resulting binary (e.g., whitespace changes, comment changes, casting etc.). Both patches seem to be "op" changes
> >
> > Hm, I meant 'no up-stream' (e.g., downstream/SAUCE patch that is not
> > expected to go upstream, as described in the StablePatchFormat wiki
> > page [3].
> >
>
> I had no idea that that's what "no-up" means :) Thanks

Heh, happens to everyone :-) Glad to help!

[snip]
> > >  - It seems strange to me that you'd have a loop that repeatedly decrements a counter because you know it will be incremented again when about to retry. I would think it makes more sense to increment the counter only once when queuing the command and decrement only once on success or "giving up" - though I assume you did it this way for a reason?
> >
> > I didn' t follow you on the loop scenario (I don't see a loop unless a
> > SCSI commands repeatedly fails to be kicked and gets requeued/resent
> > by the SCSI layer - is that it?)
>
> Yep, that's what I meant.

Okay. So, indeed, that scenario is possible and actually weird,
but that's not the common case nor hot path, fortunately
(so not much is wasted on these atomic inc/dec error path)

And yeah, unfortunately we had that 'reason' I mentioned in the previous email,
which didn't easily allow for changes in where the incrementing is done.
(If things were upstream the way they are downstram for us now,
I'd like to have sent something to simplify those things up, as you said.)

[snip]
> > I'll likely have a chat w/ Jay about it this week, but since the fix
> > was obvious and conservative,
> > I imagined it would be better to have it fixed on top for now, and
> > then refactored/re-fixed later).
>
> Fair enough - and thanks for clarifying in detail

Sure, thanks again for your careful review.

>
> >
> > [1] f1f609d8015e UBUNTU: SAUCE: (no-up) virtio-scsi: Fix race in target free
> > [2] a6c038a533f3 UBUNTU: SAUCE: (no-up) virtio-scsi: Increment reqs counter.
> >
> >
> > > None of the above points are major enough for me to NAK this patchset, but I wanted to bring them to your attention (in case you missed something).
> >
> > Sure, thanks for the careful review.
> > Hope this helps clarify things!
> >
> > > Acked-by: Khalid Elmously <[hidden email]>
> > >
> >
> > [3] https://wiki.ubuntu.com/Kernel/Dev/StablePatchFormat
> >
> > --
> > Mauricio Faria de Oliveira



--
Mauricio Faria de Oliveira

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

APPLIED: [SRU Xenial][PATCH 1/2] UBUNTU: SAUCE: (no-up) virtio-scsi: Decrement reqs counter before SCSI command requeue

Khaled Elmously
In reply to this post by Mauricio Faria de Oliveira-3
Only patch 1/2 has been applied

On 2018-10-16 12:38:19 , Mauricio Faria de Oliveira wrote:

> BugLink: https://bugs.launchpad.net/bugs/1798110
>
> The 'reqs' counter is incremented in the SCSI .queuecommand() path,
> right before virtscsi_queuecommand() is called, in either
>  - virtscsi_queuecommand_single(), or
>  - virtscsi_queuecommand_multi(), via virtscsi_pick_vq{_mq}().
>
> And it's decremented only in the SCSI command completion callback
> (after the command is successfully queued and completed by adapter):
>  - virtscsi_complete_cmd().
>
> This allows for the counter to be incremented but _not_ decremented
> if virtscsi_queuecommand() gets an error to add/kick the command to
> the virtio ring (i.e., virtscsi_kick_cmd() fails with not 0 nor EIO).
>
>     static virtscsi_queuecommand(...)
>     {
>             ...
>             ret = virtscsi_kick_cmd(...)
>             if (ret == -EIO) {
>                     ...
>                     virtscsi_complete_cmd(vscsi, cmd);
>                     ...
>             } else if (ret != 0) {
>                     return SCSI_MLQUEUE_HOST_BUSY;
>             }
>             return 0;
>     }
>
> In that case, the return code SCSI_MLQUEUE_HOST_BUSY causes the SCSI
> command to be requeued by the SCSI layer, which sends it again later
> in the .queuecommand() path -- incrementing the reqs counter _again_.
>
> This may happen many times for the same SCSI command, depending on
> the virtio ring condition/implementation, so the reqs counter will
> never return to zero (it's decremented only once in the completion
> callback). And it may happen for (m)any SCSI commands in this path.
>
> Unfortunately.. that causes a problem with a downstream/SAUCE patch
> for Xenial, which uses the reqs counter to sync with the completion
> callback: commit f1f609d8015e ("UBUNTU: SAUCE: (no-up) virtio-scsi:
> Fix race in target free"), and waits for the value to become zero.
>
> This problem plus that patch prevent the SCSI target removal from
> finishing, eventually causing a CPU soft lockup on another CPU that
> is waiting for some kernel resource that is/remains locked in the
> stack chain of this CPU.
>
> This has been verified 1) with a synthetic test case with QEMU+GDB
> that fakes the number of available elements in virtio ring for one
> time (harmless), so to force the SCSI command to be requeued, then
> uses QEMU monitor to remove the virtio-scsi target.
>
> _AND_ 2) with the test-case reported by the customer (a for-loop on
> a cloud instance that repeatedly mounts the virtio-scsi drive, copy
> data out of it, unmount it, then detach the virtio-scsi drive).
> (Here, the problem usually happens in the 1st or 2nd iteration, but
> with the patch it has run for 35 iterations without any problems).
>
> Upstream has done away with the reqs counter (originally used only
> to check if any requests were still active, for steering;  not for
> our sync purposes).  Instead of trying to find an alternative sync
> way for now let's just fix the behavior which we know is incorrect.
>
> Signed-off-by: Mauricio Faria de Oliveira <[hidden email]>
> Tested-by: Mauricio Faria de Oliveira <[hidden email]>
> Tested-by: David Coronel <[hidden email]>
> ---
>  drivers/scsi/virtio_scsi.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index b4a41d581021..572722e86bef 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -571,6 +571,10 @@ static int virtscsi_queuecommand(struct virtio_scsi *vscsi,
>   virtscsi_complete_cmd(vscsi, cmd);
>   spin_unlock_irqrestore(&req_vq->vq_lock, flags);
>   } else if (ret != 0) {
> + /* The SCSI command requeue will increment 'tgt->reqs' again. */
> + struct virtio_scsi_target_state *tgt =
> + scsi_target(sc->device)->hostdata;
> + atomic_dec(&tgt->reqs);
>   return SCSI_MLQUEUE_HOST_BUSY;
>   }
>   return 0;
> --
> 2.17.1
>
>
> --
> kernel-team mailing list
> [hidden email]
> https://lists.ubuntu.com/mailman/listinfo/kernel-team

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team