[SRU][Bionic][PATCH 1/1] nvme-pci: add a memory barrier to nvme_dbbuf_update_and_check_event

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[SRU][Bionic][PATCH 1/1] nvme-pci: add a memory barrier to nvme_dbbuf_update_and_check_event

Khalid Elmously
From: Michal Wnukowski <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1787635

In many architectures loads may be reordered with older stores to
different locations.  In the nvme driver the following two operations
could be reordered:

 - Write shadow doorbell (dbbuf_db) into memory.
 - Read EventIdx (dbbuf_ei) from memory.

This can result in a potential race condition between driver and VM host
processing requests (if given virtual NVMe controller has a support for
shadow doorbell).  If that occurs, then the NVMe controller may decide to
wait for MMIO doorbell from guest operating system, and guest driver may
decide not to issue MMIO doorbell on any of subsequent commands.

This issue is purely timing-dependent one, so there is no easy way to
reproduce it. Currently the easiest known approach is to run "Oracle IO
Numbers" (orion) that is shipped with Oracle DB:

orion -run advanced -num_large 0 -size_small 8 -type rand -simulate \
        concat -write 40 -duration 120 -matrix row -testname nvme_test

Where nvme_test is a .lun file that contains a list of NVMe block
devices to run test against. Limiting number of vCPUs assigned to given
VM instance seems to increase chances for this bug to occur. On test
environment with VM that got 4 NVMe drives and 1 vCPU assigned the
virtual NVMe controller hang could be observed within 10-20 minutes.
That correspond to about 400-500k IO operations processed (or about
100GB of IO read/writes).

Orion tool was used as a validation and set to run in a loop for 36
hours (equivalent of pushing 550M IO operations). No issues were
observed. That suggest that the patch fixes the issue.

Fixes: f9f38e33389c ("nvme: improve performance for virtual NVMe devices")
Signed-off-by: Michal Wnukowski <[hidden email]>
Reviewed-by: Keith Busch <[hidden email]>
Reviewed-by: Sagi Grimberg <[hidden email]>
[hch: updated changelog and comment a bit]
Signed-off-by: Christoph Hellwig <[hidden email]>
(cherry-picked from 07e860202d187d35ad24f0100caa0d244d4df2af git://git.infradead.org/nvme.git )
Signed-off-by: Khalid Elmously <[hidden email]>
---
 drivers/nvme/host/pci.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 9bd6ab79b851..af1773d5332e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -305,6 +305,14 @@ static bool nvme_dbbuf_update_and_check_event(u16 value, u32 *dbbuf_db,
  old_value = *dbbuf_db;
  *dbbuf_db = value;
 
+ /*
+ * Ensure that the doorbell is updated before reading the event
+ * index from memory.  The controller needs to provide similar
+ * ordering to ensure the envent index is updated before reading
+ * the doorbell.
+ */
+ mb();
+
  if (!nvme_dbbuf_need_event(*dbbuf_ei, value, old_value))
  return false;
  }
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

ACK+: [SRU][X,B,C][PATCH 1/1] nvme-pci: add a memory barrier to nvme_dbbuf_update_and_check_event

Kamal Mostafa-2
Note that this should be applied to {Xenial, Bionic Cosmic}.

Acked-by: Kamal Mostafa <[hidden email]>


On Fri, Aug 17, 2018 at 02:27:53PM -0400, Khalid Elmously wrote:

> From: Michal Wnukowski <[hidden email]>
>
> BugLink: https://bugs.launchpad.net/bugs/1787635
>
> In many architectures loads may be reordered with older stores to
> different locations.  In the nvme driver the following two operations
> could be reordered:
>
>  - Write shadow doorbell (dbbuf_db) into memory.
>  - Read EventIdx (dbbuf_ei) from memory.
>
> This can result in a potential race condition between driver and VM host
> processing requests (if given virtual NVMe controller has a support for
> shadow doorbell).  If that occurs, then the NVMe controller may decide to
> wait for MMIO doorbell from guest operating system, and guest driver may
> decide not to issue MMIO doorbell on any of subsequent commands.
>
> This issue is purely timing-dependent one, so there is no easy way to
> reproduce it. Currently the easiest known approach is to run "Oracle IO
> Numbers" (orion) that is shipped with Oracle DB:
>
> orion -run advanced -num_large 0 -size_small 8 -type rand -simulate \
> concat -write 40 -duration 120 -matrix row -testname nvme_test
>
> Where nvme_test is a .lun file that contains a list of NVMe block
> devices to run test against. Limiting number of vCPUs assigned to given
> VM instance seems to increase chances for this bug to occur. On test
> environment with VM that got 4 NVMe drives and 1 vCPU assigned the
> virtual NVMe controller hang could be observed within 10-20 minutes.
> That correspond to about 400-500k IO operations processed (or about
> 100GB of IO read/writes).
>
> Orion tool was used as a validation and set to run in a loop for 36
> hours (equivalent of pushing 550M IO operations). No issues were
> observed. That suggest that the patch fixes the issue.
>
> Fixes: f9f38e33389c ("nvme: improve performance for virtual NVMe devices")
> Signed-off-by: Michal Wnukowski <[hidden email]>
> Reviewed-by: Keith Busch <[hidden email]>
> Reviewed-by: Sagi Grimberg <[hidden email]>
> [hch: updated changelog and comment a bit]
> Signed-off-by: Christoph Hellwig <[hidden email]>
> (cherry-picked from 07e860202d187d35ad24f0100caa0d244d4df2af git://git.infradead.org/nvme.git )
> Signed-off-by: Khalid Elmously <[hidden email]>
> ---
>  drivers/nvme/host/pci.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 9bd6ab79b851..af1773d5332e 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -305,6 +305,14 @@ static bool nvme_dbbuf_update_and_check_event(u16 value, u32 *dbbuf_db,
>   old_value = *dbbuf_db;
>   *dbbuf_db = value;
>  
> + /*
> + * Ensure that the doorbell is updated before reading the event
> + * index from memory.  The controller needs to provide similar
> + * ordering to ensure the envent index is updated before reading
> + * the doorbell.
> + */
> + mb();
> +
>   if (!nvme_dbbuf_need_event(*dbbuf_ei, value, old_value))
>   return false;
>   }
> --
> 2.17.1
>
>
> --
> kernel-team mailing list
> [hidden email]
> https://lists.ubuntu.com/mailman/listinfo/kernel-team

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

ACK/cmnt: [SRU][Bionic][PATCH 1/1] nvme-pci: add a memory barrier to nvme_dbbuf_update_and_check_event

Kleber Souza
In reply to this post by Khalid Elmously
On 08/17/18 20:27, Khalid Elmously wrote:
> From: Michal Wnukowski <[hidden email]>
>
> BugLink: https://bugs.launchpad.net/bugs/1787635

There's a new bug report with a bit more information about the issue:

BugLink: https://bugs.launchpad.net/bugs/1788222

>
> In many architectures loads may be reordered with older stores to
> different locations.  In the nvme driver the following two operations
> could be reordered:
>
>  - Write shadow doorbell (dbbuf_db) into memory.
>  - Read EventIdx (dbbuf_ei) from memory.
>
> This can result in a potential race condition between driver and VM host
> processing requests (if given virtual NVMe controller has a support for
> shadow doorbell).  If that occurs, then the NVMe controller may decide to
> wait for MMIO doorbell from guest operating system, and guest driver may
> decide not to issue MMIO doorbell on any of subsequent commands.
>
> This issue is purely timing-dependent one, so there is no easy way to
> reproduce it. Currently the easiest known approach is to run "Oracle IO
> Numbers" (orion) that is shipped with Oracle DB:
>
> orion -run advanced -num_large 0 -size_small 8 -type rand -simulate \
> concat -write 40 -duration 120 -matrix row -testname nvme_test
>
> Where nvme_test is a .lun file that contains a list of NVMe block
> devices to run test against. Limiting number of vCPUs assigned to given
> VM instance seems to increase chances for this bug to occur. On test
> environment with VM that got 4 NVMe drives and 1 vCPU assigned the
> virtual NVMe controller hang could be observed within 10-20 minutes.
> That correspond to about 400-500k IO operations processed (or about
> 100GB of IO read/writes).
>
> Orion tool was used as a validation and set to run in a loop for 36
> hours (equivalent of pushing 550M IO operations). No issues were
> observed. That suggest that the patch fixes the issue.
>
> Fixes: f9f38e33389c ("nvme: improve performance for virtual NVMe devices")
> Signed-off-by: Michal Wnukowski <[hidden email]>
> Reviewed-by: Keith Busch <[hidden email]>
> Reviewed-by: Sagi Grimberg <[hidden email]>
> [hch: updated changelog and comment a bit]
> Signed-off-by: Christoph Hellwig <[hidden email]>
> (cherry-picked from 07e860202d187d35ad24f0100caa0d244d4df2af git://git.infradead.org/nvme.git )
> Signed-off-by: Khalid Elmously <[hidden email]>

As stated by Kamal, and differently from the bug reports, the patch
should be applied to Xenial, Bionic and Cosmic main kernels.

Acked-by: Kleber Sacilotto de Souza <[hidden email]>

> ---
>  drivers/nvme/host/pci.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 9bd6ab79b851..af1773d5332e 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -305,6 +305,14 @@ static bool nvme_dbbuf_update_and_check_event(u16 value, u32 *dbbuf_db,
>   old_value = *dbbuf_db;
>   *dbbuf_db = value;
>  
> + /*
> + * Ensure that the doorbell is updated before reading the event
> + * index from memory.  The controller needs to provide similar
> + * ordering to ensure the envent index is updated before reading
> + * the doorbell.
> + */
> + mb();
> +
>   if (!nvme_dbbuf_need_event(*dbbuf_ei, value, old_value))
>   return false;
>   }
>


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

APPLIED: [SRU][Bionic][PATCH 1/1] nvme-pci: add a memory barrier to nvme_dbbuf_update_and_check_event

Khalid Elmously
In reply to this post by Khalid Elmously
..to bionic/master-next - also updated the buglink as Kleber pointed out.

This patch will need to be backported for xenial as it cannot be cherry-picked - that will have to be done as a follow-up.

On 2018-08-17 14:27:53 , Khalid Elmously wrote:

> From: Michal Wnukowski <[hidden email]>
>
> BugLink: https://bugs.launchpad.net/bugs/1787635
>
> In many architectures loads may be reordered with older stores to
> different locations.  In the nvme driver the following two operations
> could be reordered:
>
>  - Write shadow doorbell (dbbuf_db) into memory.
>  - Read EventIdx (dbbuf_ei) from memory.
>
> This can result in a potential race condition between driver and VM host
> processing requests (if given virtual NVMe controller has a support for
> shadow doorbell).  If that occurs, then the NVMe controller may decide to
> wait for MMIO doorbell from guest operating system, and guest driver may
> decide not to issue MMIO doorbell on any of subsequent commands.
>
> This issue is purely timing-dependent one, so there is no easy way to
> reproduce it. Currently the easiest known approach is to run "Oracle IO
> Numbers" (orion) that is shipped with Oracle DB:
>
> orion -run advanced -num_large 0 -size_small 8 -type rand -simulate \
> concat -write 40 -duration 120 -matrix row -testname nvme_test
>
> Where nvme_test is a .lun file that contains a list of NVMe block
> devices to run test against. Limiting number of vCPUs assigned to given
> VM instance seems to increase chances for this bug to occur. On test
> environment with VM that got 4 NVMe drives and 1 vCPU assigned the
> virtual NVMe controller hang could be observed within 10-20 minutes.
> That correspond to about 400-500k IO operations processed (or about
> 100GB of IO read/writes).
>
> Orion tool was used as a validation and set to run in a loop for 36
> hours (equivalent of pushing 550M IO operations). No issues were
> observed. That suggest that the patch fixes the issue.
>
> Fixes: f9f38e33389c ("nvme: improve performance for virtual NVMe devices")
> Signed-off-by: Michal Wnukowski <[hidden email]>
> Reviewed-by: Keith Busch <[hidden email]>
> Reviewed-by: Sagi Grimberg <[hidden email]>
> [hch: updated changelog and comment a bit]
> Signed-off-by: Christoph Hellwig <[hidden email]>
> (cherry-picked from 07e860202d187d35ad24f0100caa0d244d4df2af git://git.infradead.org/nvme.git )
> Signed-off-by: Khalid Elmously <[hidden email]>
> ---
>  drivers/nvme/host/pci.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 9bd6ab79b851..af1773d5332e 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -305,6 +305,14 @@ static bool nvme_dbbuf_update_and_check_event(u16 value, u32 *dbbuf_db,
>   old_value = *dbbuf_db;
>   *dbbuf_db = value;
>  
> + /*
> + * Ensure that the doorbell is updated before reading the event
> + * index from memory.  The controller needs to provide similar
> + * ordering to ensure the envent index is updated before reading
> + * the doorbell.
> + */
> + mb();
> +
>   if (!nvme_dbbuf_need_event(*dbbuf_ei, value, old_value))
>   return false;
>   }
> --
> 2.17.1
>

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

APPLIED[C/Unstable]: [SRU][Bionic][PATCH 1/1] nvme-pci: add a memory barrier to nvme_dbbuf_update_and_check_event

Seth Forshee
In reply to this post by Khalid Elmously
On Fri, Aug 17, 2018 at 02:27:53PM -0400, Khalid Elmously wrote:

> From: Michal Wnukowski <[hidden email]>
>
> BugLink: https://bugs.launchpad.net/bugs/1787635
>
> In many architectures loads may be reordered with older stores to
> different locations.  In the nvme driver the following two operations
> could be reordered:
>
>  - Write shadow doorbell (dbbuf_db) into memory.
>  - Read EventIdx (dbbuf_ei) from memory.
>
> This can result in a potential race condition between driver and VM host
> processing requests (if given virtual NVMe controller has a support for
> shadow doorbell).  If that occurs, then the NVMe controller may decide to
> wait for MMIO doorbell from guest operating system, and guest driver may
> decide not to issue MMIO doorbell on any of subsequent commands.
>
> This issue is purely timing-dependent one, so there is no easy way to
> reproduce it. Currently the easiest known approach is to run "Oracle IO
> Numbers" (orion) that is shipped with Oracle DB:
>
> orion -run advanced -num_large 0 -size_small 8 -type rand -simulate \
> concat -write 40 -duration 120 -matrix row -testname nvme_test
>
> Where nvme_test is a .lun file that contains a list of NVMe block
> devices to run test against. Limiting number of vCPUs assigned to given
> VM instance seems to increase chances for this bug to occur. On test
> environment with VM that got 4 NVMe drives and 1 vCPU assigned the
> virtual NVMe controller hang could be observed within 10-20 minutes.
> That correspond to about 400-500k IO operations processed (or about
> 100GB of IO read/writes).
>
> Orion tool was used as a validation and set to run in a loop for 36
> hours (equivalent of pushing 550M IO operations). No issues were
> observed. That suggest that the patch fixes the issue.
>
> Fixes: f9f38e33389c ("nvme: improve performance for virtual NVMe devices")
> Signed-off-by: Michal Wnukowski <[hidden email]>
> Reviewed-by: Keith Busch <[hidden email]>
> Reviewed-by: Sagi Grimberg <[hidden email]>
> [hch: updated changelog and comment a bit]
> Signed-off-by: Christoph Hellwig <[hidden email]>
> (cherry-picked from 07e860202d187d35ad24f0100caa0d244d4df2af git://git.infradead.org/nvme.git )
> Signed-off-by: Khalid Elmously <[hidden email]>

Applied to cosmic/master-next and unstable/master, thanks!

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team