[U/D][SRU B/C][PATCH 0/1] Fix bcache + fstrim panic (LP: #1793901)

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[U/D][SRU B/C][PATCH 0/1] Fix bcache + fstrim panic (LP: #1793901)

Daniel Axtens
SRU Justification
=================

[Impact]

Some users see panics like the following when performing fstrim on a
bcached volume:

[ 529.803060] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[ 530.183928] #PF error: [normal kernel read fault]
[ 530.412392] PGD 8000001f42163067 P4D 8000001f42163067 PUD 1f42168067 PMD 0
[ 530.750887] Oops: 0000 [#1] SMP PTI
[ 530.920869] CPU: 10 PID: 4167 Comm: fstrim Kdump: loaded Not tainted 5.0.0-rc1+ #3
[ 531.290204] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015
[ 531.693137] RIP: 0010:blk_queue_split+0x148/0x620
[ 531.922205] Code: 60 38 89 55 a0 45 31 db 45 31 f6 45 31 c9 31 ff 89 4d 98 85 db 0f 84 7f 04 00 00 44 8b 6d 98 4c 89 ee 48 c1 e6 04 49 03 70 78 <8b> 46 08 44 8b 56 0c 48
8b 16 44 29 e0 39 d8 48 89 55 a8 0f 47 c3
[ 532.838634] RSP: 0018:ffffb9b708df39b0 EFLAGS: 00010246
[ 533.093571] RAX: 00000000ffffffff RBX: 0000000000046000 RCX: 0000000000000000
[ 533.441865] RDX: 0000000000000200 RSI: 0000000000000000 RDI: 0000000000000000
[ 533.789922] RBP: ffffb9b708df3a48 R08: ffff940d3b3fdd20 R09: 0000000000000000
[ 534.137512] R10: ffffb9b708df3958 R11: 0000000000000000 R12: 0000000000000000
[ 534.485329] R13: 0000000000000000 R14: 0000000000000000 R15: ffff940d39212020
[ 534.833319] FS: 00007efec26e3840(0000) GS:ffff940d1f480000(0000) knlGS:0000000000000000
[ 535.224098] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 535.504318] CR2: 0000000000000008 CR3: 0000001f4e256004 CR4: 00000000001606e0
[ 535.851759] Call Trace:
[ 535.970308] ? mempool_alloc_slab+0x15/0x20
[ 536.174152] ? bch_data_insert+0x42/0xd0 [bcache]
[ 536.403399] blk_mq_make_request+0x97/0x4f0
[ 536.607036] generic_make_request+0x1e2/0x410
[ 536.819164] submit_bio+0x73/0x150
[ 536.980168] ? submit_bio+0x73/0x150
[ 537.149731] ? bio_associate_blkg_from_css+0x3b/0x60
[ 537.391595] ? _cond_resched+0x1a/0x50
[ 537.573774] submit_bio_wait+0x59/0x90
[ 537.756105] blkdev_issue_discard+0x80/0xd0
[ 537.959590] ext4_trim_fs+0x4a9/0x9e0
[ 538.137636] ? ext4_trim_fs+0x4a9/0x9e0
[ 538.324087] ext4_ioctl+0xea4/0x1530
[ 538.497712] ? _copy_to_user+0x2a/0x40
[ 538.679632] do_vfs_ioctl+0xa6/0x600
[ 538.853127] ? __do_sys_newfstat+0x44/0x70
[ 539.051951] ksys_ioctl+0x6d/0x80
[ 539.212785] __x64_sys_ioctl+0x1a/0x20
[ 539.394918] do_syscall_64+0x5a/0x110
[ 539.568674] entry_SYSCALL_64_after_hwframe+0x44/0xa9

[Fix]

Under certain conditions, the test for whether an operation should be
written back to the underlying device was incorrect. Specifically, in
should_writeback(), we were hitting a case where an optimisation for
partial stripe conditions was returning true and so should_writeback()
was returning true early. This caused the code to go down an incorrect
path and create bios that contained NULL pointers.

To fix this issue, make sure that should_writeback() on a discard op
never returns true.

[Test Case]

We have observed it on some systems where both:
1) LVM/devmapper is involved (bcache backing device is LVM volume) and
2) writeback cache is involved (bcache cache_mode is writeback)

Not every machine exhibits the bug. On one machine that does exhibit
the bug, we can reliably reproduce it with:

 # echo writeback > /sys/block/bcache0/bcache/cache_mode
 # mount /dev/bcache0 /test
 # for i in {0..10}; do file="$(mktemp /test/zero.XXX)"; dd if=/dev/zero of="$file" bs=1M count=256; sync; rm $file; done; fstrim -v /test

[Regression Potential]

This could affect any device where bcache is used.

In mitigation, however: the patch is simple, is limited to considering
discard operations. The patch has been accepted upstream [1] and the
maintainer will be including it in SuSE kernels [2]. A Gentoo user
validated the upstream patch independently [3].

[1] https://www.spinics.net/lists/linux-bcache/msg06997.html
[2] https://www.spinics.net/lists/linux-bcache/msg06998.html
[3] https://bugzilla.kernel.org/show_bug.cgi?id=196103#c3

Daniel Axtens (1):
  UBUNTU: SAUCE: bcache: never writeback a discard operation

 drivers/md/bcache/writeback.h | 3 +++
 1 file changed, 3 insertions(+)

--
2.19.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[U/D][SRU B/C][PATCH 1/1] UBUNTU: SAUCE: bcache: never writeback a discard operation

Daniel Axtens
BugLink: https://bugs.launchpad.net/bugs/1793901

Some users see panics like the following when performing fstrim on a bcached volume:

[  529.803060] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[  530.183928] #PF error: [normal kernel read fault]
[  530.412392] PGD 8000001f42163067 P4D 8000001f42163067 PUD 1f42168067 PMD 0
[  530.750887] Oops: 0000 [#1] SMP PTI
[  530.920869] CPU: 10 PID: 4167 Comm: fstrim Kdump: loaded Not tainted 5.0.0-rc1+ #3
[  531.290204] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015
[  531.693137] RIP: 0010:blk_queue_split+0x148/0x620
[  531.922205] Code: 60 38 89 55 a0 45 31 db 45 31 f6 45 31 c9 31 ff 89 4d 98 85 db 0f 84 7f 04 00 00 44 8b 6d 98 4c 89 ee 48 c1 e6 04 49 03 70 78 <8b> 46 08 44 8b 56 0c 48
8b 16 44 29 e0 39 d8 48 89 55 a8 0f 47 c3
[  532.838634] RSP: 0018:ffffb9b708df39b0 EFLAGS: 00010246
[  533.093571] RAX: 00000000ffffffff RBX: 0000000000046000 RCX: 0000000000000000
[  533.441865] RDX: 0000000000000200 RSI: 0000000000000000 RDI: 0000000000000000
[  533.789922] RBP: ffffb9b708df3a48 R08: ffff940d3b3fdd20 R09: 0000000000000000
[  534.137512] R10: ffffb9b708df3958 R11: 0000000000000000 R12: 0000000000000000
[  534.485329] R13: 0000000000000000 R14: 0000000000000000 R15: ffff940d39212020
[  534.833319] FS:  00007efec26e3840(0000) GS:ffff940d1f480000(0000) knlGS:0000000000000000
[  535.224098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  535.504318] CR2: 0000000000000008 CR3: 0000001f4e256004 CR4: 00000000001606e0
[  535.851759] Call Trace:
[  535.970308]  ? mempool_alloc_slab+0x15/0x20
[  536.174152]  ? bch_data_insert+0x42/0xd0 [bcache]
[  536.403399]  blk_mq_make_request+0x97/0x4f0
[  536.607036]  generic_make_request+0x1e2/0x410
[  536.819164]  submit_bio+0x73/0x150
[  536.980168]  ? submit_bio+0x73/0x150
[  537.149731]  ? bio_associate_blkg_from_css+0x3b/0x60
[  537.391595]  ? _cond_resched+0x1a/0x50
[  537.573774]  submit_bio_wait+0x59/0x90
[  537.756105]  blkdev_issue_discard+0x80/0xd0
[  537.959590]  ext4_trim_fs+0x4a9/0x9e0
[  538.137636]  ? ext4_trim_fs+0x4a9/0x9e0
[  538.324087]  ext4_ioctl+0xea4/0x1530
[  538.497712]  ? _copy_to_user+0x2a/0x40
[  538.679632]  do_vfs_ioctl+0xa6/0x600
[  538.853127]  ? __do_sys_newfstat+0x44/0x70
[  539.051951]  ksys_ioctl+0x6d/0x80
[  539.212785]  __x64_sys_ioctl+0x1a/0x20
[  539.394918]  do_syscall_64+0x5a/0x110
[  539.568674]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

We have observed it where both:
1) LVM/devmapper is involved (bcache backing device is LVM volume) and
2) writeback cache is involved (bcache cache_mode is writeback)

On one machine, we can reliably reproduce it with:

 # echo writeback > /sys/block/bcache0/bcache/cache_mode # not sure if this is required
 # mount /dev/bcache0 /test
 # for i in {0..10}; do file="$(mktemp /test/zero.XXX)"; dd if=/dev/zero of="$file" bs=1M count=256; sync; rm $file; done; fstrim -v /test

Observing this with tracepoints on, we see the following writes:

fstrim-18019 [022] .... 91107.302026: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4260112 + 196352 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302050: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4456464 + 262144 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302075: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4718608 + 81920 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302094: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5324816 + 180224 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302121: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5505040 + 262144 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302145: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5767184 + 81920 hit 0 bypass 1
fstrim-18019 [022] .... 91107.308777: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 6373392 + 180224 hit 1 bypass 0
<crash>

Note the final one has different hit/bypass flags.

This is because in should_writeback(), we were hitting a case where
the partial stripe condition was returning true and so
should_writeback() was returning true early.

If that hadn't been the case, it would have hit the would_skip test, and
as would_skip == s->iop.bypass == true, should_writeback() would have
returned false.

Looking at the git history from 72c270612bd3 ("bcache: Write out full
stripes"), it looks like the idea was to optimise for raid5/6:

       * If a stripe is already dirty, force writes to that stripe to
         writeback mode - to help build up full stripes of dirty data

To fix this issue, make sure that should_writeback() on a discard op
never returns true.

More details of debugging: https://www.spinics.net/lists/linux-bcache/msg06996.html

Previous reports:
 - https://bugzilla.kernel.org/show_bug.cgi?id=201051
 - https://bugzilla.kernel.org/show_bug.cgi?id=196103
 - https://www.spinics.net/lists/linux-bcache/msg06885.html

Cc: Kent Overstreet <[hidden email]>
Fixes: 72c270612bd3 ("bcache: Write out full stripes")
Signed-off-by: Daniel Axtens <[hidden email]>
(cherry-picked from linux-bcache mailing list:
 https://www.spinics.net/lists/linux-bcache/msg06997.html
 Expected to land in v5.1:
 https://www.spinics.net/lists/linux-bcache/msg06998.html)
Signed-off-by: Daniel Axtens <[hidden email]>
---
 drivers/md/bcache/writeback.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
index d2b9fdbc8994..e75dc33339f6 100644
--- a/drivers/md/bcache/writeback.h
+++ b/drivers/md/bcache/writeback.h
@@ -63,6 +63,9 @@ static inline bool should_writeback(struct cached_dev *dc, struct bio *bio,
     in_use > CUTOFF_WRITEBACK_SYNC)
  return false;
 
+ if (bio_op(bio) == REQ_OP_DISCARD)
+ return false;
+
  if (dc->partial_stripes_expensive &&
     bcache_dev_stripe_dirty(dc, bio->bi_iter.bi_sector,
     bio_sectors(bio)))
--
2.19.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

APPLIED: [U/D][SRU B/C][PATCH 0/1] Fix bcache + fstrim panic (LP: #1793901)

Seth Forshee
In reply to this post by Daniel Axtens
On Fri, Jan 25, 2019 at 11:03:02AM +1100, Daniel Axtens wrote:

> SRU Justification
> =================
>
> [Impact]
>
> Some users see panics like the following when performing fstrim on a
> bcached volume:
>
> [ 529.803060] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> [ 530.183928] #PF error: [normal kernel read fault]
> [ 530.412392] PGD 8000001f42163067 P4D 8000001f42163067 PUD 1f42168067 PMD 0
> [ 530.750887] Oops: 0000 [#1] SMP PTI
> [ 530.920869] CPU: 10 PID: 4167 Comm: fstrim Kdump: loaded Not tainted 5.0.0-rc1+ #3
> [ 531.290204] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015
> [ 531.693137] RIP: 0010:blk_queue_split+0x148/0x620
> [ 531.922205] Code: 60 38 89 55 a0 45 31 db 45 31 f6 45 31 c9 31 ff 89 4d 98 85 db 0f 84 7f 04 00 00 44 8b 6d 98 4c 89 ee 48 c1 e6 04 49 03 70 78 <8b> 46 08 44 8b 56 0c 48
> 8b 16 44 29 e0 39 d8 48 89 55 a8 0f 47 c3
> [ 532.838634] RSP: 0018:ffffb9b708df39b0 EFLAGS: 00010246
> [ 533.093571] RAX: 00000000ffffffff RBX: 0000000000046000 RCX: 0000000000000000
> [ 533.441865] RDX: 0000000000000200 RSI: 0000000000000000 RDI: 0000000000000000
> [ 533.789922] RBP: ffffb9b708df3a48 R08: ffff940d3b3fdd20 R09: 0000000000000000
> [ 534.137512] R10: ffffb9b708df3958 R11: 0000000000000000 R12: 0000000000000000
> [ 534.485329] R13: 0000000000000000 R14: 0000000000000000 R15: ffff940d39212020
> [ 534.833319] FS: 00007efec26e3840(0000) GS:ffff940d1f480000(0000) knlGS:0000000000000000
> [ 535.224098] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 535.504318] CR2: 0000000000000008 CR3: 0000001f4e256004 CR4: 00000000001606e0
> [ 535.851759] Call Trace:
> [ 535.970308] ? mempool_alloc_slab+0x15/0x20
> [ 536.174152] ? bch_data_insert+0x42/0xd0 [bcache]
> [ 536.403399] blk_mq_make_request+0x97/0x4f0
> [ 536.607036] generic_make_request+0x1e2/0x410
> [ 536.819164] submit_bio+0x73/0x150
> [ 536.980168] ? submit_bio+0x73/0x150
> [ 537.149731] ? bio_associate_blkg_from_css+0x3b/0x60
> [ 537.391595] ? _cond_resched+0x1a/0x50
> [ 537.573774] submit_bio_wait+0x59/0x90
> [ 537.756105] blkdev_issue_discard+0x80/0xd0
> [ 537.959590] ext4_trim_fs+0x4a9/0x9e0
> [ 538.137636] ? ext4_trim_fs+0x4a9/0x9e0
> [ 538.324087] ext4_ioctl+0xea4/0x1530
> [ 538.497712] ? _copy_to_user+0x2a/0x40
> [ 538.679632] do_vfs_ioctl+0xa6/0x600
> [ 538.853127] ? __do_sys_newfstat+0x44/0x70
> [ 539.051951] ksys_ioctl+0x6d/0x80
> [ 539.212785] __x64_sys_ioctl+0x1a/0x20
> [ 539.394918] do_syscall_64+0x5a/0x110
> [ 539.568674] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> [Fix]
>
> Under certain conditions, the test for whether an operation should be
> written back to the underlying device was incorrect. Specifically, in
> should_writeback(), we were hitting a case where an optimisation for
> partial stripe conditions was returning true and so should_writeback()
> was returning true early. This caused the code to go down an incorrect
> path and create bios that contained NULL pointers.
>
> To fix this issue, make sure that should_writeback() on a discard op
> never returns true.
>
> [Test Case]
>
> We have observed it on some systems where both:
> 1) LVM/devmapper is involved (bcache backing device is LVM volume) and
> 2) writeback cache is involved (bcache cache_mode is writeback)
>
> Not every machine exhibits the bug. On one machine that does exhibit
> the bug, we can reliably reproduce it with:
>
>  # echo writeback > /sys/block/bcache0/bcache/cache_mode
>  # mount /dev/bcache0 /test
>  # for i in {0..10}; do file="$(mktemp /test/zero.XXX)"; dd if=/dev/zero of="$file" bs=1M count=256; sync; rm $file; done; fstrim -v /test
>
> [Regression Potential]
>
> This could affect any device where bcache is used.
>
> In mitigation, however: the patch is simple, is limited to considering
> discard operations. The patch has been accepted upstream [1] and the
> maintainer will be including it in SuSE kernels [2]. A Gentoo user
> validated the upstream patch independently [3].
>
> [1] https://www.spinics.net/lists/linux-bcache/msg06997.html
> [2] https://www.spinics.net/lists/linux-bcache/msg06998.html
> [3] https://bugzilla.kernel.org/show_bug.cgi?id=196103#c3

Applied to disco/master-next and unstable/master, thanks!

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

ACK: [U/D][SRU B/C][PATCH 1/1] UBUNTU: SAUCE: bcache: never writeback a discard operation

Stefan Bader-2
In reply to this post by Daniel Axtens
On 25.01.19 01:03, Daniel Axtens wrote:

> BugLink: https://bugs.launchpad.net/bugs/1793901
>
> Some users see panics like the following when performing fstrim on a bcached volume:
>
> [  529.803060] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> [  530.183928] #PF error: [normal kernel read fault]
> [  530.412392] PGD 8000001f42163067 P4D 8000001f42163067 PUD 1f42168067 PMD 0
> [  530.750887] Oops: 0000 [#1] SMP PTI
> [  530.920869] CPU: 10 PID: 4167 Comm: fstrim Kdump: loaded Not tainted 5.0.0-rc1+ #3
> [  531.290204] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015
> [  531.693137] RIP: 0010:blk_queue_split+0x148/0x620
> [  531.922205] Code: 60 38 89 55 a0 45 31 db 45 31 f6 45 31 c9 31 ff 89 4d 98 85 db 0f 84 7f 04 00 00 44 8b 6d 98 4c 89 ee 48 c1 e6 04 49 03 70 78 <8b> 46 08 44 8b 56 0c 48
> 8b 16 44 29 e0 39 d8 48 89 55 a8 0f 47 c3
> [  532.838634] RSP: 0018:ffffb9b708df39b0 EFLAGS: 00010246
> [  533.093571] RAX: 00000000ffffffff RBX: 0000000000046000 RCX: 0000000000000000
> [  533.441865] RDX: 0000000000000200 RSI: 0000000000000000 RDI: 0000000000000000
> [  533.789922] RBP: ffffb9b708df3a48 R08: ffff940d3b3fdd20 R09: 0000000000000000
> [  534.137512] R10: ffffb9b708df3958 R11: 0000000000000000 R12: 0000000000000000
> [  534.485329] R13: 0000000000000000 R14: 0000000000000000 R15: ffff940d39212020
> [  534.833319] FS:  00007efec26e3840(0000) GS:ffff940d1f480000(0000) knlGS:0000000000000000
> [  535.224098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  535.504318] CR2: 0000000000000008 CR3: 0000001f4e256004 CR4: 00000000001606e0
> [  535.851759] Call Trace:
> [  535.970308]  ? mempool_alloc_slab+0x15/0x20
> [  536.174152]  ? bch_data_insert+0x42/0xd0 [bcache]
> [  536.403399]  blk_mq_make_request+0x97/0x4f0
> [  536.607036]  generic_make_request+0x1e2/0x410
> [  536.819164]  submit_bio+0x73/0x150
> [  536.980168]  ? submit_bio+0x73/0x150
> [  537.149731]  ? bio_associate_blkg_from_css+0x3b/0x60
> [  537.391595]  ? _cond_resched+0x1a/0x50
> [  537.573774]  submit_bio_wait+0x59/0x90
> [  537.756105]  blkdev_issue_discard+0x80/0xd0
> [  537.959590]  ext4_trim_fs+0x4a9/0x9e0
> [  538.137636]  ? ext4_trim_fs+0x4a9/0x9e0
> [  538.324087]  ext4_ioctl+0xea4/0x1530
> [  538.497712]  ? _copy_to_user+0x2a/0x40
> [  538.679632]  do_vfs_ioctl+0xa6/0x600
> [  538.853127]  ? __do_sys_newfstat+0x44/0x70
> [  539.051951]  ksys_ioctl+0x6d/0x80
> [  539.212785]  __x64_sys_ioctl+0x1a/0x20
> [  539.394918]  do_syscall_64+0x5a/0x110
> [  539.568674]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> We have observed it where both:
> 1) LVM/devmapper is involved (bcache backing device is LVM volume) and
> 2) writeback cache is involved (bcache cache_mode is writeback)
>
> On one machine, we can reliably reproduce it with:
>
>  # echo writeback > /sys/block/bcache0/bcache/cache_mode # not sure if this is required
>  # mount /dev/bcache0 /test
>  # for i in {0..10}; do file="$(mktemp /test/zero.XXX)"; dd if=/dev/zero of="$file" bs=1M count=256; sync; rm $file; done; fstrim -v /test
>
> Observing this with tracepoints on, we see the following writes:
>
> fstrim-18019 [022] .... 91107.302026: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4260112 + 196352 hit 0 bypass 1
> fstrim-18019 [022] .... 91107.302050: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4456464 + 262144 hit 0 bypass 1
> fstrim-18019 [022] .... 91107.302075: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4718608 + 81920 hit 0 bypass 1
> fstrim-18019 [022] .... 91107.302094: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5324816 + 180224 hit 0 bypass 1
> fstrim-18019 [022] .... 91107.302121: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5505040 + 262144 hit 0 bypass 1
> fstrim-18019 [022] .... 91107.302145: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5767184 + 81920 hit 0 bypass 1
> fstrim-18019 [022] .... 91107.308777: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 6373392 + 180224 hit 1 bypass 0
> <crash>
>
> Note the final one has different hit/bypass flags.
>
> This is because in should_writeback(), we were hitting a case where
> the partial stripe condition was returning true and so
> should_writeback() was returning true early.
>
> If that hadn't been the case, it would have hit the would_skip test, and
> as would_skip == s->iop.bypass == true, should_writeback() would have
> returned false.
>
> Looking at the git history from 72c270612bd3 ("bcache: Write out full
> stripes"), it looks like the idea was to optimise for raid5/6:
>
>        * If a stripe is already dirty, force writes to that stripe to
> writeback mode - to help build up full stripes of dirty data
>
> To fix this issue, make sure that should_writeback() on a discard op
> never returns true.
>
> More details of debugging: https://www.spinics.net/lists/linux-bcache/msg06996.html
>
> Previous reports:
>  - https://bugzilla.kernel.org/show_bug.cgi?id=201051
>  - https://bugzilla.kernel.org/show_bug.cgi?id=196103
>  - https://www.spinics.net/lists/linux-bcache/msg06885.html
>
> Cc: Kent Overstreet <[hidden email]>
> Fixes: 72c270612bd3 ("bcache: Write out full stripes")
> Signed-off-by: Daniel Axtens <[hidden email]>
> (cherry-picked from linux-bcache mailing list:
>  https://www.spinics.net/lists/linux-bcache/msg06997.html
>  Expected to land in v5.1:
>  https://www.spinics.net/lists/linux-bcache/msg06998.html)
> Signed-off-by: Daniel Axtens <[hidden email]>
Acked-by: Stefan Bader <[hidden email]>

> ---
>  drivers/md/bcache/writeback.h | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
> index d2b9fdbc8994..e75dc33339f6 100644
> --- a/drivers/md/bcache/writeback.h
> +++ b/drivers/md/bcache/writeback.h
> @@ -63,6 +63,9 @@ static inline bool should_writeback(struct cached_dev *dc, struct bio *bio,
>      in_use > CUTOFF_WRITEBACK_SYNC)
>   return false;
>  
> + if (bio_op(bio) == REQ_OP_DISCARD)
> + return false;
> +
>   if (dc->partial_stripes_expensive &&
>      bcache_dev_stripe_dirty(dc, bio->bi_iter.bi_sector,
>      bio_sectors(bio)))
>


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

ACK/cmnt: [U/D][SRU B/C][PATCH 1/1] UBUNTU: SAUCE: bcache: never writeback a discard operation

Kleber Souza
In reply to this post by Daniel Axtens
On 1/25/19 1:03 AM, Daniel Axtens wrote:
> BugLink: https://bugs.launchpad.net/bugs/1793901
I've added the nominations for Bionic and Cosmic which were missing on
the bug report.

>
> Some users see panics like the following when performing fstrim on a bcached volume:
>
> [  529.803060] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> [  530.183928] #PF error: [normal kernel read fault]
> [  530.412392] PGD 8000001f42163067 P4D 8000001f42163067 PUD 1f42168067 PMD 0
> [  530.750887] Oops: 0000 [#1] SMP PTI
> [  530.920869] CPU: 10 PID: 4167 Comm: fstrim Kdump: loaded Not tainted 5.0.0-rc1+ #3
> [  531.290204] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015
> [  531.693137] RIP: 0010:blk_queue_split+0x148/0x620
> [  531.922205] Code: 60 38 89 55 a0 45 31 db 45 31 f6 45 31 c9 31 ff 89 4d 98 85 db 0f 84 7f 04 00 00 44 8b 6d 98 4c 89 ee 48 c1 e6 04 49 03 70 78 <8b> 46 08 44 8b 56 0c 48
> 8b 16 44 29 e0 39 d8 48 89 55 a8 0f 47 c3
> [  532.838634] RSP: 0018:ffffb9b708df39b0 EFLAGS: 00010246
> [  533.093571] RAX: 00000000ffffffff RBX: 0000000000046000 RCX: 0000000000000000
> [  533.441865] RDX: 0000000000000200 RSI: 0000000000000000 RDI: 0000000000000000
> [  533.789922] RBP: ffffb9b708df3a48 R08: ffff940d3b3fdd20 R09: 0000000000000000
> [  534.137512] R10: ffffb9b708df3958 R11: 0000000000000000 R12: 0000000000000000
> [  534.485329] R13: 0000000000000000 R14: 0000000000000000 R15: ffff940d39212020
> [  534.833319] FS:  00007efec26e3840(0000) GS:ffff940d1f480000(0000) knlGS:0000000000000000
> [  535.224098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  535.504318] CR2: 0000000000000008 CR3: 0000001f4e256004 CR4: 00000000001606e0
> [  535.851759] Call Trace:
> [  535.970308]  ? mempool_alloc_slab+0x15/0x20
> [  536.174152]  ? bch_data_insert+0x42/0xd0 [bcache]
> [  536.403399]  blk_mq_make_request+0x97/0x4f0
> [  536.607036]  generic_make_request+0x1e2/0x410
> [  536.819164]  submit_bio+0x73/0x150
> [  536.980168]  ? submit_bio+0x73/0x150
> [  537.149731]  ? bio_associate_blkg_from_css+0x3b/0x60
> [  537.391595]  ? _cond_resched+0x1a/0x50
> [  537.573774]  submit_bio_wait+0x59/0x90
> [  537.756105]  blkdev_issue_discard+0x80/0xd0
> [  537.959590]  ext4_trim_fs+0x4a9/0x9e0
> [  538.137636]  ? ext4_trim_fs+0x4a9/0x9e0
> [  538.324087]  ext4_ioctl+0xea4/0x1530
> [  538.497712]  ? _copy_to_user+0x2a/0x40
> [  538.679632]  do_vfs_ioctl+0xa6/0x600
> [  538.853127]  ? __do_sys_newfstat+0x44/0x70
> [  539.051951]  ksys_ioctl+0x6d/0x80
> [  539.212785]  __x64_sys_ioctl+0x1a/0x20
> [  539.394918]  do_syscall_64+0x5a/0x110
> [  539.568674]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> We have observed it where both:
> 1) LVM/devmapper is involved (bcache backing device is LVM volume) and
> 2) writeback cache is involved (bcache cache_mode is writeback)
>
> On one machine, we can reliably reproduce it with:
>
>  # echo writeback > /sys/block/bcache0/bcache/cache_mode # not sure if this is required
>  # mount /dev/bcache0 /test
>  # for i in {0..10}; do file="$(mktemp /test/zero.XXX)"; dd if=/dev/zero of="$file" bs=1M count=256; sync; rm $file; done; fstrim -v /test
>
> Observing this with tracepoints on, we see the following writes:
>
> fstrim-18019 [022] .... 91107.302026: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4260112 + 196352 hit 0 bypass 1
> fstrim-18019 [022] .... 91107.302050: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4456464 + 262144 hit 0 bypass 1
> fstrim-18019 [022] .... 91107.302075: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4718608 + 81920 hit 0 bypass 1
> fstrim-18019 [022] .... 91107.302094: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5324816 + 180224 hit 0 bypass 1
> fstrim-18019 [022] .... 91107.302121: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5505040 + 262144 hit 0 bypass 1
> fstrim-18019 [022] .... 91107.302145: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5767184 + 81920 hit 0 bypass 1
> fstrim-18019 [022] .... 91107.308777: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 6373392 + 180224 hit 1 bypass 0
> <crash>
>
> Note the final one has different hit/bypass flags.
>
> This is because in should_writeback(), we were hitting a case where
> the partial stripe condition was returning true and so
> should_writeback() was returning true early.
>
> If that hadn't been the case, it would have hit the would_skip test, and
> as would_skip == s->iop.bypass == true, should_writeback() would have
> returned false.
>
> Looking at the git history from 72c270612bd3 ("bcache: Write out full
> stripes"), it looks like the idea was to optimise for raid5/6:
>
>        * If a stripe is already dirty, force writes to that stripe to
> writeback mode - to help build up full stripes of dirty data
>
> To fix this issue, make sure that should_writeback() on a discard op
> never returns true.
>
> More details of debugging: https://www.spinics.net/lists/linux-bcache/msg06996.html
>
> Previous reports:
>  - https://bugzilla.kernel.org/show_bug.cgi?id=201051
>  - https://bugzilla.kernel.org/show_bug.cgi?id=196103
>  - https://www.spinics.net/lists/linux-bcache/msg06885.html
>
> Cc: Kent Overstreet <[hidden email]>
> Fixes: 72c270612bd3 ("bcache: Write out full stripes")
> Signed-off-by: Daniel Axtens <[hidden email]>
> (cherry-picked from linux-bcache mailing list:
>  https://www.spinics.net/lists/linux-bcache/msg06997.html
>  Expected to land in v5.1:
>  https://www.spinics.net/lists/linux-bcache/msg06998.html)
> Signed-off-by: Daniel Axtens <[hidden email]>

Acked-by: Kleber Sacilotto de Souza <[hidden email]>

> ---
>  drivers/md/bcache/writeback.h | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
> index d2b9fdbc8994..e75dc33339f6 100644
> --- a/drivers/md/bcache/writeback.h
> +++ b/drivers/md/bcache/writeback.h
> @@ -63,6 +63,9 @@ static inline bool should_writeback(struct cached_dev *dc, struct bio *bio,
>      in_use > CUTOFF_WRITEBACK_SYNC)
>   return false;
>  
> + if (bio_op(bio) == REQ_OP_DISCARD)
> + return false;
> +
>   if (dc->partial_stripes_expensive &&
>      bcache_dev_stripe_dirty(dc, bio->bi_iter.bi_sector,
>      bio_sectors(bio)))



--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

APPLIED(B,C): [U/D][SRU B/C][PATCH 0/1] Fix bcache + fstrim panic (LP: #1793901)

Khaled Elmously
In reply to this post by Daniel Axtens
On 2019-01-25 11:03:02 , Daniel Axtens wrote:

> SRU Justification
> =================
>
> [Impact]
>
> Some users see panics like the following when performing fstrim on a
> bcached volume:
>
> [ 529.803060] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> [ 530.183928] #PF error: [normal kernel read fault]
> [ 530.412392] PGD 8000001f42163067 P4D 8000001f42163067 PUD 1f42168067 PMD 0
> [ 530.750887] Oops: 0000 [#1] SMP PTI
> [ 530.920869] CPU: 10 PID: 4167 Comm: fstrim Kdump: loaded Not tainted 5.0.0-rc1+ #3
> [ 531.290204] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015
> [ 531.693137] RIP: 0010:blk_queue_split+0x148/0x620
> [ 531.922205] Code: 60 38 89 55 a0 45 31 db 45 31 f6 45 31 c9 31 ff 89 4d 98 85 db 0f 84 7f 04 00 00 44 8b 6d 98 4c 89 ee 48 c1 e6 04 49 03 70 78 <8b> 46 08 44 8b 56 0c 48
> 8b 16 44 29 e0 39 d8 48 89 55 a8 0f 47 c3
> [ 532.838634] RSP: 0018:ffffb9b708df39b0 EFLAGS: 00010246
> [ 533.093571] RAX: 00000000ffffffff RBX: 0000000000046000 RCX: 0000000000000000
> [ 533.441865] RDX: 0000000000000200 RSI: 0000000000000000 RDI: 0000000000000000
> [ 533.789922] RBP: ffffb9b708df3a48 R08: ffff940d3b3fdd20 R09: 0000000000000000
> [ 534.137512] R10: ffffb9b708df3958 R11: 0000000000000000 R12: 0000000000000000
> [ 534.485329] R13: 0000000000000000 R14: 0000000000000000 R15: ffff940d39212020
> [ 534.833319] FS: 00007efec26e3840(0000) GS:ffff940d1f480000(0000) knlGS:0000000000000000
> [ 535.224098] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 535.504318] CR2: 0000000000000008 CR3: 0000001f4e256004 CR4: 00000000001606e0
> [ 535.851759] Call Trace:
> [ 535.970308] ? mempool_alloc_slab+0x15/0x20
> [ 536.174152] ? bch_data_insert+0x42/0xd0 [bcache]
> [ 536.403399] blk_mq_make_request+0x97/0x4f0
> [ 536.607036] generic_make_request+0x1e2/0x410
> [ 536.819164] submit_bio+0x73/0x150
> [ 536.980168] ? submit_bio+0x73/0x150
> [ 537.149731] ? bio_associate_blkg_from_css+0x3b/0x60
> [ 537.391595] ? _cond_resched+0x1a/0x50
> [ 537.573774] submit_bio_wait+0x59/0x90
> [ 537.756105] blkdev_issue_discard+0x80/0xd0
> [ 537.959590] ext4_trim_fs+0x4a9/0x9e0
> [ 538.137636] ? ext4_trim_fs+0x4a9/0x9e0
> [ 538.324087] ext4_ioctl+0xea4/0x1530
> [ 538.497712] ? _copy_to_user+0x2a/0x40
> [ 538.679632] do_vfs_ioctl+0xa6/0x600
> [ 538.853127] ? __do_sys_newfstat+0x44/0x70
> [ 539.051951] ksys_ioctl+0x6d/0x80
> [ 539.212785] __x64_sys_ioctl+0x1a/0x20
> [ 539.394918] do_syscall_64+0x5a/0x110
> [ 539.568674] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> [Fix]
>
> Under certain conditions, the test for whether an operation should be
> written back to the underlying device was incorrect. Specifically, in
> should_writeback(), we were hitting a case where an optimisation for
> partial stripe conditions was returning true and so should_writeback()
> was returning true early. This caused the code to go down an incorrect
> path and create bios that contained NULL pointers.
>
> To fix this issue, make sure that should_writeback() on a discard op
> never returns true.
>
> [Test Case]
>
> We have observed it on some systems where both:
> 1) LVM/devmapper is involved (bcache backing device is LVM volume) and
> 2) writeback cache is involved (bcache cache_mode is writeback)
>
> Not every machine exhibits the bug. On one machine that does exhibit
> the bug, we can reliably reproduce it with:
>
>  # echo writeback > /sys/block/bcache0/bcache/cache_mode
>  # mount /dev/bcache0 /test
>  # for i in {0..10}; do file="$(mktemp /test/zero.XXX)"; dd if=/dev/zero of="$file" bs=1M count=256; sync; rm $file; done; fstrim -v /test
>
> [Regression Potential]
>
> This could affect any device where bcache is used.
>
> In mitigation, however: the patch is simple, is limited to considering
> discard operations. The patch has been accepted upstream [1] and the
> maintainer will be including it in SuSE kernels [2]. A Gentoo user
> validated the upstream patch independently [3].
>
> [1] https://www.spinics.net/lists/linux-bcache/msg06997.html
> [2] https://www.spinics.net/lists/linux-bcache/msg06998.html
> [3] https://bugzilla.kernel.org/show_bug.cgi?id=196103#c3
>
> Daniel Axtens (1):
>   UBUNTU: SAUCE: bcache: never writeback a discard operation
>
>  drivers/md/bcache/writeback.h | 3 +++
>  1 file changed, 3 insertions(+)
>
> --
> 2.19.1
>
>
> --
> kernel-team mailing list
> [hidden email]
> https://lists.ubuntu.com/mailman/listinfo/kernel-team

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team