[SRU T/X/A/B][C][PATCH 0/1] LP: #1774336 - fix FS-Cache assert

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[SRU T/X/A/B][C][PATCH 0/1] LP: #1774336 - fix FS-Cache assert

Daniel Axtens
From: Daniel Axtens <[hidden email]>

== SRU Justification ==

[Impact]
Oops during heavy NFS + FSCache use:

[81738.886634] FS-Cache:
[81738.888281] FS-Cache: Assertion failed
[81738.889461] FS-Cache: 6 == 5 is false
[81738.890625] ------------[ cut here ]------------
[81738.891706] kernel BUG at /build/linux-hVVhWi/linux-4.4.0/fs/fscache/operation.c:494!

6 == 5 represents an operation being DEAD when it was not expected to be.

[Cause]
There is a race in fscache and cachefiles.

One thread is in cachefiles_read_waiter:
 1) object->work_lock is taken.
 2) the operation is added to the to_do list.
 3) the work lock is dropped.
 4) fscache_enqueue_retrieval is called, which takes a reference.

Another thread is in cachefiles_read_copier:
 1) object->work_lock is taken
 2) an item is popped off the to_do list.
 3) object->work_lock is dropped.
 4) some processing is done on the item, and fscache_put_retrieval()
    is called, dropping a reference.

Now if the this process in cachefiles_read_copier takes place
*between* steps 3 and 4 in cachefiles_read_waiter, a reference will be
dropped before it is taken, which leads to the objects reference count
hitting zero, which leads to lifecycle events for the object happening
too soon, leading to the assertion failure later on.

(This is simplified and clarified from the original upstream analysis
for this patch at
https://www.redhat.com/archives/linux-cachefs/2018-February/msg00001.html
and from a similar patch with a different approach to fixing the bug
at
https://www.redhat.com/archives/linux-cachefs/2017-June/msg00002.html)

[Fix]
Move fscache_enqueue_retrieval under the lock in
cachefiles_read_waiter. This means that the object cannot be popped
off the to_do list until it is in a fully consistent state with the
reference taken.

[Testcase]
A user has run ~100 hours of NFS stress tests and not seen this bug recur.

[Regression Potential]
 - Limited to fscache/cachefiles.
 - The change makes things more conservative (doing more under lock)
   so that's reassuring.
 - There may be performance impacts but none have been observed so far.

Lei Xue (1):
  UBUNTU: SAUCE: CacheFiles: fix a read_waiter/read_copier race

 fs/cachefiles/rdwr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--
2.17.0


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU T/X/A/B][C][PATCH 1/1] UBUNTU: SAUCE: CacheFiles: fix a read_waiter/read_copier race

Daniel Axtens
From: Lei Xue <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1774336

There is a potential race in fscache operation enqueuing for reading and
copying multiple pages from cachefiles to netfs.

If this race occurs, an oops similar to the following is seen:

[585042.202316] FS-Cache:
[585042.202343] FS-Cache: Assertion failed
[585042.202367] FS-Cache: 6 == 5 is false
[585042.202452] ------------[ cut here ]------------
[585042.202480] kernel BUG at fs/fscache/operation.c:494!
...
[585042.209600] Call Trace:
[585042.211233]  [<ffffffffc034c29a>] fscache_op_work_func+0x2a/0x50 [fscache]
[585042.212677]  [<ffffffff81095a70>] process_one_work+0x150/0x3f0
[585042.213550]  [<ffffffff810961ea>] worker_thread+0x11a/0x470
...

The race occurs in the following situation:

One thread is in cachefiles_read_waiter:
 1) object->work_lock is taken.
 2) the operation is added to the to_do list.
 3) the work lock is dropped.
 4) fscache_enqueue_retrieval is called, which takes a reference.

Another thread is in cachefiles_read_copier:
 1) object->work_lock is taken
 2) an item is popped off the to_do list.
 3) object->work_lock is dropped.
 4) some processing is done on the item, and fscache_put_retrieval()
    is called, dropping a reference.

Now if the this process in cachefiles_read_copier takes place
*between* steps 3 and 4 in cachefiles_read_waiter, a reference will be
dropped before it is taken, which leads to the object's reference count
hitting zero, which leads to lifecycle events for the object happening
too soon, leading to the assertion failure later on.

Move fscache_enqueue_retrieval under the lock in
cachefiles_read_waiter. This means that the object cannot be popped
off the to_do list until it is in a fully consistent state with the
reference taken.

Signed-off-by: Lei Xue <[hidden email]>
Reviewed-by: Daniel Axtens <[hidden email]>
[dja: rewrite and expand commit message]
(From https://www.redhat.com/archives/linux-cachefs/2018-February/msg00000.html
 This patch has been sitting on the mailing list for months with no
 response from the maintainer. A similar patch fixing the same issue
 was posted as far back as May 2017, and likewise had no response:
 https://www.redhat.com/archives/linux-cachefs/2017-May/msg00002.html
 I poked the list recently and also got nothing:
 https://www.redhat.com/archives/linux-cachefs/2018-May/msg00000.html
 and the problem was again reported and this patch validated by
 another user:
 https://www.redhat.com/archives/linux-cachefs/2018-May/msg00001.html
 Hence the submission as a sauce patch.)
Signed-off-by: Daniel Axtens <[hidden email]>
---
 fs/cachefiles/rdwr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index c0f3da3926a0..d9bb47d1e16d 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -58,9 +58,9 @@ static int cachefiles_read_waiter(wait_queue_t *wait, unsigned mode,
 
  spin_lock(&object->work_lock);
  list_add_tail(&monitor->op_link, &monitor->op->to_do);
+ fscache_enqueue_retrieval(monitor->op);
  spin_unlock(&object->work_lock);
 
- fscache_enqueue_retrieval(monitor->op);
  return 0;
 }
 
--
2.17.0


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

ACK: [SRU T/X/A/B][C][PATCH 0/1] LP: #1774336 - fix FS-Cache assert

Khaled Elmously
In reply to this post by Daniel Axtens
On 2018-06-04 16:42:15 , Daniel Axtens wrote:

> From: Daniel Axtens <[hidden email]>
>
> == SRU Justification ==
>
> [Impact]
> Oops during heavy NFS + FSCache use:
>
> [81738.886634] FS-Cache:
> [81738.888281] FS-Cache: Assertion failed
> [81738.889461] FS-Cache: 6 == 5 is false
> [81738.890625] ------------[ cut here ]------------
> [81738.891706] kernel BUG at /build/linux-hVVhWi/linux-4.4.0/fs/fscache/operation.c:494!
>
> 6 == 5 represents an operation being DEAD when it was not expected to be.
>
> [Cause]
> There is a race in fscache and cachefiles.
>
> One thread is in cachefiles_read_waiter:
>  1) object->work_lock is taken.
>  2) the operation is added to the to_do list.
>  3) the work lock is dropped.
>  4) fscache_enqueue_retrieval is called, which takes a reference.
>
> Another thread is in cachefiles_read_copier:
>  1) object->work_lock is taken
>  2) an item is popped off the to_do list.
>  3) object->work_lock is dropped.
>  4) some processing is done on the item, and fscache_put_retrieval()
>     is called, dropping a reference.
>
> Now if the this process in cachefiles_read_copier takes place
> *between* steps 3 and 4 in cachefiles_read_waiter, a reference will be
> dropped before it is taken, which leads to the objects reference count
> hitting zero, which leads to lifecycle events for the object happening
> too soon, leading to the assertion failure later on.
>
> (This is simplified and clarified from the original upstream analysis
> for this patch at
> https://www.redhat.com/archives/linux-cachefs/2018-February/msg00001.html
> and from a similar patch with a different approach to fixing the bug
> at
> https://www.redhat.com/archives/linux-cachefs/2017-June/msg00002.html)
>
> [Fix]
> Move fscache_enqueue_retrieval under the lock in
> cachefiles_read_waiter. This means that the object cannot be popped
> off the to_do list until it is in a fully consistent state with the
> reference taken.
>
> [Testcase]
> A user has run ~100 hours of NFS stress tests and not seen this bug recur.
>
> [Regression Potential]
>  - Limited to fscache/cachefiles.
>  - The change makes things more conservative (doing more under lock)
>    so that's reassuring.
>  - There may be performance impacts but none have been observed so far.
>
> Lei Xue (1):
>   UBUNTU: SAUCE: CacheFiles: fix a read_waiter/read_copier race
>
>  fs/cachefiles/rdwr.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>

Acked-by: Khalid Elmously <[hidden email]>
 

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

ACK: [SRU T/X/A/B][C][PATCH 1/1] UBUNTU: SAUCE: CacheFiles: fix a read_waiter/read_copier race

Stefan Bader-2
In reply to this post by Daniel Axtens
On 03.06.2018 23:42, Daniel Axtens wrote:

> From: Lei Xue <[hidden email]>
>
> BugLink: https://bugs.launchpad.net/bugs/1774336
>
> There is a potential race in fscache operation enqueuing for reading and
> copying multiple pages from cachefiles to netfs.
>
> If this race occurs, an oops similar to the following is seen:
>
> [585042.202316] FS-Cache:
> [585042.202343] FS-Cache: Assertion failed
> [585042.202367] FS-Cache: 6 == 5 is false
> [585042.202452] ------------[ cut here ]------------
> [585042.202480] kernel BUG at fs/fscache/operation.c:494!
> ...
> [585042.209600] Call Trace:
> [585042.211233]  [<ffffffffc034c29a>] fscache_op_work_func+0x2a/0x50 [fscache]
> [585042.212677]  [<ffffffff81095a70>] process_one_work+0x150/0x3f0
> [585042.213550]  [<ffffffff810961ea>] worker_thread+0x11a/0x470
> ...
>
> The race occurs in the following situation:
>
> One thread is in cachefiles_read_waiter:
>  1) object->work_lock is taken.
>  2) the operation is added to the to_do list.
>  3) the work lock is dropped.
>  4) fscache_enqueue_retrieval is called, which takes a reference.
>
> Another thread is in cachefiles_read_copier:
>  1) object->work_lock is taken
>  2) an item is popped off the to_do list.
>  3) object->work_lock is dropped.
>  4) some processing is done on the item, and fscache_put_retrieval()
>     is called, dropping a reference.
>
> Now if the this process in cachefiles_read_copier takes place
> *between* steps 3 and 4 in cachefiles_read_waiter, a reference will be
> dropped before it is taken, which leads to the object's reference count
> hitting zero, which leads to lifecycle events for the object happening
> too soon, leading to the assertion failure later on.
>
> Move fscache_enqueue_retrieval under the lock in
> cachefiles_read_waiter. This means that the object cannot be popped
> off the to_do list until it is in a fully consistent state with the
> reference taken.
>
> Signed-off-by: Lei Xue <[hidden email]>
> Reviewed-by: Daniel Axtens <[hidden email]>
> [dja: rewrite and expand commit message]
> (From https://www.redhat.com/archives/linux-cachefs/2018-February/msg00000.html
>  This patch has been sitting on the mailing list for months with no
>  response from the maintainer. A similar patch fixing the same issue
>  was posted as far back as May 2017, and likewise had no response:
>  https://www.redhat.com/archives/linux-cachefs/2017-May/msg00002.html
>  I poked the list recently and also got nothing:
>  https://www.redhat.com/archives/linux-cachefs/2018-May/msg00000.html
>  and the problem was again reported and this patch validated by
>  another user:
>  https://www.redhat.com/archives/linux-cachefs/2018-May/msg00001.html
>  Hence the submission as a sauce patch.)
> Signed-off-by: Daniel Axtens <[hidden email]>
Acked-by: Stefan Bader <[hidden email]>

> ---
>  fs/cachefiles/rdwr.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
> index c0f3da3926a0..d9bb47d1e16d 100644
> --- a/fs/cachefiles/rdwr.c
> +++ b/fs/cachefiles/rdwr.c
> @@ -58,9 +58,9 @@ static int cachefiles_read_waiter(wait_queue_t *wait, unsigned mode,
>  
>   spin_lock(&object->work_lock);
>   list_add_tail(&monitor->op_link, &monitor->op->to_do);
> + fscache_enqueue_retrieval(monitor->op);
>   spin_unlock(&object->work_lock);
>  
> - fscache_enqueue_retrieval(monitor->op);
>   return 0;
>  }
>  
>


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

APPLIED: [SRU T/X/A/B][C][PATCH 0/1] LP: #1774336 - fix FS-Cache assert

Khaled Elmously
In reply to this post by Daniel Axtens
Applied to T/X/A/B


On 2018-06-04 16:42:15 , Daniel Axtens wrote:

> From: Daniel Axtens <[hidden email]>
>
> == SRU Justification ==
>
> [Impact]
> Oops during heavy NFS + FSCache use:
>
> [81738.886634] FS-Cache:
> [81738.888281] FS-Cache: Assertion failed
> [81738.889461] FS-Cache: 6 == 5 is false
> [81738.890625] ------------[ cut here ]------------
> [81738.891706] kernel BUG at /build/linux-hVVhWi/linux-4.4.0/fs/fscache/operation.c:494!
>
> 6 == 5 represents an operation being DEAD when it was not expected to be.
>
> [Cause]
> There is a race in fscache and cachefiles.
>
> One thread is in cachefiles_read_waiter:
>  1) object->work_lock is taken.
>  2) the operation is added to the to_do list.
>  3) the work lock is dropped.
>  4) fscache_enqueue_retrieval is called, which takes a reference.
>
> Another thread is in cachefiles_read_copier:
>  1) object->work_lock is taken
>  2) an item is popped off the to_do list.
>  3) object->work_lock is dropped.
>  4) some processing is done on the item, and fscache_put_retrieval()
>     is called, dropping a reference.
>
> Now if the this process in cachefiles_read_copier takes place
> *between* steps 3 and 4 in cachefiles_read_waiter, a reference will be
> dropped before it is taken, which leads to the objects reference count
> hitting zero, which leads to lifecycle events for the object happening
> too soon, leading to the assertion failure later on.
>
> (This is simplified and clarified from the original upstream analysis
> for this patch at
> https://www.redhat.com/archives/linux-cachefs/2018-February/msg00001.html
> and from a similar patch with a different approach to fixing the bug
> at
> https://www.redhat.com/archives/linux-cachefs/2017-June/msg00002.html)
>
> [Fix]
> Move fscache_enqueue_retrieval under the lock in
> cachefiles_read_waiter. This means that the object cannot be popped
> off the to_do list until it is in a fully consistent state with the
> reference taken.
>
> [Testcase]
> A user has run ~100 hours of NFS stress tests and not seen this bug recur.
>
> [Regression Potential]
>  - Limited to fscache/cachefiles.
>  - The change makes things more conservative (doing more under lock)
>    so that's reassuring.
>  - There may be performance impacts but none have been observed so far.
>
> Lei Xue (1):
>   UBUNTU: SAUCE: CacheFiles: fix a read_waiter/read_copier race
>
>  fs/cachefiles/rdwr.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> --
> 2.17.0
>
>
> --
> kernel-team mailing list
> [hidden email]
> https://lists.ubuntu.com/mailman/listinfo/kernel-team

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team