[SRU][B][PATCH 0/5] x86: mm: fix early boot problem on i386 with KPTI enabled

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[SRU][B][PATCH 0/5] x86: mm: fix early boot problem on i386 with KPTI enabled

Andrea Righi
Buglink: https://bugs.launchpad.net/bugs/1827884

[Impact]

Commit d653420532d580156c8486686899ea6a9eeb7bf0 in bionic enabled kernel page
table isolation for x86_32, but also introduced regressions. One of them
("BUG_ON() condition in vmalloc_sync_one()") has been addressed by bug 1830433,
but there are other issues reported on i386.

Specifically on some i386 systems the kernel seems to fail in the early stage
of boot (black screen and frozen keyboard) with no error reported on the
console.

If the kernel is booted with "mitigations=off" and "nopti" the problem doesn't
happen (that is a clear indication of being a kernel page table isolation
issue).

However, users have been reported positive results with the following upstream
fixes applied (all clean cherry picks), even with mitigations *and* kernel page
table isolation enabled.

[Test Case]

Unfortuantely this problem is not easily reproducible, the kernel simply fails
to boot (black screen and frozen keyboard) after the GRUB prompt, so we don't
have a real test case (except asking the bug reporters to boot the kernel and
see if it works).

[Fix]

The following upstream fix seems to resolve (prevent) the problem:

 1d8ca3be86ebc6a38dad8236f45c7a9c61681e78 x86/mm/fault: Allow stack access below %rsp
 aa37c51b9421d66f7931c5fdcb9ce80c450974be x86/mm: Break out user address space handling
 8fed62000039058adfd8b663344e2f448aed1e7a x86/mm: Break out kernel address space handling
 164477c2331be75d9bd57fb76704e676b2bcd1cd x86/mm: Clarify hardware vs. software "error_code"
 0e664eee65337082be49fbbd2ee24aa0d111d0f2 Revert "perf/core: Make sure the ring-buffer is mapped in all page-tables"

[Regression Potential]

All upstream fixes, tested on the affected platform, backport changes are
minimal.

----------------------------------------------------------------
Dave Hansen (3):
      x86/mm: Clarify hardware vs. software "error_code"
      x86/mm: Break out kernel address space handling
      x86/mm: Break out user address space handling

Joerg Roedel (1):
      Revert "perf/core: Make sure the ring-buffer is mapped in all page-tables"

Waiman Long (1):
      x86/mm/fault: Allow stack access below %rsp

 arch/x86/mm/fault.c         | 205 +++++++++++++++++++++++++++-----------------
 kernel/events/ring_buffer.c |  16 ----
 2 files changed, 126 insertions(+), 95 deletions(-)


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU][B][PATCH 1/5] Revert "perf/core: Make sure the ring-buffer is mapped in all page-tables"

Andrea Righi
From: Joerg Roedel <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1827884

This reverts commit 77754cfa09a6c528c38cbca9ee4cc4f7cf6ad6f2.

The patch was necessary to silence a WARN_ON_ONCE(in_nmi())
that triggered in the vmalloc_fault() function when PTI was
enabled on x86-32.

Faulting in an NMI handler turned out to be safe and the
warning in vmalloc_fault() is gone now. So the above patch
can be reverted.

Signed-off-by: Joerg Roedel <[hidden email]>
Signed-off-by: Thomas Gleixner <[hidden email]>
Tested-by: David H. Gutteridge <[hidden email]>
Cc: "H . Peter Anvin" <[hidden email]>
Cc: [hidden email]
Cc: Linus Torvalds <[hidden email]>
Cc: Andy Lutomirski <[hidden email]>
Cc: Dave Hansen <[hidden email]>
Cc: Josh Poimboeuf <[hidden email]>
Cc: Juergen Gross <[hidden email]>
Cc: Peter Zijlstra <[hidden email]>
Cc: Borislav Petkov <[hidden email]>
Cc: Jiri Kosina <[hidden email]>
Cc: Boris Ostrovsky <[hidden email]>
Cc: Brian Gerst <[hidden email]>
Cc: David Laight <[hidden email]>
Cc: Denys Vlasenko <[hidden email]>
Cc: Eduardo Valentin <[hidden email]>
Cc: Greg KH <[hidden email]>
Cc: Will Deacon <[hidden email]>
Cc: [hidden email]
Cc: [hidden email]
Cc: [hidden email]
Cc: [hidden email]
Cc: Andrea Arcangeli <[hidden email]>
Cc: Waiman Long <[hidden email]>
Cc: Pavel Machek <[hidden email]>
Cc: Arnaldo Carvalho de Melo <[hidden email]>
Cc: Alexander Shishkin <[hidden email]>
Cc: Jiri Olsa <[hidden email]>
Cc: Namhyung Kim <[hidden email]>
Cc: [hidden email]
Link: https://lkml.kernel.org/r/1532533683-5988-3-git-send-email-joro@...

(cherry picked from commit 0e664eee65337082be49fbbd2ee24aa0d111d0f2)
Signed-off-by: Andrea Righi <[hidden email]>
---
 kernel/events/ring_buffer.c | 16 ----------------
 1 file changed, 16 deletions(-)

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 5d8f4476d22e..dea024a6e0fb 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -813,13 +813,6 @@ static void rb_free_work(struct work_struct *work)
 
  vfree(base);
  kfree(rb);
-
- /*
- * FIXME: PAE workaround for vmalloc_fault(): Make sure buffer is
- * unmapped in all page-tables.
- */
- if (IS_ENABLED(CONFIG_X86_PAE))
- vmalloc_sync_all();
 }
 
 void rb_free(struct ring_buffer *rb)
@@ -846,15 +839,6 @@ struct ring_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags)
  if (!all_buf)
  goto fail_all_buf;
 
- /*
- * FIXME: PAE workaround for vmalloc_fault(): The buffer is
- * accessed in NMI handlers, make sure it is mapped in all
- * page-tables in the system so that we don't fault on the range in
- * an NMI handler.
- */
- if (IS_ENABLED(CONFIG_X86_PAE))
- vmalloc_sync_all();
-
  rb->user_page = all_buf;
  rb->data_pages[0] = all_buf + PAGE_SIZE;
  if (nr_pages) {
--
2.20.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU][B][PATCH 2/5] x86/mm: Clarify hardware vs. software "error_code"

Andrea Righi
In reply to this post by Andrea Righi
From: Dave Hansen <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1827884

We pass around a variable called "error_code" all around the page
fault code.  Sounds simple enough, especially since "error_code" looks
like it exactly matches the values that the hardware gives us on the
stack to report the page fault error code (PFEC in SDM parlance).

But, that's not how it works.

For part of the page fault handler, "error_code" does exactly match
PFEC.  But, during later parts, it diverges and starts to mean
something a bit different.

Give it two names for its two jobs.

The place it diverges is also really screwy.  It's only in a spot
where the hardware tells us we have kernel-mode access that occurred
while we were in usermode accessing user-controlled address space.
Add a warning in there.

Cc: [hidden email]
Cc: Jann Horn <[hidden email]>
Cc: Sean Christopherson <[hidden email]>
Cc: Thomas Gleixner <[hidden email]>
Cc: Andy Lutomirski <[hidden email]>
Signed-off-by: Dave Hansen <[hidden email]>
Signed-off-by: Peter Zijlstra (Intel) <[hidden email]>
Link: http://lkml.kernel.org/r/20180928160220.4A2272C9@...
(cherry picked from commit 164477c2331be75d9bd57fb76704e676b2bcd1cd)
Signed-off-by: Andrea Righi <[hidden email]>
---
 arch/x86/mm/fault.c | 77 ++++++++++++++++++++++++++++++---------------
 1 file changed, 52 insertions(+), 25 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index f7c63289b1b6..91f8377e45cc 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1233,9 +1233,10 @@ static inline bool smap_violation(int error_code, struct pt_regs *regs)
  * routines.
  */
 static noinline void
-__do_page_fault(struct pt_regs *regs, unsigned long error_code,
+__do_page_fault(struct pt_regs *regs, unsigned long hw_error_code,
  unsigned long address)
 {
+ unsigned long sw_error_code;
  struct vm_area_struct *vma;
  struct task_struct *tsk;
  struct mm_struct *mm;
@@ -1261,17 +1262,17 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
  * nothing more.
  *
  * This verifies that the fault happens in kernel space
- * (error_code & 4) == 0, and that the fault was not a
- * protection error (error_code & 9) == 0.
+ * (hw_error_code & 4) == 0, and that the fault was not a
+ * protection error (hw_error_code & 9) == 0.
  */
  if (unlikely(fault_in_kernel_space(address))) {
- if (!(error_code & (X86_PF_RSVD | X86_PF_USER | X86_PF_PROT))) {
+ if (!(hw_error_code & (X86_PF_RSVD | X86_PF_USER | X86_PF_PROT))) {
  if (vmalloc_fault(address) >= 0)
  return;
  }
 
  /* Can handle a stale RO->RW TLB: */
- if (spurious_fault(error_code, address))
+ if (spurious_fault(hw_error_code, address))
  return;
 
  /* kprobes don't want to hook the spurious faults: */
@@ -1281,7 +1282,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
  * Don't take the mm semaphore here. If we fixup a prefetch
  * fault we could otherwise deadlock:
  */
- bad_area_nosemaphore(regs, error_code, address, NULL);
+ bad_area_nosemaphore(regs, hw_error_code, address, NULL);
 
  return;
  }
@@ -1290,11 +1291,11 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
  if (unlikely(kprobes_fault(regs)))
  return;
 
- if (unlikely(error_code & X86_PF_RSVD))
- pgtable_bad(regs, error_code, address);
+ if (unlikely(hw_error_code & X86_PF_RSVD))
+ pgtable_bad(regs, hw_error_code, address);
 
- if (unlikely(smap_violation(error_code, regs))) {
- bad_area_nosemaphore(regs, error_code, address, NULL);
+ if (unlikely(smap_violation(hw_error_code, regs))) {
+ bad_area_nosemaphore(regs, hw_error_code, address, NULL);
  return;
  }
 
@@ -1303,10 +1304,17 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
  * in a region with pagefaults disabled then we must not take the fault
  */
  if (unlikely(faulthandler_disabled() || !mm)) {
- bad_area_nosemaphore(regs, error_code, address, NULL);
+ bad_area_nosemaphore(regs, hw_error_code, address, NULL);
  return;
  }
 
+ /*
+ * hw_error_code is literally the "page fault error code" passed to
+ * the kernel directly from the hardware.  But, we will shortly be
+ * modifying it in software, so give it a new name.
+ */
+ sw_error_code = hw_error_code;
+
  /*
  * It's safe to allow irq's after cr2 has been saved and the
  * vmalloc fault has been handled.
@@ -1316,7 +1324,26 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
  */
  if (user_mode(regs)) {
  local_irq_enable();
- error_code |= X86_PF_USER;
+ /*
+ * Up to this point, X86_PF_USER set in hw_error_code
+ * indicated a user-mode access.  But, after this,
+ * X86_PF_USER in sw_error_code will indicate either
+ * that, *or* an implicit kernel(supervisor)-mode access
+ * which originated from user mode.
+ */
+ if (!(hw_error_code & X86_PF_USER)) {
+ /*
+ * The CPU was in user mode, but the CPU says
+ * the fault was not a user-mode access.
+ * Must be an implicit kernel-mode access,
+ * which we do not expect to happen in the
+ * user address space.
+ */
+ pr_warn_once("kernel-mode error from user-mode: %lx\n",
+ hw_error_code);
+
+ sw_error_code |= X86_PF_USER;
+ }
  flags |= FAULT_FLAG_USER;
  } else {
  if (regs->flags & X86_EFLAGS_IF)
@@ -1325,9 +1352,9 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 
  perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
 
- if (error_code & X86_PF_WRITE)
+ if (sw_error_code & X86_PF_WRITE)
  flags |= FAULT_FLAG_WRITE;
- if (error_code & X86_PF_INSTR)
+ if (sw_error_code & X86_PF_INSTR)
  flags |= FAULT_FLAG_INSTRUCTION;
 
  /*
@@ -1347,9 +1374,9 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
  * space check, thus avoiding the deadlock:
  */
  if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
- if (!(error_code & X86_PF_USER) &&
+ if (!(sw_error_code & X86_PF_USER) &&
     !search_exception_tables(regs->ip)) {
- bad_area_nosemaphore(regs, error_code, address, NULL);
+ bad_area_nosemaphore(regs, sw_error_code, address, NULL);
  return;
  }
 retry:
@@ -1365,16 +1392,16 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 
  vma = find_vma(mm, address);
  if (unlikely(!vma)) {
- bad_area(regs, error_code, address);
+ bad_area(regs, sw_error_code, address);
  return;
  }
  if (likely(vma->vm_start <= address))
  goto good_area;
  if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) {
- bad_area(regs, error_code, address);
+ bad_area(regs, sw_error_code, address);
  return;
  }
- if (error_code & X86_PF_USER) {
+ if (sw_error_code & X86_PF_USER) {
  /*
  * Accessing the stack below %sp is always a bug.
  * The large cushion allows instructions like enter
@@ -1382,12 +1409,12 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
  * 32 pointers and then decrements %sp by 65535.)
  */
  if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
- bad_area(regs, error_code, address);
+ bad_area(regs, sw_error_code, address);
  return;
  }
  }
  if (unlikely(expand_stack(vma, address))) {
- bad_area(regs, error_code, address);
+ bad_area(regs, sw_error_code, address);
  return;
  }
 
@@ -1396,8 +1423,8 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
  * we can handle it..
  */
 good_area:
- if (unlikely(access_error(error_code, vma))) {
- bad_area_access_error(regs, error_code, address, vma);
+ if (unlikely(access_error(sw_error_code, vma))) {
+ bad_area_access_error(regs, sw_error_code, address, vma);
  return;
  }
 
@@ -1439,13 +1466,13 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
  return;
 
  /* Not returning to user mode? Handle exceptions or die: */
- no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
+ no_context(regs, sw_error_code, address, SIGBUS, BUS_ADRERR);
  return;
  }
 
  up_read(&mm->mmap_sem);
  if (unlikely(fault & VM_FAULT_ERROR)) {
- mm_fault_error(regs, error_code, address, &pkey, fault);
+ mm_fault_error(regs, sw_error_code, address, &pkey, fault);
  return;
  }
 
--
2.20.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU][B][PATCH 3/5] x86/mm: Break out kernel address space handling

Andrea Righi
In reply to this post by Andrea Righi
From: Dave Hansen <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1827884

The page fault handler (__do_page_fault())  basically has two sections:
one for handling faults in the kernel portion of the address space
and another for faults in the user portion of the address space.

But, these two parts don't stick out that well.  Let's make that more
clear from code separation and naming.  Pull kernel fault
handling into its own helper, and reflect that naming by renaming
spurious_fault() -> spurious_kernel_fault().

Also, rewrite the vmalloc() handling comment a bit.  It was a bit
stale and also glossed over the reserved bit handling.

Cc: [hidden email]
Cc: Jann Horn <[hidden email]>
Cc: Sean Christopherson <[hidden email]>
Cc: Thomas Gleixner <[hidden email]>
Cc: Andy Lutomirski <[hidden email]>
Signed-off-by: Dave Hansen <[hidden email]>
Signed-off-by: Peter Zijlstra (Intel) <[hidden email]>
Link: http://lkml.kernel.org/r/20180928160222.401F4E10@...
(cherry picked from commit 8fed62000039058adfd8b663344e2f448aed1e7a)
Signed-off-by: Andrea Righi <[hidden email]>
---
 arch/x86/mm/fault.c | 101 +++++++++++++++++++++++++++-----------------
 1 file changed, 62 insertions(+), 39 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 91f8377e45cc..50ffd7a09cf5 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1057,7 +1057,7 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
  }
 }
 
-static int spurious_fault_check(unsigned long error_code, pte_t *pte)
+static int spurious_kernel_fault_check(unsigned long error_code, pte_t *pte)
 {
  if ((error_code & X86_PF_WRITE) && !pte_write(*pte))
  return 0;
@@ -1096,7 +1096,7 @@ static int spurious_fault_check(unsigned long error_code, pte_t *pte)
  * (Optional Invalidation).
  */
 static noinline int
-spurious_fault(unsigned long error_code, unsigned long address)
+spurious_kernel_fault(unsigned long error_code, unsigned long address)
 {
  pgd_t *pgd;
  p4d_t *p4d;
@@ -1127,27 +1127,27 @@ spurious_fault(unsigned long error_code, unsigned long address)
  return 0;
 
  if (p4d_large(*p4d))
- return spurious_fault_check(error_code, (pte_t *) p4d);
+ return spurious_kernel_fault_check(error_code, (pte_t *) p4d);
 
  pud = pud_offset(p4d, address);
  if (!pud_present(*pud))
  return 0;
 
  if (pud_large(*pud))
- return spurious_fault_check(error_code, (pte_t *) pud);
+ return spurious_kernel_fault_check(error_code, (pte_t *) pud);
 
  pmd = pmd_offset(pud, address);
  if (!pmd_present(*pmd))
  return 0;
 
  if (pmd_large(*pmd))
- return spurious_fault_check(error_code, (pte_t *) pmd);
+ return spurious_kernel_fault_check(error_code, (pte_t *) pmd);
 
  pte = pte_offset_kernel(pmd, address);
  if (!pte_present(*pte))
  return 0;
 
- ret = spurious_fault_check(error_code, pte);
+ ret = spurious_kernel_fault_check(error_code, pte);
  if (!ret)
  return 0;
 
@@ -1155,12 +1155,12 @@ spurious_fault(unsigned long error_code, unsigned long address)
  * Make sure we have permissions in PMD.
  * If not, then there's a bug in the page tables:
  */
- ret = spurious_fault_check(error_code, (pte_t *) pmd);
+ ret = spurious_kernel_fault_check(error_code, (pte_t *) pmd);
  WARN_ONCE(!ret, "PMD has incorrect permission bits\n");
 
  return ret;
 }
-NOKPROBE_SYMBOL(spurious_fault);
+NOKPROBE_SYMBOL(spurious_kernel_fault);
 
 int show_unhandled_signals = 1;
 
@@ -1227,6 +1227,58 @@ static inline bool smap_violation(int error_code, struct pt_regs *regs)
  return true;
 }
 
+/*
+ * Called for all faults where 'address' is part of the kernel address
+ * space.  Might get called for faults that originate from *code* that
+ * ran in userspace or the kernel.
+ */
+static void
+do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
+   unsigned long address)
+{
+ /*
+ * We can fault-in kernel-space virtual memory on-demand. The
+ * 'reference' page table is init_mm.pgd.
+ *
+ * NOTE! We MUST NOT take any locks for this case. We may
+ * be in an interrupt or a critical region, and should
+ * only copy the information from the master page table,
+ * nothing more.
+ *
+ * Before doing this on-demand faulting, ensure that the
+ * fault is not any of the following:
+ * 1. A fault on a PTE with a reserved bit set.
+ * 2. A fault caused by a user-mode access.  (Do not demand-
+ *    fault kernel memory due to user-mode accesses).
+ * 3. A fault caused by a page-level protection violation.
+ *    (A demand fault would be on a non-present page which
+ *     would have X86_PF_PROT==0).
+ */
+ if (!(hw_error_code & (X86_PF_RSVD | X86_PF_USER | X86_PF_PROT))) {
+ if (vmalloc_fault(address) >= 0)
+ return;
+ }
+
+ /* Was the fault spurious, caused by lazy TLB invalidation? */
+ if (spurious_kernel_fault(hw_error_code, address))
+ return;
+
+ /* kprobes don't want to hook the spurious faults: */
+ if (kprobes_fault(regs))
+ return;
+
+ /*
+ * Note, despite being a "bad area", there are quite a few
+ * acceptable reasons to get here, such as erratum fixups
+ * and handling kernel code that can fault, like get_user().
+ *
+ * Don't take the mm semaphore here. If we fixup a prefetch
+ * fault we could otherwise deadlock:
+ */
+ bad_area_nosemaphore(regs, hw_error_code, address, NULL);
+}
+NOKPROBE_SYMBOL(do_kern_addr_fault);
+
 /*
  * This routine handles page faults.  It determines the address,
  * and the problem, and then passes it off to one of the appropriate
@@ -1252,38 +1304,9 @@ __do_page_fault(struct pt_regs *regs, unsigned long hw_error_code,
  if (unlikely(kmmio_fault(regs, address)))
  return;
 
- /*
- * We fault-in kernel-space virtual memory on-demand. The
- * 'reference' page table is init_mm.pgd.
- *
- * NOTE! We MUST NOT take any locks for this case. We may
- * be in an interrupt or a critical region, and should
- * only copy the information from the master page table,
- * nothing more.
- *
- * This verifies that the fault happens in kernel space
- * (hw_error_code & 4) == 0, and that the fault was not a
- * protection error (hw_error_code & 9) == 0.
- */
+ /* Was the fault on kernel-controlled part of the address space? */
  if (unlikely(fault_in_kernel_space(address))) {
- if (!(hw_error_code & (X86_PF_RSVD | X86_PF_USER | X86_PF_PROT))) {
- if (vmalloc_fault(address) >= 0)
- return;
- }
-
- /* Can handle a stale RO->RW TLB: */
- if (spurious_fault(hw_error_code, address))
- return;
-
- /* kprobes don't want to hook the spurious faults: */
- if (kprobes_fault(regs))
- return;
- /*
- * Don't take the mm semaphore here. If we fixup a prefetch
- * fault we could otherwise deadlock:
- */
- bad_area_nosemaphore(regs, hw_error_code, address, NULL);
-
+ do_kern_addr_fault(regs, hw_error_code, address);
  return;
  }
 
--
2.20.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU][B][PATCH 4/5] x86/mm: Break out user address space handling

Andrea Righi
In reply to this post by Andrea Righi
From: Dave Hansen <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1827884

The last patch broke out kernel address space handing into its own
helper.  Now, do the same for user address space handling.

Cc: [hidden email]
Cc: Jann Horn <[hidden email]>
Cc: Sean Christopherson <[hidden email]>
Cc: Thomas Gleixner <[hidden email]>
Cc: Andy Lutomirski <[hidden email]>
Signed-off-by: Dave Hansen <[hidden email]>
Signed-off-by: Peter Zijlstra (Intel) <[hidden email]>
Link: http://lkml.kernel.org/r/20180928160223.9C4F6440@...
(cherry picked from commit aa37c51b9421d66f7931c5fdcb9ce80c450974be)
Signed-off-by: Andrea Righi <[hidden email]>
---
 arch/x86/mm/fault.c | 47 +++++++++++++++++++++++++++------------------
 1 file changed, 28 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 50ffd7a09cf5..fca3c8def63a 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -991,6 +991,7 @@ bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
  __bad_area(regs, error_code, address, vma, SEGV_ACCERR);
 }
 
+/* Handle faults in the kernel portion of the address space */
 static void
 do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
   u32 *pkey, unsigned int fault)
@@ -1279,14 +1280,11 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
 }
 NOKPROBE_SYMBOL(do_kern_addr_fault);
 
-/*
- * This routine handles page faults.  It determines the address,
- * and the problem, and then passes it off to one of the appropriate
- * routines.
- */
-static noinline void
-__do_page_fault(struct pt_regs *regs, unsigned long hw_error_code,
- unsigned long address)
+/* Handle faults in the user portion of the address space */
+static inline
+void do_user_addr_fault(struct pt_regs *regs,
+ unsigned long hw_error_code,
+ unsigned long address)
 {
  unsigned long sw_error_code;
  struct vm_area_struct *vma;
@@ -1299,17 +1297,6 @@ __do_page_fault(struct pt_regs *regs, unsigned long hw_error_code,
  tsk = current;
  mm = tsk->mm;
 
- prefetchw(&mm->mmap_sem);
-
- if (unlikely(kmmio_fault(regs, address)))
- return;
-
- /* Was the fault on kernel-controlled part of the address space? */
- if (unlikely(fault_in_kernel_space(address))) {
- do_kern_addr_fault(regs, hw_error_code, address);
- return;
- }
-
  /* kprobes don't want to hook the spurious faults: */
  if (unlikely(kprobes_fault(regs)))
  return;
@@ -1513,6 +1500,28 @@ __do_page_fault(struct pt_regs *regs, unsigned long hw_error_code,
 
  check_v8086_mode(regs, address, tsk);
 }
+NOKPROBE_SYMBOL(do_user_addr_fault);
+
+/*
+ * This routine handles page faults.  It determines the address,
+ * and the problem, and then passes it off to one of the appropriate
+ * routines.
+ */
+static noinline void
+__do_page_fault(struct pt_regs *regs, unsigned long hw_error_code,
+ unsigned long address)
+{
+ prefetchw(&current->mm->mmap_sem);
+
+ if (unlikely(kmmio_fault(regs, address)))
+ return;
+
+ /* Was the fault on kernel-controlled part of the address space? */
+ if (unlikely(fault_in_kernel_space(address)))
+ do_kern_addr_fault(regs, hw_error_code, address);
+ else
+ do_user_addr_fault(regs, hw_error_code, address);
+}
 NOKPROBE_SYMBOL(__do_page_fault);
 
 static nokprobe_inline void
--
2.20.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU][B][PATCH 5/5] x86/mm/fault: Allow stack access below %rsp

Andrea Righi
In reply to this post by Andrea Righi
From: Waiman Long <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1827884

The current x86 page fault handler allows stack access below the stack
pointer if it is no more than 64k+256 bytes. Any access beyond the 64k+
limit will cause a segmentation fault.

The gcc -fstack-check option generates code to probe the stack for
large stack allocation to see if the stack is accessible. The newer gcc
does that while updating the %rsp simultaneously. Older gcc's like gcc4
doesn't do that. As a result, an application compiled with an old gcc
and the -fstack-check option may fail to start at all:

  $ cat test.c
  int main() {
        char tmp[1024*128];
        printf("### ok\n");
        return 0;
  }

  $ gcc -fstack-check -g -o test test.c

  $ ./test
  Segmentation fault

The old binary was working in older kernels where expand_stack() was
somehow called before the check. But it is not working in newer kernels.
Besides, the 64k+ limit check is kind of crude and will not catch a
lot of mistakes that userspace applications may be misbehaving anyway.
I think the kernel isn't the right place for this kind of tests. We
should leave it to userspace instrumentation tools to perform them.

The 64k+ limit check is now removed to just let expand_stack() decide
if a segmentation fault should happen, when the RLIMIT_STACK limit is
exceeded, for example.

Signed-off-by: Waiman Long <[hidden email]>
Cc: Andy Lutomirski <[hidden email]>
Cc: Borislav Petkov <[hidden email]>
Cc: Brian Gerst <[hidden email]>
Cc: Dave Hansen <[hidden email]>
Cc: Denys Vlasenko <[hidden email]>
Cc: H. Peter Anvin <[hidden email]>
Cc: Linus Torvalds <[hidden email]>
Cc: Peter Zijlstra <[hidden email]>
Cc: Rik van Riel <[hidden email]>
Cc: Thomas Gleixner <[hidden email]>
Link: http://lkml.kernel.org/r/1541535149-31963-1-git-send-email-longman@...
Signed-off-by: Ingo Molnar <[hidden email]>
(cherry picked from commit 1d8ca3be86ebc6a38dad8236f45c7a9c61681e78)
Signed-off-by: Andrea Righi <[hidden email]>
---
 arch/x86/mm/fault.c | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index fca3c8def63a..6b6c92bbea28 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1411,18 +1411,6 @@ void do_user_addr_fault(struct pt_regs *regs,
  bad_area(regs, sw_error_code, address);
  return;
  }
- if (sw_error_code & X86_PF_USER) {
- /*
- * Accessing the stack below %sp is always a bug.
- * The large cushion allows instructions like enter
- * and pusha to work. ("enter $65535, $31" pushes
- * 32 pointers and then decrements %sp by 65535.)
- */
- if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
- bad_area(regs, sw_error_code, address);
- return;
- }
- }
  if (unlikely(expand_stack(vma, address))) {
  bad_area(regs, sw_error_code, address);
  return;
--
2.20.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

ACK/Cmnt: [SRU][B][PATCH 0/5] x86: mm: fix early boot problem on i386 with KPTI enabled

Stefan Bader-2
In reply to this post by Andrea Righi
On 04.07.19 17:17, Andrea Righi wrote:

> Buglink: https://bugs.launchpad.net/bugs/1827884
>
> [Impact]
>
> Commit d653420532d580156c8486686899ea6a9eeb7bf0 in bionic enabled kernel page
> table isolation for x86_32, but also introduced regressions. One of them
> ("BUG_ON() condition in vmalloc_sync_one()") has been addressed by bug 1830433,
> but there are other issues reported on i386.
>
> Specifically on some i386 systems the kernel seems to fail in the early stage
> of boot (black screen and frozen keyboard) with no error reported on the
> console.
>
> If the kernel is booted with "mitigations=off" and "nopti" the problem doesn't
> happen (that is a clear indication of being a kernel page table isolation
> issue).
>
> However, users have been reported positive results with the following upstream
> fixes applied (all clean cherry picks), even with mitigations *and* kernel page
> table isolation enabled.
>
> [Test Case]
>
> Unfortuantely this problem is not easily reproducible, the kernel simply fails
> to boot (black screen and frozen keyboard) after the GRUB prompt, so we don't
> have a real test case (except asking the bug reporters to boot the kernel and
> see if it works).
>
> [Fix]
>
> The following upstream fix seems to resolve (prevent) the problem:
>
>  1d8ca3be86ebc6a38dad8236f45c7a9c61681e78 x86/mm/fault: Allow stack access below %rsp
>  aa37c51b9421d66f7931c5fdcb9ce80c450974be x86/mm: Break out user address space handling
>  8fed62000039058adfd8b663344e2f448aed1e7a x86/mm: Break out kernel address space handling
>  164477c2331be75d9bd57fb76704e676b2bcd1cd x86/mm: Clarify hardware vs. software "error_code"
>  0e664eee65337082be49fbbd2ee24aa0d111d0f2 Revert "perf/core: Make sure the ring-buffer is mapped in all page-tables"
>
> [Regression Potential]
>
> All upstream fixes, tested on the affected platform, backport changes are
> minimal.
>
> ----------------------------------------------------------------
> Dave Hansen (3):
>       x86/mm: Clarify hardware vs. software "error_code"
>       x86/mm: Break out kernel address space handling
>       x86/mm: Break out user address space handling
>
> Joerg Roedel (1):
>       Revert "perf/core: Make sure the ring-buffer is mapped in all page-tables"
>
> Waiman Long (1):
>       x86/mm/fault: Allow stack access below %rsp
>
>  arch/x86/mm/fault.c         | 205 +++++++++++++++++++++++++++-----------------
>  kernel/events/ring_buffer.c |  16 ----
>  2 files changed, 126 insertions(+), 95 deletions(-)
>
>
At first glance its a lot of change but quite a bit of it is moving code into
helper functions which should be not contributing to risk. Also the issue is a
severe one, so worth taking some risk. And there was positive testing and the
modified code should not depend on specific hardware (higher chance to be used
in testing).

Acked-by: Stefan Bader <[hidden email]>


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

ACK: [SRU][B][PATCH 0/5] x86: mm: fix early boot problem on i386 with KPTI enabled

Connor Kuehl
In reply to this post by Andrea Righi
On 7/4/19 8:17 AM, Andrea Righi wrote:

> Buglink: https://bugs.launchpad.net/bugs/1827884
>
> [Impact]
>
> Commit d653420532d580156c8486686899ea6a9eeb7bf0 in bionic enabled kernel page
> table isolation for x86_32, but also introduced regressions. One of them
> ("BUG_ON() condition in vmalloc_sync_one()") has been addressed by bug 1830433,
> but there are other issues reported on i386.
>
> Specifically on some i386 systems the kernel seems to fail in the early stage
> of boot (black screen and frozen keyboard) with no error reported on the
> console.
>
> If the kernel is booted with "mitigations=off" and "nopti" the problem doesn't
> happen (that is a clear indication of being a kernel page table isolation
> issue).
>
> However, users have been reported positive results with the following upstream
> fixes applied (all clean cherry picks), even with mitigations *and* kernel page
> table isolation enabled.
>
> [Test Case]
>
> Unfortuantely this problem is not easily reproducible, the kernel simply fails
> to boot (black screen and frozen keyboard) after the GRUB prompt, so we don't
> have a real test case (except asking the bug reporters to boot the kernel and
> see if it works).
>
> [Fix]
>
> The following upstream fix seems to resolve (prevent) the problem:
>
>  1d8ca3be86ebc6a38dad8236f45c7a9c61681e78 x86/mm/fault: Allow stack access below %rsp
>  aa37c51b9421d66f7931c5fdcb9ce80c450974be x86/mm: Break out user address space handling
>  8fed62000039058adfd8b663344e2f448aed1e7a x86/mm: Break out kernel address space handling
>  164477c2331be75d9bd57fb76704e676b2bcd1cd x86/mm: Clarify hardware vs. software "error_code"
>  0e664eee65337082be49fbbd2ee24aa0d111d0f2 Revert "perf/core: Make sure the ring-buffer is mapped in all page-tables"
>
> [Regression Potential]
>
> All upstream fixes, tested on the affected platform, backport changes are
> minimal.
>
> ----------------------------------------------------------------
> Dave Hansen (3):
>       x86/mm: Clarify hardware vs. software "error_code"
>       x86/mm: Break out kernel address space handling
>       x86/mm: Break out user address space handling
>
> Joerg Roedel (1):
>       Revert "perf/core: Make sure the ring-buffer is mapped in all page-tables"
>
> Waiman Long (1):
>       x86/mm/fault: Allow stack access below %rsp
>
>  arch/x86/mm/fault.c         | 205 +++++++++++++++++++++++++++-----------------
>  kernel/events/ring_buffer.c |  16 ----
>  2 files changed, 126 insertions(+), 95 deletions(-)
>
>

Acked-by: Connor Kuehl <[hidden email]>

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

APPLIED: [SRU][B][PATCH 0/5] x86: mm: fix early boot problem on i386 with KPTI enabled

Kleber Souza
In reply to this post by Andrea Righi
On 04.07.19 17:17, Andrea Righi wrote:

> Buglink: https://bugs.launchpad.net/bugs/1827884
>
> [Impact]
>
> Commit d653420532d580156c8486686899ea6a9eeb7bf0 in bionic enabled kernel page
> table isolation for x86_32, but also introduced regressions. One of them
> ("BUG_ON() condition in vmalloc_sync_one()") has been addressed by bug 1830433,
> but there are other issues reported on i386.
>
> Specifically on some i386 systems the kernel seems to fail in the early stage
> of boot (black screen and frozen keyboard) with no error reported on the
> console.
>
> If the kernel is booted with "mitigations=off" and "nopti" the problem doesn't
> happen (that is a clear indication of being a kernel page table isolation
> issue).
>
> However, users have been reported positive results with the following upstream
> fixes applied (all clean cherry picks), even with mitigations *and* kernel page
> table isolation enabled.
>
> [Test Case]
>
> Unfortuantely this problem is not easily reproducible, the kernel simply fails
> to boot (black screen and frozen keyboard) after the GRUB prompt, so we don't
> have a real test case (except asking the bug reporters to boot the kernel and
> see if it works).
>
> [Fix]
>
> The following upstream fix seems to resolve (prevent) the problem:
>
>  1d8ca3be86ebc6a38dad8236f45c7a9c61681e78 x86/mm/fault: Allow stack access below %rsp
>  aa37c51b9421d66f7931c5fdcb9ce80c450974be x86/mm: Break out user address space handling
>  8fed62000039058adfd8b663344e2f448aed1e7a x86/mm: Break out kernel address space handling
>  164477c2331be75d9bd57fb76704e676b2bcd1cd x86/mm: Clarify hardware vs. software "error_code"
>  0e664eee65337082be49fbbd2ee24aa0d111d0f2 Revert "perf/core: Make sure the ring-buffer is mapped in all page-tables"
>
> [Regression Potential]
>
> All upstream fixes, tested on the affected platform, backport changes are
> minimal.
>
> ----------------------------------------------------------------
> Dave Hansen (3):
>       x86/mm: Clarify hardware vs. software "error_code"
>       x86/mm: Break out kernel address space handling
>       x86/mm: Break out user address space handling
>
> Joerg Roedel (1):
>       Revert "perf/core: Make sure the ring-buffer is mapped in all page-tables"
>
> Waiman Long (1):
>       x86/mm/fault: Allow stack access below %rsp
>
>  arch/x86/mm/fault.c         | 205 +++++++++++++++++++++++++++-----------------
>  kernel/events/ring_buffer.c |  16 ----
>  2 files changed, 126 insertions(+), 95 deletions(-)
>
>

Applied to bionic/master-next branch.

Thanks,
Kleber

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team