[SRU] [B/C/D] [PATCH 0/4] Fix AMD CPU MCE bug

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

[SRU] [B/C/D] [PATCH 0/4] Fix AMD CPU MCE bug

Kai-Heng Feng
BugLink: https://bugs.launchpad.net/bugs/1796443

[Impact]
System doesn't boot without "mce=off".

[Fix]
Quote from the commit log:
"Clear the "Counter Present" bit in the Instruction Fetch bank's
MCA_MISC0 register. This will prevent enabling MCA thresholding on this
bank which will prevent the high interrupt rate due to this error."

[Test]
The affected user reported these commits fix the issue.

[Regression Potential]
Low. Upstream stable commits. I don't see any regression on my
unaffected AMD systems.

Shirish S (2):
  x86/MCE/AMD: Turn off MC4_MISC thresholding on all family 0x15 models
  x86/MCE/AMD: Carve out the MC4_MISC thresholding quirk

Yazen Ghannam (2):
  x86/MCE: Add an MCE-record filtering function
  x86/MCE/AMD: Don't report L1 BTB MCA errors on some family 17h models

 arch/x86/kernel/cpu/mce/amd.c      | 62 ++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/mce/core.c     | 38 ++++--------------
 arch/x86/kernel/cpu/mce/genpool.c  |  3 ++
 arch/x86/kernel/cpu/mce/internal.h |  9 +++++
 drivers/edac/mce_amd.c             |  4 +-
 5 files changed, 84 insertions(+), 32 deletions(-)

--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[B/C] [PATCH 1/4] x86/MCE/AMD: Turn off MC4_MISC thresholding on all family 0x15 models

Kai-Heng Feng
From: Shirish S <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1796443

MC4_MISC thresholding is not supported on all family 0x15 processors,
hence skip the x86_model check when applying the quirk.

 [ bp: massage commit message. ]

Signed-off-by: Shirish S <[hidden email]>
Signed-off-by: Borislav Petkov <[hidden email]>
Cc: "H. Peter Anvin" <[hidden email]>
Cc: Ingo Molnar <[hidden email]>
Cc: Thomas Gleixner <[hidden email]>
Cc: Tony Luck <[hidden email]>
Cc: Vishal Verma <[hidden email]>
Cc: x86-ml <[hidden email]>
Link: https://lkml.kernel.org/r/1547106849-3476-2-git-send-email-shirish.s@...
(backported from commit c95b323dcd3598dd7ef5005d6723c1ba3b801093)
Signed-off-by: Kai-Heng Feng <[hidden email]>
---
 arch/x86/kernel/cpu/mcheck/mce.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index d355751c8038..7ee9cb37d7c9 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1599,11 +1599,10 @@ static int __mcheck_cpu_apply_quirks(struct cpuinfo_x86 *c)
  mce_flags.overflow_recov = 1;
 
  /*
- * Turn off MC4_MISC thresholding banks on those models since
+ * Turn off MC4_MISC thresholding banks on all models since
  * they're not supported there.
  */
- if (c->x86 == 0x15 &&
-    (c->x86_model >= 0x10 && c->x86_model <= 0x1f)) {
+ if (c->x86 == 0x15) {
  int i;
  u64 hwcr;
  bool need_toggle;
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[D] [PATCH 1/4] x86/MCE/AMD: Turn off MC4_MISC thresholding on all family 0x15 models

Kai-Heng Feng
In reply to this post by Kai-Heng Feng
From: Shirish S <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1796443

MC4_MISC thresholding is not supported on all family 0x15 processors,
hence skip the x86_model check when applying the quirk.

 [ bp: massage commit message. ]

Signed-off-by: Shirish S <[hidden email]>
Signed-off-by: Borislav Petkov <[hidden email]>
Cc: "H. Peter Anvin" <[hidden email]>
Cc: Ingo Molnar <[hidden email]>
Cc: Thomas Gleixner <[hidden email]>
Cc: Tony Luck <[hidden email]>
Cc: Vishal Verma <[hidden email]>
Cc: x86-ml <[hidden email]>
Link: https://lkml.kernel.org/r/1547106849-3476-2-git-send-email-shirish.s@...
(cherry picked from commit c95b323dcd3598dd7ef5005d6723c1ba3b801093)
Signed-off-by: Kai-Heng Feng <[hidden email]>
---
 arch/x86/kernel/cpu/mce/core.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 6ce290c506d9..a2dccfb26c04 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1613,11 +1613,10 @@ static int __mcheck_cpu_apply_quirks(struct cpuinfo_x86 *c)
  mce_flags.overflow_recov = 1;
 
  /*
- * Turn off MC4_MISC thresholding banks on those models since
+ * Turn off MC4_MISC thresholding banks on all models since
  * they're not supported there.
  */
- if (c->x86 == 0x15 &&
-    (c->x86_model >= 0x10 && c->x86_model <= 0x1f)) {
+ if (c->x86 == 0x15) {
  int i;
  u64 hwcr;
  bool need_toggle;
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[B/C] [PATCH 2/4] x86/MCE/AMD: Carve out the MC4_MISC thresholding quirk

Kai-Heng Feng
In reply to this post by Kai-Heng Feng
From: Shirish S <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1796443

The MC4_MISC thresholding quirk needs to be applied during S5 -> S0 and
S3 -> S0 state transitions, which follow different code paths. Carve it
out into a separate function and call it mce_amd_feature_init() where
the two code paths of the state transitions converge.

 [ bp: massage commit message and the carved out function. ]

Signed-off-by: Shirish S <[hidden email]>
Signed-off-by: Borislav Petkov <[hidden email]>
Cc: "H. Peter Anvin" <[hidden email]>
Cc: Ingo Molnar <[hidden email]>
Cc: Kees Cook <[hidden email]>
Cc: Thomas Gleixner <[hidden email]>
Cc: Tony Luck <[hidden email]>
Cc: Vishal Verma <[hidden email]>
Cc: Yazen Ghannam <[hidden email]>
Cc: x86-ml <[hidden email]>
Link: https://lkml.kernel.org/r/1547651417-23583-3-git-send-email-shirish.s@...
(backported from commit 30aa3d26edb0f3d7992757287eec0ca588a5c259)
Signed-off-by: Kai-Heng Feng <[hidden email]>
---
 arch/x86/kernel/cpu/mcheck/mce.c     | 29 ----------------------
 arch/x86/kernel/cpu/mcheck/mce_amd.c | 36 ++++++++++++++++++++++++++++
 2 files changed, 36 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 7ee9cb37d7c9..7a7b3a95b4ba 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1598,35 +1598,6 @@ static int __mcheck_cpu_apply_quirks(struct cpuinfo_x86 *c)
  if (c->x86 == 0x15 && c->x86_model <= 0xf)
  mce_flags.overflow_recov = 1;
 
- /*
- * Turn off MC4_MISC thresholding banks on all models since
- * they're not supported there.
- */
- if (c->x86 == 0x15) {
- int i;
- u64 hwcr;
- bool need_toggle;
- u32 msrs[] = {
- 0x00000413, /* MC4_MISC0 */
- 0xc0000408, /* MC4_MISC1 */
- };
-
- rdmsrl(MSR_K7_HWCR, hwcr);
-
- /* McStatusWrEn has to be set */
- need_toggle = !(hwcr & BIT(18));
-
- if (need_toggle)
- wrmsrl(MSR_K7_HWCR, hwcr | BIT(18));
-
- /* Clear CntP bit safely */
- for (i = 0; i < ARRAY_SIZE(msrs); i++)
- msr_clear_bit(msrs[i], 62);
-
- /* restore old settings */
- if (need_toggle)
- wrmsrl(MSR_K7_HWCR, hwcr);
- }
  }
 
  if (c->x86_vendor == X86_VENDOR_INTEL) {
diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c b/arch/x86/kernel/cpu/mcheck/mce_amd.c
index dd33c357548f..8beea51981c4 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_amd.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c
@@ -545,6 +545,40 @@ prepare_threshold_block(unsigned int bank, unsigned int block, u32 addr,
  return offset;
 }
 
+/*
+ * Turn off MC4_MISC thresholding banks on all family 0x15 models since
+ * they're not supported there.
+ */
+void disable_err_thresholding(struct cpuinfo_x86 *c)
+{
+ int i;
+ u64 hwcr;
+ bool need_toggle;
+ u32 msrs[] = {
+ 0x00000413, /* MC4_MISC0 */
+ 0xc0000408, /* MC4_MISC1 */
+ };
+
+ if (c->x86 != 0x15)
+ return;
+
+ rdmsrl(MSR_K7_HWCR, hwcr);
+
+ /* McStatusWrEn has to be set */
+ need_toggle = !(hwcr & BIT(18));
+
+ if (need_toggle)
+ wrmsrl(MSR_K7_HWCR, hwcr | BIT(18));
+
+ /* Clear CntP bit safely */
+ for (i = 0; i < ARRAY_SIZE(msrs); i++)
+ msr_clear_bit(msrs[i], 62);
+
+ /* restore old settings */
+ if (need_toggle)
+ wrmsrl(MSR_K7_HWCR, hwcr);
+}
+
 /* cpu init entry point, called from mce.c with preempt off */
 void mce_amd_feature_init(struct cpuinfo_x86 *c)
 {
@@ -552,6 +586,8 @@ void mce_amd_feature_init(struct cpuinfo_x86 *c)
  unsigned int bank, block, cpu = smp_processor_id();
  int offset = -1;
 
+ disable_err_thresholding(c);
+
  for (bank = 0; bank < mca_cfg.banks; ++bank) {
  if (mce_flags.smca)
  smca_configure(bank, cpu);
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[D] [PATCH 2/4] x86/MCE/AMD: Carve out the MC4_MISC thresholding quirk

Kai-Heng Feng
In reply to this post by Kai-Heng Feng
From: Shirish S <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1796443

The MC4_MISC thresholding quirk needs to be applied during S5 -> S0 and
S3 -> S0 state transitions, which follow different code paths. Carve it
out into a separate function and call it mce_amd_feature_init() where
the two code paths of the state transitions converge.

 [ bp: massage commit message and the carved out function. ]

Signed-off-by: Shirish S <[hidden email]>
Signed-off-by: Borislav Petkov <[hidden email]>
Cc: "H. Peter Anvin" <[hidden email]>
Cc: Ingo Molnar <[hidden email]>
Cc: Kees Cook <[hidden email]>
Cc: Thomas Gleixner <[hidden email]>
Cc: Tony Luck <[hidden email]>
Cc: Vishal Verma <[hidden email]>
Cc: Yazen Ghannam <[hidden email]>
Cc: x86-ml <[hidden email]>
Link: https://lkml.kernel.org/r/1547651417-23583-3-git-send-email-shirish.s@...
(cherry picked from commit 30aa3d26edb0f3d7992757287eec0ca588a5c259)
Signed-off-by: Kai-Heng Feng <[hidden email]>
---
 arch/x86/kernel/cpu/mce/amd.c  | 36 ++++++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/mce/core.c | 29 ---------------------------
 2 files changed, 36 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 89298c83de53..ed3327342b40 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -545,6 +545,40 @@ prepare_threshold_block(unsigned int bank, unsigned int block, u32 addr,
  return offset;
 }
 
+/*
+ * Turn off MC4_MISC thresholding banks on all family 0x15 models since
+ * they're not supported there.
+ */
+void disable_err_thresholding(struct cpuinfo_x86 *c)
+{
+ int i;
+ u64 hwcr;
+ bool need_toggle;
+ u32 msrs[] = {
+ 0x00000413, /* MC4_MISC0 */
+ 0xc0000408, /* MC4_MISC1 */
+ };
+
+ if (c->x86 != 0x15)
+ return;
+
+ rdmsrl(MSR_K7_HWCR, hwcr);
+
+ /* McStatusWrEn has to be set */
+ need_toggle = !(hwcr & BIT(18));
+
+ if (need_toggle)
+ wrmsrl(MSR_K7_HWCR, hwcr | BIT(18));
+
+ /* Clear CntP bit safely */
+ for (i = 0; i < ARRAY_SIZE(msrs); i++)
+ msr_clear_bit(msrs[i], 62);
+
+ /* restore old settings */
+ if (need_toggle)
+ wrmsrl(MSR_K7_HWCR, hwcr);
+}
+
 /* cpu init entry point, called from mce.c with preempt off */
 void mce_amd_feature_init(struct cpuinfo_x86 *c)
 {
@@ -552,6 +586,8 @@ void mce_amd_feature_init(struct cpuinfo_x86 *c)
  unsigned int bank, block, cpu = smp_processor_id();
  int offset = -1;
 
+ disable_err_thresholding(c);
+
  for (bank = 0; bank < mca_cfg.banks; ++bank) {
  if (mce_flags.smca)
  smca_configure(bank, cpu);
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index a2dccfb26c04..b7fb541a4873 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1612,35 +1612,6 @@ static int __mcheck_cpu_apply_quirks(struct cpuinfo_x86 *c)
  if (c->x86 == 0x15 && c->x86_model <= 0xf)
  mce_flags.overflow_recov = 1;
 
- /*
- * Turn off MC4_MISC thresholding banks on all models since
- * they're not supported there.
- */
- if (c->x86 == 0x15) {
- int i;
- u64 hwcr;
- bool need_toggle;
- u32 msrs[] = {
- 0x00000413, /* MC4_MISC0 */
- 0xc0000408, /* MC4_MISC1 */
- };
-
- rdmsrl(MSR_K7_HWCR, hwcr);
-
- /* McStatusWrEn has to be set */
- need_toggle = !(hwcr & BIT(18));
-
- if (need_toggle)
- wrmsrl(MSR_K7_HWCR, hwcr | BIT(18));
-
- /* Clear CntP bit safely */
- for (i = 0; i < ARRAY_SIZE(msrs); i++)
- msr_clear_bit(msrs[i], 62);
-
- /* restore old settings */
- if (need_toggle)
- wrmsrl(MSR_K7_HWCR, hwcr);
- }
  }
 
  if (c->x86_vendor == X86_VENDOR_INTEL) {
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[B] [PATCH 3/4] x86/MCE: Add an MCE-record filtering function

Kai-Heng Feng
In reply to this post by Kai-Heng Feng
From: Yazen Ghannam <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1796443

Some systems may report spurious MCA errors. In general, spurious MCA
errors may be disabled by clearing a particular bit in MCA_CTL. However,
clearing a bit in MCA_CTL may not be recommended for some errors, so the
only option is to ignore them.

An MCA error is printed and handled after it has been added to the MCE
event pool. So an MCA error can be ignored by not adding it to that pool
in the first place.

Add such a filtering function.

 [ bp: Move function prototype to the internal header and massage. ]

Signed-off-by: Yazen Ghannam <[hidden email]>
Signed-off-by: Borislav Petkov <[hidden email]>
Cc: Arnd Bergmann <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Cc: "H. Peter Anvin" <[hidden email]>
Cc: Ingo Molnar <[hidden email]>
Cc: Pu Wen <[hidden email]>
Cc: Qiuxu Zhuo <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Cc: Shirish S <[hidden email]>
Cc: <[hidden email]> # 5.0.x
Cc: Thomas Gleixner <[hidden email]>
Cc: Tony Luck <[hidden email]>
Cc: Vishal Verma <[hidden email]>
Cc: x86-ml <[hidden email]>
Link: https://lkml.kernel.org/r/20190325163410.171021-1-Yazen.Ghannam@...
(backported from commit 45d4b7b9cb88526f6d5bd4c03efab88d75d10e4f)
Signed-off-by: Kai-Heng Feng <[hidden email]>
---
 arch/x86/kernel/cpu/mcheck/mce-genpool.c  | 3 +++
 arch/x86/kernel/cpu/mcheck/mce-internal.h | 3 +++
 arch/x86/kernel/cpu/mcheck/mce.c          | 5 +++++
 3 files changed, 11 insertions(+)

diff --git a/arch/x86/kernel/cpu/mcheck/mce-genpool.c b/arch/x86/kernel/cpu/mcheck/mce-genpool.c
index 217cd4449bc9..fe1e74d0fb5b 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-genpool.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-genpool.c
@@ -99,6 +99,9 @@ int mce_gen_pool_add(struct mce *mce)
 {
  struct mce_evt_llist *node;
 
+ if (filter_mce(mce))
+ return -EINVAL;
+
  if (!mce_evt_pool)
  return -EINVAL;
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index e956eb267061..0eb5d12c6bd0 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -130,4 +130,7 @@ static inline void mce_unmap_kpfn(unsigned long pfn) {}
 #define mce_unmap_kpfn mce_unmap_kpfn
 #endif
 
+/* Decide whether to add MCE record to MCE event pool or filter it out. */
+extern bool filter_mce(struct mce *m);
+
 #endif /* __X86_MCE_INTERNAL_H__ */
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 851b209c2f7d..bd99ba350990 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1766,6 +1766,11 @@ static void __mcheck_cpu_init_timer(void)
  mce_start_timer(t);
 }
 
+bool filter_mce(struct mce *m)
+{
+ return false;
+}
+
 /* Handle unconfigured int18 (should never happen) */
 static void unexpected_machine_check(struct pt_regs *regs, long error_code)
 {
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[C] [PATCH 3/4] x86/MCE: Add an MCE-record filtering function

Kai-Heng Feng
In reply to this post by Kai-Heng Feng
From: Yazen Ghannam <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1796443

Some systems may report spurious MCA errors. In general, spurious MCA
errors may be disabled by clearing a particular bit in MCA_CTL. However,
clearing a bit in MCA_CTL may not be recommended for some errors, so the
only option is to ignore them.

An MCA error is printed and handled after it has been added to the MCE
event pool. So an MCA error can be ignored by not adding it to that pool
in the first place.

Add such a filtering function.

 [ bp: Move function prototype to the internal header and massage. ]

Signed-off-by: Yazen Ghannam <[hidden email]>
Signed-off-by: Borislav Petkov <[hidden email]>
Cc: Arnd Bergmann <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Cc: "H. Peter Anvin" <[hidden email]>
Cc: Ingo Molnar <[hidden email]>
Cc: Pu Wen <[hidden email]>
Cc: Qiuxu Zhuo <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Cc: Shirish S <[hidden email]>
Cc: <[hidden email]> # 5.0.x
Cc: Thomas Gleixner <[hidden email]>
Cc: Tony Luck <[hidden email]>
Cc: Vishal Verma <[hidden email]>
Cc: x86-ml <[hidden email]>
Link: https://lkml.kernel.org/r/20190325163410.171021-1-Yazen.Ghannam@...
(backported from commit 45d4b7b9cb88526f6d5bd4c03efab88d75d10e4f)
Signed-off-by: Kai-Heng Feng <[hidden email]>
---
 arch/x86/kernel/cpu/mcheck/mce-genpool.c  | 3 +++
 arch/x86/kernel/cpu/mcheck/mce-internal.h | 3 +++
 arch/x86/kernel/cpu/mcheck/mce.c          | 5 +++++
 3 files changed, 11 insertions(+)

diff --git a/arch/x86/kernel/cpu/mcheck/mce-genpool.c b/arch/x86/kernel/cpu/mcheck/mce-genpool.c
index 217cd4449bc9..fe1e74d0fb5b 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-genpool.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-genpool.c
@@ -99,6 +99,9 @@ int mce_gen_pool_add(struct mce *mce)
 {
  struct mce_evt_llist *node;
 
+ if (filter_mce(mce))
+ return -EINVAL;
+
  if (!mce_evt_pool)
  return -EINVAL;
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index ceb67cd5918f..e48fefda6c68 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -170,4 +170,7 @@ struct mca_msr_regs {
 
 extern struct mca_msr_regs msr_ops;
 
+/* Decide whether to add MCE record to MCE event pool or filter it out. */
+extern bool filter_mce(struct mce *m);
+
 #endif /* __X86_MCE_INTERNAL_H__ */
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 7a7b3a95b4ba..bd0e3c3da8bc 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1752,6 +1752,11 @@ static void __mcheck_cpu_init_timer(void)
  mce_start_timer(t);
 }
 
+bool filter_mce(struct mce *m)
+{
+ return false;
+}
+
 /* Handle unconfigured int18 (should never happen) */
 static void unexpected_machine_check(struct pt_regs *regs, long error_code)
 {
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[D] [PATCH 3/4] x86/MCE: Add an MCE-record filtering function

Kai-Heng Feng
In reply to this post by Kai-Heng Feng
From: Yazen Ghannam <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1796443

Some systems may report spurious MCA errors. In general, spurious MCA
errors may be disabled by clearing a particular bit in MCA_CTL. However,
clearing a bit in MCA_CTL may not be recommended for some errors, so the
only option is to ignore them.

An MCA error is printed and handled after it has been added to the MCE
event pool. So an MCA error can be ignored by not adding it to that pool
in the first place.

Add such a filtering function.

 [ bp: Move function prototype to the internal header and massage. ]

Signed-off-by: Yazen Ghannam <[hidden email]>
Signed-off-by: Borislav Petkov <[hidden email]>
Cc: Arnd Bergmann <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Cc: "H. Peter Anvin" <[hidden email]>
Cc: Ingo Molnar <[hidden email]>
Cc: Pu Wen <[hidden email]>
Cc: Qiuxu Zhuo <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Cc: Shirish S <[hidden email]>
Cc: <[hidden email]> # 5.0.x
Cc: Thomas Gleixner <[hidden email]>
Cc: Tony Luck <[hidden email]>
Cc: Vishal Verma <[hidden email]>
Cc: x86-ml <[hidden email]>
Link: https://lkml.kernel.org/r/20190325163410.171021-1-Yazen.Ghannam@...
(cherry picked from commit 45d4b7b9cb88526f6d5bd4c03efab88d75d10e4f)
Signed-off-by: Kai-Heng Feng <[hidden email]>
---
 arch/x86/kernel/cpu/mce/core.c     | 5 +++++
 arch/x86/kernel/cpu/mce/genpool.c  | 3 +++
 arch/x86/kernel/cpu/mce/internal.h | 3 +++
 3 files changed, 11 insertions(+)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index b7fb541a4873..12d61b8f8154 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1771,6 +1771,11 @@ static void __mcheck_cpu_init_timer(void)
  mce_start_timer(t);
 }
 
+bool filter_mce(struct mce *m)
+{
+ return false;
+}
+
 /* Handle unconfigured int18 (should never happen) */
 static void unexpected_machine_check(struct pt_regs *regs, long error_code)
 {
diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
index 3395549c51d3..64d1d5a00f39 100644
--- a/arch/x86/kernel/cpu/mce/genpool.c
+++ b/arch/x86/kernel/cpu/mce/genpool.c
@@ -99,6 +99,9 @@ int mce_gen_pool_add(struct mce *mce)
 {
  struct mce_evt_llist *node;
 
+ if (filter_mce(mce))
+ return -EINVAL;
+
  if (!mce_evt_pool)
  return -EINVAL;
 
diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index af5eab1e65e2..b822a645395d 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -173,4 +173,7 @@ struct mca_msr_regs {
 
 extern struct mca_msr_regs msr_ops;
 
+/* Decide whether to add MCE record to MCE event pool or filter it out. */
+extern bool filter_mce(struct mce *m);
+
 #endif /* __X86_MCE_INTERNAL_H__ */
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[B] [PATCH 4/4] x86/MCE/AMD: Don't report L1 BTB MCA errors on some family 17h models

Kai-Heng Feng
In reply to this post by Kai-Heng Feng
From: Yazen Ghannam <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1796443

AMD family 17h Models 10h-2Fh may report a high number of L1 BTB MCA
errors under certain conditions. The errors are benign and can safely be
ignored. However, the high error rate may cause the MCA threshold
counter to overflow causing a high rate of thresholding interrupts.

In addition, users may see the errors reported through the AMD MCE
decoder module, even with the interrupt disabled, due to MCA polling.

Clear the "Counter Present" bit in the Instruction Fetch bank's
MCA_MISC0 register. This will prevent enabling MCA thresholding on this
bank which will prevent the high interrupt rate due to this error.

Define an AMD-specific function to filter these errors from the MCE
event pool so that they don't get reported during early boot.

Rename filter function in EDAC/mce_amd to avoid a naming conflict, while
at it.

 [ bp: Move function prototype to the internal header and
   massage/cleanup, fix typos. ]

Reported-by: Rafał Miłecki <[hidden email]>
Signed-off-by: Yazen Ghannam <[hidden email]>
Signed-off-by: Borislav Petkov <[hidden email]>
Cc: "H. Peter Anvin" <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Cc: Arnd Bergmann <[hidden email]>
Cc: Ingo Molnar <[hidden email]>
Cc: James Morse <[hidden email]>
Cc: Kees Cook <[hidden email]>
Cc: Mauro Carvalho Chehab <[hidden email]>
Cc: Pu Wen <[hidden email]>
Cc: Qiuxu Zhuo <[hidden email]>
Cc: Shirish S <[hidden email]>
Cc: Thomas Gleixner <[hidden email]>
Cc: Tony Luck <[hidden email]>
Cc: Vishal Verma <[hidden email]>
Cc: linux-edac <[hidden email]>
Cc: x86-ml <[hidden email]>
Cc: <[hidden email]> # 5.0.x: c95b323dcd35: x86/MCE/AMD: Turn off MC4_MISC thresholding on all family 0x15 models
Cc: <[hidden email]> # 5.0.x: 30aa3d26edb0: x86/MCE/AMD: Carve out the MC4_MISC thresholding quirk
Cc: <[hidden email]> # 5.0.x: 9308fd407455: x86/MCE: Group AMD function prototypes in <asm/mce.h>
Cc: <[hidden email]> # 5.0.x
Link: https://lkml.kernel.org/r/20190325163410.171021-2-Yazen.Ghannam@...
(backported from commit 71a84402b93e5fbd8f817f40059c137e10171788)
Signed-off-by: Kai-Heng Feng <[hidden email]>
---
 arch/x86/kernel/cpu/mcheck/mce-internal.h |  6 +++
 arch/x86/kernel/cpu/mcheck/mce.c          |  3 ++
 arch/x86/kernel/cpu/mcheck/mce_amd.c      | 52 +++++++++++++++++------
 drivers/edac/mce_amd.c                    |  4 +-
 4 files changed, 50 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index 0eb5d12c6bd0..7802ca53d553 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -133,4 +133,10 @@ static inline void mce_unmap_kpfn(unsigned long pfn) {}
 /* Decide whether to add MCE record to MCE event pool or filter it out. */
 extern bool filter_mce(struct mce *m);
 
+#ifdef CONFIG_X86_MCE_AMD
+extern bool amd_filter_mce(struct mce *m);
+#else
+static inline bool amd_filter_mce(struct mce *m) { return false; };
+#endif
+
 #endif /* __X86_MCE_INTERNAL_H__ */
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index bd99ba350990..e9f08bf9278c 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1768,6 +1768,9 @@ static void __mcheck_cpu_init_timer(void)
 
 bool filter_mce(struct mce *m)
 {
+ if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD)
+ return amd_filter_mce(m);
+
  return false;
 }
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c b/arch/x86/kernel/cpu/mcheck/mce_amd.c
index e9227fa85403..c36333c478f5 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_amd.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c
@@ -543,33 +543,59 @@ prepare_threshold_block(unsigned int bank, unsigned int block, u32 addr,
  return offset;
 }
 
+bool amd_filter_mce(struct mce *m)
+{
+ enum smca_bank_types bank_type = smca_get_bank_type(m->bank);
+ struct cpuinfo_x86 *c = &boot_cpu_data;
+ u8 xec = (m->status >> 16) & 0x3F;
+
+ /* See Family 17h Models 10h-2Fh Erratum #1114. */
+ if (c->x86 == 0x17 &&
+    c->x86_model >= 0x10 && c->x86_model <= 0x2F &&
+    bank_type == SMCA_IF && xec == 10)
+ return true;
+
+ return false;
+}
+
 /*
- * Turn off MC4_MISC thresholding banks on all family 0x15 models since
- * they're not supported there.
+ * Turn off thresholding banks for the following conditions:
+ * - MC4_MISC thresholding is not supported on Family 0x15.
+ * - Prevent possible spurious interrupts from the IF bank on Family 0x17
+ *   Models 0x10-0x2F due to Erratum #1114.
  */
-void disable_err_thresholding(struct cpuinfo_x86 *c)
+void disable_err_thresholding(struct cpuinfo_x86 *c, unsigned int bank)
 {
- int i;
+ int i, num_msrs;
  u64 hwcr;
  bool need_toggle;
- u32 msrs[] = {
- 0x00000413, /* MC4_MISC0 */
- 0xc0000408, /* MC4_MISC1 */
- };
+ u32 msrs[NR_BLOCKS];
+
+ if (c->x86 == 0x15 && bank == 4) {
+ msrs[0] = 0x00000413; /* MC4_MISC0 */
+ msrs[1] = 0xc0000408; /* MC4_MISC1 */
+ num_msrs = 2;
+ } else if (c->x86 == 0x17 &&
+   (c->x86_model >= 0x10 && c->x86_model <= 0x2F)) {
 
- if (c->x86 != 0x15)
+ if (smca_get_bank_type(bank) != SMCA_IF)
+ return;
+
+ msrs[0] = MSR_AMD64_SMCA_MCx_MISC(bank);
+ num_msrs = 1;
+ } else {
  return;
+ }
 
  rdmsrl(MSR_K7_HWCR, hwcr);
 
  /* McStatusWrEn has to be set */
  need_toggle = !(hwcr & BIT(18));
-
  if (need_toggle)
  wrmsrl(MSR_K7_HWCR, hwcr | BIT(18));
 
  /* Clear CntP bit safely */
- for (i = 0; i < ARRAY_SIZE(msrs); i++)
+ for (i = 0; i < num_msrs; i++)
  msr_clear_bit(msrs[i], 62);
 
  /* restore old settings */
@@ -584,12 +610,12 @@ void mce_amd_feature_init(struct cpuinfo_x86 *c)
  unsigned int bank, block, cpu = smp_processor_id();
  int offset = -1;
 
- disable_err_thresholding(c);
-
  for (bank = 0; bank < mca_cfg.banks; ++bank) {
  if (mce_flags.smca)
  smca_configure(bank, cpu);
 
+ disable_err_thresholding(c, bank);
+
  for (block = 0; block < NR_BLOCKS; ++block) {
  address = get_block_address(cpu, address, low, high, bank, block);
  if (!address)
diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 2ab4d61ee47e..f108ac1a0540 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -914,7 +914,7 @@ static inline void amd_decode_err_code(u16 ec)
 /*
  * Filter out unwanted MCE signatures here.
  */
-static bool amd_filter_mce(struct mce *m)
+static bool ignore_mce(struct mce *m)
 {
  /*
  * NB GART TLB error reporting is disabled by default.
@@ -948,7 +948,7 @@ amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
  unsigned int fam = x86_family(m->cpuid);
  int ecc;
 
- if (amd_filter_mce(m))
+ if (ignore_mce(m))
  return NOTIFY_STOP;
 
  pr_emerg(HW_ERR "%s\n", decode_error_status(m));
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[C] [PATCH 4/4] x86/MCE/AMD: Don't report L1 BTB MCA errors on some family 17h models

Kai-Heng Feng
In reply to this post by Kai-Heng Feng
From: Yazen Ghannam <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1796443

AMD family 17h Models 10h-2Fh may report a high number of L1 BTB MCA
errors under certain conditions. The errors are benign and can safely be
ignored. However, the high error rate may cause the MCA threshold
counter to overflow causing a high rate of thresholding interrupts.

In addition, users may see the errors reported through the AMD MCE
decoder module, even with the interrupt disabled, due to MCA polling.

Clear the "Counter Present" bit in the Instruction Fetch bank's
MCA_MISC0 register. This will prevent enabling MCA thresholding on this
bank which will prevent the high interrupt rate due to this error.

Define an AMD-specific function to filter these errors from the MCE
event pool so that they don't get reported during early boot.

Rename filter function in EDAC/mce_amd to avoid a naming conflict, while
at it.

 [ bp: Move function prototype to the internal header and
   massage/cleanup, fix typos. ]

Reported-by: Rafał Miłecki <[hidden email]>
Signed-off-by: Yazen Ghannam <[hidden email]>
Signed-off-by: Borislav Petkov <[hidden email]>
Cc: "H. Peter Anvin" <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Cc: Arnd Bergmann <[hidden email]>
Cc: Ingo Molnar <[hidden email]>
Cc: James Morse <[hidden email]>
Cc: Kees Cook <[hidden email]>
Cc: Mauro Carvalho Chehab <[hidden email]>
Cc: Pu Wen <[hidden email]>
Cc: Qiuxu Zhuo <[hidden email]>
Cc: Shirish S <[hidden email]>
Cc: Thomas Gleixner <[hidden email]>
Cc: Tony Luck <[hidden email]>
Cc: Vishal Verma <[hidden email]>
Cc: linux-edac <[hidden email]>
Cc: x86-ml <[hidden email]>
Cc: <[hidden email]> # 5.0.x: c95b323dcd35: x86/MCE/AMD: Turn off MC4_MISC thresholding on all family 0x15 models
Cc: <[hidden email]> # 5.0.x: 30aa3d26edb0: x86/MCE/AMD: Carve out the MC4_MISC thresholding quirk
Cc: <[hidden email]> # 5.0.x: 9308fd407455: x86/MCE: Group AMD function prototypes in <asm/mce.h>
Cc: <[hidden email]> # 5.0.x
Link: https://lkml.kernel.org/r/20190325163410.171021-2-Yazen.Ghannam@...
(backported from commit 71a84402b93e5fbd8f817f40059c137e10171788)
Signed-off-by: Kai-Heng Feng <[hidden email]>
---
 arch/x86/kernel/cpu/mcheck/mce-internal.h |  6 +++
 arch/x86/kernel/cpu/mcheck/mce.c          |  3 ++
 arch/x86/kernel/cpu/mcheck/mce_amd.c      | 52 +++++++++++++++++------
 drivers/edac/mce_amd.c                    |  4 +-
 4 files changed, 50 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index e48fefda6c68..7c7ac2a78082 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -173,4 +173,10 @@ extern struct mca_msr_regs msr_ops;
 /* Decide whether to add MCE record to MCE event pool or filter it out. */
 extern bool filter_mce(struct mce *m);
 
+#ifdef CONFIG_X86_MCE_AMD
+extern bool amd_filter_mce(struct mce *m);
+#else
+static inline bool amd_filter_mce(struct mce *m) { return false; };
+#endif
+
 #endif /* __X86_MCE_INTERNAL_H__ */
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index bd0e3c3da8bc..5819c3054fbc 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1754,6 +1754,9 @@ static void __mcheck_cpu_init_timer(void)
 
 bool filter_mce(struct mce *m)
 {
+ if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD)
+ return amd_filter_mce(m);
+
  return false;
 }
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c b/arch/x86/kernel/cpu/mcheck/mce_amd.c
index 8beea51981c4..a24cd6e5f183 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_amd.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c
@@ -545,33 +545,59 @@ prepare_threshold_block(unsigned int bank, unsigned int block, u32 addr,
  return offset;
 }
 
+bool amd_filter_mce(struct mce *m)
+{
+ enum smca_bank_types bank_type = smca_get_bank_type(m->bank);
+ struct cpuinfo_x86 *c = &boot_cpu_data;
+ u8 xec = (m->status >> 16) & 0x3F;
+
+ /* See Family 17h Models 10h-2Fh Erratum #1114. */
+ if (c->x86 == 0x17 &&
+    c->x86_model >= 0x10 && c->x86_model <= 0x2F &&
+    bank_type == SMCA_IF && xec == 10)
+ return true;
+
+ return false;
+}
+
 /*
- * Turn off MC4_MISC thresholding banks on all family 0x15 models since
- * they're not supported there.
+ * Turn off thresholding banks for the following conditions:
+ * - MC4_MISC thresholding is not supported on Family 0x15.
+ * - Prevent possible spurious interrupts from the IF bank on Family 0x17
+ *   Models 0x10-0x2F due to Erratum #1114.
  */
-void disable_err_thresholding(struct cpuinfo_x86 *c)
+void disable_err_thresholding(struct cpuinfo_x86 *c, unsigned int bank)
 {
- int i;
+ int i, num_msrs;
  u64 hwcr;
  bool need_toggle;
- u32 msrs[] = {
- 0x00000413, /* MC4_MISC0 */
- 0xc0000408, /* MC4_MISC1 */
- };
+ u32 msrs[NR_BLOCKS];
+
+ if (c->x86 == 0x15 && bank == 4) {
+ msrs[0] = 0x00000413; /* MC4_MISC0 */
+ msrs[1] = 0xc0000408; /* MC4_MISC1 */
+ num_msrs = 2;
+ } else if (c->x86 == 0x17 &&
+   (c->x86_model >= 0x10 && c->x86_model <= 0x2F)) {
 
- if (c->x86 != 0x15)
+ if (smca_get_bank_type(bank) != SMCA_IF)
+ return;
+
+ msrs[0] = MSR_AMD64_SMCA_MCx_MISC(bank);
+ num_msrs = 1;
+ } else {
  return;
+ }
 
  rdmsrl(MSR_K7_HWCR, hwcr);
 
  /* McStatusWrEn has to be set */
  need_toggle = !(hwcr & BIT(18));
-
  if (need_toggle)
  wrmsrl(MSR_K7_HWCR, hwcr | BIT(18));
 
  /* Clear CntP bit safely */
- for (i = 0; i < ARRAY_SIZE(msrs); i++)
+ for (i = 0; i < num_msrs; i++)
  msr_clear_bit(msrs[i], 62);
 
  /* restore old settings */
@@ -586,12 +612,12 @@ void mce_amd_feature_init(struct cpuinfo_x86 *c)
  unsigned int bank, block, cpu = smp_processor_id();
  int offset = -1;
 
- disable_err_thresholding(c);
-
  for (bank = 0; bank < mca_cfg.banks; ++bank) {
  if (mce_flags.smca)
  smca_configure(bank, cpu);
 
+ disable_err_thresholding(c, bank);
+
  for (block = 0; block < NR_BLOCKS; ++block) {
  address = get_block_address(address, low, high, bank, block);
  if (!address)
diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 2ab4d61ee47e..f108ac1a0540 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -914,7 +914,7 @@ static inline void amd_decode_err_code(u16 ec)
 /*
  * Filter out unwanted MCE signatures here.
  */
-static bool amd_filter_mce(struct mce *m)
+static bool ignore_mce(struct mce *m)
 {
  /*
  * NB GART TLB error reporting is disabled by default.
@@ -948,7 +948,7 @@ amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
  unsigned int fam = x86_family(m->cpuid);
  int ecc;
 
- if (amd_filter_mce(m))
+ if (ignore_mce(m))
  return NOTIFY_STOP;
 
  pr_emerg(HW_ERR "%s\n", decode_error_status(m));
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[D] [PATCH 4/4] x86/MCE/AMD: Don't report L1 BTB MCA errors on some family 17h models

Kai-Heng Feng
In reply to this post by Kai-Heng Feng
From: Yazen Ghannam <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1796443

AMD family 17h Models 10h-2Fh may report a high number of L1 BTB MCA
errors under certain conditions. The errors are benign and can safely be
ignored. However, the high error rate may cause the MCA threshold
counter to overflow causing a high rate of thresholding interrupts.

In addition, users may see the errors reported through the AMD MCE
decoder module, even with the interrupt disabled, due to MCA polling.

Clear the "Counter Present" bit in the Instruction Fetch bank's
MCA_MISC0 register. This will prevent enabling MCA thresholding on this
bank which will prevent the high interrupt rate due to this error.

Define an AMD-specific function to filter these errors from the MCE
event pool so that they don't get reported during early boot.

Rename filter function in EDAC/mce_amd to avoid a naming conflict, while
at it.

 [ bp: Move function prototype to the internal header and
   massage/cleanup, fix typos. ]

Reported-by: Rafał Miłecki <[hidden email]>
Signed-off-by: Yazen Ghannam <[hidden email]>
Signed-off-by: Borislav Petkov <[hidden email]>
Cc: "H. Peter Anvin" <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Cc: Arnd Bergmann <[hidden email]>
Cc: Ingo Molnar <[hidden email]>
Cc: James Morse <[hidden email]>
Cc: Kees Cook <[hidden email]>
Cc: Mauro Carvalho Chehab <[hidden email]>
Cc: Pu Wen <[hidden email]>
Cc: Qiuxu Zhuo <[hidden email]>
Cc: Shirish S <[hidden email]>
Cc: Thomas Gleixner <[hidden email]>
Cc: Tony Luck <[hidden email]>
Cc: Vishal Verma <[hidden email]>
Cc: linux-edac <[hidden email]>
Cc: x86-ml <[hidden email]>
Cc: <[hidden email]> # 5.0.x: c95b323dcd35: x86/MCE/AMD: Turn off MC4_MISC thresholding on all family 0x15 models
Cc: <[hidden email]> # 5.0.x: 30aa3d26edb0: x86/MCE/AMD: Carve out the MC4_MISC thresholding quirk
Cc: <[hidden email]> # 5.0.x: 9308fd407455: x86/MCE: Group AMD function prototypes in <asm/mce.h>
Cc: <[hidden email]> # 5.0.x
Link: https://lkml.kernel.org/r/20190325163410.171021-2-Yazen.Ghannam@...
(cherry picked from commit 71a84402b93e5fbd8f817f40059c137e10171788)
Signed-off-by: Kai-Heng Feng <[hidden email]>
---
 arch/x86/kernel/cpu/mce/amd.c      | 52 ++++++++++++++++++++++--------
 arch/x86/kernel/cpu/mce/core.c     |  3 ++
 arch/x86/kernel/cpu/mce/internal.h |  6 ++++
 drivers/edac/mce_amd.c             |  4 +--
 4 files changed, 50 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index ed3327342b40..496033b01d26 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -545,33 +545,59 @@ prepare_threshold_block(unsigned int bank, unsigned int block, u32 addr,
  return offset;
 }
 
+bool amd_filter_mce(struct mce *m)
+{
+ enum smca_bank_types bank_type = smca_get_bank_type(m->bank);
+ struct cpuinfo_x86 *c = &boot_cpu_data;
+ u8 xec = (m->status >> 16) & 0x3F;
+
+ /* See Family 17h Models 10h-2Fh Erratum #1114. */
+ if (c->x86 == 0x17 &&
+    c->x86_model >= 0x10 && c->x86_model <= 0x2F &&
+    bank_type == SMCA_IF && xec == 10)
+ return true;
+
+ return false;
+}
+
 /*
- * Turn off MC4_MISC thresholding banks on all family 0x15 models since
- * they're not supported there.
+ * Turn off thresholding banks for the following conditions:
+ * - MC4_MISC thresholding is not supported on Family 0x15.
+ * - Prevent possible spurious interrupts from the IF bank on Family 0x17
+ *   Models 0x10-0x2F due to Erratum #1114.
  */
-void disable_err_thresholding(struct cpuinfo_x86 *c)
+void disable_err_thresholding(struct cpuinfo_x86 *c, unsigned int bank)
 {
- int i;
+ int i, num_msrs;
  u64 hwcr;
  bool need_toggle;
- u32 msrs[] = {
- 0x00000413, /* MC4_MISC0 */
- 0xc0000408, /* MC4_MISC1 */
- };
+ u32 msrs[NR_BLOCKS];
+
+ if (c->x86 == 0x15 && bank == 4) {
+ msrs[0] = 0x00000413; /* MC4_MISC0 */
+ msrs[1] = 0xc0000408; /* MC4_MISC1 */
+ num_msrs = 2;
+ } else if (c->x86 == 0x17 &&
+   (c->x86_model >= 0x10 && c->x86_model <= 0x2F)) {
 
- if (c->x86 != 0x15)
+ if (smca_get_bank_type(bank) != SMCA_IF)
+ return;
+
+ msrs[0] = MSR_AMD64_SMCA_MCx_MISC(bank);
+ num_msrs = 1;
+ } else {
  return;
+ }
 
  rdmsrl(MSR_K7_HWCR, hwcr);
 
  /* McStatusWrEn has to be set */
  need_toggle = !(hwcr & BIT(18));
-
  if (need_toggle)
  wrmsrl(MSR_K7_HWCR, hwcr | BIT(18));
 
  /* Clear CntP bit safely */
- for (i = 0; i < ARRAY_SIZE(msrs); i++)
+ for (i = 0; i < num_msrs; i++)
  msr_clear_bit(msrs[i], 62);
 
  /* restore old settings */
@@ -586,12 +612,12 @@ void mce_amd_feature_init(struct cpuinfo_x86 *c)
  unsigned int bank, block, cpu = smp_processor_id();
  int offset = -1;
 
- disable_err_thresholding(c);
-
  for (bank = 0; bank < mca_cfg.banks; ++bank) {
  if (mce_flags.smca)
  smca_configure(bank, cpu);
 
+ disable_err_thresholding(c, bank);
+
  for (block = 0; block < NR_BLOCKS; ++block) {
  address = get_block_address(address, low, high, bank, block);
  if (!address)
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 12d61b8f8154..1a7084ba9a3b 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1773,6 +1773,9 @@ static void __mcheck_cpu_init_timer(void)
 
 bool filter_mce(struct mce *m)
 {
+ if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD)
+ return amd_filter_mce(m);
+
  return false;
 }
 
diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index b822a645395d..a34b55baa7aa 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -176,4 +176,10 @@ extern struct mca_msr_regs msr_ops;
 /* Decide whether to add MCE record to MCE event pool or filter it out. */
 extern bool filter_mce(struct mce *m);
 
+#ifdef CONFIG_X86_MCE_AMD
+extern bool amd_filter_mce(struct mce *m);
+#else
+static inline bool amd_filter_mce(struct mce *m) { return false; };
+#endif
+
 #endif /* __X86_MCE_INTERNAL_H__ */
diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index c605089d899f..397cd51f033a 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -914,7 +914,7 @@ static inline void amd_decode_err_code(u16 ec)
 /*
  * Filter out unwanted MCE signatures here.
  */
-static bool amd_filter_mce(struct mce *m)
+static bool ignore_mce(struct mce *m)
 {
  /*
  * NB GART TLB error reporting is disabled by default.
@@ -948,7 +948,7 @@ amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
  unsigned int fam = x86_family(m->cpuid);
  int ecc;
 
- if (amd_filter_mce(m))
+ if (ignore_mce(m))
  return NOTIFY_STOP;
 
  pr_emerg(HW_ERR "%s\n", decode_error_status(m));
--
2.17.1


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

ACK/Cmnt: [SRU] [B/D] [PATCH 0/4] Fix AMD CPU MCE bug

Stefan Bader-2
In reply to this post by Kai-Heng Feng
On 03.07.19 09:23, Kai-Heng Feng wrote:

> BugLink: https://bugs.launchpad.net/bugs/1796443
>
> [Impact]
> System doesn't boot without "mce=off".
>
> [Fix]
> Quote from the commit log:
> "Clear the "Counter Present" bit in the Instruction Fetch bank's
> MCA_MISC0 register. This will prevent enabling MCA thresholding on this
> bank which will prevent the high interrupt rate due to this error."
>
> [Test]
> The affected user reported these commits fix the issue.
>
> [Regression Potential]
> Low. Upstream stable commits. I don't see any regression on my
> unaffected AMD systems.
>
> Shirish S (2):
>   x86/MCE/AMD: Turn off MC4_MISC thresholding on all family 0x15 models
>   x86/MCE/AMD: Carve out the MC4_MISC thresholding quirk
>
> Yazen Ghannam (2):
>   x86/MCE: Add an MCE-record filtering function
>   x86/MCE/AMD: Don't report L1 BTB MCA errors on some family 17h models
>
>  arch/x86/kernel/cpu/mce/amd.c      | 62 ++++++++++++++++++++++++++++++
>  arch/x86/kernel/cpu/mce/core.c     | 38 ++++--------------
>  arch/x86/kernel/cpu/mce/genpool.c  |  3 ++
>  arch/x86/kernel/cpu/mce/internal.h |  9 +++++
>  drivers/edac/mce_amd.c             |  4 +-
>  5 files changed, 84 insertions(+), 32 deletions(-)
>
Cosmic will be EOL by next cycle.

Acked-by: Stefan Bader <[hidden email]>


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

ACK: [SRU] [B/C/D] [PATCH 0/4] Fix AMD CPU MCE bug

Kleber Souza
In reply to this post by Kai-Heng Feng
On 03.07.19 09:23, Kai-Heng Feng wrote:

> BugLink: https://bugs.launchpad.net/bugs/1796443
>
> [Impact]
> System doesn't boot without "mce=off".
>
> [Fix]
> Quote from the commit log:
> "Clear the "Counter Present" bit in the Instruction Fetch bank's
> MCA_MISC0 register. This will prevent enabling MCA thresholding on this
> bank which will prevent the high interrupt rate due to this error."
>
> [Test]
> The affected user reported these commits fix the issue.
>
> [Regression Potential]
> Low. Upstream stable commits. I don't see any regression on my
> unaffected AMD systems.
>
> Shirish S (2):
>   x86/MCE/AMD: Turn off MC4_MISC thresholding on all family 0x15 models
>   x86/MCE/AMD: Carve out the MC4_MISC thresholding quirk
>
> Yazen Ghannam (2):
>   x86/MCE: Add an MCE-record filtering function
>   x86/MCE/AMD: Don't report L1 BTB MCA errors on some family 17h models
>
>  arch/x86/kernel/cpu/mce/amd.c      | 62 ++++++++++++++++++++++++++++++
>  arch/x86/kernel/cpu/mce/core.c     | 38 ++++--------------
>  arch/x86/kernel/cpu/mce/genpool.c  |  3 ++
>  arch/x86/kernel/cpu/mce/internal.h |  9 +++++
>  drivers/edac/mce_amd.c             |  4 +-
>  5 files changed, 84 insertions(+), 32 deletions(-)
>

Acked-by: Kleber Sacilotto de Souza <[hidden email]>

Thank you,
Kleber

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

APPLIED[B]/Cmnt: [SRU] [B/C/D] [PATCH 0/4] Fix AMD CPU MCE bug

Kleber Souza
In reply to this post by Kai-Heng Feng
On 03.07.19 09:23, Kai-Heng Feng wrote:

> BugLink: https://bugs.launchpad.net/bugs/1796443
>
> [Impact]
> System doesn't boot without "mce=off".
>
> [Fix]
> Quote from the commit log:
> "Clear the "Counter Present" bit in the Instruction Fetch bank's
> MCA_MISC0 register. This will prevent enabling MCA thresholding on this
> bank which will prevent the high interrupt rate due to this error."
>
> [Test]
> The affected user reported these commits fix the issue.
>
> [Regression Potential]
> Low. Upstream stable commits. I don't see any regression on my
> unaffected AMD systems.
>
> Shirish S (2):
>   x86/MCE/AMD: Turn off MC4_MISC thresholding on all family 0x15 models
>   x86/MCE/AMD: Carve out the MC4_MISC thresholding quirk
>
> Yazen Ghannam (2):
>   x86/MCE: Add an MCE-record filtering function
>   x86/MCE/AMD: Don't report L1 BTB MCA errors on some family 17h models
>
>  arch/x86/kernel/cpu/mce/amd.c      | 62 ++++++++++++++++++++++++++++++
>  arch/x86/kernel/cpu/mce/core.c     | 38 ++++--------------
>  arch/x86/kernel/cpu/mce/genpool.c  |  3 ++
>  arch/x86/kernel/cpu/mce/internal.h |  9 +++++
>  drivers/edac/mce_amd.c             |  4 +-
>  5 files changed, 84 insertions(+), 32 deletions(-)
>


Applied to bionic/master-next branch.

Please note that all commits targeted to Disco have already been applied
as part of LP: #1836614 ("Disco update: 5.0.18 upstream stable release")
and Cosmic is reaching EOL.



Thanks,
Kleber

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team