[SRU v2] [G] [PATCH 0/6] Prevent thermal shutdown during boot process

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

[SRU v2] [G] [PATCH 0/6] Prevent thermal shutdown during boot process

Kai-Heng Feng
BugLink: https://bugs.launchpad.net/bugs/1906168

[Impact]
Surprising thermal shutdown at boot on Intel based mobile workstations.

[Fix]
Since these thermal devcies are not in ACPI ThermalZone, OS shouldn't
shutdown the system.

These critial temperatures are for usespace to handle, so let kernel
know it shouldn't handle it.

For Groovy, a patch that removes .notify callback is dropped.

[Test]
Use reboot stress as a reproducer. 5% chance to see a surprising
shutdown at boot.

With the fix applied, the thermal shutdown is no longer reproducible.

[Where problems could occur]
For ACPI based platforms, we still have "acpitz" to protect systems from
overheating. If these acpitz sensors don't work, then the system could
face real overheating issue.

Daniel Lezcano (4):
  thermal/core: Emit a warning if the thermal zone is updated without
    ops
  thermal/core: Add critical and hot ops
  thermal/drivers/acpi: Use hot and critical ops
  thermal/drivers/rcar: Remove notification usage

Kai-Heng Feng (2):
  thermal: int340x: Fix unexpected shutdown at critical temperature
  thermal: intel: pch: Fix unexpected shutdown at critical temperature

 drivers/acpi/thermal.c                        | 30 ++++++------
 .../int340x_thermal/int340x_thermal_zone.c    |  6 +++
 drivers/thermal/intel/intel_pch_thermal.c     |  6 +++
 drivers/thermal/rcar_thermal.c                | 19 --------
 drivers/thermal/thermal_core.c                | 46 ++++++++++++-------
 include/linux/thermal.h                       |  3 ++
 6 files changed, 58 insertions(+), 52 deletions(-)

--
2.29.2


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[PATCH 1/6] thermal/core: Emit a warning if the thermal zone is updated without ops

Kai-Heng Feng
From: Daniel Lezcano <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1906168

The actual code is silently ignoring a thermal zone update when a
driver is requesting it without a get_temp ops set.

That looks not correct, as the caller should not have called this
function if the thermal zone is unable to read the temperature.

That makes the code less robust as the check won't detect the driver
is inconsistently using the thermal API and that does not help to
improve the framework as these circumvolutions hide the problem at the
source.

In order to detect the situation when it happens, let's add a warning
when the update is requested without the get_temp() ops set.

Any warning emitted will have to be fixed at the source of the
problem: the caller must not call thermal_zone_device_update if there
is not get_temp callback set.

Cc: Thara Gopinath <[hidden email]>
Cc: Amit Kucheria <[hidden email]>
Cc: [hidden email]
Cc: [hidden email]
Signed-off-by: Daniel Lezcano <[hidden email]>
Reviewed-by: Lukasz Luba <[hidden email]>
Link: https://lore.kernel.org/r/20201210121514.25760-1-daniel.lezcano@...
(cherry picked from commit 433178e75834dc35f1ae79b56ec2cf396f2c6f3c)
Signed-off-by: Kai-Heng Feng <[hidden email]>
---
 drivers/thermal/thermal_core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index b71196eaf90e..749001a88b44 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -467,7 +467,8 @@ void thermal_zone_device_update(struct thermal_zone_device *tz,
  if (atomic_read(&in_suspend))
  return;
 
- if (!tz->ops->get_temp)
+ if (WARN_ONCE(!tz->ops->get_temp, "'%s' must not be called without "
+      "'get_temp' ops set\n", __func__))
  return;
 
  update_temperature(tz);
--
2.29.2


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[PATCH 2/6] thermal/core: Add critical and hot ops

Kai-Heng Feng
In reply to this post by Kai-Heng Feng
From: Daniel Lezcano <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1906168

Currently there is no way to the sensors to directly call an ops in
interrupt mode without calling thermal_zone_device_update assuming all
the trip points are defined.

A sensor may want to do something special if a trip point is hot or
critical.

This patch adds the critical and hot ops to the thermal zone device,
so a sensor can directly invoke them or let the thermal framework to
call the sensor specific ones.

Tested-by: Kai-Heng Feng <[hidden email]>
Signed-off-by: Daniel Lezcano <[hidden email]>
Reviewed-by: Lukasz Luba <[hidden email]>
Link: https://lore.kernel.org/r/20201210121514.25760-2-daniel.lezcano@...
(backported from commit d7203eedf4f68e9909fd489453168a9d26bf0c3d)
Signed-off-by: Kai-Heng Feng <[hidden email]>
---
 drivers/thermal/thermal_core.c | 43 +++++++++++++++++++++-------------
 include/linux/thermal.h        |  3 +++
 2 files changed, 30 insertions(+), 16 deletions(-)

diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 749001a88b44..aba84e1a4396 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -369,6 +369,25 @@ static void thermal_emergency_poweroff(void)
       msecs_to_jiffies(poweroff_delay_ms));
 }
 
+void thermal_zone_device_critical(struct thermal_zone_device *tz)
+{
+ dev_emerg(&tz->device, "%s: critical temperature reached, "
+  "shutting down\n", tz->type);
+
+ mutex_lock(&poweroff_lock);
+ if (!power_off_triggered) {
+ /*
+ * Queue a backup emergency shutdown in the event of
+ * orderly_poweroff failure
+ */
+ thermal_emergency_poweroff();
+ orderly_poweroff(true);
+ power_off_triggered = true;
+ }
+ mutex_unlock(&poweroff_lock);
+}
+EXPORT_SYMBOL(thermal_zone_device_critical);
+
 static void handle_critical_trips(struct thermal_zone_device *tz,
   int trip, enum thermal_trip_type trip_type)
 {
@@ -385,22 +404,10 @@ static void handle_critical_trips(struct thermal_zone_device *tz,
  if (tz->ops->notify)
  tz->ops->notify(tz, trip, trip_type);
 
- if (trip_type == THERMAL_TRIP_CRITICAL) {
- dev_emerg(&tz->device,
-  "critical temperature reached (%d C), shutting down\n",
-  tz->temperature / 1000);
- mutex_lock(&poweroff_lock);
- if (!power_off_triggered) {
- /*
- * Queue a backup emergency shutdown in the event of
- * orderly_poweroff failure
- */
- thermal_emergency_poweroff();
- orderly_poweroff(true);
- power_off_triggered = true;
- }
- mutex_unlock(&poweroff_lock);
- }
+ if (trip_type == THERMAL_TRIP_HOT && tz->ops->hot)
+ tz->ops->hot(tz);
+ else if (trip_type == THERMAL_TRIP_CRITICAL)
+ tz->ops->critical(tz);
 }
 
 static void handle_thermal_trip(struct thermal_zone_device *tz, int trip)
@@ -1273,6 +1280,10 @@ thermal_zone_device_register(const char *type, int trips, int mask,
 
  tz->id = id;
  strlcpy(tz->type, type, sizeof(tz->type));
+
+ if (!ops->critical)
+ ops->critical = thermal_zone_device_critical;
+
  tz->ops = ops;
  tz->tzp = tzp;
  tz->device.class = &thermal_class;
diff --git a/include/linux/thermal.h b/include/linux/thermal.h
index 216185bb3014..e7989cec090a 100644
--- a/include/linux/thermal.h
+++ b/include/linux/thermal.h
@@ -92,6 +92,8 @@ struct thermal_zone_device_ops {
   enum thermal_trend *);
  int (*notify) (struct thermal_zone_device *, int,
        enum thermal_trip_type);
+ void (*hot)(struct thermal_zone_device *);
+ void (*critical)(struct thermal_zone_device *);
 };
 
 struct thermal_cooling_device_ops {
@@ -416,6 +418,7 @@ int thermal_zone_get_offset(struct thermal_zone_device *tz);
 
 void thermal_cdev_update(struct thermal_cooling_device *);
 void thermal_notify_framework(struct thermal_zone_device *, int);
+void thermal_zone_device_critical(struct thermal_zone_device *tz);
 #else
 static inline struct thermal_zone_device *thermal_zone_device_register(
  const char *type, int trips, int mask, void *devdata,
--
2.29.2


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[PATCH 3/6] thermal/drivers/acpi: Use hot and critical ops

Kai-Heng Feng
In reply to this post by Kai-Heng Feng
From: Daniel Lezcano <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1906168

The acpi driver wants to do a netlink notification in case of a hot or
critical trip point. Implement the corresponding ops to be used for
the thermal zone and use them instead of the notify ops.

Signed-off-by: Daniel Lezcano <[hidden email]>
Acked-by: Srinivas Pandruvada <[hidden email]>
Link: https://lore.kernel.org/r/20201210121514.25760-3-daniel.lezcano@...
(cherry picked from commit a73cb2024caa3480263a009dce91fa581b3748bf linux-next)
Signed-off-by: Kai-Heng Feng <[hidden email]>
---
 drivers/acpi/thermal.c | 30 ++++++++++++++----------------
 1 file changed, 14 insertions(+), 16 deletions(-)

diff --git a/drivers/acpi/thermal.c b/drivers/acpi/thermal.c
index 19067a5e5293..3412fee26915 100644
--- a/drivers/acpi/thermal.c
+++ b/drivers/acpi/thermal.c
@@ -725,27 +725,24 @@ static int thermal_get_trend(struct thermal_zone_device *thermal,
  return 0;
 }
 
-
-static int thermal_notify(struct thermal_zone_device *thermal, int trip,
-   enum thermal_trip_type trip_type)
+static void acpi_thermal_zone_device_hot(struct thermal_zone_device *thermal)
 {
- u8 type = 0;
  struct acpi_thermal *tz = thermal->devdata;
 
- if (trip_type == THERMAL_TRIP_CRITICAL)
- type = ACPI_THERMAL_NOTIFY_CRITICAL;
- else if (trip_type == THERMAL_TRIP_HOT)
- type = ACPI_THERMAL_NOTIFY_HOT;
- else
- return 0;
-
  acpi_bus_generate_netlink_event(tz->device->pnp.device_class,
- dev_name(&tz->device->dev), type, 1);
+ dev_name(&tz->device->dev),
+ ACPI_THERMAL_NOTIFY_HOT, 1);
+}
 
- if (trip_type == THERMAL_TRIP_CRITICAL && nocrt)
- return 1;
+static void acpi_thermal_zone_device_critical(struct thermal_zone_device *thermal)
+{
+ struct acpi_thermal *tz = thermal->devdata;
 
- return 0;
+ acpi_bus_generate_netlink_event(tz->device->pnp.device_class,
+ dev_name(&tz->device->dev),
+ ACPI_THERMAL_NOTIFY_CRITICAL, 1);
+
+ thermal_zone_device_critical(thermal);
 }
 
 static int acpi_thermal_cooling_device_cb(struct thermal_zone_device *thermal,
@@ -862,7 +859,8 @@ static struct thermal_zone_device_ops acpi_thermal_zone_ops = {
  .get_trip_temp = thermal_get_trip_temp,
  .get_crit_temp = thermal_get_crit_temp,
  .get_trend = thermal_get_trend,
- .notify = thermal_notify,
+ .hot = acpi_thermal_zone_device_hot,
+ .critical = acpi_thermal_zone_device_critical,
 };
 
 static int acpi_thermal_register_thermal_zone(struct acpi_thermal *tz)
--
2.29.2


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[PATCH 4/6] thermal/drivers/rcar: Remove notification usage

Kai-Heng Feng
In reply to this post by Kai-Heng Feng
From: Daniel Lezcano <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1906168

The ops is only showing a trace telling a critical trip point is
crossed. The same information is given by the thermal framework.

This is redundant, remove the code.

Signed-off-by: Daniel Lezcano <[hidden email]>
Reviewed-by: Niklas Söderlund <[hidden email]>
Link: https://lore.kernel.org/r/20201210121514.25760-4-daniel.lezcano@...
(cherry picked from commit 1fa34e49e4b7e66214a1d15261c0224d60366eec)
Signed-off-by: Kai-Heng Feng <[hidden email]>
---
 drivers/thermal/rcar_thermal.c | 19 -------------------
 1 file changed, 19 deletions(-)

diff --git a/drivers/thermal/rcar_thermal.c b/drivers/thermal/rcar_thermal.c
index 46aeb28b4e90..f490e3668e2a 100644
--- a/drivers/thermal/rcar_thermal.c
+++ b/drivers/thermal/rcar_thermal.c
@@ -323,24 +323,6 @@ static int rcar_thermal_get_trip_temp(struct thermal_zone_device *zone,
  return 0;
 }
 
-static int rcar_thermal_notify(struct thermal_zone_device *zone,
-       int trip, enum thermal_trip_type type)
-{
- struct rcar_thermal_priv *priv = rcar_zone_to_priv(zone);
- struct device *dev = rcar_priv_to_dev(priv);
-
- switch (type) {
- case THERMAL_TRIP_CRITICAL:
- /* FIXME */
- dev_warn(dev, "Thermal reached to critical temperature\n");
- break;
- default:
- break;
- }
-
- return 0;
-}
-
 static const struct thermal_zone_of_device_ops rcar_thermal_zone_of_ops = {
  .get_temp = rcar_thermal_of_get_temp,
 };
@@ -349,7 +331,6 @@ static struct thermal_zone_device_ops rcar_thermal_zone_ops = {
  .get_temp = rcar_thermal_get_temp,
  .get_trip_type = rcar_thermal_get_trip_type,
  .get_trip_temp = rcar_thermal_get_trip_temp,
- .notify = rcar_thermal_notify,
 };
 
 /*
--
2.29.2


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[PATCH 5/6] thermal: int340x: Fix unexpected shutdown at critical temperature

Kai-Heng Feng
In reply to this post by Kai-Heng Feng
BugLink: https://bugs.launchpad.net/bugs/1906168

We are seeing thermal shutdown on Intel based mobile workstations, the
shutdown happens during the first trip handle in
thermal_zone_device_register():
kernel: thermal thermal_zone15: critical temperature reached (101 C), shutting down

However, we shouldn't do a thermal shutdown here, since
1) We may want to use a dedicated daemon, Intel's thermald in this case,
to handle thermal shutdown.

2) For ACPI based system, _CRT doesn't mean shutdown unless it's inside
ThermalZone namespace. ACPI Spec, 11.4.4 _CRT (Critical Temperature):
"... If this object it present under a device, the device’s driver
evaluates this object to determine the device’s critical cooling
temperature trip point. This value may then be used by the device’s
driver to program an internal device temperature sensor trip point."

So a "critical trip" here merely means we should take a more aggressive
cooling method.

As int340x device isn't present under ACPI ThermalZone, override the
default .critical callback to prevent surprising thermal shutdown.

Signed-off-by: Kai-Heng Feng <[hidden email]>
Signed-off-by: Daniel Lezcano <[hidden email]>
Link: https://lore.kernel.org/r/20201221172345.36976-1-kai.heng.feng@...
(cherry picked from commit dd47366aaa9b93ac3d97cb4ee7641d38a28a771e linux-next)
Signed-off-by: Kai-Heng Feng <[hidden email]>
---
 .../thermal/intel/int340x_thermal/int340x_thermal_zone.c    | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c b/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c
index 432213272f1e..eeb08fab19a7 100644
--- a/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c
+++ b/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c
@@ -146,12 +146,18 @@ static int int340x_thermal_get_trip_hyst(struct thermal_zone_device *zone,
  return 0;
 }
 
+static void int340x_thermal_critical(struct thermal_zone_device *zone)
+{
+ dev_dbg(&zone->device, "%s: critical temperature reached\n", zone->type);
+}
+
 static struct thermal_zone_device_ops int340x_thermal_zone_ops = {
  .get_temp       = int340x_thermal_get_zone_temp,
  .get_trip_temp = int340x_thermal_get_trip_temp,
  .get_trip_type = int340x_thermal_get_trip_type,
  .set_trip_temp = int340x_thermal_set_trip_temp,
  .get_trip_hyst =  int340x_thermal_get_trip_hyst,
+ .critical = int340x_thermal_critical,
 };
 
 static int int340x_thermal_get_trip_config(acpi_handle handle, char *name,
--
2.29.2


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[PATCH 6/6] thermal: intel: pch: Fix unexpected shutdown at critical temperature

Kai-Heng Feng
In reply to this post by Kai-Heng Feng
BugLink: https://bugs.launchpad.net/bugs/1906168

Like previous patch, the intel_pch_thermal device is not in ACPI
ThermalZone namespace, so a critical trip doesn't mean shutdown.

Override the default .critical callback to prevent surprising thermal
shutdoown.

Signed-off-by: Kai-Heng Feng <[hidden email]>
Signed-off-by: Daniel Lezcano <[hidden email]>
Link: https://lore.kernel.org/r/20201221172345.36976-2-kai.heng.feng@...
(cherry picked from commit 03671968d0bf2db598f7e3aa98f190b76c1bb4ff linux-next)
Signed-off-by: Kai-Heng Feng <[hidden email]>
---
 drivers/thermal/intel/intel_pch_thermal.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/thermal/intel/intel_pch_thermal.c b/drivers/thermal/intel/intel_pch_thermal.c
index 56401fd4708d..7bccf588c15e 100644
--- a/drivers/thermal/intel/intel_pch_thermal.c
+++ b/drivers/thermal/intel/intel_pch_thermal.c
@@ -263,10 +263,16 @@ static int pch_get_trip_temp(struct thermal_zone_device *tzd, int trip, int *tem
  return 0;
 }
 
+static void pch_critical(struct thermal_zone_device *tzd)
+{
+ dev_dbg(&tzd->device, "%s: critical temperature reached\n", tzd->type);
+}
+
 static struct thermal_zone_device_ops tzd_ops = {
  .get_temp = pch_thermal_get_temp,
  .get_trip_type = pch_get_trip_type,
  .get_trip_temp = pch_get_trip_temp,
+ .critical = pch_critical,
 };
 
 enum board_ids {
--
2.29.2


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

ACK/Cmnt: [SRU v2] [G] [PATCH 0/6] Prevent thermal shutdown during boot process

Stefan Bader-2
In reply to this post by Kai-Heng Feng
On 27.01.21 04:05, Kai-Heng Feng wrote:

> BugLink: https://bugs.launchpad.net/bugs/1906168
>
> [Impact]
> Surprising thermal shutdown at boot on Intel based mobile workstations.
>
> [Fix]
> Since these thermal devcies are not in ACPI ThermalZone, OS shouldn't
> shutdown the system.
>
> These critial temperatures are for usespace to handle, so let kernel
> know it shouldn't handle it.
>
> For Groovy, a patch that removes .notify callback is dropped.
>
> [Test]
> Use reboot stress as a reproducer. 5% chance to see a surprising
> shutdown at boot.
>
> With the fix applied, the thermal shutdown is no longer reproducible.
>
> [Where problems could occur]
> For ACPI based platforms, we still have "acpitz" to protect systems from
> overheating. If these acpitz sensors don't work, then the system could
> face real overheating issue.
>
> Daniel Lezcano (4):
>   thermal/core: Emit a warning if the thermal zone is updated without
>     ops
>   thermal/core: Add critical and hot ops
>   thermal/drivers/acpi: Use hot and critical ops
>   thermal/drivers/rcar: Remove notification usage
>
> Kai-Heng Feng (2):
>   thermal: int340x: Fix unexpected shutdown at critical temperature
>   thermal: intel: pch: Fix unexpected shutdown at critical temperature
>
>  drivers/acpi/thermal.c                        | 30 ++++++------
>  .../int340x_thermal/int340x_thermal_zone.c    |  6 +++
>  drivers/thermal/intel/intel_pch_thermal.c     |  6 +++
>  drivers/thermal/rcar_thermal.c                | 19 --------
>  drivers/thermal/thermal_core.c                | 46 ++++++++++++-------
>  include/linux/thermal.h                       |  3 ++
>  6 files changed, 58 insertions(+), 52 deletions(-)
>
The actual fix seem to be patches #5 and #6. The other patches add some new
hooks which the fixes depend on or rate limit some warning (probably not
strictly needed but makes things more bearable). Just to double check on patch
#4: is that still needed now that notify remains? I don't think it is fatal to
remove the message but I wonder whether there would really be anything else
issuing the warning without adding a critical hook.

Povisional Ack for the set but should clarify the need for patch #4.

Acked-by: Stefan Bader <[hidden email]>

Thanks,
Stefan


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [SRU v2] [G] [PATCH 0/6] Prevent thermal shutdown during boot process

Paolo Pisati-5
In reply to this post by Kai-Heng Feng
On Wed, Jan 27, 2021 at 11:05:11AM +0800, Kai-Heng Feng wrote:

> BugLink: https://bugs.launchpad.net/bugs/1906168
>
> [Impact]
> Surprising thermal shutdown at boot on Intel based mobile workstations.
>
> [Fix]
> Since these thermal devcies are not in ACPI ThermalZone, OS shouldn't
> shutdown the system.
>
> These critial temperatures are for usespace to handle, so let kernel
> know it shouldn't handle it.
>
> For Groovy, a patch that removes .notify callback is dropped.
>
> [Test]
> Use reboot stress as a reproducer. 5% chance to see a surprising
> shutdown at boot.
>
> With the fix applied, the thermal shutdown is no longer reproducible.
>
> [Where problems could occur]
> For ACPI based platforms, we still have "acpitz" to protect systems from
> overheating. If these acpitz sensors don't work, then the system could
> face real overheating issue.
>
> Daniel Lezcano (4):
>   thermal/core: Emit a warning if the thermal zone is updated without
>     ops
>   thermal/core: Add critical and hot ops
>   thermal/drivers/acpi: Use hot and critical ops
>   thermal/drivers/rcar: Remove notification usage
>
> Kai-Heng Feng (2):
>   thermal: int340x: Fix unexpected shutdown at critical temperature
>   thermal: intel: pch: Fix unexpected shutdown at critical temperature

Hi Kai-Heng,

looks like this v2 patchset is identical to v1 but dropped:

commit 2079c0fe7fd0b46bc00e341205aa325bae7406b7
Author: Daniel Lezcano <[hidden email]>
Date:   Thu Jan 21 16:49:00 2021 +0800

    thermal/core: Remove notify ops

and in Hirsute (5.10+) we have the v1 applied - shall we keep v1 applied? Or
shall we drop the above patch?
--
bye,
p.

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

Re: ACK/Cmnt: [SRU v2] [G] [PATCH 0/6] Prevent thermal shutdown during boot process

Kai-Heng Feng
In reply to this post by Stefan Bader-2
On Wed, Jan 27, 2021 at 4:30 PM Stefan Bader <[hidden email]> wrote:

>
> On 27.01.21 04:05, Kai-Heng Feng wrote:
> > BugLink: https://bugs.launchpad.net/bugs/1906168
> >
> > [Impact]
> > Surprising thermal shutdown at boot on Intel based mobile workstations.
> >
> > [Fix]
> > Since these thermal devcies are not in ACPI ThermalZone, OS shouldn't
> > shutdown the system.
> >
> > These critial temperatures are for usespace to handle, so let kernel
> > know it shouldn't handle it.
> >
> > For Groovy, a patch that removes .notify callback is dropped.
> >
> > [Test]
> > Use reboot stress as a reproducer. 5% chance to see a surprising
> > shutdown at boot.
> >
> > With the fix applied, the thermal shutdown is no longer reproducible.
> >
> > [Where problems could occur]
> > For ACPI based platforms, we still have "acpitz" to protect systems from
> > overheating. If these acpitz sensors don't work, then the system could
> > face real overheating issue.
> >
> > Daniel Lezcano (4):
> >   thermal/core: Emit a warning if the thermal zone is updated without
> >     ops
> >   thermal/core: Add critical and hot ops
> >   thermal/drivers/acpi: Use hot and critical ops
> >   thermal/drivers/rcar: Remove notification usage
> >
> > Kai-Heng Feng (2):
> >   thermal: int340x: Fix unexpected shutdown at critical temperature
> >   thermal: intel: pch: Fix unexpected shutdown at critical temperature
> >
> >  drivers/acpi/thermal.c                        | 30 ++++++------
> >  .../int340x_thermal/int340x_thermal_zone.c    |  6 +++
> >  drivers/thermal/intel/intel_pch_thermal.c     |  6 +++
> >  drivers/thermal/rcar_thermal.c                | 19 --------
> >  drivers/thermal/thermal_core.c                | 46 ++++++++++++-------
> >  include/linux/thermal.h                       |  3 ++
> >  6 files changed, 58 insertions(+), 52 deletions(-)
> >
>
> The actual fix seem to be patches #5 and #6. The other patches add some new
> hooks which the fixes depend on or rate limit some warning (probably not
> strictly needed but makes things more bearable). Just to double check on patch
> #4: is that still needed now that notify remains? I don't think it is fatal to
> remove the message but I wonder whether there would really be anything else
> issuing the warning without adding a critical hook.

The patch does nothing, so I'd like to keep as close to upstream as possible.
So I prefer not to drop it.

Kai-Heng

>
> Povisional Ack for the set but should clarify the need for patch #4.
>
> Acked-by: Stefan Bader <[hidden email]>
>
> Thanks,
> Stefan
>

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

Re: [SRU v2] [G] [PATCH 0/6] Prevent thermal shutdown during boot process

Kai-Heng Feng
In reply to this post by Paolo Pisati-5
On Wed, Jan 27, 2021 at 7:02 PM Paolo Pisati <[hidden email]> wrote:

>
> On Wed, Jan 27, 2021 at 11:05:11AM +0800, Kai-Heng Feng wrote:
> > BugLink: https://bugs.launchpad.net/bugs/1906168
> >
> > [Impact]
> > Surprising thermal shutdown at boot on Intel based mobile workstations.
> >
> > [Fix]
> > Since these thermal devcies are not in ACPI ThermalZone, OS shouldn't
> > shutdown the system.
> >
> > These critial temperatures are for usespace to handle, so let kernel
> > know it shouldn't handle it.
> >
> > For Groovy, a patch that removes .notify callback is dropped.
> >
> > [Test]
> > Use reboot stress as a reproducer. 5% chance to see a surprising
> > shutdown at boot.
> >
> > With the fix applied, the thermal shutdown is no longer reproducible.
> >
> > [Where problems could occur]
> > For ACPI based platforms, we still have "acpitz" to protect systems from
> > overheating. If these acpitz sensors don't work, then the system could
> > face real overheating issue.
> >
> > Daniel Lezcano (4):
> >   thermal/core: Emit a warning if the thermal zone is updated without
> >     ops
> >   thermal/core: Add critical and hot ops
> >   thermal/drivers/acpi: Use hot and critical ops
> >   thermal/drivers/rcar: Remove notification usage
> >
> > Kai-Heng Feng (2):
> >   thermal: int340x: Fix unexpected shutdown at critical temperature
> >   thermal: intel: pch: Fix unexpected shutdown at critical temperature
>
> Hi Kai-Heng,
>
> looks like this v2 patchset is identical to v1 but dropped:
>
> commit 2079c0fe7fd0b46bc00e341205aa325bae7406b7
> Author: Daniel Lezcano <[hidden email]>
> Date:   Thu Jan 21 16:49:00 2021 +0800
>
>     thermal/core: Remove notify ops
>
> and in Hirsute (5.10+) we have the v1 applied - shall we keep v1 applied? Or
> shall we drop the above patch?

Please keep v1 applied. Since Hirsute isn't a stable kernel yet, I
hope it can be as close as upstream version.

Kai-Heng

> --
> bye,
> p.

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

ACK: [SRU v2] [G] [PATCH 0/6] Prevent thermal shutdown during boot process

Kleber Souza
In reply to this post by Kai-Heng Feng
On 27.01.21 04:05, Kai-Heng Feng wrote:

> BugLink: https://bugs.launchpad.net/bugs/1906168
>
> [Impact]
> Surprising thermal shutdown at boot on Intel based mobile workstations.
>
> [Fix]
> Since these thermal devcies are not in ACPI ThermalZone, OS shouldn't
> shutdown the system.
>
> These critial temperatures are for usespace to handle, so let kernel
> know it shouldn't handle it.
>
> For Groovy, a patch that removes .notify callback is dropped.
>
> [Test]
> Use reboot stress as a reproducer. 5% chance to see a surprising
> shutdown at boot.
>
> With the fix applied, the thermal shutdown is no longer reproducible.
>
> [Where problems could occur]
> For ACPI based platforms, we still have "acpitz" to protect systems from
> overheating. If these acpitz sensors don't work, then the system could
> face real overheating issue.
>
> Daniel Lezcano (4):
>    thermal/core: Emit a warning if the thermal zone is updated without
>      ops
>    thermal/core: Add critical and hot ops
>    thermal/drivers/acpi: Use hot and critical ops
>    thermal/drivers/rcar: Remove notification usage
>
> Kai-Heng Feng (2):
>    thermal: int340x: Fix unexpected shutdown at critical temperature
>    thermal: intel: pch: Fix unexpected shutdown at critical temperature
>
>   drivers/acpi/thermal.c                        | 30 ++++++------
>   .../int340x_thermal/int340x_thermal_zone.c    |  6 +++
>   drivers/thermal/intel/intel_pch_thermal.c     |  6 +++
>   drivers/thermal/rcar_thermal.c                | 19 --------
>   drivers/thermal/thermal_core.c                | 46 ++++++++++++-------
>   include/linux/thermal.h                       |  3 ++
>   6 files changed, 58 insertions(+), 52 deletions(-)
>

Acked-by: Kleber Sacilotto de Souza <[hidden email]>

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

APPLIED: [SRU v2] [G] [PATCH 0/6] Prevent thermal shutdown during boot process

Kelsey Skunberg
In reply to this post by Kai-Heng Feng
Applied to Groovy/master-next. Thank you!

-Kelsey

On 2021-01-27 11:05:11 , Kai-Heng Feng wrote:

> BugLink: https://bugs.launchpad.net/bugs/1906168
>
> [Impact]
> Surprising thermal shutdown at boot on Intel based mobile workstations.
>
> [Fix]
> Since these thermal devcies are not in ACPI ThermalZone, OS shouldn't
> shutdown the system.
>
> These critial temperatures are for usespace to handle, so let kernel
> know it shouldn't handle it.
>
> For Groovy, a patch that removes .notify callback is dropped.
>
> [Test]
> Use reboot stress as a reproducer. 5% chance to see a surprising
> shutdown at boot.
>
> With the fix applied, the thermal shutdown is no longer reproducible.
>
> [Where problems could occur]
> For ACPI based platforms, we still have "acpitz" to protect systems from
> overheating. If these acpitz sensors don't work, then the system could
> face real overheating issue.
>
> Daniel Lezcano (4):
>   thermal/core: Emit a warning if the thermal zone is updated without
>     ops
>   thermal/core: Add critical and hot ops
>   thermal/drivers/acpi: Use hot and critical ops
>   thermal/drivers/rcar: Remove notification usage
>
> Kai-Heng Feng (2):
>   thermal: int340x: Fix unexpected shutdown at critical temperature
>   thermal: intel: pch: Fix unexpected shutdown at critical temperature
>
>  drivers/acpi/thermal.c                        | 30 ++++++------
>  .../int340x_thermal/int340x_thermal_zone.c    |  6 +++
>  drivers/thermal/intel/intel_pch_thermal.c     |  6 +++
>  drivers/thermal/rcar_thermal.c                | 19 --------
>  drivers/thermal/thermal_core.c                | 46 ++++++++++++-------
>  include/linux/thermal.h                       |  3 ++
>  6 files changed, 58 insertions(+), 52 deletions(-)
>
> --
> 2.29.2
>
>
> --
> kernel-team mailing list
> [hidden email]
> https://lists.ubuntu.com/mailman/listinfo/kernel-team

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team