[SRU] [G/Unstable/OEM-5.10] [PATCH 0/7] Prevent thermal shutdown during boot process

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

[SRU] [G/Unstable/OEM-5.10] [PATCH 0/7] Prevent thermal shutdown during boot process

Kai-Heng Feng
BugLink: https://bugs.launchpad.net/bugs/1906168

[Impact]
Surprising thermal shutdown at boot on Intel based mobile workstations.

[Fix]
Since these thermal devcies are not in ACPI ThermalZone, OS shouldn't
shutdown the system.

These critial temperatures are for usespace to handle, so let kernel
know it shouldn't handle it.

[Test]
Use reboot stress as a reproducer. 5% chance to see a surprising
shutdown at boot.

With the fix applied, the thermal shutdown is no longer reproducible.

[Where problems could occur]
For ACPI based platforms, we still have "acpitz" to protect systems from
overheating. If these acpitz sensors don't work, then the system could
face real overheating issue.

Daniel Lezcano (5):
  thermal/core: Emit a warning if the thermal zone is updated without
    ops
  thermal/core: Add critical and hot ops
  thermal/drivers/acpi: Use hot and critical ops
  thermal/drivers/rcar: Remove notification usage
  thermal/core: Remove notify ops

Kai-Heng Feng (2):
  thermal: int340x: Fix unexpected shutdown at critical temperature
  thermal: intel: pch: Fix unexpected shutdown at critical temperature

 drivers/acpi/thermal.c                        | 30 ++++++------
 .../int340x_thermal/int340x_thermal_zone.c    |  6 +++
 drivers/thermal/intel/intel_pch_thermal.c     |  6 +++
 drivers/thermal/rcar_thermal.c                | 19 -------
 drivers/thermal/thermal_core.c                | 49 +++++++++++--------
 include/linux/thermal.h                       |  5 +-
 6 files changed, 58 insertions(+), 57 deletions(-)

--
2.29.2


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[G/OEM-5.10] [PATCH 1/7] thermal/core: Emit a warning if the thermal zone is updated without ops

Kai-Heng Feng
From: Daniel Lezcano <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1906168

The actual code is silently ignoring a thermal zone update when a
driver is requesting it without a get_temp ops set.

That looks not correct, as the caller should not have called this
function if the thermal zone is unable to read the temperature.

That makes the code less robust as the check won't detect the driver
is inconsistently using the thermal API and that does not help to
improve the framework as these circumvolutions hide the problem at the
source.

In order to detect the situation when it happens, let's add a warning
when the update is requested without the get_temp() ops set.

Any warning emitted will have to be fixed at the source of the
problem: the caller must not call thermal_zone_device_update if there
is not get_temp callback set.

Cc: Thara Gopinath <[hidden email]>
Cc: Amit Kucheria <[hidden email]>
Cc: [hidden email]
Cc: [hidden email]
Signed-off-by: Daniel Lezcano <[hidden email]>
Reviewed-by: Lukasz Luba <[hidden email]>
Link: https://lore.kernel.org/r/20201210121514.25760-1-daniel.lezcano@...
(cherry picked from commit 433178e75834dc35f1ae79b56ec2cf396f2c6f3c)
Signed-off-by: Kai-Heng Feng <[hidden email]>
---
 drivers/thermal/thermal_core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index b71196eaf90e..749001a88b44 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -467,7 +467,8 @@ void thermal_zone_device_update(struct thermal_zone_device *tz,
  if (atomic_read(&in_suspend))
  return;
 
- if (!tz->ops->get_temp)
+ if (WARN_ONCE(!tz->ops->get_temp, "'%s' must not be called without "
+      "'get_temp' ops set\n", __func__))
  return;
 
  update_temperature(tz);
--
2.29.2


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[G] [PATCH 2/7] thermal/core: Add critical and hot ops

Kai-Heng Feng
In reply to this post by Kai-Heng Feng
From: Daniel Lezcano <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1906168

Currently there is no way to the sensors to directly call an ops in
interrupt mode without calling thermal_zone_device_update assuming all
the trip points are defined.

A sensor may want to do something special if a trip point is hot or
critical.

This patch adds the critical and hot ops to the thermal zone device,
so a sensor can directly invoke them or let the thermal framework to
call the sensor specific ones.

Tested-by: Kai-Heng Feng <[hidden email]>
Signed-off-by: Daniel Lezcano <[hidden email]>
Reviewed-by: Lukasz Luba <[hidden email]>
Link: https://lore.kernel.org/r/20201210121514.25760-2-daniel.lezcano@...
(backported from commit d7203eedf4f68e9909fd489453168a9d26bf0c3d)
Signed-off-by: Kai-Heng Feng <[hidden email]>
---
 drivers/thermal/thermal_core.c | 43 +++++++++++++++++++++-------------
 include/linux/thermal.h        |  3 +++
 2 files changed, 30 insertions(+), 16 deletions(-)

diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 749001a88b44..aba84e1a4396 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -369,6 +369,25 @@ static void thermal_emergency_poweroff(void)
       msecs_to_jiffies(poweroff_delay_ms));
 }
 
+void thermal_zone_device_critical(struct thermal_zone_device *tz)
+{
+ dev_emerg(&tz->device, "%s: critical temperature reached, "
+  "shutting down\n", tz->type);
+
+ mutex_lock(&poweroff_lock);
+ if (!power_off_triggered) {
+ /*
+ * Queue a backup emergency shutdown in the event of
+ * orderly_poweroff failure
+ */
+ thermal_emergency_poweroff();
+ orderly_poweroff(true);
+ power_off_triggered = true;
+ }
+ mutex_unlock(&poweroff_lock);
+}
+EXPORT_SYMBOL(thermal_zone_device_critical);
+
 static void handle_critical_trips(struct thermal_zone_device *tz,
   int trip, enum thermal_trip_type trip_type)
 {
@@ -385,22 +404,10 @@ static void handle_critical_trips(struct thermal_zone_device *tz,
  if (tz->ops->notify)
  tz->ops->notify(tz, trip, trip_type);
 
- if (trip_type == THERMAL_TRIP_CRITICAL) {
- dev_emerg(&tz->device,
-  "critical temperature reached (%d C), shutting down\n",
-  tz->temperature / 1000);
- mutex_lock(&poweroff_lock);
- if (!power_off_triggered) {
- /*
- * Queue a backup emergency shutdown in the event of
- * orderly_poweroff failure
- */
- thermal_emergency_poweroff();
- orderly_poweroff(true);
- power_off_triggered = true;
- }
- mutex_unlock(&poweroff_lock);
- }
+ if (trip_type == THERMAL_TRIP_HOT && tz->ops->hot)
+ tz->ops->hot(tz);
+ else if (trip_type == THERMAL_TRIP_CRITICAL)
+ tz->ops->critical(tz);
 }
 
 static void handle_thermal_trip(struct thermal_zone_device *tz, int trip)
@@ -1273,6 +1280,10 @@ thermal_zone_device_register(const char *type, int trips, int mask,
 
  tz->id = id;
  strlcpy(tz->type, type, sizeof(tz->type));
+
+ if (!ops->critical)
+ ops->critical = thermal_zone_device_critical;
+
  tz->ops = ops;
  tz->tzp = tzp;
  tz->device.class = &thermal_class;
diff --git a/include/linux/thermal.h b/include/linux/thermal.h
index 216185bb3014..e7989cec090a 100644
--- a/include/linux/thermal.h
+++ b/include/linux/thermal.h
@@ -92,6 +92,8 @@ struct thermal_zone_device_ops {
   enum thermal_trend *);
  int (*notify) (struct thermal_zone_device *, int,
        enum thermal_trip_type);
+ void (*hot)(struct thermal_zone_device *);
+ void (*critical)(struct thermal_zone_device *);
 };
 
 struct thermal_cooling_device_ops {
@@ -416,6 +418,7 @@ int thermal_zone_get_offset(struct thermal_zone_device *tz);
 
 void thermal_cdev_update(struct thermal_cooling_device *);
 void thermal_notify_framework(struct thermal_zone_device *, int);
+void thermal_zone_device_critical(struct thermal_zone_device *tz);
 #else
 static inline struct thermal_zone_device *thermal_zone_device_register(
  const char *type, int trips, int mask, void *devdata,
--
2.29.2


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[OEM-5.10] [PATCH] thermal/core: Add critical and hot ops

Kai-Heng Feng
In reply to this post by Kai-Heng Feng
From: Daniel Lezcano <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1906168

Currently there is no way to the sensors to directly call an ops in
interrupt mode without calling thermal_zone_device_update assuming all
the trip points are defined.

A sensor may want to do something special if a trip point is hot or
critical.

This patch adds the critical and hot ops to the thermal zone device,
so a sensor can directly invoke them or let the thermal framework to
call the sensor specific ones.

Tested-by: Kai-Heng Feng <[hidden email]>
Signed-off-by: Daniel Lezcano <[hidden email]>
Reviewed-by: Lukasz Luba <[hidden email]>
Link: https://lore.kernel.org/r/20201210121514.25760-2-daniel.lezcano@...
(cherry picked from commit d7203eedf4f68e9909fd489453168a9d26bf0c3d)
Signed-off-by: Kai-Heng Feng <[hidden email]>
---
 drivers/thermal/thermal_core.c | 43 +++++++++++++++++++++-------------
 include/linux/thermal.h        |  3 +++
 2 files changed, 30 insertions(+), 16 deletions(-)

diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index d819c98ebb9d..5b83fb9d9b7d 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -380,6 +380,25 @@ static void thermal_emergency_poweroff(void)
       msecs_to_jiffies(poweroff_delay_ms));
 }
 
+void thermal_zone_device_critical(struct thermal_zone_device *tz)
+{
+ dev_emerg(&tz->device, "%s: critical temperature reached, "
+  "shutting down\n", tz->type);
+
+ mutex_lock(&poweroff_lock);
+ if (!power_off_triggered) {
+ /*
+ * Queue a backup emergency shutdown in the event of
+ * orderly_poweroff failure
+ */
+ thermal_emergency_poweroff();
+ orderly_poweroff(true);
+ power_off_triggered = true;
+ }
+ mutex_unlock(&poweroff_lock);
+}
+EXPORT_SYMBOL(thermal_zone_device_critical);
+
 static void handle_critical_trips(struct thermal_zone_device *tz,
   int trip, enum thermal_trip_type trip_type)
 {
@@ -396,22 +415,10 @@ static void handle_critical_trips(struct thermal_zone_device *tz,
  if (tz->ops->notify)
  tz->ops->notify(tz, trip, trip_type);
 
- if (trip_type == THERMAL_TRIP_CRITICAL) {
- dev_emerg(&tz->device,
-  "critical temperature reached (%d C), shutting down\n",
-  tz->temperature / 1000);
- mutex_lock(&poweroff_lock);
- if (!power_off_triggered) {
- /*
- * Queue a backup emergency shutdown in the event of
- * orderly_poweroff failure
- */
- thermal_emergency_poweroff();
- orderly_poweroff(true);
- power_off_triggered = true;
- }
- mutex_unlock(&poweroff_lock);
- }
+ if (trip_type == THERMAL_TRIP_HOT && tz->ops->hot)
+ tz->ops->hot(tz);
+ else if (trip_type == THERMAL_TRIP_CRITICAL)
+ tz->ops->critical(tz);
 }
 
 static void handle_thermal_trip(struct thermal_zone_device *tz, int trip)
@@ -1424,6 +1431,10 @@ thermal_zone_device_register(const char *type, int trips, int mask,
 
  tz->id = id;
  strlcpy(tz->type, type, sizeof(tz->type));
+
+ if (!ops->critical)
+ ops->critical = thermal_zone_device_critical;
+
  tz->ops = ops;
  tz->tzp = tzp;
  tz->device.class = &thermal_class;
diff --git a/include/linux/thermal.h b/include/linux/thermal.h
index d07ea27e72a9..31b84404f047 100644
--- a/include/linux/thermal.h
+++ b/include/linux/thermal.h
@@ -79,6 +79,8 @@ struct thermal_zone_device_ops {
   enum thermal_trend *);
  int (*notify) (struct thermal_zone_device *, int,
        enum thermal_trip_type);
+ void (*hot)(struct thermal_zone_device *);
+ void (*critical)(struct thermal_zone_device *);
 };
 
 struct thermal_cooling_device_ops {
@@ -399,6 +401,7 @@ void thermal_cdev_update(struct thermal_cooling_device *);
 void thermal_notify_framework(struct thermal_zone_device *, int);
 int thermal_zone_device_enable(struct thermal_zone_device *tz);
 int thermal_zone_device_disable(struct thermal_zone_device *tz);
+void thermal_zone_device_critical(struct thermal_zone_device *tz);
 #else
 static inline struct thermal_zone_device *thermal_zone_device_register(
  const char *type, int trips, int mask, void *devdata,
--
2.29.2


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[G/Unstable/OEM-5.10] [PATCH 3/7] thermal/drivers/acpi: Use hot and critical ops

Kai-Heng Feng
In reply to this post by Kai-Heng Feng
From: Daniel Lezcano <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1906168

The acpi driver wants to do a netlink notification in case of a hot or
critical trip point. Implement the corresponding ops to be used for
the thermal zone and use them instead of the notify ops.

Signed-off-by: Daniel Lezcano <[hidden email]>
Acked-by: Srinivas Pandruvada <[hidden email]>
Link: https://lore.kernel.org/r/20201210121514.25760-3-daniel.lezcano@...
(cherry picked from commit a73cb2024caa3480263a009dce91fa581b3748bf linux-next)
Signed-off-by: Kai-Heng Feng <[hidden email]>
---
 drivers/acpi/thermal.c | 30 ++++++++++++++----------------
 1 file changed, 14 insertions(+), 16 deletions(-)

diff --git a/drivers/acpi/thermal.c b/drivers/acpi/thermal.c
index 19067a5e5293..3412fee26915 100644
--- a/drivers/acpi/thermal.c
+++ b/drivers/acpi/thermal.c
@@ -725,27 +725,24 @@ static int thermal_get_trend(struct thermal_zone_device *thermal,
  return 0;
 }
 
-
-static int thermal_notify(struct thermal_zone_device *thermal, int trip,
-   enum thermal_trip_type trip_type)
+static void acpi_thermal_zone_device_hot(struct thermal_zone_device *thermal)
 {
- u8 type = 0;
  struct acpi_thermal *tz = thermal->devdata;
 
- if (trip_type == THERMAL_TRIP_CRITICAL)
- type = ACPI_THERMAL_NOTIFY_CRITICAL;
- else if (trip_type == THERMAL_TRIP_HOT)
- type = ACPI_THERMAL_NOTIFY_HOT;
- else
- return 0;
-
  acpi_bus_generate_netlink_event(tz->device->pnp.device_class,
- dev_name(&tz->device->dev), type, 1);
+ dev_name(&tz->device->dev),
+ ACPI_THERMAL_NOTIFY_HOT, 1);
+}
 
- if (trip_type == THERMAL_TRIP_CRITICAL && nocrt)
- return 1;
+static void acpi_thermal_zone_device_critical(struct thermal_zone_device *thermal)
+{
+ struct acpi_thermal *tz = thermal->devdata;
 
- return 0;
+ acpi_bus_generate_netlink_event(tz->device->pnp.device_class,
+ dev_name(&tz->device->dev),
+ ACPI_THERMAL_NOTIFY_CRITICAL, 1);
+
+ thermal_zone_device_critical(thermal);
 }
 
 static int acpi_thermal_cooling_device_cb(struct thermal_zone_device *thermal,
@@ -862,7 +859,8 @@ static struct thermal_zone_device_ops acpi_thermal_zone_ops = {
  .get_trip_temp = thermal_get_trip_temp,
  .get_crit_temp = thermal_get_crit_temp,
  .get_trend = thermal_get_trend,
- .notify = thermal_notify,
+ .hot = acpi_thermal_zone_device_hot,
+ .critical = acpi_thermal_zone_device_critical,
 };
 
 static int acpi_thermal_register_thermal_zone(struct acpi_thermal *tz)
--
2.29.2


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[G/OEM-5.10] [PATCH 4/7] thermal/drivers/rcar: Remove notification usage

Kai-Heng Feng
In reply to this post by Kai-Heng Feng
From: Daniel Lezcano <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1906168

The ops is only showing a trace telling a critical trip point is
crossed. The same information is given by the thermal framework.

This is redundant, remove the code.

Signed-off-by: Daniel Lezcano <[hidden email]>
Reviewed-by: Niklas Söderlund <[hidden email]>
Link: https://lore.kernel.org/r/20201210121514.25760-4-daniel.lezcano@...
(cherry picked from commit 1fa34e49e4b7e66214a1d15261c0224d60366eec)
Signed-off-by: Kai-Heng Feng <[hidden email]>
---
 drivers/thermal/rcar_thermal.c | 19 -------------------
 1 file changed, 19 deletions(-)

diff --git a/drivers/thermal/rcar_thermal.c b/drivers/thermal/rcar_thermal.c
index 46aeb28b4e90..f490e3668e2a 100644
--- a/drivers/thermal/rcar_thermal.c
+++ b/drivers/thermal/rcar_thermal.c
@@ -323,24 +323,6 @@ static int rcar_thermal_get_trip_temp(struct thermal_zone_device *zone,
  return 0;
 }
 
-static int rcar_thermal_notify(struct thermal_zone_device *zone,
-       int trip, enum thermal_trip_type type)
-{
- struct rcar_thermal_priv *priv = rcar_zone_to_priv(zone);
- struct device *dev = rcar_priv_to_dev(priv);
-
- switch (type) {
- case THERMAL_TRIP_CRITICAL:
- /* FIXME */
- dev_warn(dev, "Thermal reached to critical temperature\n");
- break;
- default:
- break;
- }
-
- return 0;
-}
-
 static const struct thermal_zone_of_device_ops rcar_thermal_zone_of_ops = {
  .get_temp = rcar_thermal_of_get_temp,
 };
@@ -349,7 +331,6 @@ static struct thermal_zone_device_ops rcar_thermal_zone_ops = {
  .get_temp = rcar_thermal_get_temp,
  .get_trip_type = rcar_thermal_get_trip_type,
  .get_trip_temp = rcar_thermal_get_trip_temp,
- .notify = rcar_thermal_notify,
 };
 
 /*
--
2.29.2


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[G/Unstable/OEM-5.10] [PATCH 5/7] thermal/core: Remove notify ops

Kai-Heng Feng
In reply to this post by Kai-Heng Feng
From: Daniel Lezcano <[hidden email]>

BugLink: https://bugs.launchpad.net/bugs/1906168

With the removal of the notifys user in a previous patches, the ops is no
longer needed, remove it.

Signed-off-by: Daniel Lezcano <[hidden email]>
Reviewed-by: Lukasz Luba <[hidden email]>
Link: https://lore.kernel.org/r/20201210121514.25760-5-daniel.lezcano@...
(cherry picked from commit 04f111130e9afa41c10d7bcec14e00e3be8b6977 linux-next)
Signed-off-by: Kai-Heng Feng <[hidden email]>
---
 drivers/thermal/thermal_core.c | 3 ---
 include/linux/thermal.h        | 2 --
 2 files changed, 5 deletions(-)

diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index aba84e1a4396..b53305b8d643 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -401,9 +401,6 @@ static void handle_critical_trips(struct thermal_zone_device *tz,
 
  trace_thermal_zone_trip(tz, trip, trip_type);
 
- if (tz->ops->notify)
- tz->ops->notify(tz, trip, trip_type);
-
  if (trip_type == THERMAL_TRIP_HOT && tz->ops->hot)
  tz->ops->hot(tz);
  else if (trip_type == THERMAL_TRIP_CRITICAL)
diff --git a/include/linux/thermal.h b/include/linux/thermal.h
index e7989cec090a..5777a9e4ea54 100644
--- a/include/linux/thermal.h
+++ b/include/linux/thermal.h
@@ -90,8 +90,6 @@ struct thermal_zone_device_ops {
  int (*set_emul_temp) (struct thermal_zone_device *, int);
  int (*get_trend) (struct thermal_zone_device *, int,
   enum thermal_trend *);
- int (*notify) (struct thermal_zone_device *, int,
-       enum thermal_trip_type);
  void (*hot)(struct thermal_zone_device *);
  void (*critical)(struct thermal_zone_device *);
 };
--
2.29.2


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[G/Unstable/OEM-5.10] [PATCH 6/7] thermal: int340x: Fix unexpected shutdown at critical temperature

Kai-Heng Feng
In reply to this post by Kai-Heng Feng
BugLink: https://bugs.launchpad.net/bugs/1906168

We are seeing thermal shutdown on Intel based mobile workstations, the
shutdown happens during the first trip handle in
thermal_zone_device_register():
kernel: thermal thermal_zone15: critical temperature reached (101 C), shutting down

However, we shouldn't do a thermal shutdown here, since
1) We may want to use a dedicated daemon, Intel's thermald in this case,
to handle thermal shutdown.

2) For ACPI based system, _CRT doesn't mean shutdown unless it's inside
ThermalZone namespace. ACPI Spec, 11.4.4 _CRT (Critical Temperature):
"... If this object it present under a device, the device’s driver
evaluates this object to determine the device’s critical cooling
temperature trip point. This value may then be used by the device’s
driver to program an internal device temperature sensor trip point."

So a "critical trip" here merely means we should take a more aggressive
cooling method.

As int340x device isn't present under ACPI ThermalZone, override the
default .critical callback to prevent surprising thermal shutdown.

Signed-off-by: Kai-Heng Feng <[hidden email]>
Signed-off-by: Daniel Lezcano <[hidden email]>
Link: https://lore.kernel.org/r/20201221172345.36976-1-kai.heng.feng@...
(cherry picked from commit dd47366aaa9b93ac3d97cb4ee7641d38a28a771e linux-next)
Signed-off-by: Kai-Heng Feng <[hidden email]>
---
 .../thermal/intel/int340x_thermal/int340x_thermal_zone.c    | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c b/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c
index 432213272f1e..eeb08fab19a7 100644
--- a/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c
+++ b/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c
@@ -146,12 +146,18 @@ static int int340x_thermal_get_trip_hyst(struct thermal_zone_device *zone,
  return 0;
 }
 
+static void int340x_thermal_critical(struct thermal_zone_device *zone)
+{
+ dev_dbg(&zone->device, "%s: critical temperature reached\n", zone->type);
+}
+
 static struct thermal_zone_device_ops int340x_thermal_zone_ops = {
  .get_temp       = int340x_thermal_get_zone_temp,
  .get_trip_temp = int340x_thermal_get_trip_temp,
  .get_trip_type = int340x_thermal_get_trip_type,
  .set_trip_temp = int340x_thermal_set_trip_temp,
  .get_trip_hyst =  int340x_thermal_get_trip_hyst,
+ .critical = int340x_thermal_critical,
 };
 
 static int int340x_thermal_get_trip_config(acpi_handle handle, char *name,
--
2.29.2


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[G/Unstable/OEM-5.10] [PATCH 7/7] thermal: intel: pch: Fix unexpected shutdown at critical temperature

Kai-Heng Feng
In reply to this post by Kai-Heng Feng
BugLink: https://bugs.launchpad.net/bugs/1906168

Like previous patch, the intel_pch_thermal device is not in ACPI
ThermalZone namespace, so a critical trip doesn't mean shutdown.

Override the default .critical callback to prevent surprising thermal
shutdoown.

Signed-off-by: Kai-Heng Feng <[hidden email]>
Signed-off-by: Daniel Lezcano <[hidden email]>
Link: https://lore.kernel.org/r/20201221172345.36976-2-kai.heng.feng@...
(cherry picked from commit 03671968d0bf2db598f7e3aa98f190b76c1bb4ff linux-next)
Signed-off-by: Kai-Heng Feng <[hidden email]>
---
 drivers/thermal/intel/intel_pch_thermal.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/thermal/intel/intel_pch_thermal.c b/drivers/thermal/intel/intel_pch_thermal.c
index 56401fd4708d..7bccf588c15e 100644
--- a/drivers/thermal/intel/intel_pch_thermal.c
+++ b/drivers/thermal/intel/intel_pch_thermal.c
@@ -263,10 +263,16 @@ static int pch_get_trip_temp(struct thermal_zone_device *tzd, int trip, int *tem
  return 0;
 }
 
+static void pch_critical(struct thermal_zone_device *tzd)
+{
+ dev_dbg(&tzd->device, "%s: critical temperature reached\n", tzd->type);
+}
+
 static struct thermal_zone_device_ops tzd_ops = {
  .get_temp = pch_thermal_get_temp,
  .get_trip_type = pch_get_trip_type,
  .get_trip_temp = pch_get_trip_temp,
+ .critical = pch_critical,
 };
 
 enum board_ids {
--
2.29.2


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

APPLIED H: Re: [SRU] [G/Unstable/OEM-5.10] [PATCH 0/7] Prevent thermal shutdown during boot process

Paolo Pisati-5
In reply to this post by Kai-Heng Feng
On Thu, Jan 21, 2021 at 04:48:54PM +0800, Kai-Heng Feng wrote:
> BugLink: https://bugs.launchpad.net/bugs/1906168

--
bye,
p.

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

APPLIED (unstable/5.11): [SRU] [G/Unstable/OEM-5.10] [PATCH 0/7] Prevent thermal shutdown during boot process

Andrea Righi
In reply to this post by Kai-Heng Feng
On Thu, Jan 21, 2021 at 04:48:54PM +0800, Kai-Heng Feng wrote:

> BugLink: https://bugs.launchpad.net/bugs/1906168
>
> [Impact]
> Surprising thermal shutdown at boot on Intel based mobile workstations.
>
> [Fix]
> Since these thermal devcies are not in ACPI ThermalZone, OS shouldn't
> shutdown the system.
>
> These critial temperatures are for usespace to handle, so let kernel
> know it shouldn't handle it.
>
> [Test]
> Use reboot stress as a reproducer. 5% chance to see a surprising
> shutdown at boot.
>
> With the fix applied, the thermal shutdown is no longer reproducible.
>
> [Where problems could occur]
> For ACPI based platforms, we still have "acpitz" to protect systems from
> overheating. If these acpitz sensors don't work, then the system could
> face real overheating issue.

Applied to 5.11 unstable (only the patches from linux-next that were
missing). Thanks!

-Andrea

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

NACK: [G] [PATCH 5/7] thermal/core: Remove notify ops

Stefan Bader-2
In reply to this post by Kai-Heng Feng
On 21.01.21 09:49, Kai-Heng Feng wrote:

> From: Daniel Lezcano <[hidden email]>
>
> BugLink: https://bugs.launchpad.net/bugs/1906168
>
> With the removal of the notifys user in a previous patches, the ops is no
> longer needed, remove it.
>
> Signed-off-by: Daniel Lezcano <[hidden email]>
> Reviewed-by: Lukasz Luba <[hidden email]>
> Link: https://lore.kernel.org/r/20201210121514.25760-5-daniel.lezcano@...
> (cherry picked from commit 04f111130e9afa41c10d7bcec14e00e3be8b6977 linux-next)
> Signed-off-by: Kai-Heng Feng <[hidden email]>
> ---
Removing interfaces like this in a backport sounds rather dangerous. Upstream
usually does this after fixing all drivers which make use of it and do not care
about anything else. For stable releases we have to care about 3rd party drivers
as well as potentially still having usage in the kernel. I suppose the latter
has to some degree be tested by test compiling it but there might be drivers not
enabled. Not that I saw any mention of a test compile.
Instead of removing the call completely, I would use something like this:

        if ((!tz-ops->hot && !tz-ops->critical) && tz->ops->notify)
                ts->ops->notify(tz, trip, trip_type);

>  drivers/thermal/thermal_core.c | 3 ---
>  include/linux/thermal.h        | 2 --
>  2 files changed, 5 deletions(-)
>
> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> index aba84e1a4396..b53305b8d643 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -401,9 +401,6 @@ static void handle_critical_trips(struct thermal_zone_device *tz,
>  
>   trace_thermal_zone_trip(tz, trip, trip_type);
>  
> - if (tz->ops->notify)
> - tz->ops->notify(tz, trip, trip_type);
> -
>   if (trip_type == THERMAL_TRIP_HOT && tz->ops->hot)
>   tz->ops->hot(tz);
>   else if (trip_type == THERMAL_TRIP_CRITICAL)
> diff --git a/include/linux/thermal.h b/include/linux/thermal.h
> index e7989cec090a..5777a9e4ea54 100644
> --- a/include/linux/thermal.h
> +++ b/include/linux/thermal.h
> @@ -90,8 +90,6 @@ struct thermal_zone_device_ops {
>   int (*set_emul_temp) (struct thermal_zone_device *, int);
>   int (*get_trend) (struct thermal_zone_device *, int,
>    enum thermal_trend *);
> - int (*notify) (struct thermal_zone_device *, int,
> -       enum thermal_trip_type);
>   void (*hot)(struct thermal_zone_device *);
>   void (*critical)(struct thermal_zone_device *);
>  };
>


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: NACK: [G] [PATCH 5/7] thermal/core: Remove notify ops

Kai-Heng Feng
On Tue, Jan 26, 2021 at 4:36 PM Stefan Bader <[hidden email]> wrote:

>
> On 21.01.21 09:49, Kai-Heng Feng wrote:
> > From: Daniel Lezcano <[hidden email]>
> >
> > BugLink: https://bugs.launchpad.net/bugs/1906168
> >
> > With the removal of the notifys user in a previous patches, the ops is no
> > longer needed, remove it.
> >
> > Signed-off-by: Daniel Lezcano <[hidden email]>
> > Reviewed-by: Lukasz Luba <[hidden email]>
> > Link: https://lore.kernel.org/r/20201210121514.25760-5-daniel.lezcano@...
> > (cherry picked from commit 04f111130e9afa41c10d7bcec14e00e3be8b6977 linux-next)
> > Signed-off-by: Kai-Heng Feng <[hidden email]>
> > ---
>
> Removing interfaces like this in a backport sounds rather dangerous. Upstream
> usually does this after fixing all drivers which make use of it and do not care
> about anything else. For stable releases we have to care about 3rd party drivers
> as well as potentially still having usage in the kernel. I suppose the latter
> has to some degree be tested by test compiling it but there might be drivers not
> enabled. Not that I saw any mention of a test compile.

I didn't think of the case of 3rd party module, thanks for pointing that out.

> Instead of removing the call completely, I would use something like this:
>
>         if ((!tz-ops->hot && !tz-ops->critical) && tz->ops->notify)
>                 ts->ops->notify(tz, trip, trip_type);

Actually, it makes more sense to keep the code as is.
So we can drop this one and merge the other patches.

Kai-Heng

>
> >  drivers/thermal/thermal_core.c | 3 ---
> >  include/linux/thermal.h        | 2 --
> >  2 files changed, 5 deletions(-)
> >
> > diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> > index aba84e1a4396..b53305b8d643 100644
> > --- a/drivers/thermal/thermal_core.c
> > +++ b/drivers/thermal/thermal_core.c
> > @@ -401,9 +401,6 @@ static void handle_critical_trips(struct thermal_zone_device *tz,
> >
> >       trace_thermal_zone_trip(tz, trip, trip_type);
> >
> > -     if (tz->ops->notify)
> > -             tz->ops->notify(tz, trip, trip_type);
> > -
> >       if (trip_type == THERMAL_TRIP_HOT && tz->ops->hot)
> >               tz->ops->hot(tz);
> >       else if (trip_type == THERMAL_TRIP_CRITICAL)
> > diff --git a/include/linux/thermal.h b/include/linux/thermal.h
> > index e7989cec090a..5777a9e4ea54 100644
> > --- a/include/linux/thermal.h
> > +++ b/include/linux/thermal.h
> > @@ -90,8 +90,6 @@ struct thermal_zone_device_ops {
> >       int (*set_emul_temp) (struct thermal_zone_device *, int);
> >       int (*get_trend) (struct thermal_zone_device *, int,
> >                         enum thermal_trend *);
> > -     int (*notify) (struct thermal_zone_device *, int,
> > -                    enum thermal_trip_type);
> >       void (*hot)(struct thermal_zone_device *);
> >       void (*critical)(struct thermal_zone_device *);
> >  };
> >
>
>

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

Re: NACK: [G] [PATCH 5/7] thermal/core: Remove notify ops

Stefan Bader-2
On 26.01.21 13:15, Kai-Heng Feng wrote:

> On Tue, Jan 26, 2021 at 4:36 PM Stefan Bader <[hidden email]> wrote:
>>
>> On 21.01.21 09:49, Kai-Heng Feng wrote:
>>> From: Daniel Lezcano <[hidden email]>
>>>
>>> BugLink: https://bugs.launchpad.net/bugs/1906168
>>>
>>> With the removal of the notifys user in a previous patches, the ops is no
>>> longer needed, remove it.
>>>
>>> Signed-off-by: Daniel Lezcano <[hidden email]>
>>> Reviewed-by: Lukasz Luba <[hidden email]>
>>> Link: https://lore.kernel.org/r/20201210121514.25760-5-daniel.lezcano@...
>>> (cherry picked from commit 04f111130e9afa41c10d7bcec14e00e3be8b6977 linux-next)
>>> Signed-off-by: Kai-Heng Feng <[hidden email]>
>>> ---
>>
>> Removing interfaces like this in a backport sounds rather dangerous. Upstream
>> usually does this after fixing all drivers which make use of it and do not care
>> about anything else. For stable releases we have to care about 3rd party drivers
>> as well as potentially still having usage in the kernel. I suppose the latter
>> has to some degree be tested by test compiling it but there might be drivers not
>> enabled. Not that I saw any mention of a test compile.
>
> I didn't think of the case of 3rd party module, thanks for pointing that out.
>
>> Instead of removing the call completely, I would use something like this:
>>
>>         if ((!tz-ops->hot && !tz-ops->critical) && tz->ops->notify)
>>                 ts->ops->notify(tz, trip, trip_type);
>
> Actually, it makes more sense to keep the code as is.
> So we can drop this one and merge the other patches.
Hm, if we drop this one but keep the others, won't those process critical events
twice? Oh probably not as they no longer set a notify hook. So yeah that should
be ok. Could you send a v2 set for G only (since the other places picked it up).
I think that would be simpler to follow.

Thanks.

>
> Kai-Heng
>
>>
>>>  drivers/thermal/thermal_core.c | 3 ---
>>>  include/linux/thermal.h        | 2 --
>>>  2 files changed, 5 deletions(-)
>>>
>>> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
>>> index aba84e1a4396..b53305b8d643 100644
>>> --- a/drivers/thermal/thermal_core.c
>>> +++ b/drivers/thermal/thermal_core.c
>>> @@ -401,9 +401,6 @@ static void handle_critical_trips(struct thermal_zone_device *tz,
>>>
>>>       trace_thermal_zone_trip(tz, trip, trip_type);
>>>
>>> -     if (tz->ops->notify)
>>> -             tz->ops->notify(tz, trip, trip_type);
>>> -
>>>       if (trip_type == THERMAL_TRIP_HOT && tz->ops->hot)
>>>               tz->ops->hot(tz);
>>>       else if (trip_type == THERMAL_TRIP_CRITICAL)
>>> diff --git a/include/linux/thermal.h b/include/linux/thermal.h
>>> index e7989cec090a..5777a9e4ea54 100644
>>> --- a/include/linux/thermal.h
>>> +++ b/include/linux/thermal.h
>>> @@ -90,8 +90,6 @@ struct thermal_zone_device_ops {
>>>       int (*set_emul_temp) (struct thermal_zone_device *, int);
>>>       int (*get_trend) (struct thermal_zone_device *, int,
>>>                         enum thermal_trend *);
>>> -     int (*notify) (struct thermal_zone_device *, int,
>>> -                    enum thermal_trip_type);
>>>       void (*hot)(struct thermal_zone_device *);
>>>       void (*critical)(struct thermal_zone_device *);
>>>  };
>>>
>>
>>


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

APPLIED [OEM-5.10] Re: [SRU] [G/Unstable/OEM-5.10] [PATCH 0/7] Prevent thermal shutdown during boot process

Timo Aaltonen-6
In reply to this post by Kai-Heng Feng
On 21.1.2021 10.48, Kai-Heng Feng wrote:

> BugLink: https://bugs.launchpad.net/bugs/1906168
>
> [Impact]
> Surprising thermal shutdown at boot on Intel based mobile workstations.
>
> [Fix]
> Since these thermal devcies are not in ACPI ThermalZone, OS shouldn't
> shutdown the system.
>
> These critial temperatures are for usespace to handle, so let kernel
> know it shouldn't handle it.
>
> [Test]
> Use reboot stress as a reproducer. 5% chance to see a surprising
> shutdown at boot.
>
> With the fix applied, the thermal shutdown is no longer reproducible.
>
> [Where problems could occur]
> For ACPI based platforms, we still have "acpitz" to protect systems from
> overheating. If these acpitz sensors don't work, then the system could
> face real overheating issue.
>
> Daniel Lezcano (5):
>    thermal/core: Emit a warning if the thermal zone is updated without
>      ops
>    thermal/core: Add critical and hot ops
>    thermal/drivers/acpi: Use hot and critical ops
>    thermal/drivers/rcar: Remove notification usage
>    thermal/core: Remove notify ops
>
> Kai-Heng Feng (2):
>    thermal: int340x: Fix unexpected shutdown at critical temperature
>    thermal: intel: pch: Fix unexpected shutdown at critical temperature
>
>   drivers/acpi/thermal.c                        | 30 ++++++------
>   .../int340x_thermal/int340x_thermal_zone.c    |  6 +++
>   drivers/thermal/intel/intel_pch_thermal.c     |  6 +++
>   drivers/thermal/rcar_thermal.c                | 19 -------
>   drivers/thermal/thermal_core.c                | 49 +++++++++++--------
>   include/linux/thermal.h                       |  5 +-
>   6 files changed, 58 insertions(+), 57 deletions(-)
>

applied to oem-5.10, thanks

--
t

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team