BugLink: https://bugs.launchpad.net/bugs/1906168
[Impact] Surprising thermal shutdown at boot on Intel based mobile workstations. [Fix] Since these thermal devcies are not in ACPI ThermalZone, OS shouldn't shutdown the system. These critial temperatures are for usespace to handle, so let kernel know it shouldn't handle it. [Test] Use reboot stress as a reproducer. 5% chance to see a surprising shutdown at boot. With the fix applied, the thermal shutdown is no longer reproducible. [Where problems could occur] For ACPI based platforms, we still have "acpitz" to protect systems from overheating. If these acpitz sensors don't work, then the system could face real overheating issue. Daniel Lezcano (5): thermal/core: Emit a warning if the thermal zone is updated without ops thermal/core: Add critical and hot ops thermal/drivers/acpi: Use hot and critical ops thermal/drivers/rcar: Remove notification usage thermal/core: Remove notify ops Kai-Heng Feng (2): thermal: int340x: Fix unexpected shutdown at critical temperature thermal: intel: pch: Fix unexpected shutdown at critical temperature drivers/acpi/thermal.c | 30 ++++++------ .../int340x_thermal/int340x_thermal_zone.c | 6 +++ drivers/thermal/intel/intel_pch_thermal.c | 6 +++ drivers/thermal/rcar_thermal.c | 19 ------- drivers/thermal/thermal_core.c | 49 +++++++++++-------- include/linux/thermal.h | 5 +- 6 files changed, 58 insertions(+), 57 deletions(-) -- 2.29.2 -- kernel-team mailing list [hidden email] https://lists.ubuntu.com/mailman/listinfo/kernel-team |
From: Daniel Lezcano <[hidden email]>
BugLink: https://bugs.launchpad.net/bugs/1906168 The actual code is silently ignoring a thermal zone update when a driver is requesting it without a get_temp ops set. That looks not correct, as the caller should not have called this function if the thermal zone is unable to read the temperature. That makes the code less robust as the check won't detect the driver is inconsistently using the thermal API and that does not help to improve the framework as these circumvolutions hide the problem at the source. In order to detect the situation when it happens, let's add a warning when the update is requested without the get_temp() ops set. Any warning emitted will have to be fixed at the source of the problem: the caller must not call thermal_zone_device_update if there is not get_temp callback set. Cc: Thara Gopinath <[hidden email]> Cc: Amit Kucheria <[hidden email]> Cc: [hidden email] Cc: [hidden email] Signed-off-by: Daniel Lezcano <[hidden email]> Reviewed-by: Lukasz Luba <[hidden email]> Link: https://lore.kernel.org/r/20201210121514.25760-1-daniel.lezcano@... (cherry picked from commit 433178e75834dc35f1ae79b56ec2cf396f2c6f3c) Signed-off-by: Kai-Heng Feng <[hidden email]> --- drivers/thermal/thermal_core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c index b71196eaf90e..749001a88b44 100644 --- a/drivers/thermal/thermal_core.c +++ b/drivers/thermal/thermal_core.c @@ -467,7 +467,8 @@ void thermal_zone_device_update(struct thermal_zone_device *tz, if (atomic_read(&in_suspend)) return; - if (!tz->ops->get_temp) + if (WARN_ONCE(!tz->ops->get_temp, "'%s' must not be called without " + "'get_temp' ops set\n", __func__)) return; update_temperature(tz); -- 2.29.2 -- kernel-team mailing list [hidden email] https://lists.ubuntu.com/mailman/listinfo/kernel-team |
In reply to this post by Kai-Heng Feng
From: Daniel Lezcano <[hidden email]>
BugLink: https://bugs.launchpad.net/bugs/1906168 Currently there is no way to the sensors to directly call an ops in interrupt mode without calling thermal_zone_device_update assuming all the trip points are defined. A sensor may want to do something special if a trip point is hot or critical. This patch adds the critical and hot ops to the thermal zone device, so a sensor can directly invoke them or let the thermal framework to call the sensor specific ones. Tested-by: Kai-Heng Feng <[hidden email]> Signed-off-by: Daniel Lezcano <[hidden email]> Reviewed-by: Lukasz Luba <[hidden email]> Link: https://lore.kernel.org/r/20201210121514.25760-2-daniel.lezcano@... (backported from commit d7203eedf4f68e9909fd489453168a9d26bf0c3d) Signed-off-by: Kai-Heng Feng <[hidden email]> --- drivers/thermal/thermal_core.c | 43 +++++++++++++++++++++------------- include/linux/thermal.h | 3 +++ 2 files changed, 30 insertions(+), 16 deletions(-) diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c index 749001a88b44..aba84e1a4396 100644 --- a/drivers/thermal/thermal_core.c +++ b/drivers/thermal/thermal_core.c @@ -369,6 +369,25 @@ static void thermal_emergency_poweroff(void) msecs_to_jiffies(poweroff_delay_ms)); } +void thermal_zone_device_critical(struct thermal_zone_device *tz) +{ + dev_emerg(&tz->device, "%s: critical temperature reached, " + "shutting down\n", tz->type); + + mutex_lock(&poweroff_lock); + if (!power_off_triggered) { + /* + * Queue a backup emergency shutdown in the event of + * orderly_poweroff failure + */ + thermal_emergency_poweroff(); + orderly_poweroff(true); + power_off_triggered = true; + } + mutex_unlock(&poweroff_lock); +} +EXPORT_SYMBOL(thermal_zone_device_critical); + static void handle_critical_trips(struct thermal_zone_device *tz, int trip, enum thermal_trip_type trip_type) { @@ -385,22 +404,10 @@ static void handle_critical_trips(struct thermal_zone_device *tz, if (tz->ops->notify) tz->ops->notify(tz, trip, trip_type); - if (trip_type == THERMAL_TRIP_CRITICAL) { - dev_emerg(&tz->device, - "critical temperature reached (%d C), shutting down\n", - tz->temperature / 1000); - mutex_lock(&poweroff_lock); - if (!power_off_triggered) { - /* - * Queue a backup emergency shutdown in the event of - * orderly_poweroff failure - */ - thermal_emergency_poweroff(); - orderly_poweroff(true); - power_off_triggered = true; - } - mutex_unlock(&poweroff_lock); - } + if (trip_type == THERMAL_TRIP_HOT && tz->ops->hot) + tz->ops->hot(tz); + else if (trip_type == THERMAL_TRIP_CRITICAL) + tz->ops->critical(tz); } static void handle_thermal_trip(struct thermal_zone_device *tz, int trip) @@ -1273,6 +1280,10 @@ thermal_zone_device_register(const char *type, int trips, int mask, tz->id = id; strlcpy(tz->type, type, sizeof(tz->type)); + + if (!ops->critical) + ops->critical = thermal_zone_device_critical; + tz->ops = ops; tz->tzp = tzp; tz->device.class = &thermal_class; diff --git a/include/linux/thermal.h b/include/linux/thermal.h index 216185bb3014..e7989cec090a 100644 --- a/include/linux/thermal.h +++ b/include/linux/thermal.h @@ -92,6 +92,8 @@ struct thermal_zone_device_ops { enum thermal_trend *); int (*notify) (struct thermal_zone_device *, int, enum thermal_trip_type); + void (*hot)(struct thermal_zone_device *); + void (*critical)(struct thermal_zone_device *); }; struct thermal_cooling_device_ops { @@ -416,6 +418,7 @@ int thermal_zone_get_offset(struct thermal_zone_device *tz); void thermal_cdev_update(struct thermal_cooling_device *); void thermal_notify_framework(struct thermal_zone_device *, int); +void thermal_zone_device_critical(struct thermal_zone_device *tz); #else static inline struct thermal_zone_device *thermal_zone_device_register( const char *type, int trips, int mask, void *devdata, -- 2.29.2 -- kernel-team mailing list [hidden email] https://lists.ubuntu.com/mailman/listinfo/kernel-team |
In reply to this post by Kai-Heng Feng
From: Daniel Lezcano <[hidden email]>
BugLink: https://bugs.launchpad.net/bugs/1906168 Currently there is no way to the sensors to directly call an ops in interrupt mode without calling thermal_zone_device_update assuming all the trip points are defined. A sensor may want to do something special if a trip point is hot or critical. This patch adds the critical and hot ops to the thermal zone device, so a sensor can directly invoke them or let the thermal framework to call the sensor specific ones. Tested-by: Kai-Heng Feng <[hidden email]> Signed-off-by: Daniel Lezcano <[hidden email]> Reviewed-by: Lukasz Luba <[hidden email]> Link: https://lore.kernel.org/r/20201210121514.25760-2-daniel.lezcano@... (cherry picked from commit d7203eedf4f68e9909fd489453168a9d26bf0c3d) Signed-off-by: Kai-Heng Feng <[hidden email]> --- drivers/thermal/thermal_core.c | 43 +++++++++++++++++++++------------- include/linux/thermal.h | 3 +++ 2 files changed, 30 insertions(+), 16 deletions(-) diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c index d819c98ebb9d..5b83fb9d9b7d 100644 --- a/drivers/thermal/thermal_core.c +++ b/drivers/thermal/thermal_core.c @@ -380,6 +380,25 @@ static void thermal_emergency_poweroff(void) msecs_to_jiffies(poweroff_delay_ms)); } +void thermal_zone_device_critical(struct thermal_zone_device *tz) +{ + dev_emerg(&tz->device, "%s: critical temperature reached, " + "shutting down\n", tz->type); + + mutex_lock(&poweroff_lock); + if (!power_off_triggered) { + /* + * Queue a backup emergency shutdown in the event of + * orderly_poweroff failure + */ + thermal_emergency_poweroff(); + orderly_poweroff(true); + power_off_triggered = true; + } + mutex_unlock(&poweroff_lock); +} +EXPORT_SYMBOL(thermal_zone_device_critical); + static void handle_critical_trips(struct thermal_zone_device *tz, int trip, enum thermal_trip_type trip_type) { @@ -396,22 +415,10 @@ static void handle_critical_trips(struct thermal_zone_device *tz, if (tz->ops->notify) tz->ops->notify(tz, trip, trip_type); - if (trip_type == THERMAL_TRIP_CRITICAL) { - dev_emerg(&tz->device, - "critical temperature reached (%d C), shutting down\n", - tz->temperature / 1000); - mutex_lock(&poweroff_lock); - if (!power_off_triggered) { - /* - * Queue a backup emergency shutdown in the event of - * orderly_poweroff failure - */ - thermal_emergency_poweroff(); - orderly_poweroff(true); - power_off_triggered = true; - } - mutex_unlock(&poweroff_lock); - } + if (trip_type == THERMAL_TRIP_HOT && tz->ops->hot) + tz->ops->hot(tz); + else if (trip_type == THERMAL_TRIP_CRITICAL) + tz->ops->critical(tz); } static void handle_thermal_trip(struct thermal_zone_device *tz, int trip) @@ -1424,6 +1431,10 @@ thermal_zone_device_register(const char *type, int trips, int mask, tz->id = id; strlcpy(tz->type, type, sizeof(tz->type)); + + if (!ops->critical) + ops->critical = thermal_zone_device_critical; + tz->ops = ops; tz->tzp = tzp; tz->device.class = &thermal_class; diff --git a/include/linux/thermal.h b/include/linux/thermal.h index d07ea27e72a9..31b84404f047 100644 --- a/include/linux/thermal.h +++ b/include/linux/thermal.h @@ -79,6 +79,8 @@ struct thermal_zone_device_ops { enum thermal_trend *); int (*notify) (struct thermal_zone_device *, int, enum thermal_trip_type); + void (*hot)(struct thermal_zone_device *); + void (*critical)(struct thermal_zone_device *); }; struct thermal_cooling_device_ops { @@ -399,6 +401,7 @@ void thermal_cdev_update(struct thermal_cooling_device *); void thermal_notify_framework(struct thermal_zone_device *, int); int thermal_zone_device_enable(struct thermal_zone_device *tz); int thermal_zone_device_disable(struct thermal_zone_device *tz); +void thermal_zone_device_critical(struct thermal_zone_device *tz); #else static inline struct thermal_zone_device *thermal_zone_device_register( const char *type, int trips, int mask, void *devdata, -- 2.29.2 -- kernel-team mailing list [hidden email] https://lists.ubuntu.com/mailman/listinfo/kernel-team |
In reply to this post by Kai-Heng Feng
From: Daniel Lezcano <[hidden email]>
BugLink: https://bugs.launchpad.net/bugs/1906168 The acpi driver wants to do a netlink notification in case of a hot or critical trip point. Implement the corresponding ops to be used for the thermal zone and use them instead of the notify ops. Signed-off-by: Daniel Lezcano <[hidden email]> Acked-by: Srinivas Pandruvada <[hidden email]> Link: https://lore.kernel.org/r/20201210121514.25760-3-daniel.lezcano@... (cherry picked from commit a73cb2024caa3480263a009dce91fa581b3748bf linux-next) Signed-off-by: Kai-Heng Feng <[hidden email]> --- drivers/acpi/thermal.c | 30 ++++++++++++++---------------- 1 file changed, 14 insertions(+), 16 deletions(-) diff --git a/drivers/acpi/thermal.c b/drivers/acpi/thermal.c index 19067a5e5293..3412fee26915 100644 --- a/drivers/acpi/thermal.c +++ b/drivers/acpi/thermal.c @@ -725,27 +725,24 @@ static int thermal_get_trend(struct thermal_zone_device *thermal, return 0; } - -static int thermal_notify(struct thermal_zone_device *thermal, int trip, - enum thermal_trip_type trip_type) +static void acpi_thermal_zone_device_hot(struct thermal_zone_device *thermal) { - u8 type = 0; struct acpi_thermal *tz = thermal->devdata; - if (trip_type == THERMAL_TRIP_CRITICAL) - type = ACPI_THERMAL_NOTIFY_CRITICAL; - else if (trip_type == THERMAL_TRIP_HOT) - type = ACPI_THERMAL_NOTIFY_HOT; - else - return 0; - acpi_bus_generate_netlink_event(tz->device->pnp.device_class, - dev_name(&tz->device->dev), type, 1); + dev_name(&tz->device->dev), + ACPI_THERMAL_NOTIFY_HOT, 1); +} - if (trip_type == THERMAL_TRIP_CRITICAL && nocrt) - return 1; +static void acpi_thermal_zone_device_critical(struct thermal_zone_device *thermal) +{ + struct acpi_thermal *tz = thermal->devdata; - return 0; + acpi_bus_generate_netlink_event(tz->device->pnp.device_class, + dev_name(&tz->device->dev), + ACPI_THERMAL_NOTIFY_CRITICAL, 1); + + thermal_zone_device_critical(thermal); } static int acpi_thermal_cooling_device_cb(struct thermal_zone_device *thermal, @@ -862,7 +859,8 @@ static struct thermal_zone_device_ops acpi_thermal_zone_ops = { .get_trip_temp = thermal_get_trip_temp, .get_crit_temp = thermal_get_crit_temp, .get_trend = thermal_get_trend, - .notify = thermal_notify, + .hot = acpi_thermal_zone_device_hot, + .critical = acpi_thermal_zone_device_critical, }; static int acpi_thermal_register_thermal_zone(struct acpi_thermal *tz) -- 2.29.2 -- kernel-team mailing list [hidden email] https://lists.ubuntu.com/mailman/listinfo/kernel-team |
In reply to this post by Kai-Heng Feng
From: Daniel Lezcano <[hidden email]>
BugLink: https://bugs.launchpad.net/bugs/1906168 The ops is only showing a trace telling a critical trip point is crossed. The same information is given by the thermal framework. This is redundant, remove the code. Signed-off-by: Daniel Lezcano <[hidden email]> Reviewed-by: Niklas Söderlund <[hidden email]> Link: https://lore.kernel.org/r/20201210121514.25760-4-daniel.lezcano@... (cherry picked from commit 1fa34e49e4b7e66214a1d15261c0224d60366eec) Signed-off-by: Kai-Heng Feng <[hidden email]> --- drivers/thermal/rcar_thermal.c | 19 ------------------- 1 file changed, 19 deletions(-) diff --git a/drivers/thermal/rcar_thermal.c b/drivers/thermal/rcar_thermal.c index 46aeb28b4e90..f490e3668e2a 100644 --- a/drivers/thermal/rcar_thermal.c +++ b/drivers/thermal/rcar_thermal.c @@ -323,24 +323,6 @@ static int rcar_thermal_get_trip_temp(struct thermal_zone_device *zone, return 0; } -static int rcar_thermal_notify(struct thermal_zone_device *zone, - int trip, enum thermal_trip_type type) -{ - struct rcar_thermal_priv *priv = rcar_zone_to_priv(zone); - struct device *dev = rcar_priv_to_dev(priv); - - switch (type) { - case THERMAL_TRIP_CRITICAL: - /* FIXME */ - dev_warn(dev, "Thermal reached to critical temperature\n"); - break; - default: - break; - } - - return 0; -} - static const struct thermal_zone_of_device_ops rcar_thermal_zone_of_ops = { .get_temp = rcar_thermal_of_get_temp, }; @@ -349,7 +331,6 @@ static struct thermal_zone_device_ops rcar_thermal_zone_ops = { .get_temp = rcar_thermal_get_temp, .get_trip_type = rcar_thermal_get_trip_type, .get_trip_temp = rcar_thermal_get_trip_temp, - .notify = rcar_thermal_notify, }; /* -- 2.29.2 -- kernel-team mailing list [hidden email] https://lists.ubuntu.com/mailman/listinfo/kernel-team |
In reply to this post by Kai-Heng Feng
From: Daniel Lezcano <[hidden email]>
BugLink: https://bugs.launchpad.net/bugs/1906168 With the removal of the notifys user in a previous patches, the ops is no longer needed, remove it. Signed-off-by: Daniel Lezcano <[hidden email]> Reviewed-by: Lukasz Luba <[hidden email]> Link: https://lore.kernel.org/r/20201210121514.25760-5-daniel.lezcano@... (cherry picked from commit 04f111130e9afa41c10d7bcec14e00e3be8b6977 linux-next) Signed-off-by: Kai-Heng Feng <[hidden email]> --- drivers/thermal/thermal_core.c | 3 --- include/linux/thermal.h | 2 -- 2 files changed, 5 deletions(-) diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c index aba84e1a4396..b53305b8d643 100644 --- a/drivers/thermal/thermal_core.c +++ b/drivers/thermal/thermal_core.c @@ -401,9 +401,6 @@ static void handle_critical_trips(struct thermal_zone_device *tz, trace_thermal_zone_trip(tz, trip, trip_type); - if (tz->ops->notify) - tz->ops->notify(tz, trip, trip_type); - if (trip_type == THERMAL_TRIP_HOT && tz->ops->hot) tz->ops->hot(tz); else if (trip_type == THERMAL_TRIP_CRITICAL) diff --git a/include/linux/thermal.h b/include/linux/thermal.h index e7989cec090a..5777a9e4ea54 100644 --- a/include/linux/thermal.h +++ b/include/linux/thermal.h @@ -90,8 +90,6 @@ struct thermal_zone_device_ops { int (*set_emul_temp) (struct thermal_zone_device *, int); int (*get_trend) (struct thermal_zone_device *, int, enum thermal_trend *); - int (*notify) (struct thermal_zone_device *, int, - enum thermal_trip_type); void (*hot)(struct thermal_zone_device *); void (*critical)(struct thermal_zone_device *); }; -- 2.29.2 -- kernel-team mailing list [hidden email] https://lists.ubuntu.com/mailman/listinfo/kernel-team |
In reply to this post by Kai-Heng Feng
BugLink: https://bugs.launchpad.net/bugs/1906168
We are seeing thermal shutdown on Intel based mobile workstations, the shutdown happens during the first trip handle in thermal_zone_device_register(): kernel: thermal thermal_zone15: critical temperature reached (101 C), shutting down However, we shouldn't do a thermal shutdown here, since 1) We may want to use a dedicated daemon, Intel's thermald in this case, to handle thermal shutdown. 2) For ACPI based system, _CRT doesn't mean shutdown unless it's inside ThermalZone namespace. ACPI Spec, 11.4.4 _CRT (Critical Temperature): "... If this object it present under a device, the device’s driver evaluates this object to determine the device’s critical cooling temperature trip point. This value may then be used by the device’s driver to program an internal device temperature sensor trip point." So a "critical trip" here merely means we should take a more aggressive cooling method. As int340x device isn't present under ACPI ThermalZone, override the default .critical callback to prevent surprising thermal shutdown. Signed-off-by: Kai-Heng Feng <[hidden email]> Signed-off-by: Daniel Lezcano <[hidden email]> Link: https://lore.kernel.org/r/20201221172345.36976-1-kai.heng.feng@... (cherry picked from commit dd47366aaa9b93ac3d97cb4ee7641d38a28a771e linux-next) Signed-off-by: Kai-Heng Feng <[hidden email]> --- .../thermal/intel/int340x_thermal/int340x_thermal_zone.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c b/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c index 432213272f1e..eeb08fab19a7 100644 --- a/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c +++ b/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c @@ -146,12 +146,18 @@ static int int340x_thermal_get_trip_hyst(struct thermal_zone_device *zone, return 0; } +static void int340x_thermal_critical(struct thermal_zone_device *zone) +{ + dev_dbg(&zone->device, "%s: critical temperature reached\n", zone->type); +} + static struct thermal_zone_device_ops int340x_thermal_zone_ops = { .get_temp = int340x_thermal_get_zone_temp, .get_trip_temp = int340x_thermal_get_trip_temp, .get_trip_type = int340x_thermal_get_trip_type, .set_trip_temp = int340x_thermal_set_trip_temp, .get_trip_hyst = int340x_thermal_get_trip_hyst, + .critical = int340x_thermal_critical, }; static int int340x_thermal_get_trip_config(acpi_handle handle, char *name, -- 2.29.2 -- kernel-team mailing list [hidden email] https://lists.ubuntu.com/mailman/listinfo/kernel-team |
In reply to this post by Kai-Heng Feng
BugLink: https://bugs.launchpad.net/bugs/1906168
Like previous patch, the intel_pch_thermal device is not in ACPI ThermalZone namespace, so a critical trip doesn't mean shutdown. Override the default .critical callback to prevent surprising thermal shutdoown. Signed-off-by: Kai-Heng Feng <[hidden email]> Signed-off-by: Daniel Lezcano <[hidden email]> Link: https://lore.kernel.org/r/20201221172345.36976-2-kai.heng.feng@... (cherry picked from commit 03671968d0bf2db598f7e3aa98f190b76c1bb4ff linux-next) Signed-off-by: Kai-Heng Feng <[hidden email]> --- drivers/thermal/intel/intel_pch_thermal.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/drivers/thermal/intel/intel_pch_thermal.c b/drivers/thermal/intel/intel_pch_thermal.c index 56401fd4708d..7bccf588c15e 100644 --- a/drivers/thermal/intel/intel_pch_thermal.c +++ b/drivers/thermal/intel/intel_pch_thermal.c @@ -263,10 +263,16 @@ static int pch_get_trip_temp(struct thermal_zone_device *tzd, int trip, int *tem return 0; } +static void pch_critical(struct thermal_zone_device *tzd) +{ + dev_dbg(&tzd->device, "%s: critical temperature reached\n", tzd->type); +} + static struct thermal_zone_device_ops tzd_ops = { .get_temp = pch_thermal_get_temp, .get_trip_type = pch_get_trip_type, .get_trip_temp = pch_get_trip_temp, + .critical = pch_critical, }; enum board_ids { -- 2.29.2 -- kernel-team mailing list [hidden email] https://lists.ubuntu.com/mailman/listinfo/kernel-team |
In reply to this post by Kai-Heng Feng
On Thu, Jan 21, 2021 at 04:48:54PM +0800, Kai-Heng Feng wrote:
> BugLink: https://bugs.launchpad.net/bugs/1906168 -- bye, p. -- kernel-team mailing list [hidden email] https://lists.ubuntu.com/mailman/listinfo/kernel-team |
In reply to this post by Kai-Heng Feng
On Thu, Jan 21, 2021 at 04:48:54PM +0800, Kai-Heng Feng wrote:
> BugLink: https://bugs.launchpad.net/bugs/1906168 > > [Impact] > Surprising thermal shutdown at boot on Intel based mobile workstations. > > [Fix] > Since these thermal devcies are not in ACPI ThermalZone, OS shouldn't > shutdown the system. > > These critial temperatures are for usespace to handle, so let kernel > know it shouldn't handle it. > > [Test] > Use reboot stress as a reproducer. 5% chance to see a surprising > shutdown at boot. > > With the fix applied, the thermal shutdown is no longer reproducible. > > [Where problems could occur] > For ACPI based platforms, we still have "acpitz" to protect systems from > overheating. If these acpitz sensors don't work, then the system could > face real overheating issue. Applied to 5.11 unstable (only the patches from linux-next that were missing). Thanks! -Andrea -- kernel-team mailing list [hidden email] https://lists.ubuntu.com/mailman/listinfo/kernel-team |
In reply to this post by Kai-Heng Feng
On 21.01.21 09:49, Kai-Heng Feng wrote:
> From: Daniel Lezcano <[hidden email]> > > BugLink: https://bugs.launchpad.net/bugs/1906168 > > With the removal of the notifys user in a previous patches, the ops is no > longer needed, remove it. > > Signed-off-by: Daniel Lezcano <[hidden email]> > Reviewed-by: Lukasz Luba <[hidden email]> > Link: https://lore.kernel.org/r/20201210121514.25760-5-daniel.lezcano@... > (cherry picked from commit 04f111130e9afa41c10d7bcec14e00e3be8b6977 linux-next) > Signed-off-by: Kai-Heng Feng <[hidden email]> > --- usually does this after fixing all drivers which make use of it and do not care about anything else. For stable releases we have to care about 3rd party drivers as well as potentially still having usage in the kernel. I suppose the latter has to some degree be tested by test compiling it but there might be drivers not enabled. Not that I saw any mention of a test compile. Instead of removing the call completely, I would use something like this: if ((!tz-ops->hot && !tz-ops->critical) && tz->ops->notify) ts->ops->notify(tz, trip, trip_type); > drivers/thermal/thermal_core.c | 3 --- > include/linux/thermal.h | 2 -- > 2 files changed, 5 deletions(-) > > diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c > index aba84e1a4396..b53305b8d643 100644 > --- a/drivers/thermal/thermal_core.c > +++ b/drivers/thermal/thermal_core.c > @@ -401,9 +401,6 @@ static void handle_critical_trips(struct thermal_zone_device *tz, > > trace_thermal_zone_trip(tz, trip, trip_type); > > - if (tz->ops->notify) > - tz->ops->notify(tz, trip, trip_type); > - > if (trip_type == THERMAL_TRIP_HOT && tz->ops->hot) > tz->ops->hot(tz); > else if (trip_type == THERMAL_TRIP_CRITICAL) > diff --git a/include/linux/thermal.h b/include/linux/thermal.h > index e7989cec090a..5777a9e4ea54 100644 > --- a/include/linux/thermal.h > +++ b/include/linux/thermal.h > @@ -90,8 +90,6 @@ struct thermal_zone_device_ops { > int (*set_emul_temp) (struct thermal_zone_device *, int); > int (*get_trend) (struct thermal_zone_device *, int, > enum thermal_trend *); > - int (*notify) (struct thermal_zone_device *, int, > - enum thermal_trip_type); > void (*hot)(struct thermal_zone_device *); > void (*critical)(struct thermal_zone_device *); > }; > -- kernel-team mailing list [hidden email] https://lists.ubuntu.com/mailman/listinfo/kernel-team |
On Tue, Jan 26, 2021 at 4:36 PM Stefan Bader <[hidden email]> wrote:
> > On 21.01.21 09:49, Kai-Heng Feng wrote: > > From: Daniel Lezcano <[hidden email]> > > > > BugLink: https://bugs.launchpad.net/bugs/1906168 > > > > With the removal of the notifys user in a previous patches, the ops is no > > longer needed, remove it. > > > > Signed-off-by: Daniel Lezcano <[hidden email]> > > Reviewed-by: Lukasz Luba <[hidden email]> > > Link: https://lore.kernel.org/r/20201210121514.25760-5-daniel.lezcano@... > > (cherry picked from commit 04f111130e9afa41c10d7bcec14e00e3be8b6977 linux-next) > > Signed-off-by: Kai-Heng Feng <[hidden email]> > > --- > > Removing interfaces like this in a backport sounds rather dangerous. Upstream > usually does this after fixing all drivers which make use of it and do not care > about anything else. For stable releases we have to care about 3rd party drivers > as well as potentially still having usage in the kernel. I suppose the latter > has to some degree be tested by test compiling it but there might be drivers not > enabled. Not that I saw any mention of a test compile. I didn't think of the case of 3rd party module, thanks for pointing that out. > Instead of removing the call completely, I would use something like this: > > if ((!tz-ops->hot && !tz-ops->critical) && tz->ops->notify) > ts->ops->notify(tz, trip, trip_type); Actually, it makes more sense to keep the code as is. So we can drop this one and merge the other patches. Kai-Heng > > > drivers/thermal/thermal_core.c | 3 --- > > include/linux/thermal.h | 2 -- > > 2 files changed, 5 deletions(-) > > > > diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c > > index aba84e1a4396..b53305b8d643 100644 > > --- a/drivers/thermal/thermal_core.c > > +++ b/drivers/thermal/thermal_core.c > > @@ -401,9 +401,6 @@ static void handle_critical_trips(struct thermal_zone_device *tz, > > > > trace_thermal_zone_trip(tz, trip, trip_type); > > > > - if (tz->ops->notify) > > - tz->ops->notify(tz, trip, trip_type); > > - > > if (trip_type == THERMAL_TRIP_HOT && tz->ops->hot) > > tz->ops->hot(tz); > > else if (trip_type == THERMAL_TRIP_CRITICAL) > > diff --git a/include/linux/thermal.h b/include/linux/thermal.h > > index e7989cec090a..5777a9e4ea54 100644 > > --- a/include/linux/thermal.h > > +++ b/include/linux/thermal.h > > @@ -90,8 +90,6 @@ struct thermal_zone_device_ops { > > int (*set_emul_temp) (struct thermal_zone_device *, int); > > int (*get_trend) (struct thermal_zone_device *, int, > > enum thermal_trend *); > > - int (*notify) (struct thermal_zone_device *, int, > > - enum thermal_trip_type); > > void (*hot)(struct thermal_zone_device *); > > void (*critical)(struct thermal_zone_device *); > > }; > > > > -- kernel-team mailing list [hidden email] https://lists.ubuntu.com/mailman/listinfo/kernel-team |
On 26.01.21 13:15, Kai-Heng Feng wrote:
> On Tue, Jan 26, 2021 at 4:36 PM Stefan Bader <[hidden email]> wrote: >> >> On 21.01.21 09:49, Kai-Heng Feng wrote: >>> From: Daniel Lezcano <[hidden email]> >>> >>> BugLink: https://bugs.launchpad.net/bugs/1906168 >>> >>> With the removal of the notifys user in a previous patches, the ops is no >>> longer needed, remove it. >>> >>> Signed-off-by: Daniel Lezcano <[hidden email]> >>> Reviewed-by: Lukasz Luba <[hidden email]> >>> Link: https://lore.kernel.org/r/20201210121514.25760-5-daniel.lezcano@... >>> (cherry picked from commit 04f111130e9afa41c10d7bcec14e00e3be8b6977 linux-next) >>> Signed-off-by: Kai-Heng Feng <[hidden email]> >>> --- >> >> Removing interfaces like this in a backport sounds rather dangerous. Upstream >> usually does this after fixing all drivers which make use of it and do not care >> about anything else. For stable releases we have to care about 3rd party drivers >> as well as potentially still having usage in the kernel. I suppose the latter >> has to some degree be tested by test compiling it but there might be drivers not >> enabled. Not that I saw any mention of a test compile. > > I didn't think of the case of 3rd party module, thanks for pointing that out. > >> Instead of removing the call completely, I would use something like this: >> >> if ((!tz-ops->hot && !tz-ops->critical) && tz->ops->notify) >> ts->ops->notify(tz, trip, trip_type); > > Actually, it makes more sense to keep the code as is. > So we can drop this one and merge the other patches. twice? Oh probably not as they no longer set a notify hook. So yeah that should be ok. Could you send a v2 set for G only (since the other places picked it up). I think that would be simpler to follow. Thanks. > > Kai-Heng > >> >>> drivers/thermal/thermal_core.c | 3 --- >>> include/linux/thermal.h | 2 -- >>> 2 files changed, 5 deletions(-) >>> >>> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c >>> index aba84e1a4396..b53305b8d643 100644 >>> --- a/drivers/thermal/thermal_core.c >>> +++ b/drivers/thermal/thermal_core.c >>> @@ -401,9 +401,6 @@ static void handle_critical_trips(struct thermal_zone_device *tz, >>> >>> trace_thermal_zone_trip(tz, trip, trip_type); >>> >>> - if (tz->ops->notify) >>> - tz->ops->notify(tz, trip, trip_type); >>> - >>> if (trip_type == THERMAL_TRIP_HOT && tz->ops->hot) >>> tz->ops->hot(tz); >>> else if (trip_type == THERMAL_TRIP_CRITICAL) >>> diff --git a/include/linux/thermal.h b/include/linux/thermal.h >>> index e7989cec090a..5777a9e4ea54 100644 >>> --- a/include/linux/thermal.h >>> +++ b/include/linux/thermal.h >>> @@ -90,8 +90,6 @@ struct thermal_zone_device_ops { >>> int (*set_emul_temp) (struct thermal_zone_device *, int); >>> int (*get_trend) (struct thermal_zone_device *, int, >>> enum thermal_trend *); >>> - int (*notify) (struct thermal_zone_device *, int, >>> - enum thermal_trip_type); >>> void (*hot)(struct thermal_zone_device *); >>> void (*critical)(struct thermal_zone_device *); >>> }; >>> >> >> -- kernel-team mailing list [hidden email] https://lists.ubuntu.com/mailman/listinfo/kernel-team |
In reply to this post by Kai-Heng Feng
On 21.1.2021 10.48, Kai-Heng Feng wrote:
> BugLink: https://bugs.launchpad.net/bugs/1906168 > > [Impact] > Surprising thermal shutdown at boot on Intel based mobile workstations. > > [Fix] > Since these thermal devcies are not in ACPI ThermalZone, OS shouldn't > shutdown the system. > > These critial temperatures are for usespace to handle, so let kernel > know it shouldn't handle it. > > [Test] > Use reboot stress as a reproducer. 5% chance to see a surprising > shutdown at boot. > > With the fix applied, the thermal shutdown is no longer reproducible. > > [Where problems could occur] > For ACPI based platforms, we still have "acpitz" to protect systems from > overheating. If these acpitz sensors don't work, then the system could > face real overheating issue. > > Daniel Lezcano (5): > thermal/core: Emit a warning if the thermal zone is updated without > ops > thermal/core: Add critical and hot ops > thermal/drivers/acpi: Use hot and critical ops > thermal/drivers/rcar: Remove notification usage > thermal/core: Remove notify ops > > Kai-Heng Feng (2): > thermal: int340x: Fix unexpected shutdown at critical temperature > thermal: intel: pch: Fix unexpected shutdown at critical temperature > > drivers/acpi/thermal.c | 30 ++++++------ > .../int340x_thermal/int340x_thermal_zone.c | 6 +++ > drivers/thermal/intel/intel_pch_thermal.c | 6 +++ > drivers/thermal/rcar_thermal.c | 19 ------- > drivers/thermal/thermal_core.c | 49 +++++++++++-------- > include/linux/thermal.h | 5 +- > 6 files changed, 58 insertions(+), 57 deletions(-) > applied to oem-5.10, thanks -- t -- kernel-team mailing list [hidden email] https://lists.ubuntu.com/mailman/listinfo/kernel-team |
Free forum by Nabble | Edit this page |