REISUB -- now what?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

REISUB -- now what?

Kevin O'Gorman
I've got a big machine running Ubuntu that's I'd prefer kept running all the time.  I don't always get what I want, of course...

Today, for the second time in about three weeks, it completely froze.  Fortunately, I kept a printed copy of http://blog.kember.net/articles/reisub-the-gentle-linux-restart/ in a folder with assorted HOWTO-type articles, and was able to restart it that way.  Which, by the way, proves it was still alive, at least sort of.

Simply put, I held down Alt and SysRq and slowly typed R E I S U B.  Several times, because there's no feedback and a machine with 256GB or RAM can take a while to dump buffers, but I could not actually tell if it was doing anything at all until the boot sequence started.

My question is this: is there any way to investigate what caused this?  I'll try looking in logs, but I am unfamiliar with them and don't really know what to look for.

--
Kevin O'Gorman
#define QUESTION ((bb) || (!bb))   /* Shakespeare */

Please consider the environment before printing this email.


--
ubuntu-users mailing list
[hidden email]
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: REISUB -- now what?

Colin Law
On 16 February 2017 at 02:48, Kevin O'Gorman <[hidden email]> wrote:
I've got a big machine running Ubuntu that's I'd prefer kept running all the time.  I don't always get what I want, of course...

Today, for the second time in about three weeks, it completely froze.  Fortunately, I kept a printed copy of http://blog.kember.net/articles/reisub-the-gentle-linux-restart/ in a folder with assorted HOWTO-type articles, and was able to restart it that way.  Which, by the way, proves it was still alive, at least sort of.

Simply put, I held down Alt and SysRq and slowly typed R E I S U B.  Several times, because there's no feedback and a machine with 256GB or RAM can take a while to dump buffers, but I could not actually tell if it was doing anything at all until the boot sequence started.

My question is this: is there any way to investigate what caused this?  I'll try looking in logs, but I am unfamiliar with them and don't really know what to look for.

The first thing is to look in /var/log/syslog and see what is there at the point of failure.

Colin


--
ubuntu-users mailing list
[hidden email]
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: REISUB -- now what?

Kevin O'Gorman
The problem is it's kind of overwhelming.  At 4MB the log is over 1000x the normal size.  The previous day's log is 10x normal. The failure happened unattended, so I'm not even sure when it happened.
It would sure help if I knew in advance what the first stages of a reboot looks like.

Just making some guesses, I grepped for "lockup" on unzipped logs: "grep -i lockup saved-*" and got 662 lines
The first looked the most severe:
saved.syslog.1:Feb 15 16:48:51 plato-x kernel: [1377271.765054] NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
Oddly, it was followed by the rest of those 662 that looked more  less like the next one and the last one:
saved.syslog.1:Feb 15 16:55:57 plato-x kernel: [1377749.441524] NMI watchdog: BUG: soft lockup - CPU#16 stuck for 23s! [systemd-timesyn:1338]
saved.syslog.1:Feb 15 18:18:05 plato-x kernel: [1382677.646422] NMI watchdog: BUG: soft lockup - CPU#17 stuck for 22s! [chromium-browse:19520]
with a considerable variety of daemons implicated, none of them the 32 processes that are the real workload, and which run at near 100% CPU according to top(1)

But then, I think this may have happened during the bootup, not the failure.  That's because after silence from 18:18:05 to 18:25:51 there's
Feb 15 18:25:51 plato-x rsyslogd: [origin software="rsyslogd" swVersion="8.16.0" x-pid="1409" x-info="http://www.rsyslog.com"] start
and very soon there's a log entry that looks very much like startup:
Feb 15 18:25:51 plato-x systemd[1]: Activating swap Swap Partition...
It's surrounded by a lot of stuff I don't understand, including many occurrences of the phrase "failed with exit code" which really makes me wonder

Anyway. before the first lockup line, things looked very quiet, as you might expect for an unattended system, except for some oddities that went on for a long time but ended about 10 hours earlier, here's the complete sequence from the last few of 09:33:59 up to the first lockup message:
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-livemail
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-mail-ru
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-newsblur
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-pandora
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-qml-autopilot
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** (process:22931): WARNING **: Found newline before all of met
adata, skipping app
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-qq-mail
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-reddit
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-tumblr
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-twitter
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-vkcom
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-wordpress
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-wordpress-com
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-yahoomail
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-yahoonews
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-yandex-music
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-yandexmail
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-yandexmusic
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-yandexnews
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-youtube
Feb 15 09:43:34 plato-x systemd[1]: snapd.refresh.timer: Adding 56min 56.771453s random time.
Feb 15 10:10:17 plato-x dhclient[30629]: DHCPREQUEST of 192.168.1.172 on enp6s0 to 192.168.1.1 port 67 (xid=0x6d95b0a3)
Feb 15 10:10:17 plato-x dhclient[30629]: DHCPACK of 192.168.1.172 from 192.168.1.1
Feb 15 10:10:17 plato-x NetworkManager[1493]: <info>  [1487182217.6309]   address 192.168.1.172
Feb 15 10:10:17 plato-x NetworkManager[1493]: <info>  [1487182217.6310]   plen 24 (255.255.255.0)
Feb 15 10:10:17 plato-x NetworkManager[1493]: <info>  [1487182217.6310]   gateway 192.168.1.1
Feb 15 10:10:17 plato-x NetworkManager[1493]: <info>  [1487182217.6310]   server identifier 192.168.1.1
Feb 15 10:10:17 plato-x dbus[1459]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.f
reedesktop.nm-dispatcher.service'
Feb 15 10:10:17 plato-x NetworkManager[1493]: <info>  [1487182217.6310]   lease time 86400
Feb 15 10:10:17 plato-x NetworkManager[1493]: <info>  [1487182217.6310]   hostname 'plato-x'
Feb 15 10:10:17 plato-x NetworkManager[1493]: <info>  [1487182217.6311]   nameserver '192.168.1.1'
Feb 15 10:10:17 plato-x NetworkManager[1493]: <info>  [1487182217.6311] dhcp4 (enp6s0): state changed bound -> bound
Feb 15 10:10:17 plato-x dhclient[30629]: bound to 192.168.1.172 -- renewal in 29017 seconds.
Feb 15 10:10:17 plato-x systemd[1]: Starting Network Manager Script Dispatcher Service...
Feb 15 10:10:17 plato-x dbus[1459]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Feb 15 10:10:17 plato-x nm-dispatcher: req:1 'dhcp4-change' [enp6s0]: new request (1 scripts)
Feb 15 10:10:17 plato-x nm-dispatcher: req:1 'dhcp4-change' [enp6s0]: start running ordered scripts...
Feb 15 10:10:17 plato-x systemd[1]: Started Network Manager Script Dispatcher Service.
Feb 15 10:17:01 plato-x CRON[24916]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Feb 15 11:17:01 plato-x CRON[27608]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Feb 15 11:54:34 plato-x systemd[1]: Starting Daily apt activities...
Feb 15 11:56:56 plato-x systemd[1]: snapd.refresh.timer: Adding 5h 21min 51.717429s random time.
Feb 15 11:57:31 plato-x systemd[1]: Started Daily apt activities.
Feb 15 11:57:31 plato-x systemd[1]: apt-daily.timer: Adding 6h 3min 22.557945s random time.
Feb 15 11:57:31 plato-x systemd[1]: apt-daily.timer: Adding 9h 28min 33.356115s random time.
Feb 15 12:17:01 plato-x CRON[30798]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Feb 15 13:17:01 plato-x CRON[1056]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Feb 15 13:30:44 plato-x systemd-timesyncd[1338]: Timed out waiting for reply from 91.189.89.199:123 (ntp.ubuntu.com).
Feb 15 13:30:44 plato-x systemd-timesyncd[1338]: Synchronized to time server 91.189.91.157:123 (ntp.ubuntu.com).
Feb 15 14:17:01 plato-x CRON[4111]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Feb 15 15:17:01 plato-x CRON[6860]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Feb 15 16:17:01 plato-x CRON[9648]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Feb 15 16:48:51 plato-x kernel: [1377271.765054] NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
So could these com.canonical.Unity.Webapps.Service lines be related to a failure of the user interface, which then went quiet until I got home and around 4 PM noticed the machine was frozen, and all the CPU lockups are attempts to respond to my REISUB?

If so is there anything I can do about this?





On Thu, Feb 16, 2017 at 12:39 AM, Colin Law <[hidden email]> wrote:
On 16 February 2017 at 02:48, Kevin O'Gorman <[hidden email]> wrote:
I've got a big machine running Ubuntu that's I'd prefer kept running all the time.  I don't always get what I want, of course...

Today, for the second time in about three weeks, it completely froze.  Fortunately, I kept a printed copy of http://blog.kember.net/articles/reisub-the-gentle-linux-restart/ in a folder with assorted HOWTO-type articles, and was able to restart it that way.  Which, by the way, proves it was still alive, at least sort of.

Simply put, I held down Alt and SysRq and slowly typed R E I S U B.  Several times, because there's no feedback and a machine with 256GB or RAM can take a while to dump buffers, but I could not actually tell if it was doing anything at all until the boot sequence started.

My question is this: is there any way to investigate what caused this?  I'll try looking in logs, but I am unfamiliar with them and don't really know what to look for.

The first thing is to look in /var/log/syslog and see what is there at the point of failure.

Colin


--
ubuntu-users mailing list
[hidden email]
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users




--
Kevin O'Gorman
#define QUESTION ((bb) || (!bb))   /* Shakespeare */

Please consider the environment before printing this email.


--
ubuntu-users mailing list
[hidden email]
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: REISUB -- now what?

Kevin O'Gorman
Hmm.  Answering my own question: no, the webapps stuff occurs also on other days that do not have a system failure.  I still haven't found anything that sticks out as happening at the start of a freeze-up.  And the first lockup message is so close to the restart that I doubt it's coincidence.   Very hard to say, because when the system has failed, all the screens are blank, and won't come alive until reboot.

About all I can see to do is to stop the screens from blanking and hope there's some clue on the workspace I happen to be on when it fails.

On Thu, Feb 16, 2017 at 8:55 AM, Kevin O'Gorman <[hidden email]> wrote:
The problem is it's kind of overwhelming.  At 4MB the log is over 1000x the normal size.  The previous day's log is 10x normal. The failure happened unattended, so I'm not even sure when it happened.
It would sure help if I knew in advance what the first stages of a reboot looks like.

Just making some guesses, I grepped for "lockup" on unzipped logs: "grep -i lockup saved-*" and got 662 lines
The first looked the most severe:
saved.syslog.1:Feb 15 16:48:51 plato-x kernel: [1377271.765054] NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
Oddly, it was followed by the rest of those 662 that looked more  less like the next one and the last one:
saved.syslog.1:Feb 15 16:55:57 plato-x kernel: [1377749.441524] NMI watchdog: BUG: soft lockup - CPU#16 stuck for 23s! [systemd-timesyn:1338]
saved.syslog.1:Feb 15 18:18:05 plato-x kernel: [1382677.646422] NMI watchdog: BUG: soft lockup - CPU#17 stuck for 22s! [chromium-browse:19520]
with a considerable variety of daemons implicated, none of them the 32 processes that are the real workload, and which run at near 100% CPU according to top(1)

But then, I think this may have happened during the bootup, not the failure.  That's because after silence from 18:18:05 to 18:25:51 there's
Feb 15 18:25:51 plato-x rsyslogd: [origin software="rsyslogd" swVersion="8.16.0" x-pid="1409" x-info="http://www.rsyslog.com"] start
and very soon there's a log entry that looks very much like startup:
Feb 15 18:25:51 plato-x systemd[1]: Activating swap Swap Partition...
It's surrounded by a lot of stuff I don't understand, including many occurrences of the phrase "failed with exit code" which really makes me wonder

Anyway. before the first lockup line, things looked very quiet, as you might expect for an unattended system, except for some oddities that went on for a long time but ended about 10 hours earlier, here's the complete sequence from the last few of 09:33:59 up to the first lockup message:
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-livemail
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-mail-ru
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-newsblur
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-pandora
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-qml-autopilot
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** (process:22931): WARNING **: Found newline before all of met
adata, skipping app
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-qq-mail
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-reddit
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-tumblr
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-twitter
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-vkcom
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-wordpress
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-wordpress-com
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-yahoomail
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-yahoonews
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-yandex-music
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-yandexmail
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-yandexmusic
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-yandexnews
Feb 15 09:33:59 plato-x com.canonical.Unity.Webapps.Service[2265]: ** Message: Investigating webapp: unity-webapps-youtube
Feb 15 09:43:34 plato-x systemd[1]: snapd.refresh.timer: Adding 56min 56.771453s random time.
Feb 15 10:10:17 plato-x dhclient[30629]: DHCPREQUEST of 192.168.1.172 on enp6s0 to 192.168.1.1 port 67 (xid=0x6d95b0a3)
Feb 15 10:10:17 plato-x dhclient[30629]: DHCPACK of 192.168.1.172 from 192.168.1.1
Feb 15 10:10:17 plato-x NetworkManager[1493]: <info>  [1487182217.6309]   address 192.168.1.172
Feb 15 10:10:17 plato-x NetworkManager[1493]: <info>  [1487182217.6310]   plen 24 (255.255.255.0)
Feb 15 10:10:17 plato-x NetworkManager[1493]: <info>  [1487182217.6310]   gateway 192.168.1.1
Feb 15 10:10:17 plato-x NetworkManager[1493]: <info>  [1487182217.6310]   server identifier 192.168.1.1
Feb 15 10:10:17 plato-x dbus[1459]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.f
reedesktop.nm-dispatcher.service'
Feb 15 10:10:17 plato-x NetworkManager[1493]: <info>  [1487182217.6310]   lease time 86400
Feb 15 10:10:17 plato-x NetworkManager[1493]: <info>  [1487182217.6310]   hostname 'plato-x'
Feb 15 10:10:17 plato-x NetworkManager[1493]: <info>  [1487182217.6311]   nameserver '192.168.1.1'
Feb 15 10:10:17 plato-x NetworkManager[1493]: <info>  [1487182217.6311] dhcp4 (enp6s0): state changed bound -> bound
Feb 15 10:10:17 plato-x dhclient[30629]: bound to 192.168.1.172 -- renewal in 29017 seconds.
Feb 15 10:10:17 plato-x systemd[1]: Starting Network Manager Script Dispatcher Service...
Feb 15 10:10:17 plato-x dbus[1459]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Feb 15 10:10:17 plato-x nm-dispatcher: req:1 'dhcp4-change' [enp6s0]: new request (1 scripts)
Feb 15 10:10:17 plato-x nm-dispatcher: req:1 'dhcp4-change' [enp6s0]: start running ordered scripts...
Feb 15 10:10:17 plato-x systemd[1]: Started Network Manager Script Dispatcher Service.
Feb 15 10:17:01 plato-x CRON[24916]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Feb 15 11:17:01 plato-x CRON[27608]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Feb 15 11:54:34 plato-x systemd[1]: Starting Daily apt activities...
Feb 15 11:56:56 plato-x systemd[1]: snapd.refresh.timer: Adding 5h 21min 51.717429s random time.
Feb 15 11:57:31 plato-x systemd[1]: Started Daily apt activities.
Feb 15 11:57:31 plato-x systemd[1]: apt-daily.timer: Adding 6h 3min 22.557945s random time.
Feb 15 11:57:31 plato-x systemd[1]: apt-daily.timer: Adding 9h 28min 33.356115s random time.
Feb 15 12:17:01 plato-x CRON[30798]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Feb 15 13:17:01 plato-x CRON[1056]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Feb 15 13:30:44 plato-x systemd-timesyncd[1338]: Timed out waiting for reply from 91.189.89.199:123 (ntp.ubuntu.com).
Feb 15 13:30:44 plato-x systemd-timesyncd[1338]: Synchronized to time server 91.189.91.157:123 (ntp.ubuntu.com).
Feb 15 14:17:01 plato-x CRON[4111]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Feb 15 15:17:01 plato-x CRON[6860]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Feb 15 16:17:01 plato-x CRON[9648]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Feb 15 16:48:51 plato-x kernel: [1377271.765054] NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
So could these com.canonical.Unity.Webapps.Service lines be related to a failure of the user interface, which then went quiet until I got home and around 4 PM noticed the machine was frozen, and all the CPU lockups are attempts to respond to my REISUB?

If so is there anything I can do about this?





On Thu, Feb 16, 2017 at 12:39 AM, Colin Law <[hidden email]> wrote:
On 16 February 2017 at 02:48, Kevin O'Gorman <[hidden email]> wrote:
I've got a big machine running Ubuntu that's I'd prefer kept running all the time.  I don't always get what I want, of course...

Today, for the second time in about three weeks, it completely froze.  Fortunately, I kept a printed copy of http://blog.kember.net/articles/reisub-the-gentle-linux-restart/ in a folder with assorted HOWTO-type articles, and was able to restart it that way.  Which, by the way, proves it was still alive, at least sort of.

Simply put, I held down Alt and SysRq and slowly typed R E I S U B.  Several times, because there's no feedback and a machine with 256GB or RAM can take a while to dump buffers, but I could not actually tell if it was doing anything at all until the boot sequence started.

My question is this: is there any way to investigate what caused this?  I'll try looking in logs, but I am unfamiliar with them and don't really know what to look for.

The first thing is to look in /var/log/syslog and see what is there at the point of failure.

Colin


--
ubuntu-users mailing list
[hidden email]
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users




--
Kevin O'Gorman
#define QUESTION ((bb) || (!bb))   /* Shakespeare */

Please consider the environment before printing this email.




--
Kevin O'Gorman
#define QUESTION ((bb) || (!bb))   /* Shakespeare */

Please consider the environment before printing this email.


--
ubuntu-users mailing list
[hidden email]
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: REISUB -- now what?

Colin Law
On 16 February 2017 at 17:34, Kevin O'Gorman <[hidden email]> wrote:
>
> Hmm.  Answering my own question: no, the webapps stuff occurs also on other days that do not have a system failure.  I still haven't found anything that sticks out as happening at the start of a freeze-up.  And the first lockup message is so close to the restart that I doubt it's coincidence.   Very hard to say, because when the system has failed, all the screens are blank, and won't come alive until reboot.
>
> About all I can see to do is to stop the screens from blanking and hope there's some clue on the workspace I happen to be on when it fails.
>

When it locks up does Ctrl+Alt+F1 give you a terminal?  If so then
Ctrl+Alt+F7 should take you back. You may at least be able to login
and shutdown cleanly.

Have you tried booting into the RAM test and running it overnight (or
whenever) to check it runs ok? This is often a good check on the RAM
PSU etc.  Dependent on how often it fails you may need to run it
longer to get confidence.

Colin

--
ubuntu-users mailing list
[hidden email]
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: REISUB -- now what?

Kevin O'Gorman
I have not tried C-A-F1, but I will surely try it if this comes up again.

So far it has been at least 2 weeks between failures.  I have run RAM tests, but I'm unwilling to do it for more than a day.  There's a reason I have this machine, and it's not for running tests.

On Thu, Feb 16, 2017 at 9:44 AM, Colin Law <[hidden email]> wrote:
On 16 February 2017 at 17:34, Kevin O'Gorman <[hidden email]> wrote:
>
> Hmm.  Answering my own question: no, the webapps stuff occurs also on other days that do not have a system failure.  I still haven't found anything that sticks out as happening at the start of a freeze-up.  And the first lockup message is so close to the restart that I doubt it's coincidence.   Very hard to say, because when the system has failed, all the screens are blank, and won't come alive until reboot.
>
> About all I can see to do is to stop the screens from blanking and hope there's some clue on the workspace I happen to be on when it fails.
>

When it locks up does Ctrl+Alt+F1 give you a terminal?  If so then
Ctrl+Alt+F7 should take you back. You may at least be able to login
and shutdown cleanly.

Have you tried booting into the RAM test and running it overnight (or
whenever) to check it runs ok? This is often a good check on the RAM
PSU etc.  Dependent on how often it fails you may need to run it
longer to get confidence.

Colin

--
ubuntu-users mailing list
[hidden email]
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users



--
Kevin O'Gorman
#define QUESTION ((bb) || (!bb))   /* Shakespeare */

Please consider the environment before printing this email.


--
ubuntu-users mailing list
[hidden email]
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: REISUB -- now what?

compdoc
On 02/16/2017 12:45 PM, Kevin O'Gorman wrote:

> There's a reason I have this machine, and it's not for running tests.

The reason to run tests is to find the problems. Just run memtest86 for
a few passes. if you get no errors, then look elsewhere.

Hard drives and power supplies are common problems that can cause
freezing, bad fans and clogged heatsinks as well, or bad capacitors on
older motherboards.

The hard drive SMART info can tell you if sectors are failing.

--
ubuntu-users mailing list
[hidden email]
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: REISUB -- now what?

Peter Silva
I googled your lockup message.

in which someone fixed it by using the clocksource=tsc
Kernel parameter.

For a receipe on how to apply a kernel parameter

No guarantee this fixes your problem but it's easy to try.



On Thu, Feb 16, 2017 at 9:53 PM, compdoc <[hidden email]> wrote:
On 02/16/2017 12:45 PM, Kevin O'Gorman wrote:

There's a reason I have this machine, and it's not for running tests.

The reason to run tests is to find the problems. Just run memtest86 for a few passes. if you get no errors, then look elsewhere.

Hard drives and power supplies are common problems that can cause freezing, bad fans and clogged heatsinks as well, or bad capacitors on older motherboards.

The hard drive SMART info can tell you if sectors are failing.


--
ubuntu-users mailing list
[hidden email]
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users


--
ubuntu-users mailing list
[hidden email]
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: REISUB -- now what?

Kevin O'Gorman
In reply to this post by compdoc


On Thu, Feb 16, 2017 at 6:53 PM, compdoc <[hidden email]> wrote:
On 02/16/2017 12:45 PM, Kevin O'Gorman wrote:

There's a reason I have this machine, and it's not for running tests.

The reason to run tests is to find the problems. Just run memtest86 for a few passes. if you get no errors, then look elsewhere.
I think mentioned that I've done that.

Hard drives and power supplies are common problems that can cause freezing, bad fans and clogged heatsinks as well, or bad capacitors on older motherboards.
xsensors(1) and hddtemp(8) show no problems

The hard drive SMART info can tell you if sectors are failing.


--
ubuntu-users mailing list
[hidden email]
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users



--
Kevin O'Gorman
#define QUESTION ((bb) || (!bb))   /* Shakespeare */

Please consider the environment before printing this email.


--
ubuntu-users mailing list
[hidden email]
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: REISUB -- now what?

Kevin O'Gorman
In reply to this post by Peter Silva


On Fri, Feb 17, 2017 at 7:15 PM, Peter Silva <[hidden email]> wrote:
I googled your lockup message.

in which someone fixed it by using the clocksource=tsc
Kernel parameter.

For a receipe on how to apply a kernel parameter

No guarantee this fixes your problem but it's easy to try.

This could be useful.  I added a comment on that bug, and I'll be giving it a try.

Thanks

 



On Thu, Feb 16, 2017 at 9:53 PM, compdoc <[hidden email]> wrote:
On 02/16/2017 12:45 PM, Kevin O'Gorman wrote:

There's a reason I have this machine, and it's not for running tests.

The reason to run tests is to find the problems. Just run memtest86 for a few passes. if you get no errors, then look elsewhere.

Hard drives and power supplies are common problems that can cause freezing, bad fans and clogged heatsinks as well, or bad capacitors on older motherboards.

The hard drive SMART info can tell you if sectors are failing.


--
ubuntu-users mailing list
[hidden email]
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users


--
ubuntu-users mailing list
[hidden email]
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users




--
Kevin O'Gorman
#define QUESTION ((bb) || (!bb))   /* Shakespeare */

Please consider the environment before printing this email.


--
ubuntu-users mailing list
[hidden email]
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
Loading...