Quantcast

Machine check exception, but what kind?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Machine check exception, but what kind?

Kevin O'Gorman
I've been having trouble with two of my personal computers.  One is from System76 and their great support staff suggested I load package mcelog to monitor for machine check exceptions (MCE).  Sounded good to me, so I did it on all my Ubuntu machines (I have 4 if you count laptops).

Lo and behold, one of the other machines glitched last night.  Not the System76 one, but a home-brew I built myself (with a little help from my friends).  It's got a medium-fast Core i-7 on an ASUS board. It was a familiar occurrence:
- It had rebooted on its own and when I woke up it was asking me to log in
- On logging in I saw two popup dialogs that said there was an error detected by a system program (but absolutely no other information about it) and wanted permission to report it.  Even when I gave that permission, I did not get a copy or any further information about what happened.
- /var/log/syslog showed the reboot sequence, but nothing particularly helpful about the cause.

Pretty frustrating, but because I had installed mcelog, I also got this:
- /var/log/mcelog contained this:
mcelog: failed to prefill DIMM database from DMI data
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 4
MISC 7fbc6369a0eb ADDR 7fbc6369a0eb
TIME 1492751851 Thu Apr 20 22:17:31 2017
MCG status:
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
Running trigger `unknown-error-trigger'
STATUS be00000000800400 MCGSTATUS 0
MCGCAP c09 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 60
Hardware event. This is not a software error.
MCE 1
CPU 3 BANK 3
MISC 7fbc6369a0eb ADDR 7fbc6369a0eb
TIME 1492751851 Thu Apr 20 22:17:31 2017
MCG status:
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
Running trigger `unknown-error-trigger'
STATUS be00000000800400 MCGSTATUS 0
MCGCAP c09 APICID 6 SOCKETID 0
CPUID Vendor Intel Family 6 Model 60

So it looks like a hardware error.  It even says so, or at least "Hardware event. This is not a software error."

Thing is the rest of this log is almost entirely opaque to me.  I do understand the timestamp and "Vendor Intel" but that's about it.  I'm wondering what actually happened, and if there's anyone on this list that can explain.  In particular, does that first line, containing "DIMM" suggest that there was a RAM memory-related problem? 

I also wanted to alert anyone else who might be having trouble diagnosing a recurring problem.  This package is in the regular repository, but is not installed by default.  I think that's a shame.

--
Kevin O'Gorman
#define QUESTION ((bb) || (!bb))   /* Shakespeare */

Please consider the environment before printing this email.


--
ubuntu-users mailing list
[hidden email]
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Machine check exception, but what kind?

Joel Rees
On Sat, Apr 22, 2017 at 1:31 AM, Kevin O'Gorman <[hidden email]> wrote:
> I've been having trouble with two of my personal computers.  One is from
> System76 and their great support staff suggested I load package mcelog to
> monitor for machine check exceptions (MCE).  Sounded good to me, so I did it
> on all my Ubuntu machines (I have 4 if you count laptops).

I assume you have been reading

    https://www.mcelog.org/

I'm seeing a lot of useful information there. Maybe I'll try it out.

If you haven't read the manpage and the FAQ, ...

> Lo and behold, one of the other machines glitched last night.  Not the
> System76 one, but a home-brew I built myself (with a little help from my
> friends).  It's got a medium-fast Core i-7 on an ASUS board. It was a
> familiar occurrence:
> - It had rebooted on its own and when I woke up it was asking me to log in
> - On logging in I saw two popup dialogs that said there was an error
> detected by a system program (but absolutely no other information about it)
> and wanted permission to report it.  Even when I gave that permission, I did
> not get a copy or any further information about what happened.

Did you read the page on triggers? (Mentioned also in the FAQ.)

> - /var/log/syslog showed the reboot sequence, but nothing particularly
> helpful about the cause.
>
> Pretty frustrating, but because I had installed mcelog, I also got this:
> - /var/log/mcelog contained this:
> mcelog: failed to prefill DIMM database from DMI data

I saw something about that in the FAQ.

> Hardware event. This is not a software error.
> MCE 0
> CPU 0 BANK 4
> MISC 7fbc6369a0eb ADDR 7fbc6369a0eb
> TIME 1492751851 Thu Apr 20 22:17:31 2017
> MCG status:
> MCi status:
> Uncorrected error
> Error enabled
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: Internal Timer error
> Running trigger `unknown-error-trigger'
> STATUS be00000000800400 MCGSTATUS 0
> MCGCAP c09 APICID 0 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 60

> Hardware event. This is not a software error.
> MCE 1
> CPU 3 BANK 3
> MISC 7fbc6369a0eb ADDR 7fbc6369a0eb
> TIME 1492751851 Thu Apr 20 22:17:31 2017
> MCG status:
> MCi status:
> Uncorrected error
> Error enabled
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: Internal Timer error
> Running trigger `unknown-error-trigger'
> STATUS be00000000800400 MCGSTATUS 0
> MCGCAP c09 APICID 6 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 60
>
> So it looks like a hardware error.  It even says so, or at least "Hardware
> event. This is not a software error."

Two, in fact.

> Thing is the rest of this log is almost entirely opaque to me.  I do
> understand the timestamp and "Vendor Intel" but that's about it.  I'm
> wondering what actually happened, and if there's anyone on this list that
> can explain.  In particular, does that first line, containing "DIMM" suggest
> that there was a RAM memory-related problem?

It does, but did you check the glossary?

> I also wanted to alert anyone else who might be having trouble diagnosing a
> recurring problem.  This package is in the regular repository, but is not
> installed by default.  I think that's a shame.

You might want to look up EDAC. I see it mentioned in the FAQ.


> --
> Kevin O'Gorman
> #define QUESTION ((bb) || (!bb))   /* Shakespeare */
>
> Please consider the environment before printing this email.
>

Happy hunting.

--
Joel Rees

I'm imagining I'm a novelist:
http://joel-rees-economics.blogspot.com/2017/01/soc500-00-00-toc.html
More of my delusions:
http://reiisi.blogspot.jp/p/novels-i-am-writing.html

--
ubuntu-users mailing list
[hidden email]
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Machine check exception, but what kind?

J.Witvliet
In reply to this post by Kevin O'Gorman
Run mem-check for atleast 24 hours.
HW errors are *not* binary; either or not present. Especially ESD related problems (you did take precautions, did you?) can take long time to manifest once in a while.

Verstuurd vanaf mijn iPhone

> Op 22 apr. 2017 om 00:25 heeft Joel Rees <[hidden email]> het volgende geschreven:
>
>> On Sat, Apr 22, 2017 at 1:31 AM, Kevin O'Gorman <[hidden email]> wrote:
>> I've been having trouble with two of my personal computers.  One is from
>> System76 and their great support staff suggested I load package mcelog to
>> monitor for machine check exceptions (MCE).  Sounded good to me, so I did it
>> on all my Ubuntu machines (I have 4 if you count laptops).
>
> I assume you have been reading
>
>    https://www.mcelog.org/
>
> I'm seeing a lot of useful information there. Maybe I'll try it out.
>
> If you haven't read the manpage and the FAQ, ...
>
>> Lo and behold, one of the other machines glitched last night.  Not the
>> System76 one, but a home-brew I built myself (with a little help from my
>> friends).  It's got a medium-fast Core i-7 on an ASUS board. It was a
>> familiar occurrence:
>> - It had rebooted on its own and when I woke up it was asking me to log in
>> - On logging in I saw two popup dialogs that said there was an error
>> detected by a system program (but absolutely no other information about it)
>> and wanted permission to report it.  Even when I gave that permission, I did
>> not get a copy or any further information about what happened.
>
> Did you read the page on triggers? (Mentioned also in the FAQ.)
>
>> - /var/log/syslog showed the reboot sequence, but nothing particularly
>> helpful about the cause.
>>
>> Pretty frustrating, but because I had installed mcelog, I also got this:
>> - /var/log/mcelog contained this:
>> mcelog: failed to prefill DIMM database from DMI data
>
> I saw something about that in the FAQ.
>
>> Hardware event. This is not a software error.
>> MCE 0
>> CPU 0 BANK 4
>> MISC 7fbc6369a0eb ADDR 7fbc6369a0eb
>> TIME 1492751851 Thu Apr 20 22:17:31 2017
>> MCG status:
>> MCi status:
>> Uncorrected error
>> Error enabled
>> MCi_MISC register valid
>> MCi_ADDR register valid
>> Processor context corrupt
>> MCA: Internal Timer error
>> Running trigger `unknown-error-trigger'
>> STATUS be00000000800400 MCGSTATUS 0
>> MCGCAP c09 APICID 0 SOCKETID 0
>> CPUID Vendor Intel Family 6 Model 60
>
>> Hardware event. This is not a software error.
>> MCE 1
>> CPU 3 BANK 3
>> MISC 7fbc6369a0eb ADDR 7fbc6369a0eb
>> TIME 1492751851 Thu Apr 20 22:17:31 2017
>> MCG status:
>> MCi status:
>> Uncorrected error
>> Error enabled
>> MCi_MISC register valid
>> MCi_ADDR register valid
>> Processor context corrupt
>> MCA: Internal Timer error
>> Running trigger `unknown-error-trigger'
>> STATUS be00000000800400 MCGSTATUS 0
>> MCGCAP c09 APICID 6 SOCKETID 0
>> CPUID Vendor Intel Family 6 Model 60
>>
>> So it looks like a hardware error.  It even says so, or at least "Hardware
>> event. This is not a software error."
>
> Two, in fact.
>
>> Thing is the rest of this log is almost entirely opaque to me.  I do
>> understand the timestamp and "Vendor Intel" but that's about it.  I'm
>> wondering what actually happened, and if there's anyone on this list that
>> can explain.  In particular, does that first line, containing "DIMM" suggest
>> that there was a RAM memory-related problem?
>
> It does, but did you check the glossary?
>
>> I also wanted to alert anyone else who might be having trouble diagnosing a
>> recurring problem.  This package is in the regular repository, but is not
>> installed by default.  I think that's a shame.
>
> You might want to look up EDAC. I see it mentioned in the FAQ.
>
>
>> --
>> Kevin O'Gorman
>> #define QUESTION ((bb) || (!bb))   /* Shakespeare */
>>
>> Please consider the environment before printing this email.
>>
>
> Happy hunting.
>
> --
> Joel Rees
>
> I'm imagining I'm a novelist:
> http://joel-rees-economics.blogspot.com/2017/01/soc500-00-00-toc.html
> More of my delusions:
> http://reiisi.blogspot.jp/p/novels-i-am-writing.html
>
> --
> ubuntu-users mailing list
> [hidden email]
> Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users

Dit bericht kan informatie bevatten die niet voor u is bestemd. Indien u niet de geadresseerde bent of dit bericht abusievelijk aan u is toegezonden, wordt u verzocht dat aan de afzender te melden en het bericht te verwijderen. De Staat aanvaardt geen aansprakelijkheid voor schade, van welke aard ook, die verband houdt met risico's verbonden aan het elektronisch verzenden van berichten.

This message may contain information that is not intended for you. If you are not the addressee or if this message was sent to you by mistake, you are requested to inform the sender and delete the message. The State accepts no liability for damage of any kind resulting from the risks inherent in the electronic transmission of messages.

--
ubuntu-users mailing list
[hidden email]
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Machine check exception, but what kind?

Ralf Mardorf-2
On Sun, 23 Apr 2017 09:51:14 +0000, [hidden email] wrote:
>This message may contain information that is not intended for you. If
>you are not the addressee or if this message was sent to you by
>mistake, you are requested to inform the sender and delete the
>message.

Hi,

I just want to inform you, that I'm not the addressee. However, I don't
delete this mail, instead I will store multiple copies of this mail on
my computer and I will make it public by forwarding it to my neighbours
and everybody else I know. I also will add signatures to office
emails, containing links to websites that already published your mail or
will publish it soon [1].

Regards,
Ralf

PS: Joking apart, is anybody able to login the Wiki/Ubuntu One? For me
it doesn't work. I want to correct the FAQ
https://wiki.ubuntu.com/UbuntuUsersListFAQ#FAQ5 , since
http://news.gmane.org/gmane.linux.ubuntu.user is a dead link . I want
to replace the Gmane link, with a link to Nabble. I joined the
documentation team, so I've got permission to edit the Wiki, but login
in doesn't work anymore, it doesn't finish, it's waiting and waiting
and waiting.

[1]
https://lists.ubuntu.com/archives/ubuntu-users/2017-April/290114.html
http://ubuntu.5.x6.nabble.com/ubuntu-users-f1215774.html


--
ubuntu-users mailing list
[hidden email]
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
Loading...