Em 13-05-2013 11:14, Denys Vlasenko escreveu:
I'm writing to you as the principal developer of EDAC drivers in Linux.
Looks like Red Hat customers would like to get notified about
"certain EDAC messages indicate problems which are almost always relevant
to the end customer. While not usually OS related, RHT can provide value
to our customers by detecting and informing them about the problem.
I would like to see ABRT monitor for fatal/critical EDAC errors
and when detected, collect information and inform the user.
We think the following messages are worth tracking..."
EDAC drivers currently just log detected errors to kernel log.
I have two discussion topics related to this.
1) Log message format
I looked at current EDAC drivers and they don't seem to have
an easily greppable message signature standardized across them.
No, they don't. In the future, the idea is to collect those errors
via tracing. See:
mce_amd.c uses HW_ERR macro, which expands to "[Hardware Error]:",
but this isn't picked up by other drivers. Is there a plan
to use it in other drivers?
I think that newer kernels use it as well, but we didn't change
on RHEL5/RHEL6 as this could potentially break existing userspace
I also a bit worried that it is too generic.
It is defined in include/linux/printk.h.
I can imagine it being used by e.g. USB stick drivers,
and even though badly seated USB stick generating errors
is also bad, it is not as bad as defective RAM.
Can we use more specific prefixes? Say:
RAM corruption: <more specific message follows>...
can be used for all DRAM/cache corruptions,
but it clearly excludes filesystem corruptions, which may be
a useful distinction. (Just "Data corruption" would be too generic).
For a related EDACs on buses etc we can use e.g.:
PCI bus error: <more specific message follows>...
2) Anything better than log scrubbing?
What method would you recommend for userspace tools
to detect EDAC errors?
With what solution the discussion started here:
As I said before, the plan is to use tracing. A kernel 3.10
(or the corresponding backported patches) are needed, in order
to fully benefit of it, as there are several patches at perf/tracing
infrastructure in Kernel 3.10 that are needed for this to work
properly (e. g. having a separate perf instance for hardware
errors and matching the kernel timestamp with machine's uptime).