I'm writing to you as the principal developer of EDAC drivers in Linux.
Looks like Red Hat customers would like to get notified about
"certain EDAC messages indicate problems which are almost always relevant
to the end customer. While not usually OS related, RHT can provide value
to our customers by detecting and informing them about the problem.
I would like to see ABRT monitor for fatal/critical EDAC errors
and when detected, collect information and inform the user.
We think the following messages are worth tracking..."
EDAC drivers currently just log detected errors to kernel log.
I have two discussion topics related to this.
1) Log message format
I looked at current EDAC drivers and they don't seem to have
an easily greppable message signature standardized across them.
mce_amd.c uses HW_ERR macro, which expands to "[Hardware Error]:",
but this isn't picked up by other drivers. Is there a plan
to use it in other drivers?
I also a bit worried that it is too generic.
It is defined in include/linux/printk.h.
I can imagine it being used by e.g. USB stick drivers,
and even though badly seated USB stick generating errors
is also bad, it is not as bad as defective RAM.
Can we use more specific prefixes? Say:
RAM corruption: <more specific message follows>...
can be used for all DRAM/cache corruptions,
but it clearly excludes filesystem corruptions, which may be
a useful distinction. (Just "Data corruption" would be too generic).
For a related EDACs on buses etc we can use e.g.:
PCI bus error: <more specific message follows>...
2) Anything better than log scrubbing?
What method would you recommend for userspace tools
to detect EDAC errors?
With what solution the discussion started here: