Em Wed, 15 May 2013 14:25:57 +0200
Denys Vlasenko <dvlasenk(a)redhat.com> escreveu:
On 05/13/2013 05:08 PM, Mauro Carvalho Chehab wrote:
> Em 13-05-2013 11:14, Denys Vlasenko escreveu:
>> I looked at current EDAC drivers and they don't seem to have
>> an easily greppable message signature standardized across them.
> No, they don't. In the future, the idea is to collect those errors
> via tracing. See:
>> 2) Anything better than log scrubbing?
>> What method would you recommend for userspace tools
>> to detect EDAC errors?
>> With what solution the discussion started here:
>> ended up?
> As I said before, the plan is to use tracing. A kernel 3.10
> (or the corresponding backported patches) are needed, in order
> to fully benefit of it, as there are several patches at perf/tracing
> infrastructure in Kernel 3.10 that are needed for this to work
> properly (e. g. having a separate perf instance for hardware
> errors and matching the kernel timestamp with machine's uptime).
I am reading rasdaemon-0.2.0 sources right now.
Indeed this answers my question about data capture.
You have it there.
The task of collecting and reporting EDAC errors
naturally splits into collecting and reporting parts,
and the collecting part is well covered by now
(by you and other guys).
Yes. Btw, I just added there some code for getting PCIe AER
errors (still needs testing, so this feature needs to be
I'm also intending to add today some code to collect MCE
traces, as we want to be able to replace mcelog by this tool
in the long term.
As to reporting, rasdaemon-0.2.0 just writes
captured data to a SQLite database.
I guess it's sort of a placeholder code for now,
you don't really know what to do with the error reports.
Yes. The code that reads from the database has yet to be
Good news: *we* know :)
ABRT is the reporting tool. Our project goal is to report
errors. We were working on the reporting logic for some time.
We just need to connect rasdaemon and ABRT.
Makes sense. I'd like to have it also as a configurable option, as
I expect that the rasdaemon to also be used by other distributions,
and they might not be using ABRT.
I will brainstorm a little how we can go about doing that.
Great. Feel free to submit patches to bind it to ABRT.
I imagine ECC errors, PCI bus errors and the like can manifest
themselves in a few different scenarios.
(1) Random corruption (e.g. heavy ion particle hitting DRAM cell):
Single error, happens rarely and in a random cell, is corrected.
(2) Marginal component.
Single error, happens periodically but not too often (not many
times per second), and happens in the same hw piece
(say, same DRAM module).
(3) Failing component or connection.
Many errors, potentially many errors per second, potentially
never-ending stream of them.
While case (1) is straightforward - when that happens, we need
to let user know "hey, there was a corrected ECC error in DRAM module #2",
and that's it, handling the other two cases needs some smarts.
Yes. I think we should split the problem into a few parts:
1) event collection;
2) some rule engine that will analyze the data and apply some
rules that could eventually be customized by the user;
3) the reporting engine.
I'm talking with a guy that is working on (2). Perhaps we'll
integrate his work there.
From our (ABRT project) past experience with problem reporting,
it is important to not overload users with too many problem reports.
If we would try to report 200 problems at once, it is almost certain
that all of them have a common cause. Actually reporting all 200
problems would be worse than useless: it will swamp the user
with mostly irrelevant or redundant information.
Agreed. That's basically why I think that using a local database is
important: it can be used to correlate events and determine the root
To that end, we implemented in ABRT several overlapping mechanisms
which reduce problem report generation rate.
For example, we try to coalesce detected problems if they look
sufficiently the same:
"12:34 Process /bin/foo was killed by signal 11 (SIGSEGV)"
"13:56 Process /bin/foo was killed by signal 11 (SIGSEGV)"
"15:12 Process /bin/foo was killed by signal 11 (SIGSEGV)"
"12:34 Process /bin/foo was killed by signal 11 (SIGSEGV), and this happened 3
times since then"
We also have cool-down periods in problem detectors,
so, for example, if /bin/foo segfaults every 2 seconds
because it is restarted when it exits, we will NOT
create a new problem every time it segfaults.
Scenario (2) "Marginal component" probably needs to use
some sort of coalescing, and scenario (3) "Failing component"
would need cool-down.
As to how to implement this.
How about launching (fork+execing) a helper tool when rasdaemon
detects a problem, passing it the information via environment variables
and/or stdin, and not caring (in rasdaemon) what happens next?
rasdaemon will need to cap the number of parallel helpers running
for fairly obvious reasons.
IMHO, the better would be for rasdaemon to call a "rasanalysis" tool
(either by forking and execing it, if not running yet or by sending it
a HUP signal). Then, the rasanalysis will just use the sql database
to correlate the data and produce an error report if needed.
The rasanalysis tool will then do whatever it needs to post-process
the error, and then exit. Perhaps we can add an event sequence number
at the database tables, for the analysis tool to know from where it
needs to start its analysis.
rasdaemon will need to implement cool-down for scenario (3).
That is, if a large number of errors start to come down from kernel,
rasdaemon should ideally collect a few of them (on the order of 10),
then launch *one* helper and pass it down this data, then
ignore all further errors of this type for the next N minutes.
I don't think that the collect daemon should ignore the errors. It
should keep storing them at the database.
The analysis tool needs to check if the error is happening at the
same DIMM in order to know if the error should be counted as "one
more error of the same type".
Also, on newer kernels/hardware, it is possible to not use the affected
page anymore, marking it internally as "bad". After marking it as
bad, the system can be used normally.
Imagine a system with very big memories, in the orders of terabytes.
On such systems, there will be a lot of memory channels and DIMMs,
and the chances that memory errors will happen increases a lot.
Eventually, stopping such big machine in order to replace an
entire big-memory DIMM there just because a few KB of memories
are bad may not be the right thing to do, or could be delayed
to happen on a programmed window, if the memory pages with errors
are not used.
So, checking if the error is happening at the same memory page
could mean that a particular part of the DIMM is with problems,
and can be disabled.
In any case, keeping everything at the SQL database helps the
user to latter dig what is happening at the system, getting how
many errors happened by type/page, etc.
No special handling is needed in rasdaemon for scenario (2).
Since marginal component generates errors infrequently,
running one helper per error is ok. We will implement problem
data coalescing on ABRT side - see my example about SIGSEGVing
What do you think?