On 05/13/2013 05:08 PM, Mauro Carvalho Chehab wrote:
Em 13-05-2013 11:14, Denys Vlasenko escreveu:
> I looked at current EDAC drivers and they don't seem to have
> an easily greppable message signature standardized across them.
No, they don't. In the future, the idea is to collect those errors
via tracing. See:
> 2) Anything better than log scrubbing?
> What method would you recommend for userspace tools
> to detect EDAC errors?
> With what solution the discussion started here:
> ended up?
As I said before, the plan is to use tracing. A kernel 3.10
(or the corresponding backported patches) are needed, in order
to fully benefit of it, as there are several patches at perf/tracing
infrastructure in Kernel 3.10 that are needed for this to work
properly (e. g. having a separate perf instance for hardware
errors and matching the kernel timestamp with machine's uptime).
I am reading rasdaemon-0.2.0 sources right now.
Indeed this answers my question about data capture.
You have it there.
The task of collecting and reporting EDAC errors
naturally splits into collecting and reporting parts,
and the collecting part is well covered by now
(by you and other guys).
As to reporting, rasdaemon-0.2.0 just writes
captured data to a SQLite database.
I guess it's sort of a placeholder code for now,
you don't really know what to do with the error reports.
Good news: *we* know :)
ABRT is the reporting tool. Our project goal is to report
errors. We were working on the reporting logic for some time.
We just need to connect rasdaemon and ABRT.
I will brainstorm a little how we can go about doing that.
I imagine ECC errors, PCI bus errors and the like can manifest
themselves in a few different scenarios.
(1) Random corruption (e.g. heavy ion particle hitting DRAM cell):
Single error, happens rarely and in a random cell, is corrected.
(2) Marginal component.
Single error, happens periodically but not too often (not many
times per second), and happens in the same hw piece
(say, same DRAM module).
(3) Failing component or connection.
Many errors, potentially many errors per second, potentially
never-ending stream of them.
While case (1) is straightforward - when that happens, we need
to let user know "hey, there was a corrected ECC error in DRAM module #2",
and that's it, handling the other two cases needs some smarts.
From our (ABRT project) past experience with problem reporting,
it is important to not overload users with too many problem reports.
If we would try to report 200 problems at once, it is almost certain
that all of them have a common cause. Actually reporting all 200
problems would be worse than useless: it will swamp the user
with mostly irrelevant or redundant information.
To that end, we implemented in ABRT several overlapping mechanisms
which reduce problem report generation rate.
For example, we try to coalesce detected problems if they look
sufficiently the same:
"12:34 Process /bin/foo was killed by signal 11 (SIGSEGV)"
"13:56 Process /bin/foo was killed by signal 11 (SIGSEGV)"
"15:12 Process /bin/foo was killed by signal 11 (SIGSEGV)"
"12:34 Process /bin/foo was killed by signal 11 (SIGSEGV), and this happened 3
times since then"
We also have cool-down periods in problem detectors,
so, for example, if /bin/foo segfaults every 2 seconds
because it is restarted when it exits, we will NOT
create a new problem every time it segfaults.
Scenario (2) "Marginal component" probably needs to use
some sort of coalescing, and scenario (3) "Failing component"
would need cool-down.
As to how to implement this.
How about launching (fork+execing) a helper tool when rasdaemon
detects a problem, passing it the information via environment variables
and/or stdin, and not caring (in rasdaemon) what happens next?
rasdaemon will need to cap the number of parallel helpers running
for fairly obvious reasons.
rasdaemon will need to implement cool-down for scenario (3).
That is, if a large number of errors start to come down from kernel,
rasdaemon should ideally collect a few of them (on the order of 10),
then launch *one* helper and pass it down this data, then
ignore all further errors of this type for the next N minutes.
No special handling is needed in rasdaemon for scenario (2).
Since marginal component generates errors infrequently,
running one helper per error is ok. We will implement problem
data coalescing on ABRT side - see my example about SIGSEGVing
What do you think?