On 10/01/2013 10:15 PM, Petr Holasek wrote:
On Tue, 01 Oct 2013, Denys Vlasenko wrote:
> On 09/27/2013 09:29 AM, Jiri Moskovcak wrote:
>> On 09/27/2013 08:46 AM, Junliang Li wrote:
>>> Fedora 19 has a tool named rasdaemon. From its maintainer, Mauro, I know
>>> that rasdaemon and ABRT will work together. But I don't know much about
>>> that. Would anyone introduce something about rasdaemon and ABRT?
>> Denys is responsible for rasdaemon&ABRT integration, so I'm adding him to
> IIUC rasdaemon does not send its data yet to abrt.
> rasdaemon developers work on the way to prevent
> floods of error reports: it's semi-trivial to generate
> a single report about an isolated ECC error on PCIe bus;
> but what if there are thousands of them per second?
> We (abrt team) provided documentation necessary
> to use abrt's "create problem data" API.
> We are ready to aid rasdaemon people if they have
> questions or proposals for changes in abrt.
> Some of them (Petr Holasek) are colocated with
> abrt team and can just walk over and talk with us.
to be honest, I still can't find time for digging into implementation of abrt
hook for rasdaemon as well as we still wait for Intel guys who implement code
for reducing floods of errors in some reasonable manner.
How about reporting first detected error to abrt right away, then,
if more errors happen, hold on for a few seconds, then
batch-report them as one problem ("1234 PCIe parity errors happened
at 12:34 during 4 seconds on the device FOO" would be a nice way to report
such a problem).
Increase cooldown period if errors keep coming, with a cap.
We have something like this elsewhere in abrt:
unsigned cooldown_sec = 5;
cooldown_sec *= cooldown_sec;
if (cooldown_sec > 15 * 60)
cooldown_sec = 15 * 60;
With formulas like above cooldown rises quickly, resulting in just
a few problem reports even with constant flood of error events;
yet, it does not grow to astronomical values - "collect PCIe errors
for next 27 hours and report them as one"
is obviously a bad idea too.