在 2013-10-02三的 12:32 +0200，Denys Vlasenko写道：
On 10/01/2013 10:15 PM, Petr Holasek wrote:
> On Tue, 01 Oct 2013, Denys Vlasenko wrote:
>> On 09/27/2013 09:29 AM, Jiri Moskovcak wrote:
>>> On 09/27/2013 08:46 AM, Junliang Li wrote:
>>>> Fedora 19 has a tool named rasdaemon. From its maintainer, Mauro, I
>>>> that rasdaemon and ABRT will work together. But I don't know much
>>>> that. Would anyone introduce something about rasdaemon and ABRT?
>>> Denys is responsible for rasdaemon&ABRT integration, so I'm adding
him to the loop.
>> IIUC rasdaemon does not send its data yet to abrt.
>> rasdaemon developers work on the way to prevent
>> floods of error reports: it's semi-trivial to generate
>> a single report about an isolated ECC error on PCIe bus;
>> but what if there are thousands of them per second?
>> We (abrt team) provided documentation necessary
>> to use abrt's "create problem data" API.
>> We are ready to aid rasdaemon people if they have
>> questions or proposals for changes in abrt.
>> Some of them (Petr Holasek) are colocated with
>> abrt team and can just walk over and talk with us.
> Hello all,
> to be honest, I still can't find time for digging into implementation of abrt
> hook for rasdaemon as well as we still wait for Intel guys who implement code
> for reducing floods of errors in some reasonable manner.
How about reporting first detected error to abrt right away, then,
if more errors happen, hold on for a few seconds, then
batch-report them as one problem ("1234 PCIe parity errors happened
at 12:34 during 4 seconds on the device FOO" would be a nice way to report
such a problem).
Increase cooldown period if errors keep coming, with a cap.
We have something like this elsewhere in abrt:
unsigned cooldown_sec = 5;
cooldown_sec *= cooldown_sec;
if (cooldown_sec > 15 * 60)
cooldown_sec = 15 * 60;
With formulas like above cooldown rises quickly, resulting in just
a few problem reports even with constant flood of error events;
yet, it does not grow to astronomical values - "collect PCIe errors
for next 27 hours and report them as one"
is obviously a bad idea too.
Cooldown period is a good idea. Let sysadm customize their report
threshold in rasdaemon would be OK. Maybe we just need add an plugin in
rasdaemon to customize threshold and work as abrt hook.