Let's separate the two questions. First, what information do you want to
extract and combine to form the "crash identity"? Second, how do we go
about getting that data?
The obvious starting place is a raw backtrace. That is, just an ordered
list of PCs. We canonicalize each PC by identifying its module and
relativizing it. That is,
build ID af2dede56c346a99c6f4b7f011574416f3afef7b + 0xba130
build ID 0948352b72da50b6935f3530d05835b7c2257db2 + 0x51e6
build ID 0948352b72da50b6935f3530d05835b7c2257db2 + 0x1df9
build ID af2dede56c346a99c6f4b7f011574416f3afef7b + 0x376
build ID 0948352b72da50b6935f3530d05835b7c2257db2 + 0x1709
This canonicalizes each PC to the one particular file and independent of
the changing load address for DSOs or PIEs. On the back end (with
debuginfo), this is enough to reconstruct a list of rpms, files in them,
and source locations in those files.
For purposes of "crash identity", you might actually want a looser binding,
or a tighter one. I'll explain what I mean.
That example backtrace is from the core dump I made by running "cat" and
hitting ^\. So it goes:
(The source info is not available at crash time without debuginfo,
I just include it to illustrate the example.) In an x86-32 example,
the backtrace would start in the vDSO, which for our purposes appears
like any other DSO; it has a build ID that maps back to the particular
Now let's pretend this is some application crash, for this example say
"cat" had a bug so it crashed by passing bad pointers into the libc call
(or let's say it called "abort" instead of "read" here, for
So the most salient "identity" of this crash is that it went:
If one user has upgraded glibc and another has not, you might want to say
that the two crashes are "the same crash" anyway. Even moreso for a
backtrace that includes the vDSO, where one user might use the i586 vs i686
kernel vs PAE kernel or different kernel upgrade versions--but the crash
really has nothing to do with the kernel variant. (It's almost always the
case that a kernel change does not change the vDSO at all, though each vDSO
build ID is new for each kernel build. It's quite likely that a given
glibc upgrade won't change any of the functions' internal code offsets at
all for functions likely to appear in an application backtrace.)
So that would be a "looser" approach. That is, don't constrain each PC to
a specific build ID (i.e. an exact match on the binary loaded).
Even looser would be only to use:
That is, don't constrain each PC to its code offset at all, just the
proximate symbol name.
In the contrary direction, you might want to collect all the build IDs
loaded in the process, not just the ones whose PCs appear in the backtrace.
For example, say there is a crash in firefox, where all the PCs fall either
into the firefox binary or in the libc binary. But the most relevant fact
about that crash might actually be that the "foobar" plugin is loaded--say
that crash never happens when "foobar" is not loaded, though none of the
"foobar" DSO's code appears in the actual list of backtrace PCs. So that
would be a "tighter" approach. That is helpful in the "foobar"
example. OTOH, it splays into many different "crash identities" for what's
really the same crash, not only from unimportant kernel variant or upgrade
drift differences, but potentially from factors like /etc/nsswitch.conf
settings differences that often have nothing at all to do with the crash.
I've only talked about the first of the two questions I posed at the top,
and now I'm going to split that one into two and will have covered only
parts of each of those halves by the end of this message.
First half, is this enough info to start with (raw backtrace)? The other
thing readily available to extract without fancy knowledge (debuginfo) is
the crash-time registers and some related magic (see eu-readelf -n corefile
for what is in there). Except for the crash signal info, this doesn't seem
all that likely to help any more than just the PCs. It seems far more
likely to leak user data (or at least be feared to do so), and certainly to
have garbage/flutter that fails to match across "identical" crashes. Is
there anything else that might go into it?
The new second half of the first question goes to what kind of answers we
want for that first half and for the looser/tighter spectrum of questions,
and is why instead of a solid answer to the first question, I've just
delivered more questions. That is, what exactly is the purpose of this
"We want to be able to recognize duplicate crashes just from core dumps."
Why exactly, and what does "duplicate" mean exactly? The point I'm getting
at is the split between the up-front extraction, the "crash identity"
distillation, and the full "recognize duplicates" picture.
One answer is that the purpose is to reduce the number of times we submit a
crash report to whatever the central tracking facility is (or queue a report
to be submitted, or to a local server, whatever), possibly implying also the
number of times a user might be asked to decide about submitting, etc. For
this sort of goal, it's a "winnowing" just for overhead/load/annoyance
reduction purposes. Then I'd say it follows to err on the side of including
more distinguishing features in the distillation done at the time of
up-front collection (i.e. locally at crash time).
If the same user does the same thing twice in a row it will be recognized
locally as the same "crash identity" and not impose any repetition of
reporting/whatnot. If the user upgrades something that's not really
related, or fiddles a configuration that is not really related, another
instance of "really the same crash" might be filed separately--an annoyance
scaled to the number of upgrades or configuration changes or plugin
combinations or whatever that the user covers.
From the back-end service perspective, erring on this side still
winnows thousands down to dozens of unique crash identities up front before
hammering the service. In the back end, the field is open and the resources
available (i.e. debuginfo et al) to do all manner of smarts in further
distilling several unique crash identities with a "tight" algorithm into a
single "crash profile". By mapping to source line info, it can even call
two crashes from different arch's "the same" because they crash on the same
line, though in two different arch builds with entirely different build IDs
and code offsets. More generally, back-end tools can grow arbitrarily fancy
in analyzing and presenting a spectrum of similarity culled from many "base"
crash identities. (Not to mention the endless variety of fun analyses like
"rpm name + source line implicated in most crashes this week", "this app
crashed most often when the moon was full", feeding stats into maintainers'
Fedora pages or ohloh metrics, etc., etc.)
On the local up-front collection side, there is also a space to work in
between the tightly-construed "base" crash identity and some "buckets of
sameness" determined locally (without much overhead or fancy business). It
seems right to submit the details for each different base crash identity to
the back-end service, since it has the most sophistication on judging the
similarity of crashes. But there can be some local, simpler heuristics on
lumping together multiple crashes experienced on this host for purposes of
policy and user interaction.
In a desktop interaction example, say that in the first instance, an
application crash asks the user, "Report this crash?" If a second crash
with the same base crash identity happens, clearly it will say, "This is
like your crash on <date:time> and was already reported." But, if a third
crash happens with a second base crash identity, it might apply some local
(cheap) heuristics and decide to say, "This crash appears similar to the
one I reported for you on <date:time>, reporting follow-up details too."
For example, the local heuristics might apply the "loosest" of examples
above (just file and function names in the backtrace, regardless of
versions, other plugins loaded, etc.) and judge it a "similar enough" crash
if a previous different base crash identity's backtrace matched on this
loose standard. Then automatically apply the policy choice made for the
first "similar enough" crash to new crashes. When it sends the follow-up
report to the back-end service for a new base crash identity, it can
include, "and btw, I'm reporting this because I deemed this similar to crash
id 123abc that you already know about" (and perhaps, "heuristic module
foobar version n with settings blah blah made this judgment").
Or the local policy/user configuration might only care about really broad
heuristics of similarity, like "firefox crashed" or "anything
Seen one, seen 'em all. Just report the new catastrophe like you did the
last one and don't ask me any fancy questions!
So, on the assumption that the motivating purpose is something along these
lines, my inclination is to err on the side of being overly-specific in the
distillation of base crash identity in up-front collection (and, by
extension, in the reporting of unique base crash identities to the back-end
service), and then err on the side of loose heuristics applied locally to
decide which base crash identities are similar enough to treat as one
(i.e. report all or none, minimize repeated interaction on "similars").
I think the set of module build IDs plus the canonicalized backtrace PCs
is about the right balance for base crash identity. I leave it open what
local heuristics of similarity to implement and how those might relate to
policy/interaction choices for handling categories of crashes. Is any of
this near the right direction for what you want?