Hi all,
Has anyone on the desktop team given much attention to a crasher I see daily (often multiple times) and my inbox fills with new comments every other day?
https://bugzilla.redhat.com/show_bug.cgi?id=1510059
If not I will push it upstream.
Thanks, Michael
On Thu, 2017-12-14 at 11:00 -0600, Michael Cronenworth wrote:
Hi all,
Has anyone on the desktop team given much attention to a crasher I see daily (often multiple times) and my inbox fills with new comments every other day?
https://bugzilla.redhat.com/show_bug.cgi?id=1510059
If not I will push it upstream.
I think this is another case of false duping. If you look at the "dupes" - especially I find it useful to look at the somewhat-wrongly- named var_log_messages log, which is actually a record of system log messages (whether actually from /var/log/messages or from the journal) around the time of the crash - you'll see that they really have nothing in common.
What I think is going on in several bugs like this is that there are many ways to produce a very similar traceback, because of this `_g_log_abort` function in glib. What that basically is - I went and looked at it - is a sort of 'special' logging function which, when called, is intended to log a message and then terminate the app. I think in many cases the same 3 or 4 frames appear in the trace around the `_g_log_abort` call - in this case it's these frames that seem to appear in all the tracebacks:
#0 raise at ../sysdeps/unix/sysv/linux/raise.c:51 #1 dump_gjs_stack_on_signal_handler at ../src/main.c:372 #3 _g_log_abort at gmessages.c:554
but after that, the traces may differ. There appear to be a few patterns for the next two frames, like:
#4 g_log_default_handler at gmessages.c:3051 #5 default_log_handler at ../src/main.c:315
or:
#4 g_log_writer_default at gmessages.c:2628 #5 g_log_structured_array at gmessages.c:1941
but we're still basically just in "log something" codepaths. Once you follow the traces back past the "log something" frames to the frames which indicate what was actually *going on*, they're often different. So in your report we have:
#9 xkl_process_error at xklavier_evt.c:541 #14 XPending at Pending.c:55 #15 gdk_check_xpending at gdkeventsource.c:269 #16 gdk_event_source_check at gdkeventsource.c:306
(xklavier is to do with keyboards, IIRC). But in the first few "dupes", we see:
#8 clutter_box_layout_allocate at clutter-box-layout.c:1010 #9 st_widget_allocate at ../src/st/st-widget.c:426 #10 st_box_layout_allocate at ../src/st/st-box-layout.c:288 #11 clutter_actor_allocate_internal at clutter-actor.c:9972 #12 clutter_actor_set_animatable_property at clutter-actor.c:14986
#8 x_io_error at wayland/meta-xwayland.c:418 #10 _XReadEvents at xcb_io.c:400 #11 XIfEvent at IfEvent.c:68 #12 meta_display_get_current_time_roundtrip at core/display.c:1332 #13 meta_wayland_surface_destroy_window at wayland/meta-wayland- surface.c:430
#8 x_io_error at wayland/meta-xwayland.c:418 #11 XPending at Pending.c:55 #12 gdk_check_xpending at gdkeventsource.c:269 #13 gdk_event_source_prepare at gdkeventsource.c:287 #14 g_main_context_prepare at gmain.c:3450
#8 object_instance_finalize at gi/object.cpp:1490 #9 js::Class::doFinalize at /usr/src/debug/mozjs52-52.4.0- 1.fc27.x86_64/dist/include/js/Class.h:816 #10 JSObject::finalize at /usr/src/debug/mozjs52-52.4.0- 1.fc27.x86_64/jsobjinlines.h:87 #11 js::gc::Arena::finalize<JSObject> at /usr/src/debug/mozjs52- 52.4.0-1.fc27.x86_64/jsgc.cpp:457 #12 FinalizeTypedArenas<JSObject>(js::FreeOp *, js::gc::Arena **, js::gc::SortedArenaList &, enum class AllocKind, SliceBudget &, js::gc::ArenaLists::KeepArenasEnum) at /usr/src/debug/mozjs52-52.4.0- 1.fc27.x86_64/jsgc.cpp:515
#8 g_malloc0 at gmem.c:129 #9 blur_pixels at ../src/st/st-private.c:280 #10 _st_create_shadow_pipeline at ../src/st/st-private.c:372 #11 _st_create_shadow_pipeline_from_actor at ../src/st/st- private.c:434 #12 st_icon_update_shadow_pipeline at ../src/st/st-icon.c:278
#8 clutter_box_layout_allocate at clutter-box-layout.c:1010 #9 st_widget_allocate at ../src/st/st-widget.c:426 #10 st_box_layout_allocate at ../src/st/st-box-layout.c:288 #11 clutter_actor_allocate_internal at clutter-actor.c:9972 #12 clutter_actor_set_animatable_property at clutter-actor.c:14986
#8 x_io_error at wayland/meta-xwayland.c:418 #11 XPending at Pending.c:55 #12 gdk_check_xpending at gdkeventsource.c:269 #13 gdk_event_source_prepare at gdkeventsource.c:287 #14 g_main_context_prepare at gmain.c:3450
#8 g_malloc0 at gmem.c:129 #9 blur_pixels at ../src/st/st-private.c:280 #10 _st_create_shadow_pipeline at ../src/st/st-private.c:372 #11 _st_create_shadow_pipeline_from_actor at ../src/st/st- private.c:434 #12 st_icon_update_shadow_pipeline at ../src/st/st-icon.c:278
#8 x_io_error at wayland/meta-xwayland.c:418 #10 _XReadEvents at xcb_io.c:400 #11 XIfEvent at IfEvent.c:68 #12 meta_display_get_current_time_roundtrip at core/display.c:1332 #13 meta_wayland_surface_destroy_window at wayland/meta-wayland-surface.c:430
There are a few cases there which look like dupes of each other, but there are at least 6 very *different* cases in there. They really don't look like they're dupes of each other at all.
It would probably be a good idea to get some abrt/libreport and glib folks together to figure out how abrt/libreport can handle these `_g_log_abort` cases (and any other similar cases) better.
For now, I'll try and go through all the "dupes" of this bug and clean them up.
Hi,
On Thu, Dec 14, 2017 at 4:22 PM Adam Williamson adamwill@fedoraproject.org wrote:
#9 xkl_process_error at xklavier_evt.c:541 #14 XPending at Pending.c:55 #15 gdk_check_xpending at gdkeventsource.c:269 #16 gdk_event_source_check at gdkeventsource.c:306
(xklavier is to do with keyboards, IIRC). But in the first few "dupes",
xklavier does deal with keyboards, but this crash may be completely unrelated to xklavier unfortunately. Because of the asynchronous nature of X errors, the error could be for some completely unrelated reason, and xklavier's error handler was just the first to catch it.
#8 x_io_error at wayland/meta-xwayland.c:418 #10 _XReadEvents at xcb_io.c:400 #11 XIfEvent at IfEvent.c:68 #12 meta_display_get_current_time_roundtrip at core/display.c:1332 #13 meta_wayland_surface_destroy_window at wayland/meta-wayland- surface.c:430
#8 x_io_error at wayland/meta-xwayland.c:418 #11 XPending at Pending.c:55 #12 gdk_check_xpending at gdkeventsource.c:269 #13 gdk_event_source_prepare at gdkeventsource.c:287 #14 g_main_context_prepare at gmain.c:3450
These two mean Xwayland (or Xorg) tanked. x_io_error means disconnected from X server.
#8 object_instance_finalize at gi/object.cpp:1490 #9 js::Class::doFinalize at /usr/src/debug/mozjs52-52.4.0- 1.fc27.x86_64/dist/include/js/Class.h:816 #10 JSObject::finalize at /usr/src/debug/mozjs52-52.4.0- 1.fc27.x86_64/jsobjinlines.h:87 #11 js::gc::Arena::finalize<JSObject> at /usr/src/debug/mozjs52- 52.4.0-1.fc27.x86_64/jsgc.cpp:457 #12 FinalizeTypedArenas<JSObject>(js::FreeOp *, js::gc::Arena **, js::gc::SortedArenaList &, enum class AllocKind, SliceBudget &, js::gc::ArenaLists::KeepArenasEnum) at /usr/src/debug/mozjs52-52.4.0- 1.fc27.x86_64/jsgc.cpp:515
#8 g_malloc0 at gmem.c:129 #9 blur_pixels at ../src/st/st-private.c:280 #10 _st_create_shadow_pipeline at ../src/st/st-private.c:372 #11 _st_create_shadow_pipeline_from_actor at ../src/st/st- private.c:434 #12 st_icon_update_shadow_pipeline at ../src/st/st-icon.c:278
These two could both be heap corruption of some sort. valgrind might help with them. Though the first one, might also be some sort of gjs bug.
#8 x_io_error at wayland/meta-xwayland.c:418 #11 XPending at Pending.c:55 #12 gdk_check_xpending at gdkeventsource.c:269 #13 gdk_event_source_prepare at gdkeventsource.c:287 #14 g_main_context_prepare at gmain.c:3450
another X server crash
#8 g_malloc0 at gmem.c:129 #9 blur_pixels at ../src/st/st-private.c:280 #10 _st_create_shadow_pipeline at ../src/st/st-private.c:372 #11 _st_create_shadow_pipeline_from_actor at ../src/st/st- private.c:434 #12 st_icon_update_shadow_pipeline at ../src/st/st-icon.c:278
another heap corruption bug.
#8 x_io_error at wayland/meta-xwayland.c:418 #10 _XReadEvents at xcb_io.c:400 #11 XIfEvent at IfEvent.c:68 #12 meta_display_get_current_time_roundtrip at core/display.c:1332 #13 meta_wayland_surface_destroy_window at wayland/meta-wayland-surface.c:430
another X server crash
--Ray
On Thu, 2017-12-14 at 13:22 -0800, Adam Williamson wrote:
What I think is going on in several bugs like this is that there are many ways to produce a very similar traceback, because of this `_g_log_abort` function in glib. What that basically is - I went and looked at it - is a sort of 'special' logging function which, when called, is intended to log a message and then terminate the app. I think in many cases the same 3 or 4 frames appear in the trace around the `_g_log_abort` call - in this case it's these frames that seem to appear in all the tracebacks:
It would probably be a good idea to get some abrt/libreport and glib folks together to figure out how abrt/libreport can handle these `_g_log_abort` cases (and any other similar cases) better.
For now, I'll try and go through all the "dupes" of this bug and clean them up.
So I've now gone through all the dupes, tried to sort them into groups, and posted a little summary:
https://bugzilla.redhat.com/show_bug.cgi?id=1510059#c89
A couple of bugs cropped up several times. One is https://bugzilla.redhat.com/show_bug.cgi?id=1526164 , identified by an attempt to allocate (exactly) 18446744072098939136 bytes. There's a matching upstream bug report with a patch currently under review: http://bugzilla.gnome.org/show_bug.cgi?id=788908 It'd be great to get that patch reviewed and then backport it for F27 (and F26, if relevant, I'm not 100% sure).
The other one which just kept coming up is this: https://bugzilla.redhat.com/show_bug.cgi?id=1514220 I provided some references to external reports in comment #16, including a Freedesktop.org Wayland issue which suggests the underlying problem is that creating and destroying temporary global objects is inherently racy (AIUI). I found 8 reports of that one, suggesting it could really do with fixing.
Aside from that the most commonly encountered ones seem to be a crash triggered by a bug in the Multi Monitors extension, and one which is *possibly* caused by trying to run the game Kerbal Space Program. There are also several bugs which all ultimately seem to wind up crashing due to "Creating pipes for GWakeup: Too many open files" - like https://bugzilla.redhat.com/show_bug.cgi?id=1516553 - but that one kinda feels like I don't have the whole story figured out yet.
Thanks for any attention you can give to these, folks!
On Thu, 2017-12-14 at 19:41 -0800, Adam Williamson wrote:
On Thu, 2017-12-14 at 13:22 -0800, Adam Williamson wrote:
What I think is going on in several bugs like this is that there are many ways to produce a very similar traceback, because of this `_g_log_abort` function in glib. What that basically is - I went and looked at it - is a sort of 'special' logging function which, when called, is intended to log a message and then terminate the app. I think in many cases the same 3 or 4 frames appear in the trace around the `_g_log_abort` call - in this case it's these frames that seem to appear in all the tracebacks:
It would probably be a good idea to get some abrt/libreport and glib folks together to figure out how abrt/libreport can handle these `_g_log_abort` cases (and any other similar cases) better.
For now, I'll try and go through all the "dupes" of this bug and clean them up.
So I've now gone through all the dupes, tried to sort them into groups, and posted a little summary:
https://bugzilla.redhat.com/show_bug.cgi?id=1510059#c89
A couple of bugs cropped up several times. One is https://bugzilla.redhat.com/show_bug.cgi?id=1526164 , identified by an attempt to allocate (exactly) 18446744072098939136 bytes. There's a matching upstream bug report with a patch currently under review: http://bugzilla.gnome.org/show_bug.cgi?id=788908 It'd be great to get that patch reviewed and then backport it for F27 (and F26, if relevant, I'm not 100% sure).
The other one which just kept coming up is this: https://bugzilla.redhat.com/show_bug.cgi?id=1514220 I provided some references to external reports in comment #16, including a Freedesktop.org Wayland issue which suggests the underlying problem is that creating and destroying temporary global objects is inherently racy (AIUI). I found 8 reports of that one, suggesting it could really do with fixing.
Aside from that the most commonly encountered ones seem to be a crash triggered by a bug in the Multi Monitors extension, and one which is *possibly* caused by trying to run the game Kerbal Space Program. There are also several bugs which all ultimately seem to wind up crashing due to "Creating pipes for GWakeup: Too many open files" - like https://bugzilla.redhat.com/show_bug.cgi?id=1516553 - but that one kinda feels like I don't have the whole story figured out yet.
Thanks for any attention you can give to these, folks!
Sorry, totally forgot the second part of the post:
Meanwhile, I recalled we actually have a RHBZ and a github report for the abrt false-dupe side of things already. I have updated those too:
https://bugzilla.redhat.com/show_bug.cgi?id=1509086 https://github.com/abrt/satyr/pull/264
They've made a change upstream to sort of blacklist (AIUI) _g_log_abort and another function in the commonly-encountered trace, and sent it to Rawhide, but not sent it out as an update for F26 and F27 yet; I asked if they can do that. I also suggested another variant path to _g_log_abort which may indicate another function or two to add to the blacklist.
On 12/14/2017 09:46 PM, Adam Williamson wrote:
Sorry, totally forgot the second part of the post:
Meanwhile, I recalled we actually have a RHBZ and a github report for the abrt false-dupe side of things already. I have updated those too:
https://bugzilla.redhat.com/show_bug.cgi?id=1509086 https://github.com/abrt/satyr/pull/264
They've made a change upstream to sort of blacklist (AIUI) _g_log_abort and another function in the commonly-encountered trace, and sent it to Rawhide, but not sent it out as an update for F26 and F27 yet; I asked if they can do that. I also suggested another variant path to _g_log_abort which may indicate another function or two to add to the blacklist.
Thanks, Adam, for your due diligence and sorting out the mess.
On Fri, Dec 15, 2017 at 9:55 AM, Michael Cronenworth mike@cchtml.com wrote:
On 12/14/2017 09:46 PM, Adam Williamson wrote:
Sorry, totally forgot the second part of the post:
Meanwhile, I recalled we actually have a RHBZ and a github report for the abrt false-dupe side of things already. I have updated those too:
https://bugzilla.redhat.com/show_bug.cgi?id=1509086 https://github.com/abrt/satyr/pull/264
They've made a change upstream to sort of blacklist (AIUI) _g_log_abort and another function in the commonly-encountered trace, and sent it to Rawhide, but not sent it out as an update for F26 and F27 yet; I asked if they can do that. I also suggested another variant path to _g_log_abort which may indicate another function or two to add to the blacklist.
Thanks, Adam, for your due diligence and sorting out the mess.
Yes, indeed. Many thanks for doing this great work, Adam!
I am a little concerned about an uptick in overall gnome-shell crashes as seen at https://retrace.fedoraproject.org/faf/summary/?opsysreleases=123&opsysre... whether or not any one specific problem is to blame.
Note that in the mirror statistics, all three of these releases are about equal, with F27 just overtaking F25 but not F26 yet. But, there's a _ton_ more reports from F27, and more than either when those releases were fresh or when they were at their peak.
On Fri, 2017-12-15 at 08:55 -0600, Michael Cronenworth wrote:
On 12/14/2017 09:46 PM, Adam Williamson wrote:
Sorry, totally forgot the second part of the post:
Meanwhile, I recalled we actually have a RHBZ and a github report for the abrt false-dupe side of things already. I have updated those too:
https://bugzilla.redhat.com/show_bug.cgi?id=1509086 https://github.com/abrt/satyr/pull/264
They've made a change upstream to sort of blacklist (AIUI) _g_log_abort and another function in the commonly-encountered trace, and sent it to Rawhide, but not sent it out as an update for F26 and F27 yet; I asked if they can do that. I also suggested another variant path to _g_log_abort which may indicate another function or two to add to the blacklist.
Thanks, Adam, for your due diligence and sorting out the mess.
No problem. Since I just *love* work, today I had the great idea of looking through all open gnome-shell bugs for other ones reported by abrt which reference _g_log_abort - and surprise surprise, I found several others like this. I think abrt considers the distribution version and the signal to be key attributes of a crash - so it won't consider two bugs as dupes if they're for different releases or with different signals, even if it thinks the traceback is the same. So we get a few of these bugs per release and per signal. So I'm working gradually through all the ones I've found. One big one was this one, for F26:
https://bugzilla.redhat.com/show_bug.cgi?id=1402492
which had 81 dupes. I've got several more to look at next week.
The dupes turned out to be a mix of one-offs, Xwayland crashes without sufficient data to identify them, and a few common cases.
1. One that cropped up a *hell* of a lot was this error:
"Finalizing proxy for an object that's scheduled to be unrooted: Gio.Subprocess"
I came across a few of these in the F25 and F27 bugs, but there were a *lot* of cases reported against F26 around July to September last year. For a while I wasn't duping these bugs as I wasn't sure they were dupes, but there's *so* many with the same error it seems like there probably is some common cause, so I started making them dupes of https://bugzilla.redhat.com/show_bug.cgi?id=1466948 after a while. Even though the reports tailed off after September I'm not sure that means the bug went away - it might just mean that all the people who encountered it and are willing to file bug reports had already filed their report, so they wouldn't create any more new ones. There *are* some more recent reports, like these:
https://bugzilla.redhat.com/show_bug.cgi?id=1514732 https://bugzilla.redhat.com/show_bug.cgi?id=1517479 https://bugzilla.redhat.com/show_bug.cgi?id=1522921 https://bugzilla.redhat.com/show_bug.cgi?id=1523323 https://bugzilla.redhat.com/show_bug.cgi?id=1524066
it'd be great if anyone could figure out what the hell's going on there.
2. There were also quite a lot more cases of "invalid global wl_output", which I duped against https://bugzilla.redhat.com/show_bug.cgi?id=1514220 - it'd be good if someone could apply the proposed fix for that to F26 as well as F27, since it's clearly affecting F26 users too.
3. There were several cases of Xwayland crashes caused by two variants on an intel driver error "intel_do_flush_locked failed", which I duped against https://bugzilla.redhat.com/show_bug.cgi?id=1525693 and https://bugzilla.redhat.com/show_bug.cgi?id=1515473 .
4. There were several cases of errors along the lines of "Settings schema (SOMESCHEMA) does not contain a key named (SOMEKEY)". I *think* these are mostly misbehaving extensions, but I did wonder if an error like that really has to cause the entire shell to crash? https://bugzilla.redhat.com/show_bug.cgi?id=1493377 is one example of this type of error, there were several others for different extensions.
5. There were quite a few bugs which were basically cases where the Shell or Xwayland crashed due to exhaustion of open file descriptors. These seem to fall into a few different buckets. There's one which Ray believes is caused by the Shell leaking timerfds on background change: that's https://bugzilla.redhat.com/show_bug.cgi?id=1516553 , and he's sent out a proposed fix. There were a couple of others that seem related to backgrounds, but had large numbers of fds open that were actually related to wallpaper files and directories; I'm starting to suspect these are caused by the "Random Walls" extension - https://github.com/rodakorn/randwall - and am working with the reporters to confirm that. There were a couple of reports where most of the open FDs were for /sys/class/power_supply ; we tentatively suspect that may be a leak in upower and have assigned https://bugzilla.redhat.com/show_bug.cgi?id=1513492 to track that. And there were a few reports where most of the open FDs were some sort of pipe: https://bugzilla.redhat.com/show_bug.cgi?id=1497104 , https://bugzilla.redhat.com/show_bug.cgi?id=1499405 and https://bugzilla.redhat.com/show_bug.cgi?id=1497857 . Finally, in https://bugzilla.redhat.com/show_bug.cgi?id=1487826 , most of the open FDs appear to be some kind of *socket*.
6. The parent bugs for both F25 and F26 turned out to be interesting cases in their own right. The error in the F25 parent, https://bugzilla.redhat.com/show_bug.cgi?id=1370073 , is "toggling down object GSettings that's already queued to toggle up", which I came across a few more reports of for F26 later: #1484728 , #1492312 and #1502171 . kparal also apparently hit that one, per https://bugzilla.redhat.com/show_bug.cgi?id=1402492#c21 .
7. The error in the F26 parent, https://bugzilla.redhat.com/show_bug.cgi?id=1402492 , is "Error in freeze/thaw accounting" (an error condition in mutter), and again I found another case of that later, https://bugzilla.redhat.com/show_bug.cgi?id=1497447 . Also the first commenter really *did* have the same bug as the OP there (piio and Igor Gnatenko).
I'll carry on working through all of these tomorrow :) Thanks folks!
desktop@lists.fedoraproject.org