https://fedoraproject.org/wiki/Changes/fno-omit-frame-pointer
This document represents a proposed Change. As part of the Changes process, proposals are publicly announced in order to receive community feedback. This proposal will only be implemented if approved by the Fedora Engineering Steering Committee.
== Summary ==
Fedora will add -fno-omit-frame-pointer to the default C/C++ compilation flags, which will improve the effectiveness of profiling and debugging tools.
== Owner == * Name: [[User:daandemeyer| Daan De Meyer]], [[User:Dcavalca| Davide Cavalca]], [[ Andrii Nakryiko]] * Email: daandemeyer@fb.com, dcavalca@fb.com, andriin@fb.com
== Detailed Description ==
Credits to Mirek Klimos, whose internal note on stacktrace unwinding formed the basis for this change proposal (myreggg@gmail.com).
Any performance or efficiency work relies on accurate profiling data. Sampling profilers probe the target program's call stack at regular intervals and store the stack traces. If we collect enough of them, we can closely approximate the real cost of a library or function with minimal runtime overhead.
Stack trace capture what’s running on a thread. It should start with clone - if the thread was created via clone syscall - or with _start - if it’s the main thread of the process. The last function in the stack trace is code that CPU is currently executing. If a stack starts with [unknown] or any other symbol, it means it's not complete.
=== Unwinding ===
How does the profiler get the list of function names? There are two parts of it:
# Unwinding the stack - getting a list of virtual addresses pointing to the executable code # Symbolization - translating virtual addresses into human-readable information, like function name, inlined functions at the address, or file name and line number.
Unwinding is what we're interested in for the purpose of this proposal. The important things are:
* Data on stack is split into frames, each frame belonging to one function. * Right before each function call, the return address is put on the stack. This is the instruction address in the caller to which we will eventually return — and that's what we care about. * One register, called the "frame pointer" or "base pointer" register (RBP), is traditionally used to point to the beginning of the current frame. Every function should back up RBP onto the stack and set it properly at the very beginning.
The “frame pointer” part is achieved by adding push %rbp, mov %rsp,%rbp to the beginning of every function and by adding pop %rbp before returning. Using this knowledge, stack unwinding boils down to traversing a linked list:
https://i.imgur.com/P6pFdPD.png
=== Where’s the catch? ===
The frame pointer register is not necessary to run a compiled binary. It makes it easy to unwind the stack, and some debugging tools rely on frame pointers, but the compiler knows how much data it put on the stack, so it can generate code that doesn't need the RBP. Not using the frame pointer register can make a program more efficient:
* We don’t need to back up the value of the register onto the stack, which saves 3 instructions per function. * We can treat the RBP as a general-purpose register and use it for something else.
Whether the compiler sets frame pointer or not is controlled by the -fomit-frame-pointer flag and the default is "omit", meaning we can’t use this method of stack unwinding by default.
To make it possible to rely on the frame pointer being available, we'll add -fno-omit-frame-pointer to the default C/C++ compilation flags. This will instruct the compiler to make sure the frame pointer is always available. This will in turn allow profiling tools to provide accurate performance data which can drive performance improvements in core libraries and executables.
== Feedback ==
=== Potential performance impact ===
* Meta builds all its libraries and executables with -fno-omit-frame-pointer by default. Internal benchmarks did not show significant impact on performance when omitting the frame pointer for two of our most performance intensive applications. * Firefox recently landed a change to preserve the frame pointer in all jitted code (https://bugzilla.mozilla.org/show_bug.cgi?id=1426134). No significant decrease in performance was observed. * Kernel 4.8 frame pointer benchmarks by Suse showed 5%-10% regressions in some benchmarks (https://lore.kernel.org/all/20170602104048.jkkzssljsompjdwy@suse.de/T/#u)
Should individual libraries or executables notice a significant performance degradation caused by including the frame pointer everywhere, these packages can opt-out on an individual basis as described in https://docs.fedoraproject.org/en-US/packaging-guidelines/#_compiler_flags.
=== Alternatives to frame pointers ===
There are a few alternative ways to unwind stacks instead of using the frame pointer:
* [https://dwarfstd.org DWARF] data - The compiler can emit extra information that allows us to find the beginning of the frame without the frame pointer, which means we can walk the stack exactly as before. The problem is that we need to unwind the stack in kernel space which isn't implemented in the kernel. Given that the kernel implemented it's own format (ORC) instead of using DWARF, it's unlikely that we'll see a DWARF unwinder in the kernel any time soon. The perf tool allows you to use the DWARF data with --call-graph=dwarf, but this means that it copies the full stack on every event and unwinds in user space. This has very high overhead. * [https://www.kernel.org/doc/html/v5.3/x86/orc-unwinder.html ORC] (undwarf) - problems with unwinding in kernel led to creation of another format with the same purpose as DWARF, just much simpler. This can only be used to unwind kernel stack traces; it doesn't help us with userspace stacks. More information on ORC can be found [https://lwn.net/Articles/728339 here]. * [https://lwn.net/Articles/680985 LBR] - New Intel CPUs have a feature that gives you source and target addresses for the last 16 (or 32, in newer CPUs) branches with no overhead. It can be configured to record only function calls and to be used as a stack, which means it can be used to get the stack trace. Sadly, you only get the last X calls, and not the full stack trace, so the data can be very incomplete. On top of that, many Fedora users might still be using CPUs without LBR support which means we wouldn't be able to assume working profilers on a Fedora system by default.
To summarize, if we want complete stacks with reasonably low overhead (which we do, there's no other way to get accurate profiling data from running services), frame pointers are currently the best option.
== Benefit to Fedora ==
Implementing this change will provide profiling tools with easy access to stacktraces of installed libraries and executables which will lead to more accurate profiling data in general. This in turn can be used to implement optimizations to core libraries and executables which will improve the overall performance of Fedora itself and the wider Linux ecosystem.
Various debugging tools can also make use of the frame pointer to access the current stacktrace, although tools like gdb can already do this to some degree via embedded dwarf debugging info.
== Scope == * Proposal owners: Put up a PR to change the rpm macros to build packages by default with -fno-omit-frame-pointer by default.
* Other developers: Review and merge the PR implementing the Change.
* Release engineering: [https://pagure.io/releng/issues #Releng issue number]. A mass rebuild is required.
* Policies and guidelines: N/A (not needed for this Change)
* Trademark approval: N/A (not needed for this Change)
* Alignment with Objectives: N/A
== Upgrade/compatibility impact ==
This should not impact upgrades in any way.
== How To Test ==
# Build the package with the updated rpm macros # Profile the binary with `perf record -g <binary>` # Inspect the perf data with `perf report -g 'graph,0.5,caller'` # When expanding hot functions in the perf report, perf should show the full call graph of the hot function (at least for all functions that are part of the binary compiled with -fno-omit-frame-pointer)
== User Experience ==
Fedora users will be more likely to have a streamlined experience when trying to debug/profile system executables/libraries. Tools such as perf will work out of the box instead of requiring to users to provide extra options (e.g. --call-graph=dwarf/LBR) or requiring users to recompile all relevant packages with -fno-omit-frame-pointer.
== Dependencies ==
The rpm macros for Fedora need to be adjusted to include -fno-omit-frame-pointer in the default C/C++ compilation flags.
== Contingency Plan ==
* Contingency mechanism: The new version can be released without every package being rebuilt with fno-omit-frame-pointer. Profiling will only work perfectly once all packages have been rebuilt but there will be no regression in behavior if not all packages have been rebuilt by the time of the release. If the Change is found to introduce unacceptable regressions, the PR implementing it can be reverted and affected packages can be rebuilt. * Contingency deadline: Final freeze * Blocks release? No
== Documentation ==
* Original proposal for in-kernel DWARF unwinder (rejected): https://lkml.org/lkml/2017/5/5/571
== Release Notes ==
Packages are now compiled with frame pointers included by default. This will enable a variety of profiling and debugging tools to show more information out of the box.
https://fedoraproject.org/wiki/Changes/fno-omit-frame-pointer [...] === Unwinding === [...]
- Kernel 4.8 frame pointer benchmarks by Suse showed 5%-10%
regressions in some benchmarks (https://lore.kernel.org/all/20170602104048.jkkzssljsompjdwy@suse.de/T/#u)
Regressions of such magnitude can veto such changes, especially when they hit everyone, not just those who are highly dependent on the profiling tools the proposal is concerned about.
=== Alternatives to frame pointers ===
There are a few alternative ways to unwind stacks instead of using the frame pointer:
- [https://dwarfstd.org DWARF] data - The compiler can emit extra
[...] The problem is that we need to unwind the stack in kernel space
(Are you referring to a novel kernel-resident tool?)
To summarize, if we want complete stacks with reasonably low overhead (which we do, there's no other way to get accurate profiling data from running services), frame pointers are currently the best option.
The proposal doesn't characterize the "reasonably low overhead" that this operation targets. That makes it hard to judge the tradeoffs.
[...] Fedora users will be more likely to have a streamlined experience when trying to debug/profile system executables/libraries. Tools such as perf will work out of the box instead of requiring to users to provide extra options (e.g. --call-graph=dwarf/LBR) [...]
If typing that option were a hardship, it could be made default on Fedora. With broad debuginfod auto-downloading capability, maybe it's worth considering.
- FChE
On Thu, Jun 16, 2022 at 07:13:55PM -0400, Frank Ch. Eigler wrote:
https://fedoraproject.org/wiki/Changes/fno-omit-frame-pointer [...] === Unwinding === [...]
- Kernel 4.8 frame pointer benchmarks by Suse showed 5%-10%
regressions in some benchmarks (https://lore.kernel.org/all/20170602104048.jkkzssljsompjdwy@suse.de/T/#u)
Regressions of such magnitude can veto such changes, especially when they hit everyone, not just those who are highly dependent on the profiling tools the proposal is concerned about.
I can only concur. Say what you want about Phoronix benchmark, but they consistently benchmark different distributions And Fedora consistently is lagging behind. Latest article is at https://www.phoronix.com/scan.php?page=article&item=h1-2022-linux
Slowing Fedora even further is really undesirable.
On 6/17/22 01:41, Tomasz Torcz wrote:
On Thu, Jun 16, 2022 at 07:13:55PM -0400, Frank Ch. Eigler wrote:
https://fedoraproject.org/wiki/Changes/fno-omit-frame-pointer [...] === Unwinding === [...]
- Kernel 4.8 frame pointer benchmarks by Suse showed 5%-10%
regressions in some benchmarks (https://lore.kernel.org/all/20170602104048.jkkzssljsompjdwy@suse.de/T/#u)
Regressions of such magnitude can veto such changes, especially when they hit everyone, not just those who are highly dependent on the profiling tools the proposal is concerned about.
I can only concur. Say what you want about Phoronix benchmark, but they consistently benchmark different distributions And Fedora consistently is lagging behind. Latest article is at https://www.phoronix.com/scan.php?page=article&item=h1-2022-linux
Slowing Fedora even further is really undesirable.
What would it take to make Fedora faster?
On 6/17/22 12:15, Demi Marie Obenour wrote:
On 6/17/22 01:41, Tomasz Torcz wrote:
I can only concur. Say what you want about Phoronix benchmark, but they consistently benchmark different distributions And Fedora consistently is lagging behind. Latest article is at https://www.phoronix.com/scan.php?page=article&item=h1-2022-linux
Slowing Fedora even further is really undesirable.
What would it take to make Fedora faster?
If you go back a few years, it was pretty common for Fedora to perform below average in these benchmarks, and that really isn't the case any more. The top performing systems in these benchmarks are Clear Linux and RHEL rebuilds (which, in this context, I think we can probably just treat as a proxy for RHEL performance.) Clear and RHEL (rebuilds) probably get most of their advantages from building for an x86_64-v2 microarchitecture, which Fedora discussed and rejected last year (after discussing and rejecting a proposal to build for x86_64-v3 two years before that.) If you exclude Clear and RHEL (rebuilds), Fedora's showing in the Phoronix benchmarks above is, subjectively, pretty good. So, I don't think that Tomasz's claim that Fedora is consistently lagging behind is true for the last couple of years, though it had been in the past. (It is also probably true that when Fedora Workstation is tested instead of Fedora Server, btrfs impacts some benchmarks.)
To Demi's question, though, I would venture a guess that building glibc (and possibly some other libraries) for more modern microarchitectures and shipping that support in hwcaps would probably be a big step forward, at the cost of some disk space. It was mentioned in Neal's x86_64-v2 thread, but that discussion didn't seem to go anywhere. Building the whole OS for a more modern microarchitecture would probably also help, at the cost of compatibility with older hardware, and that doesn't seem like an trade-off Fedora is willing to consider today.
https://docs.01.org/clearlinux/latest/reference/system-requirements.html#id1
https://developers.redhat.com/blog/2021/01/05/building-red-hat-enterprise-li...
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/...
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/...
On 6/19/22 16:36, Gordon Messmer wrote:
On 6/17/22 12:15, Demi Marie Obenour wrote:
On 6/17/22 01:41, Tomasz Torcz wrote:
I can only concur. Say what you want about Phoronix benchmark, but they consistently benchmark different distributions And Fedora consistently is lagging behind. Latest article is at https://www.phoronix.com/scan.php?page=article&item=h1-2022-linux
Slowing Fedora even further is really undesirable.
What would it take to make Fedora faster?
If you go back a few years, it was pretty common for Fedora to perform below average in these benchmarks, and that really isn't the case any more. The top performing systems in these benchmarks are Clear Linux and RHEL rebuilds (which, in this context, I think we can probably just treat as a proxy for RHEL performance.) Clear and RHEL (rebuilds) probably get most of their advantages from building for an x86_64-v2 microarchitecture, which Fedora discussed and rejected last year (after discussing and rejecting a proposal to build for x86_64-v3 two years before that.) If you exclude Clear and RHEL (rebuilds), Fedora's showing in the Phoronix benchmarks above is, subjectively, pretty good. So, I don't think that Tomasz's claim that Fedora is consistently lagging behind is true for the last couple of years, though it had been in the past. (It is also probably true that when Fedora Workstation is tested instead of Fedora Server, btrfs impacts some benchmarks.)
To Demi's question, though, I would venture a guess that building glibc (and possibly some other libraries) for more modern microarchitectures and shipping that support in hwcaps would probably be a big step forward, at the cost of some disk space. It was mentioned in Neal's x86_64-v2 thread, but that discussion didn't seem to go anywhere. Building the whole OS for a more modern microarchitecture would probably also help, at the cost of compatibility with older hardware, and that doesn't seem like an trade-off Fedora is willing to consider today.
Are there any cases where there is no trade-off? For instance, anything that requires KVM or Xen (except Xen PV) requires hardware virtualization support in the CPU, so it can be built under the assumption that such support is present. Therefore, it only needs to support CPUs that are new enough to have such support. Another option would be to exclude CPUs that are no longer receiving microcode updates for security vulnerabilities; this is especially true for virtualization packages.
On Sun, Jun 19, 2022 at 01:36:22PM -0700, Gordon Messmer wrote:
On 6/17/22 12:15, Demi Marie Obenour wrote:
On 6/17/22 01:41, Tomasz Torcz wrote:
I can only concur. Say what you want about Phoronix benchmark, but they consistently benchmark different distributions And Fedora consistently is lagging behind. Latest article is at https://www.phoronix.com/scan.php?page=article&item=h1-2022-linux
Slowing Fedora even further is really undesirable.
What would it take to make Fedora faster?
If you go back a few years, it was pretty common for Fedora to perform below average in these benchmarks, and that really isn't the case any more. The top performing systems in these benchmarks are Clear Linux and RHEL rebuilds (which, in this context, I think we can probably just treat as a proxy for RHEL performance.) Clear and RHEL (rebuilds) probably get most of their advantages from building for an x86_64-v2 microarchitecture
Actually, in the cases in the past where I looked at Phoronix benchmarks, Clear got most of it's performance advantage from defaulting to "Performance" setting of the CPU, while almost everyone else defaults to "Balanced". "Performance" makes sense pretty much only if you're benchmarking or showing off to friends, otherwise "Balanced" is a much more reasonable way to use the hardware. But anyway, we now have a drop-down menu for this at least under gnome, just click "Performance" and get the same boost :)
And I'd take the results for RHEL + downstreams with a grain of salt too. In particular, CentoOS Stream and AlmaLinux get opposite places in various bechmarks, which doesn't fit well the hypothesis above…
To Demi's question, though, I would venture a guess that building glibc (and possibly some other libraries) for more modern microarchitectures and shipping that support in hwcaps would probably be a big step forward, at the cost of some disk space. It was mentioned in Neal's x86_64-v2 thread, but that discussion didn't seem to go anywhere. Building the whole OS for a more modern microarchitecture would probably also help, at the cost of compatibility with older hardware, and that doesn't seem like an trade-off Fedora is willing to consider today.
I'd like to see benchmarks before accepting this as a fact.
Zbyszek
On 6/20/22 13:17, Zbigniew Jędrzejewski-Szmek wrote:
Actually, in the cases in the past where I looked at Phoronix benchmarks, Clear got most of it's performance advantage from defaulting to "Performance" setting of the CPU, while almost everyone else defaults to "Balanced".
I'd be surprised if that were true, because a system in "balanced" mode should scale its CPU frequency up when the system is busy. For systems that aren't constantly busy, the "balanced" mode will usually result in some jitter in response time depending on what state a given core is in when the request is made, while a "performance" mode system should be more predictable. But for a heavily loaded system, there shouldn't be much difference.
But we don't have to speculate about that too much. Phoronix publishes its benchmark suite as a container image and since we're interested in how the power management affects performance in this case rather than how the supporting OS configuration affects the tools, we can use that container image.
$ podman run --rm -it phoronix/pts
Within the container, I can run "benchmark system/compress-zstd" to run a processor-focused benchmark where Clear and Fedora differed significantly under a "balanced" config and a "performance" config and observe the results. In Phoronix's test, the result for "Zstd Compression Compression Level: 3, Long Mode - Compression Speed" under Clear Linux was 1192.4 MB/s, and under Fedora it was 292.7 MB/s. On my Fedora system, the results are statistically indistinguishable when I run that benchmark under "balanced" and under "performance." I get 410.5 MB/s under a balanced profile and 410.7 under performance.
That does not prove that the CPU micro-architecture is the most significant difference between the two, but it does strongly suggest that the CPU power settings are not a significant factor in these differences.
And I'd take the results for RHEL + downstreams with a grain of salt too. In particular, CentoOS Stream and AlmaLinux get opposite places in various bechmarks, which doesn't fit well the hypothesis above…
Alma is release 8 and CentOS Stream is release 9. I am not surprised that there are some benchmarks where they differ. I would, however, be fairly surprised if benchmarks showed significant differences between RHEL, Alma, and CentOS Stream 8, or RHEL, Alma, and CentOS Stream 9.
I'd like to see benchmarks before accepting this as a fact.
Aren't we looking at them? :)
On Mon, 20 Jun 2022 at 19:31, Gordon Messmer gordon.messmer@gmail.com wrote:
Alma is release 8 and CentOS Stream is release 9. I am not surprised that there are some benchmarks where they differ. I would, however, be fairly surprised if benchmarks showed significant differences between RHEL, Alma, and CentOS Stream 8, or RHEL, Alma, and CentOS Stream 9.
I'd like to see benchmarks before accepting this as a fact.
Aren't we looking at them? :)
I am expecting the words are: I would like to see benchmarks we can agree are useful being done by a 'trusted' third party versus a site (or at least does Apples to Apples comparisons of Alma 9 vs CS9 vs Fedora 34 and Alma8 vs CS8 vs Fedora 29 etc ).
Clear Linux also seems to carry various out of band patches to the kernel, systemd and other places to speed things up so I expect it is more than just throwing the CPU into performance mode.
On 6/21/22 05:10, Stephen Smoogen wrote:
I am expecting the words are: I would like to see benchmarks we can agree are useful being done by a 'trusted' third party versus a site (or at least does Apples to Apples comparisons of Alma 9 vs CS9 vs Fedora 34 and Alma8 vs CS8 vs Fedora 29 etc ).
I'm being a *little* glib in reply to a comment that attributes the performance difference between Clear and Fedora as merely the power management configuration without presenting evidence, but demands benchmarks before considering that they might be the result of the targeted CPU microarchitecture. There's a certain humorous irony, I think, in demanding benchmarks in the context of a discussion about benchmarks that include Fedora Server 36 and CS 9.
To be less glib: Of course I think that benchmarks are warranted and interesting. In particular, while we do have a distro (CS 9) entirely built for x86_64-v2, the Fedora community has been clear that they are not willing to consider that step yet. We don't have an example of a distro that uses glibc hwcaps for more targeted optimization, and I think we would naturally want that before discussing whether Fedora should ship additional optimized libraries. But while there is an open question of whether hwcaps would deliver the benefits of targeting a newer microarchitecture, I think we should acknowledge that there is relatively good evidence that those benefits exist.
Clear Linux also seems to carry various out of band patches to the kernel, systemd and other places to speed things up so I expect it is more than just throwing the CPU into performance mode.
https://github.com/clearlinux-pkgs/linux
https://docs.01.org/clearlinux/latest/guides/clear/performance.html
I've read their kernel patch set in the past, and at the time the vast majority of patches weren't performance focused at all, they looked like they were just cherry-picked fixes that weren't yet in a current release. They also describe patches to glibc, llvm, and gcc that upstream developers haven't accepted yet.
But there's a reason that I think most of that is irrelevant. Clear Linux and RHEL 9 both are built for a CPU microarchitecture that includes SSE2, and both of those (if we accept CS 9 as a substitue for RHEL in this context) have relatively similar results in the benchmarks on Phoronix. CentOS Stream doesn't have most of the other patches present in Clear Linux, which suggests that the CPU microarchitecture is the most likely factor in their performance advantage. I think that suggests that we should try to find out how much of that advantage we can deliver in Fedora with optimized libraries and glibc hwcaps.
On Sun, Jun 19, 2022 at 01:36:22PM -0700, Gordon Messmer wrote:
can probably just treat as a proxy for RHEL performance.) Clear and RHEL (rebuilds) probably get most of their advantages from building for an x86_64-v2 microarchitecture, which Fedora discussed and rejected last year (after discussing and rejecting a proposal to build for x86_64-v3 two years before that.) If you exclude Clear
Phoronix credits this to those distros shipping with P-state Performance by default. In order to figure out what's really the best there as a default for users (rather than benchmarks), I think we'd need to do some significant testing with real-world workloads for latency, throughput, and power consumption. (It might be something where we'd want a different default for Fedora Workstation than for Server or IoT...)
On Tue, Jun 21, 2022 at 3:11 PM Matthew Miller mattdm@fedoraproject.org wrote:
On Sun, Jun 19, 2022 at 01:36:22PM -0700, Gordon Messmer wrote:
can probably just treat as a proxy for RHEL performance.) Clear and RHEL (rebuilds) probably get most of their advantages from building for an x86_64-v2 microarchitecture, which Fedora discussed and rejected last year (after discussing and rejecting a proposal to build for x86_64-v3 two years before that.) If you exclude Clear
Phoronix credits this to those distros shipping with P-state Performance by default. In order to figure out what's really the best there as a default for users (rather than benchmarks), I think we'd need to do some significant testing with real-world workloads for latency, throughput, and power consumption. (It might be something where we'd want a different default for Fedora Workstation than for Server or IoT...)
P-state can help in some areas, though the new AMD P-state bits last showed to be both slower, and worse on power usage. I do expect that will change soon enough. These are things we do keep an eye on to some extent. Though I don't have an army of people dedicated to benchmarking rounds with various kernels. It might be an interesting experiment at some point. Overall Fedora does willingly sacrifice a little bit of performance for power savings, and (unfortunately) for security. Though almost all of these settings can be overridden with the kernel command line. Again, it might be worth a write up at some point, what kind of performance gain on average one can expect by changing those defaults with command line options, and what you are giving up by doing so. "This feature adds an average 5% overall performance on cpu family X, but sacrifices 11% of battery life", that type of thing. But it would take a reasonable effort to coordinate that, verify it across CPU families, etc. I do say unfortunately on the security part, because these didn't give us much choice. These mitigations were not some well designed system to improve security, with options put forth, and time to fine tune like several of our security features were. They were harsh workarounds to shortcomings of hardware. They have been optimized over time, but there is only so much that can be done. Does every user need all of these turned on? Probably not, but it is better to default to safety until a user can determine whether their particular environment needs a particular mitigation or not.
-- Matthew Miller mattdm@fedoraproject.org Fedora Project Leader
On 6/21/22 13:10, Matthew Miller wrote:
Phoronix credits this to those distros shipping with P-state Performance by default.
Yes, but I doubt that for several reasons: First, it's a claim without evidence. That setting isn't the only difference between any two systems tested. Second, the claim doesn't make any *sense*. Systems with intel_pstate balanced aren't supposed to be noticeably slower for sustained CPU intensive workloads. The intel_pstate driver is supposed to scale the frequency up under load in the "balanced" configuration, delivering performance when it is needed and power saving when it isn't. Third, I can run their tests on my own system in an intel_pstate performance mode and an intel_pstate balanced mode, and the test results are nearly identical, which is the expected outcome.
If no one else does any testing sooner, I'll try to build an x86_64-v2 glibc and run some basic benchmarks when I take some time off of my day job at the beginning of July.
On 6/21/22 21:40, Gordon Messmer wrote:
On 6/21/22 13:10, Matthew Miller wrote:
Phoronix credits this to those distros shipping with P-state Performance by default.
Yes, but I doubt that for several reasons: First, it's a claim without evidence. That setting isn't the only difference between any two systems tested. Second, the claim doesn't make any *sense*. Systems with intel_pstate balanced aren't supposed to be noticeably slower for sustained CPU intensive workloads. The intel_pstate driver is supposed to scale the frequency up under load in the "balanced" configuration, delivering performance when it is needed and power saving when it isn't. Third, I can run their tests on my own system in an intel_pstate performance mode and an intel_pstate balanced mode, and the test results are nearly identical, which is the expected outcome.
I did some work this week to see if I could learn anything from Phoronix's article [1], and came up pretty much dry. I cannot replicate any of the differences that I would expect to be able to. More than anything else, their results look like evidence of a bug in the Xeon Platinum 8380.
In retrospect, the first thing that should have stood out to me when I looked at this ~ 3 weeks ago (but which I missed) was that if I pull the phoronix/pts container image and "run pts/compress-zstd" with "Compression Level: 3, Long Mode", I get better results on my XPS 13 laptop than they did on their Xeon. And, while cpubenchmark.net does suggest that my i5 CPU [2] has a better single-core test results than the Xeon [3], the zstd test should not be limited to a single core. On my laptop, top reported the zstd process typically using ~400% CPU time.
The first thing I tried to reproduce was a difference between "performance" and "powersave" settings in the intel_pstate cpufreq driver. I used the zstd compression test on my only Intel CPU, which is in my XPS 13 laptop. In the default Fedora WS configuration, scaling_driver is intel_pstate, scaling_governor is powersave, and (I believe) energy_performance_preference is balance_performance. In that configuration, typical values for scaling_cur_freq were significantly lower than typical values after changing energy_performance_preference to performance, and scaling_governor to performance. So on this laptop, I'm confident that the governor and EPP settings are behaving as expected. But zstd benchmark results are essentially indistinguishable when running in one mode vs the other, because the powersave mode for the intel_pstate driver will scale CPU speed up on demand.
In addition, phoronix has benchmarked Intel systems in the past [4] to determine the effective difference between the intel_pstate powersave and performance modes, and found minimal differences on an Intel i9 CPU.
I also tested the svt-av1 benchmark on this system in both modes, as this was another CPU bound test that Phoronix reported as a significant difference and attributed to the P-State governor setting. Again, I saw no significant difference between performance and powersave results.
All of this suggests that the Xeon was simply not scaling up for these tests. Given its large number of cores, perhaps the benchmarks weren't putting enough load on the system to trigger scaling up. Or (as a matter of *wild* speculation) maybe it was scaling up some cores, but Linux was shuffling tasks between cores and "missing" the fast ones. Whatever the case, the big differences between distributions reported by Phoronix are probably limited to this class of CPUs.
If this is *normal* behavior for those CPUs, then maybe the Fedora Server group would want to change the default governor, or emphasize the importance of the CPU governor selection in their documentation.
I also ran benchmarks on CentOS Stream 9 and Fedora Server 36, each installed in a VM under CentOS Stream 9 libvirt, running on a host with a AMD Ryzen 5 CPU [5], with the CPU configuration copied to the guests. As VMs, these would not apply any cpufreq management of their own, and if there were any differences resulting from the CPU architecture target, they should be apparent in these tests. Test results for these VMs 20-40% better than the Xeon's best results, but results under the CentOS Stream 9 VM were essentially the same as results under the Fedora Server 36 VM. It's probably still interesting to run the full suite and see if any other tests do have significant differences, and I'll try to do that later.
I think that's enough to convince me that I was wrong to doubt that the intel_pstate configuration was the reason that these results differed, although I still believe that if that is the case, then the CPU's internal pstate selection is broken.
1: https://www.phoronix.com/scan.php?page=article&item=h1-2022-linux&nu...
2: https://www.cpubenchmark.net/cpu.php?cpu=Intel+Core+i5-1135G7+%40+2.40GHz&am...
3: https://www.cpubenchmark.net/cpu.php?cpu=Intel+Xeon+Platinum+8380+%40+2.30GH...
4: https://www.phoronix.com/scan.php?page=article&item=linux50-pstate-cpufr...
5: https://www.cpubenchmark.net/cpu.php?cpu=AMD+Ryzen+5+5600X&id=3859
On 09/07/2022 22:32, Gordon Messmer wrote:
I cannot replicate any of the differences that I would expect to be able to. More than anything else, their results look like evidence of a bug in the Xeon Platinum 8380.
Have you rebuilt all system packages with -fno-omit-frame-pointer or just tested packages?
On 7/10/22 04:38, Vitaly Zaitsev via devel wrote:
Have you rebuilt all system packages with -fno-omit-frame-pointer or just tested packages?
No. Early in the thread, Tomasz Torcz posted a link to a Phoronix article as evidence that Fedora's performance was behind other distributions. At one point, I expressed doubt that intel_pstate in performance mode was the reason for some significant differences, so I ran some tests to try to either prove or disprove that explanation. Having done so, I don't think the results that Phoronix published are useful as evidence of anything.
On 10/07/2022 19:36, Gordon Messmer wrote:
No. Early in the thread, Tomasz Torcz posted a link to a Phoronix article as evidence that Fedora's performance was behind other distributions.
All packages in the dependency tree must be rebuilt with -fno-omit-frame-pointer flag, otherwise the tests will show the weather on Mars.
On 7/10/22 10:55, Vitaly Zaitsev via devel wrote:
On 10/07/2022 19:36, Gordon Messmer wrote:
No. Early in the thread, Tomasz Torcz posted a link to a Phoronix article as evidence that Fedora's performance was behind other distributions.
All packages in the dependency tree must be rebuilt with -fno-omit-frame-pointer flag, otherwise the tests will show the weather on Mars.
Right, I understand that. I was not attempting to determine the effect of the -fno-omit-frame-pointer flag, and neither was Tomasz. I believe Tomasz's point was that Fedora should not consider any changes that would negatively impact performance, because existing benchmarks indicate that Fedora is slower than other distributions. I think that the benchmarks that Tomasz used as evidence do not support that point of view.
On 7/10/2022 11:36 AM, Gordon Messmer wrote:
On 7/10/22 04:38, Vitaly Zaitsev via devel wrote:
Have you rebuilt all system packages with -fno-omit-frame-pointer or just tested packages?
No. Early in the thread, Tomasz Torcz posted a link to a Phoronix article as evidence that Fedora's performance was behind other distributions. At one point, I expressed doubt that intel_pstate in performance mode was the reason for some significant differences, so I ran some tests to try to either prove or disprove that explanation. Having done so, I don't think the results that Phoronix published are useful as evidence of anything.
Phoronix is notorious for poor benchmarking methodology and trying to draw any conclusions from their data is pointless as a result.
Jeff
On So, 2022-07-10 at 10:36 -0700, Gordon Messmer wrote:
On 7/10/22 04:38, Vitaly Zaitsev via devel wrote:
Have you rebuilt all system packages with -fno-omit-frame-pointer or just tested packages?
No. Early in the thread, Tomasz Torcz posted a link to a Phoronix article as evidence that Fedora's performance was behind other distributions. At one point, I expressed doubt that intel_pstate in performance mode was the reason for some significant differences, so I ran some tests to try to either prove or disprove that explanation. Having done so, I don't think the results that Phoronix published are useful as evidence of anything.
it is well known(TM) that ceph performance suffers when not running with "performance" governor.
see e.g. this blog (CTRL-F "powersave"):
https://croit.io/blog/ceph-performance-test-and-optimization
where IOPS are roughly quadrupled, when disabling the powersaving features.
see also the holy bible of ceph performance tuning:
https://yourcmc.ru/wiki/Ceph_performance#CPUs
but the internet is full with benchmarks and real workload scenarios which show that the cpu performance governor has a real impact on highly concurrent workloads.
this might not apply to workstations, but it indeed does apply to datacenter workloads, like SDS.
On Fri, Jun 17, 2022 at 12:14 AM Frank Ch. Eigler fche@redhat.com wrote:
https://fedoraproject.org/wiki/Changes/fno-omit-frame-pointer [...] === Unwinding === [...]
- Kernel 4.8 frame pointer benchmarks by Suse showed 5%-10%
regressions in some benchmarks (https://lore.kernel.org/all/20170602104048.jkkzssljsompjdwy@suse.de/T/#u)
Regressions of such magnitude can veto such changes, especially when they hit everyone, not just those who are highly dependent on the profiling tools the proposal is concerned about.
The 4.8 kernel is comparatively ancient now and a lot had changed in this area of the kernel, I think we need benchmarks on a more modern kernel with gcc12 to see if any of this has since been mitigated before we rule out the change in it's entirety.
=== Alternatives to frame pointers ===
There are a few alternative ways to unwind stacks instead of using the frame pointer:
- [https://dwarfstd.org DWARF] data - The compiler can emit extra
[...] The problem is that we need to unwind the stack in kernel space
(Are you referring to a novel kernel-resident tool?)
To summarize, if we want complete stacks with reasonably low overhead (which we do, there's no other way to get accurate profiling data from running services), frame pointers are currently the best option.
The proposal doesn't characterize the "reasonably low overhead" that this operation targets. That makes it hard to judge the tradeoffs.
[...] Fedora users will be more likely to have a streamlined experience when trying to debug/profile system executables/libraries. Tools such as perf will work out of the box instead of requiring to users to provide extra options (e.g. --call-graph=dwarf/LBR) [...]
If typing that option were a hardship, it could be made default on Fedora. With broad debuginfod auto-downloading capability, maybe it's worth considering.
- FChE
devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Regressions of such magnitude can veto such changes, especially when they hit everyone, not just those who are highly dependent on the profiling tools the proposal is concerned about.
The kernel benchmarks were added as an example of openly available data we could find on the potential impact of frame pointers. Note that the email from Mel Gorman is all we have to go on. Unfortunately the original data from the benchmarks is gone so I can't try to reproduce them. I've emailed Mel to see if he still has the benchmarks stored somewhere so we can perhaps try to reproduce the results.
I've added a clarification to the change proposal that we don't intend to actually compile the kernel with frame pointers, since the kernel is already built with ORC support and this works well so there's nothing to really be gained by building the kernel with frame pointers. That means we won't see the kernel regressions that were reported by the Suse benchmarks.
Unfortunately, there's no readily available benchmarks that I've been able to find that would show the exact impact of frame pointers on common Fedora workflows. The Phoronix benchmark suite could be used but that would imply doing a mass rebuild with frame pointers before we could actually run it and measure the impact.
Also, as mentioned in the proposal, all our internal services at Meta are built with frame pointers enabled. We did canaries a few years ago on some of our most CPU intensive services to see if it would make sense to build them without frame pointers, and found that there were no significant enough wins to be had to justify the loss in continuous profiling data caused by building without frame pointers
(Are you referring to a novel kernel-resident tool?)
Unfortunately, no, there's no in-kernel DWARF unwinder due to the complexity involved. Instead, the kernel uses ORC and has an unwinder for that. Adding ORC support to all of Linux userspace so that we can unwind it in the kernel isn't likely to happen, since all tooling would have to be changed to support ORC.
The proposal doesn't characterize the "reasonably low overhead" that this operation targets. That makes it hard to judge the tradeoffs.
Characterizing the impact would mean rebuilding most of the distro with frame pointers and running a comprehensive benchmark suite on it. Doing this will be a rather involved process. If you know of any other representative benchmark suites that we could run that wouldn't require rebuilding most of the distro, we could look into running these with and without frame pointers to measure the impact.
If typing that option were a hardship, it could be made default on Fedora. With broad debuginfod auto-downloading capability, maybe it's worth considering.
The issue with DWARF isn't that we have to add an extra option to perf, it's that without an in kernel DWARF unwinder (which is very unlikely to ever happen as discussed above), it's expensive to use DWARF for stacktrace unwinding, as we have to copy the entire stack and unwind it in user space, which adds substantial overhead. This means we can't use it for continuous profiling.
On 6/17/22 15:10, Daan De Meyer via devel wrote:
they hit everyone, not just those who are highly dependent on the profiling tools the proposal is concerned about.
The kernel benchmarks were added as an example of openly available data we could find on the potential impact of frame pointers. Note that the email from Mel Gorman is all we have to go on. Unfortunately the original data from the benchmarks is gone so I can't try to reproduce them. I've emailed Mel to see if he still has the benchmarks stored somewhere so we can perhaps try to reproduce the results.
I've added a clarification to the change proposal that we don't intend to actually compile the kernel with frame pointers, since the kernel is already built with ORC support and this works well so there's nothing to really be gained by building the kernel with frame pointers. That means we won't see the kernel regressions that were reported by the Suse benchmarks.
Unfortunately, there's no readily available benchmarks that I've been able to find that would show the exact impact of frame pointers on common Fedora workflows. The Phoronix benchmark suite could be used but that would imply doing a mass rebuild with frame pointers before we could actually run it and measure the impact.
I don’t think it is a good idea to do something that would regress performance until Fedora is competitive when it comes to real-world performance (boot time, latency in GUI applications, etc). Synthetic benchmarks are less important.
Also, as mentioned in the proposal, all our internal services at Meta are built with frame pointers enabled. We did canaries a few years ago on some of our most CPU intensive services to see if it would make sense to build them without frame pointers, and found that there were no significant enough wins to be had to justify the loss in continuous profiling data caused by building without frame pointers
Can you provide actual numbers here?
(Are you referring to a novel kernel-resident tool?)
Unfortunately, no, there's no in-kernel DWARF unwinder due to the complexity involved. Instead, the kernel uses ORC and has an unwinder for that. Adding ORC support to all of Linux userspace so that we can unwind it in the kernel isn't likely to happen, since all tooling would have to be changed to support ORC.
How difficult would it be to do exactly this?
The proposal doesn't characterize the "reasonably low overhead" that this operation targets. That makes it hard to judge the tradeoffs.
Characterizing the impact would mean rebuilding most of the distro with frame pointers and running a comprehensive benchmark suite on it. Doing this will be a rather involved process. If you know of any other representative benchmark suites that we could run that wouldn't require rebuilding most of the distro, we could look into running these with and without frame pointers to measure the impact.
If typing that option were a hardship, it could be made default on Fedora. With broad debuginfod auto-downloading capability, maybe it's worth considering.
The issue with DWARF isn't that we have to add an extra option to perf, it's that without an in kernel DWARF unwinder (which is very unlikely to ever happen as discussed above), it's expensive to use DWARF for stacktrace unwinding, as we have to copy the entire stack and unwind it in user space, which adds substantial overhead. This means we can't use it for continuous profiling.
Would it be possible for the kernel to somehow drop into ring 3, use whatever library userspace would use to unwind the stack, and then use some magic syscall to return to kernel space? Or is the problem that perf runs in NMI context and so very much cannot do anything fancy?
On Fri, Jun 17, 2022 at 07:10:01PM -0000, Daan De Meyer via devel wrote:
Regressions of such magnitude can veto such changes, especially when they hit everyone, not just those who are highly dependent on the profiling tools the proposal is concerned about.
The kernel benchmarks were added as an example of openly available data we could find on the potential impact of frame pointers. Note that the email from Mel Gorman is all we have to go on. Unfortunately the original data from the benchmarks is gone so I can't try to reproduce them. I've emailed Mel to see if he still has the benchmarks stored somewhere so we can perhaps try to reproduce the results.
I've added a clarification to the change proposal that we don't intend to actually compile the kernel with frame pointers
OK, so the (old!) kernel benchmarks are not of little relevance.
I think that this really needs to be benchmarked. Without that the discussion will just go on in circles. There are two nice options for this: either copr, where you first build the redhat-rpm-config with the adjusted options and then rebuild some subset of rawhide to be able to do benchmarks, or just the same thing locally on a server with a bunch of CPUs. One architecture should be enough. Mini-rebuilds like that take a few days, so it's entirely possible to do.
Zbyszek
That makes total sense. We're looking at doing exactly this. Are there any particular benchmarks the community would be interested in? Our idea was to take the phoronix test suite and run some of the relevant suites from that with and without the updated packages and compare the results. On a quick initial look, compilation and compression suites look like relevant benchmarks for this use case. If there's any other benchmarks that would be interesting, I can take a look at running those as well.
Cheers,
Daan
* Zbigniew Jędrzejewski-Szmek:
I think that this really needs to be benchmarked. Without that the discussion will just go on in circles. There are two nice options for this: either copr, where you first build the redhat-rpm-config with the adjusted options and then rebuild some subset of rawhide to be able to do benchmarks, or just the same thing locally on a server with a bunch of CPUs. One architecture should be enough.
If that one architecture is x86_64. Note that x86_64 needs -mno-omit-leaf-frame-pointer, not just -fno-omit-frame-pointer.
i686 will have additional build failures and very different performance characteristics. ppc64le already uses a backchain, s390x has a different backchain convention (and may not be affected by -fno-omit-frame-pointer). aarch64 needs -mno-omit-leaf-frame-pointer as well.
Mini-rebuilds like that take a few days, so it's entirely possible to do.
I pushed a change to rawhide glibc which makes it inherit the -fno-omit-frame-pointer and -mno-omit-leaf-frame-pointer from redhat-rpm-config. I think both flags are required if you want to use the frame pointers for accurate unwinding.
GCC (for libgcc and libstdc++) will need very different changes. I don't think it inherits the build flags, either.
Thanks, Florian
Unfortunately, no, there's no in-kernel DWARF unwinder due to the complexity involved. Instead, the kernel uses ORC and has an unwinder for that. Adding ORC support to all of Linux userspace so that we can unwind it in the kernel isn't likely to happen, since all tooling would have to be changed to support ORC.
A huge task, most definitely. Setting that aside though, which cases (if any) would that solution *not* handle? Is the level of effort required the primary blocker to such a solution, or are there technical ones as well?
It's not even a question worth asking because it's both impractical and unlikely to actually fix the situation we have.
Ben Cotton wrote (on behalf of Daan De Meyer, Davide Cavalca, Andrii Nakryiko):
Fedora will add -fno-omit-frame-pointer to the default C/C++ compilation flags, which will improve the effectiveness of profiling and debugging tools.
[…]
Any performance or efficiency work relies on accurate profiling data.
So, you propose to destroy programs' performance in order to allow people to more easily improve programs' performance? That strikes me as a particularly bad idea.
The profiling data collected with -fno-omit-frame-pointer will also not be accurate for when the software is compiled with the default (under -O2 and above) -fomit-frame-pointer. (In particular, the cost of shorter functions will be significantly higher (in relation to the total) with the frame pointer than with omitted frame pointer.) So it can actually lead to the software getting tuned only for the Fedora build configuration.
In addition, the frame pointer also increases code size (though, to be fair, the asynchronous unwind tables that we usually want for code compiled with -fomit-frame-pointer also take space).
Sampling profilers probe the target program's call stack at regular intervals and store the stack traces. If we collect enough of them, we can closely approximate the real cost of a library or function with minimal runtime overhead.
Sampling profilers are fast, but will never reach the accuracy of Valgrind (which, as far as I know, can deal with DWARF unwinding just fine).
Kevin Kofler
On 6/17/22 06:55, Kevin Kofler via devel wrote:
Ben Cotton wrote (on behalf of Daan De Meyer, Davide Cavalca, Andrii Nakryiko):
Fedora will add -fno-omit-frame-pointer to the default C/C++ compilation flags, which will improve the effectiveness of profiling and debugging tools.
[…]
Any performance or efficiency work relies on accurate profiling data.
So, you propose to destroy programs' performance in order to allow people to more easily improve programs' performance? That strikes me as a particularly bad idea.
A much better solution would be to enhance perf to run the unwinder in user mode. The whole reason that this is even a tradeoff is because perf does unwinding in kernel mode, and a kernel-mode DWARF parser is an terrible idea for security reasons.
The profiling data collected with -fno-omit-frame-pointer will also not be accurate for when the software is compiled with the default (under -O2 and above) -fomit-frame-pointer. (In particular, the cost of shorter functions will be significantly higher (in relation to the total) with the frame pointer than with omitted frame pointer.) So it can actually lead to the software getting tuned only for the Fedora build configuration.
In addition, the frame pointer also increases code size (though, to be fair, the asynchronous unwind tables that we usually want for code compiled with -fomit-frame-pointer also take space).
Sampling profilers probe the target program's call stack at regular intervals and store the stack traces. If we collect enough of them, we can closely approximate the real cost of a library or function with minimal runtime overhead.
Sampling profilers are fast, but will never reach the accuracy of Valgrind (which, as far as I know, can deal with DWARF unwinding just fine).
Kevin Kofler
Valgrind is not helpful for profiling production workloads. It is too slow and will not provide an accurate indication of where the time is being spent. That requires a sampling profiler.
Demi Marie Obenour wrote:
Valgrind is not helpful for profiling production workloads. It is too slow and will not provide an accurate indication of where the time is being spent. That requires a sampling profiler.
IMHO, Valgrind (with the Callgrind or Cachegrind profiles) has a pretty good cost model. Is it slow? Yes, definitely. (Count up to a factor 50 slowdown for CPU-bound code.) Does it tell you where the bottlenecks are? In my experience, it does. I have even run entire JVMs through Valgrind Callgrind in order to find bottlenecks in the native C/C++ JNIs. (It will not help with the Java code, of course. You need a Java profiler for that.) It has always found the problem spots, where fixing them made the program faster. So, while I can understand the "too slow" part, I cannot agree with your "will not provide an accurate indication of where the time is being spent" assertion. It is quite the opposite: sampling will necessarily be less accurate because it can only take snapshots at certain intervals whereas Valgrind monitors the entire program execution at all times.
Kevin Kofler
Kevin Kofler via devel wrote:
Demi Marie Obenour wrote:
Valgrind is not helpful for profiling production workloads. It is too slow and will not provide an accurate indication of where the time is being spent. That requires a sampling profiler.
IMHO, Valgrind (with the Callgrind or Cachegrind profiles) has a pretty good cost model. Is it slow? Yes, definitely. (Count up to a factor 50 slowdown for CPU-bound code.) Does it tell you where the bottlenecks are? In my experience, it does. I have even run entire JVMs through Valgrind Callgrind in order to find bottlenecks in the native C/C++ JNIs. (It will not help with the Java code, of course. You need a Java profiler for that.) It has always found the problem spots, where fixing them made the program faster. So, while I can understand the "too slow" part, I cannot agree with your "will not provide an accurate indication of where the time is being spent" assertion. It is quite the opposite: sampling will necessarily be less accurate because it can only take snapshots at certain intervals whereas Valgrind monitors the entire program execution at all times.
PS: Now, what I have NOT done with Valgrind, and what I agree Valgrind is not designed for, is what Meta is apparently doing with Perf, i.e., running their production server services with profiling always enabled. I normally run CLI tools or unit tests in Valgrind, which is what it works with best. If I have to profile or debug a server in Valgrind, I need to run a dedicated local instance for the test. But usually, it is better to just write a non-server test program that simulates the workload without a need for a client.
What Meta is doing can indeed only reasonably be done with a sampling profiler, with additional restrictions (in particular, they state that Perf's support to drop back to userspace for DWARF unwinding is too slow for them), but is that a common enough use case that it warrants globally degrading Fedora performance for all users, most of whom do not use Perf this way (if they even use it at all)?
I see -fno-omit-frame-pointer as a crude workaround for lack of proper unwinding support, not something you want to ever use in production.
Kevin Kofler
[...] What Meta is doing can indeed only reasonably be done with a sampling profiler, with additional restrictions (in particular, they state that Perf's support to drop back to userspace for DWARF unwinding is too slow for them) [...]
(Note FWIW that systemtap's kernel runtime does DWARF-based unwinding for the kernel and user-designated userspace executables and shared-libraries on demand, and it's not particularly slow at it.)
- FChE
On 6/18/22 08:57, Frank Ch. Eigler wrote:
[...] What Meta is doing can indeed only reasonably be done with a sampling profiler, with additional restrictions (in particular, they state that Perf's support to drop back to userspace for DWARF unwinding is too slow for them) [...]
(Note FWIW that systemtap's kernel runtime does DWARF-based unwinding for the kernel and user-designated userspace executables and shared-libraries on demand, and it's not particularly slow at it.)
How does it do that? Does it have a kernel-mode DWARF unwinder? Is this trick something that perf could use as well?
demiobenour wrote:
(Note FWIW that systemtap's kernel runtime does DWARF-based unwinding for the kernel and user-designated userspace executables and shared-libraries on demand, and it's not particularly slow at it.)
How does it do that? Does it have a kernel-mode DWARF unwinder?
Yes.
# stap -e 'probe timer.profile { if(user_mode()) { print_ubacktrace() println() } }' \ -d /lib64/libc.so.6 -d /path/to/other/library --ldd
see also, e.g.: https://sourceware.org/systemtap/man/stap.1.html https://sourceware.org/systemtap/tapsets/API-print-backtrace-fileline.html https://sourceware.org/systemtap/tapsets/API-print-ubacktrace-brief.html
Is this trick something that perf could use as well?
Yes, but this approach is unlikely to be adopted there.
- FChE
On 6/18/22 13:22, Frank Ch. Eigler wrote:
demiobenour wrote:
(Note FWIW that systemtap's kernel runtime does DWARF-based unwinding for the kernel and user-designated userspace executables and shared-libraries on demand, and it's not particularly slow at it.)
How does it do that? Does it have a kernel-mode DWARF unwinder?
Yes.
# stap -e 'probe timer.profile { if(user_mode()) { print_ubacktrace() println() } }' \ -d /lib64/libc.so.6 -d /path/to/other/library --ldd
see also, e.g.: https://sourceware.org/systemtap/man/stap.1.html https://sourceware.org/systemtap/tapsets/API-print-backtrace-fileline.html https://sourceware.org/systemtap/tapsets/API-print-ubacktrace-brief.html
Is this trick something that perf could use as well?
Yes, but this approach is unlikely to be adopted there.
Why is this?
* Ben Cotton:
- Meta builds all its libraries and executables with
-fno-omit-frame-pointer by default. Internal benchmarks did not show significant impact on performance when omitting the frame pointer for two of our most performance intensive applications.
They probably saw *significant* (in the statistics sense) performance regressions, but deemed them acceptable.
- Firefox recently landed a change to preserve the frame pointer in
all jitted code (https://bugzilla.mozilla.org/show_bug.cgi?id=1426134). No significant decrease in performance was observed.
That could because they have to do stack walking as part of regular operation. So I'm not sure if this an appropriate comparison. It's also possible that the JIT compiler had issues that prevented it from taking full advantage of a larger register file.
What you see on the Mozilla ticket is stuff that broke with the slightly smaller register set. That is going to bite some packages on i686. (It's not going to impact x86-64, I think.)
- [https://lwn.net/Articles/680985 LBR] - New Intel CPUs have a
feature that gives you source and target addresses for the last 16 (or 32, in newer CPUs) branches with no overhead. It can be configured to record only function calls and to be used as a stack, which means it can be used to get the stack trace. Sadly, you only get the last X calls, and not the full stack trace, so the data can be very incomplete. On top of that, many Fedora users might still be using CPUs without LBR support which means we wouldn't be able to assume working profilers on a Fedora system by default.
Do you really need more than five or so *physical* stack frames during profiling, to figure out what is going on? Graphs generated from DWARF unwinding typically will show logical stack frames from inlining, too, and appear much deeper.
Thanks, Florian
Do you really need more than five or so *physical* stack frames during profiling, to figure out what is going on? Graphs generated from DWARF unwinding typically will show logical stack frames from inlining, too, and appear much deeper.
Yes, more than a couple times per year I encounter cases that require 10 to 12 physical stack frames in order to understand what is happening. The current (deepest) level can more than 5 or so frames from the nearest level that has meaning for the application programmer. This happens particularly often in Boost, other C++ object-oriented programming (std::string and relatives have several physical levels that tend to obscure) and python (or any interpreter with a deeply-recursive implementation.)
Hi,
On June 16, 2022 8:53:59 PM UTC, Ben Cotton bcotton@redhat.com wrote:
https://fedoraproject.org/wiki/Changes/fno-omit-frame-pointer
This document represents a proposed Change. As part of the Changes process, proposals are publicly announced in order to receive community feedback. This proposal will only be implemented if approved by the Fedora Engineering Steering Committee.
== Summary ==
Fedora will add -fno-omit-frame-pointer to the default C/C++ compilation flags, which will improve the effectiveness of profiling and debugging tools.
== Owner ==
- Name: [[User:daandemeyer| Daan De Meyer]], [[User:Dcavalca| Davide
Cavalca]], [[ Andrii Nakryiko]]
- Email: daandemeyer@fb.com, dcavalca@fb.com, andriin@fb.com
== Detailed Description ==
Credits to Mirek Klimos, whose internal note on stacktrace unwinding formed the basis for this change proposal (myreggg@gmail.com).
Any performance or efficiency work relies on accurate profiling data. Sampling profilers probe the target program's call stack at regular intervals and store the stack traces. If we collect enough of them, we can closely approximate the real cost of a library or function with minimal runtime overhead.
Stack trace capture what’s running on a thread. It should start with clone - if the thread was created via clone syscall - or with _start - if it’s the main thread of the process. The last function in the stack trace is code that CPU is currently executing. If a stack starts with [unknown] or any other symbol, it means it's not complete.
=== Unwinding ===
How does the profiler get the list of function names? There are two parts of it:
# Unwinding the stack - getting a list of virtual addresses pointing to the executable code # Symbolization - translating virtual addresses into human-readable information, like function name, inlined functions at the address, or file name and line number.
Unwinding is what we're interested in for the purpose of this proposal. The important things are:
- Data on stack is split into frames, each frame belonging to one function.
- Right before each function call, the return address is put on the
stack. This is the instruction address in the caller to which we will eventually return — and that's what we care about.
- One register, called the "frame pointer" or "base pointer" register
(RBP), is traditionally used to point to the beginning of the current frame. Every function should back up RBP onto the stack and set it properly at the very beginning.
The “frame pointer” part is achieved by adding push %rbp, mov %rsp,%rbp to the beginning of every function and by adding pop %rbp before returning. Using this knowledge, stack unwinding boils down to traversing a linked list:
As you specifically use x86_64 assembly as an example here: have you looked on the impact this will have on other architectures like arm or riscv?
Cheers,
Dan
=== Where’s the catch? ===
The frame pointer register is not necessary to run a compiled binary. It makes it easy to unwind the stack, and some debugging tools rely on frame pointers, but the compiler knows how much data it put on the stack, so it can generate code that doesn't need the RBP. Not using the frame pointer register can make a program more efficient:
- We don’t need to back up the value of the register onto the stack,
which saves 3 instructions per function.
- We can treat the RBP as a general-purpose register and use it for
something else.
Whether the compiler sets frame pointer or not is controlled by the -fomit-frame-pointer flag and the default is "omit", meaning we can’t use this method of stack unwinding by default.
To make it possible to rely on the frame pointer being available, we'll add -fno-omit-frame-pointer to the default C/C++ compilation flags. This will instruct the compiler to make sure the frame pointer is always available. This will in turn allow profiling tools to provide accurate performance data which can drive performance improvements in core libraries and executables.
== Feedback ==
=== Potential performance impact ===
- Meta builds all its libraries and executables with
-fno-omit-frame-pointer by default. Internal benchmarks did not show significant impact on performance when omitting the frame pointer for two of our most performance intensive applications.
- Firefox recently landed a change to preserve the frame pointer in
all jitted code (https://bugzilla.mozilla.org/show_bug.cgi?id=1426134). No significant decrease in performance was observed.
- Kernel 4.8 frame pointer benchmarks by Suse showed 5%-10%
regressions in some benchmarks (https://lore.kernel.org/all/20170602104048.jkkzssljsompjdwy@suse.de/T/#u)
Should individual libraries or executables notice a significant performance degradation caused by including the frame pointer everywhere, these packages can opt-out on an individual basis as described in https://docs.fedoraproject.org/en-US/packaging-guidelines/#_compiler_flags.
=== Alternatives to frame pointers ===
There are a few alternative ways to unwind stacks instead of using the frame pointer:
- [https://dwarfstd.org DWARF] data - The compiler can emit extra
information that allows us to find the beginning of the frame without the frame pointer, which means we can walk the stack exactly as before. The problem is that we need to unwind the stack in kernel space which isn't implemented in the kernel. Given that the kernel implemented it's own format (ORC) instead of using DWARF, it's unlikely that we'll see a DWARF unwinder in the kernel any time soon. The perf tool allows you to use the DWARF data with --call-graph=dwarf, but this means that it copies the full stack on every event and unwinds in user space. This has very high overhead.
(undwarf) - problems with unwinding in kernel led to creation of another format with the same purpose as DWARF, just much simpler. This can only be used to unwind kernel stack traces; it doesn't help us with userspace stacks. More information on ORC can be found [https://lwn.net/Articles/728339 here].
- [https://lwn.net/Articles/680985 LBR] - New Intel CPUs have a
feature that gives you source and target addresses for the last 16 (or 32, in newer CPUs) branches with no overhead. It can be configured to record only function calls and to be used as a stack, which means it can be used to get the stack trace. Sadly, you only get the last X calls, and not the full stack trace, so the data can be very incomplete. On top of that, many Fedora users might still be using CPUs without LBR support which means we wouldn't be able to assume working profilers on a Fedora system by default.
To summarize, if we want complete stacks with reasonably low overhead (which we do, there's no other way to get accurate profiling data from running services), frame pointers are currently the best option.
== Benefit to Fedora ==
Implementing this change will provide profiling tools with easy access to stacktraces of installed libraries and executables which will lead to more accurate profiling data in general. This in turn can be used to implement optimizations to core libraries and executables which will improve the overall performance of Fedora itself and the wider Linux ecosystem.
Various debugging tools can also make use of the frame pointer to access the current stacktrace, although tools like gdb can already do this to some degree via embedded dwarf debugging info.
== Scope ==
- Proposal owners: Put up a PR to change the rpm macros to build
packages by default with -fno-omit-frame-pointer by default.
Other developers: Review and merge the PR implementing the Change.
Release engineering: [https://pagure.io/releng/issues #Releng issue
number]. A mass rebuild is required.
Policies and guidelines: N/A (not needed for this Change)
Trademark approval: N/A (not needed for this Change)
Alignment with Objectives: N/A
== Upgrade/compatibility impact ==
This should not impact upgrades in any way.
== How To Test ==
# Build the package with the updated rpm macros # Profile the binary with `perf record -g <binary>` # Inspect the perf data with `perf report -g 'graph,0.5,caller'` # When expanding hot functions in the perf report, perf should show the full call graph of the hot function (at least for all functions that are part of the binary compiled with -fno-omit-frame-pointer)
== User Experience ==
Fedora users will be more likely to have a streamlined experience when trying to debug/profile system executables/libraries. Tools such as perf will work out of the box instead of requiring to users to provide extra options (e.g. --call-graph=dwarf/LBR) or requiring users to recompile all relevant packages with -fno-omit-frame-pointer.
== Dependencies ==
The rpm macros for Fedora need to be adjusted to include -fno-omit-frame-pointer in the default C/C++ compilation flags.
== Contingency Plan ==
- Contingency mechanism: The new version can be released without every
package being rebuilt with fno-omit-frame-pointer. Profiling will only work perfectly once all packages have been rebuilt but there will be no regression in behavior if not all packages have been rebuilt by the time of the release. If the Change is found to introduce unacceptable regressions, the PR implementing it can be reverted and affected packages can be rebuilt.
- Contingency deadline: Final freeze
- Blocks release? No
== Documentation ==
- Original proposal for in-kernel DWARF unwinder (rejected):
https://lkml.org/lkml/2017/5/5/571
== Release Notes ==
Packages are now compiled with frame pointers included by default. This will enable a variety of profiling and debugging tools to show more information out of the box.
On 16/06/2022 22:53, Ben Cotton wrote:
- Kernel 4.8 frame pointer benchmarks by Suse showed 5%-10%
regressions in some benchmarks
I'm not satisfied with the 10% performance loss.
Please do more testing on large projects like Chromium (you must also rebuild all dependent packages with -fno-omit-frame-pointer to get the maximum impact).
* Ben Cotton:
=== Where’s the catch? ===
The frame pointer register is not necessary to run a compiled binary. It makes it easy to unwind the stack, and some debugging tools rely on frame pointers, but the compiler knows how much data it put on the stack, so it can generate code that doesn't need the RBP. Not using the frame pointer register can make a program more efficient:
- We don’t need to back up the value of the register onto the stack,
which saves 3 instructions per function.
- We can treat the RBP as a general-purpose register and use it for
something else.
Whether the compiler sets frame pointer or not is controlled by the -fomit-frame-pointer flag and the default is "omit", meaning we can’t use this method of stack unwinding by default.
To make it possible to rely on the frame pointer being available, we'll add -fno-omit-frame-pointer to the default C/C++ compilation flags. This will instruct the compiler to make sure the frame pointer is always available. This will in turn allow profiling tools to provide accurate performance data which can drive performance improvements in core libraries and executables.
I don't think this paints an incomplete picture. Many programs spend a noticeable fraction of their time in the glibc string functions (particularly memcpy and memset, maybe also memmove, strcpy, strlen, memcmp, and strcmp). These string functions are implemented in hand-tuned assembler and do not set up a frame pointer. I assume this means that a backchain-based unwinder will pick %rbp in these functions and use that to find the caller's frame and the address of its caller, which is *not* the caller of the string function, but the next caller after that. This means that profiles generated this way will lack the immediate callers of the string functions, which I expect will be rather confusing. Given how often string functions show up in profiles, I think this is hardly acceptable.
I do not want to maintain a fork of glibc which adds frame pointers to the string functions because there are so many variants of them (making achieving decent test coverage difficult), and the upstream change rate in this area is pretty high. The risk of semantic (not textual) merge conflicts is also high because we might not notice if an early return instruction is introduced.
I really dislike this proposal and want to record my objection. Instead, I recommend to use better profilers (or at least profilers less political about DWARF) and CPUs with matching hardware support.
DWARF-based unwinding does not have to be extremely slow. There is a widespread belief that it has to be that way because of some magic DWARF properties. It's perhaps not as fast as it could be. But repeating this claim like a mantra merely dissuades people from looking at performance improvements. For example, we made a few simple changes in glibc 2.35 and GCC 12 to make in-process unwinding efficient with many shared objects and in multi-threaded processes. I do wonder if we could have arrived there many, many years ago if it weren't for the “DWARF is slow” meme. (And now that's done, there's other straightforward implementation issues in the libgcc in-process unwinder that could be improved.)
Thanks, Florian
Given the recent benchmarks from Phoronix (https://www.phoronix.com/scan.php?page=article&item=fedora-frame-pointer...) on the proposal that showed some surprising results, we went and tried to reproduce some of the benchmarks to make sure they were actually making sense.
The first one we looked at is the redis benchmark from https://www.phoronix.com/scan.php?page=article&item=fedora-frame-pointer.... We were unable to reproduce the results from the Phoronix article.
Redis GET: https://user-images.githubusercontent.com/9395011/176536797-7424d40f-7140-46... Redis SET: https://user-images.githubusercontent.com/9395011/176536624-eeb5f85c-a63b-49...
Instead, we only saw differences from 0%-2% between Redis compiled with frame pointers and Redis compiled without frame pointers. These benchmarks were done using the phoronix-test-suite in exactly the same way as documented in the phoronix article.
The other one we've looked at is the Botan AES-256 benchmark (https://www.phoronix.com/scan.php?page=article&item=fedora-frame-pointer...). Initially, we were able to reproduce the results of this benchmark when setting CXXFLAGS="-fno-omit-frame-pointer". However, what we found here, is that due to the way Botan's custom build system works, when the CXXFLAGS environment variable is set to enable -fno-omit-frame-pointer, the botan binary is built in debug mode without optimizations whereas when CXXFLAGS is unset, the botan binary is build in release mode (-O3). This explains the huge difference in performance in the botan AES-256 benchmark.
When making sure both binaries are built in release mode by setting CXXFLAGS="-O2" and CXXFLAGS="-O2 -fno-omit-frame-pointer" respectively, we get the following results :
Without frame pointers:
AES-256 encrypt buffer size 1024 bytes: 5410.085 MiB/sec 0.42 cycles/byte (2705.04 MiB in 500.00 ms) AES-256 decrypt buffer size 1024 bytes: 5407.610 MiB/sec 0.42 cycles/byte (2703.81 MiB in 500.00 ms)
With frame pointers:
AES-256 encrypt buffer size 1024 bytes: 5359.241 MiB/sec 0.42 cycles/byte (2679.62 MiB in 500.00 ms) AES-256 decrypt buffer size 1024 bytes: 5404.226 MiB/sec 0.42 cycles/byte (2702.11 MiB in 500.00 ms)
Which shows a smaller than 1% slowdown between the binary built with frame pointers and the binary built without frame pointers.
Supposedly, the Phoronix benchmark was also built with "-O2" in both configurations but given that we saw very similar results to what was in the phoronix benchmark result when building Botan in debug mode, we assume that's what happened with the AES Botan benchmark.
We haven't yet dived deeper into the other benchmarks, but we expect that the benchmarks showing significant differences might suffer from similar issues, where the huge differences are not caused by the inclusion of frame pointers, but other unrelated issues such as the botan case where setting CXXFLAGS causes binaries to be built in debug mode unless an explicit optimization mode is set.
These benchmarks were done on an Amazon EC2 instance running Fedora 36 Cloud edition. The full details as reported by phoronix-test-suite can be found here: https://user-images.githubusercontent.com/9395011/176538700-c82974fa-fbb5-41...
Daan De Meyer via devel wrote:
Which shows a smaller than 1% slowdown between the binary built with frame pointers and the binary built without frame pointers.
Still 1% too many just to work around broken debugging tools when DWARF unwinding has been available for years and is already supported by many tools. (GCC would not default to -fomit-frame-pointer on -O2 otherwise. It does not do that on platforms where frame pointers are really needed for debugging.)
And what is the impact on code size? In my experience, -fomit-frame-pointer also generates smaller code than -fno-omit-frame-pointer, so I would like to see the sizes in your test cases.
I am still strongly opposed to degrading performance and size for all users just to help the handful users of poorly-designed profiling tools.
Kevin Kofler
On 6/30/22 04:54, Kevin Kofler via devel wrote:
I am still strongly opposed to degrading performance and size for all users just to help the handful users of poorly-designed profiling tools.
I agree.
My career experience has been that the performance impact of having an extra register for compiler scheduling and to reduce register pressure has real and tangible benefit.
I have had to use frame pointers, but only for deeply embedded projects where the cost tradeoffs are different and a smaller constrained unwinder was needed.
I would not recommend the use of -fno-omit-frame-pointer in Fedora.
I have had to use frame pointers, but only for deeply embedded projects where the cost tradeoffs are different and a smaller constrained unwinder was needed.
As mentioned in the change proposal, when using sampling profilers that rely on fast access to the stacktrace, there is currently no viable alternative to frame pointers. DWARF unwinding in absence of frame pointers is too slow because of the complexity of the DWARF format and the necessity to copy the stack to userspace and do unwinding there due to the lack of an in kernel DWARF unwinder.
Looking at the future, we will be following up on the alternative approaches such as CTF Frame which will hopefully provide us with a sufficiently fast way to unwind the stack in the kernel itself without requiring frame pointers. Until such an alternative is available, we see no option but to use frame pointers in order to do reliable and fast profiling.
Cheers,
Daan
Daan De Meyer via devel wrote:
As mentioned in the change proposal, when using sampling profilers that rely on fast access to the stacktrace, there is currently no viable alternative to frame pointers. DWARF unwinding in absence of frame pointers is too slow because of the complexity of the DWARF format and the necessity to copy the stack to userspace and do unwinding there due to the lack of an in kernel DWARF unwinder.
And as already pointed out several times in this thread, that is a very niche use case for which you cannot hold the entire Fedora community hostage. This is not Facebook Linux.
I do not know anybody personally who runs 24/7 profiling in production as you apparently do.
Kevin Kofler
On July 4, 2022 2:54:11 PM UTC, Kevin Kofler via devel devel@lists.fedoraproject.org wrote:
Daan De Meyer via devel wrote:
As mentioned in the change proposal, when using sampling profilers that rely on fast access to the stacktrace, there is currently no viable alternative to frame pointers. DWARF unwinding in absence of frame pointers is too slow because of the complexity of the DWARF format and the necessity to copy the stack to userspace and do unwinding there due to the lack of an in kernel DWARF unwinder.
And as already pointed out several times in this thread, that is a very niche use case for which you cannot hold the entire Fedora community hostage. This is not Facebook Linux.
I do not know anybody personally who runs 24/7 profiling in production as you apparently do.
I think Kevin has a point here. This change will degrade the performance of everyone at all times so that a minority can run their very niche use case. I would thus suggest that if you really need this in production, that you rebuild just the packages that you need with the additional flags in koji, copr or obs and use these instead of making everyone pay the price.
If it turns out that the performance hit is really negligible (especially on register starved architectures), then I'd be more than happy to revisit the topic. There's really no need for duplicate work if it has no negative impact on the majority of our users.
Cheers,
Dan
On Thu, Jun 30, 2022 at 4:55 AM Kevin Kofler via devel < devel@lists.fedoraproject.org> wrote:
Daan De Meyer via devel wrote:
Which shows a smaller than 1% slowdown between the binary built with
frame
pointers and the binary built without frame pointers.
I am still strongly opposed to degrading performance and size for all users just to help the handful users of poorly-designed profiling tools.
I am coming a bit late to this discussion, but I would like to inject the viewpoint that 'performance' (however defined) isn't the only criterion by which we should just judge what Fedora produces. At least for Fedora Workstation, being a useful system for developers with working debugging and profiling tools should have some weight too.
And I doubt that you'd be able to notice a 'smaller than 1% slowdown' on your system.
Matthias Clasen wrote:
I am coming a bit late to this discussion, but I would like to inject the viewpoint that 'performance' (however defined) isn't the only criterion by which we should just judge what Fedora produces. At least for Fedora Workstation, being a useful system for developers with working debugging and profiling tools should have some weight too.
But we *have* "working debugging and profiling tools" with -fomit-frame- pointer. (In fact, as I already mentioned, this is the criterion for GCC to enable it by default under -O2 at all.)
What we have is *one* profiling tool (perf) in a very specific configuration (continuous profiling in production) that cannot deal with it in a way that the users consider acceptable. (As I understand it, perf *can* call back into user space to do DWARF unwinding, it is just that doing that all the time, on a production machine, has too high overhead to be useful.) That does not mean that we are stuck with no "working debugging and profiling tools". There would have been a huge outcry years ago when GCC made this change if that were the case.
What I see here is a single Fedora-using corporation attempting to use their lobbying power as a huge corporation to force a change on all Fedora users that ultimately benefits only that one corporation at everyone else's expense.
Kevin Kofler
On Tue, Jul 05, 2022 at 03:15:02PM -0400, Matthias Clasen wrote:
On Thu, Jun 30, 2022 at 4:55 AM Kevin Kofler via devel < devel@lists.fedoraproject.org> wrote:
Daan De Meyer via devel wrote:
Which shows a smaller than 1% slowdown between the binary built with
frame
pointers and the binary built without frame pointers.
I am still strongly opposed to degrading performance and size for all users just to help the handful users of poorly-designed profiling tools.
I am coming a bit late to this discussion, but I would like to inject the viewpoint that 'performance' (however defined) isn't the only criterion by which we should just judge what Fedora produces. At least for Fedora Workstation, being a useful system for developers with working debugging and profiling tools should have some weight too.
And I doubt that you'd be able to notice a 'smaller than 1% slowdown' on your system.
Maybe not, but even ~1% is still an unacceptable slowdown. It would take about a year for the compiler to catch up.
Marek
On Tue, Jul 5, 2022 at 3:40 PM Marek Polacek polacek@redhat.com wrote:
Maybe not, but even ~1% is still an unacceptable slowdown. It would take about a year for the compiler to catch up.
(Un)acceptable for whom? And why would it be unacceptable? You just said compilers will make up for it quickly, not to mention hardware continuously getting faster too...
I haven't seen any convincing arguments as to why such a small drop would be the end of the world.
And I don't think Fedora is or should be used in high-speed trading or similar environments where every microsecond matters.
* Matthias Clasen:
not to mention hardware continuously getting faster too...
The proposal is about enabling this feature for older hardware. Recent x86-64 CPUs can maintain an array of return addresses in hardware (so basically the backtrace is available directly). Kernel patches to enable this feature exist, but have not been merged yet. Fedora userspace has been prepared for this since around Fedora 28 or so, and we should be able to turn this on with just a glibc update once the kernel feature lands upstream.
Thanks, Florian
On Tue, Jul 05, 2022 at 03:47:26PM -0400, Matthias Clasen wrote:
On Tue, Jul 5, 2022 at 3:40 PM Marek Polacek polacek@redhat.com wrote:
Maybe not, but even ~1% is still an unacceptable slowdown. It would take about a year for the compiler to catch up.
(Un)acceptable for whom?
GCC maintainers in Fedora, at least.
And why would it be unacceptable?
Because it's too much.
You just said compilers will make up for it quickly, not to mention hardware continuously getting faster too...
Dozens of developers working a whole release (if not more) is not quick.
I haven't seen any convincing arguments as to why such a small drop would be the end of the world.
And likewise, I haven't seen how this proposal would be helpful to the majority of users, nevermind that it'd likely break programs using inline assembly that use %rbp. But others have already raised similar points in this thread.
And I don't think Fedora is or should be used in high-speed trading or similar environments where every microsecond matters.
I think you may be underestimating how much even 1% matters.
Marek
On 7/6/2022 7:26 AM, Marek Polacek wrote:
On Tue, Jul 05, 2022 at 03:47:26PM -0400, Matthias Clasen wrote:
On Tue, Jul 5, 2022 at 3:40 PM Marek Polacek polacek@redhat.com wrote:
Maybe not, but even ~1% is still an unacceptable slowdown. It would take about a year for the compiler to catch up.
(Un)acceptable for whom?
GCC maintainers in Fedora, at least.
And why would it be unacceptable?
Because it's too much.
You just said compilers will make up for it quickly, not to mention hardware continuously getting faster too...
Dozens of developers working a whole release (if not more) is not quick.
I haven't seen any convincing arguments as to why such a small drop would be the end of the world.
And likewise, I haven't seen how this proposal would be helpful to the majority of users, nevermind that it'd likely break programs using inline assembly that use %rbp. But others have already raised similar points in this thread.
And I don't think Fedora is or should be used in high-speed trading or similar environments where every microsecond matters.
I think you may be underestimating how much even 1% matters.
Amen. A 1% hit is very significant -- as Marek indicated, that's roughly a year of work for the GCC community to recover.
Sometimes we're willing to take a 1% hit, sometimes not. It's a question of balancing the performance hit against the benefits of whatever change is being considered.
If I'm understanding things correctly, the original proposal is trying to make a very special case of profiling work better -- a case that 99.9% of Fedora users do not need or care about. That seems like a particularly bad cost/benefit for this proposal.
Jeff
On Wed, Jul 6 2022 at 08:06:45 AM -0600, Jeff Law jeffreyalaw@gmail.com wrote:
If I'm understanding things correctly, the original proposal is trying to make a very special case of profiling work better -- a case that 99.9% of Fedora users do not need or care about. That seems like a particularly bad cost/benefit for this proposal.
But all Fedora users benefit from performance improvements implemented as a result of profiling.
On 7/6/2022 8:20 AM, Michael Catanzaro wrote:
On Wed, Jul 6 2022 at 08:06:45 AM -0600, Jeff Law jeffreyalaw@gmail.com wrote:
If I'm understanding things correctly, the original proposal is trying to make a very special case of profiling work better -- a case that 99.9% of Fedora users do not need or care about. That seems like a particularly bad cost/benefit for this proposal.
But all Fedora users benefit from performance improvements implemented as a result of profiling.
Yes, but, IMHO, you need to find another way to do the profiling you need to get those improvements.
jeff
On Wed, 6 Jul 2022 at 15:57, Jeff Law jeffreyalaw@gmail.com wrote:
On 7/6/2022 8:20 AM, Michael Catanzaro wrote:
On Wed, Jul 6 2022 at 08:06:45 AM -0600, Jeff Law jeffreyalaw@gmail.com wrote:
If I'm understanding things correctly, the original proposal is trying to make a very special case of profiling work better -- a case that 99.9% of Fedora users do not need or care about. That seems like a particularly bad cost/benefit for this proposal.
But all Fedora users benefit from performance improvements implemented as a result of profiling.
Yes, but, IMHO, you need to find another way to do the profiling you need to get those improvements.
Right. Developers already need to rebuild packages locally. The suggestion that developers of fedora packages can't improve those packages without making system-wide profiling work on every fedora users' machines seems like nonsense.
Let's say you profile something and find a performance problem. You tweak the code to improve performance. How do you verify if it improves performance? Do you push the change to rawhide, wait for koji and bodhi to do their thing, then re-profile with the new packages from the updates-testing repo? And if that didn't work, push another change to rawhide, wait for koji and bodhi, and re-profile? Of course not, that would be ridiculous.
You build locally and profile using your locally built packages. So you're already doing rebuilds, and you're already profiling custom builds anyway. So you can add frame pointers back in to your local builds that are used for profiling. It's not essential for frame pointers to be present in the official koji builds for you to do that. It might be a minor convenience because it reduces the number of system libs that you need to rebuild locally, but it's just a convenience, not a necessity.
Maybe we could make it easier to do local mock builds (or copr builds, or koji scratch builds) with changes to RPM_OPT_FLAGS to simplify rebuilding to get frame pointers. But pushing that to every user just so a handful of people don't have a few extra steps is the wrong trade-off.
On Wed, Jul 6 2022 at 04:20:50 PM +0100, Jonathan Wakely jwakely@redhat.com wrote:
You build locally and profile using your locally built packages.
Problem is that in order to get good profiling results today, you need to rebuild all dependencies with frame pointers enabled. And that is not realistic. Nobody does that.
Developers normally only build what they are working on, not 100 dependencies including mesa, glibc, etc.
Michael
On 7/6/22 08:42, Michael Catanzaro wrote:
On Wed, Jul 6 2022 at 04:20:50 PM +0100, Jonathan Wakely jwakely@redhat.com wrote:
You build locally and profile using your locally built packages.
Problem is that in order to get good profiling results today, you need to rebuild all dependencies with frame pointers enabled. And that is not realistic. Nobody does that.
Developers normally only build what they are working on, not 100 dependencies including mesa, glibc, etc.
With the current profiling methods, are you able to at least narrow down which libraries applications spend the most time in? Or do you really need detailed profile information for every single library in order to determine where the problem is?
-Tom
Michael
devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
On Wed, Jul 6 2022 at 10:05:17 AM -0700, Tom Stellard tstellar@redhat.com wrote:
With the current profiling methods, are you able to at least narrow down which libraries applications spend the most time in? Or do you really need detailed profile information for every single library in order to determine where the problem is?
Honestly, I'm bad at profiling, so I'm not the best person to answer this. :/
I can point you to documentation for sysprof:
https://gitlab.gnome.org/GNOME/sysprof#debugging-symbols
which says that every library should be built with -fno-omit-frame-pointer.
Michael
Michael Catanzaro wrote:
I can point you to documentation for sysprof:
https://gitlab.gnome.org/GNOME/sysprof#debugging-symbols
which says that every library should be built with -fno-omit-frame-pointer.
And why is that? Do they not use libunwind, or GDB, or any other sane (reliable and portable) way to produce backtraces? Why not?
Kevin Kofler
Sysprof has modular data collection backends, and not everything requires linking against libunwind.
For those not familiar with Sysprof, or profiling the desktop at large, generally a single program is not the problem. The performance problems often exist across a number of processes. That can be anything from a library used by multiple applications which cumulatively waste resources, IPC across programs, thundering herds when files on disk change, GPU usage, CPU frequency scaling, memory bandwidth, RAPL, etc.
So Sysprof has a binary logging format that is straight-forward, efficient, and allows us to record many different types of information within a single file. That file format is used by a number of tools in the stack from GLib, Pango, Gtk, Mutter, GNOME Shell, GJS, various libraries, and applications on top of it. It can capture counters, stack traces, file contents, marks, logs, and a multitude of other data frames.
These capture files can also be muxed together at any point.
Some of the modular data collectors require libunwind, many do not. For example, the memprof collector records the backtraces from malloc/free/etc. But the GJS data-collector can use SpiderMonkey's internal APIs to get backtraces from a SIGPROF sigaction. The most used collector, however, is the perf collector which is just reading from a perf fd mmap'd into a ring buffer.
The perf collector doesn't record the whole stack because the amount of time it takes to decode a 30 second system-wide capture with DWARF/etc is so slow practically nobody would use it.
The best profiler is the one people will use.
We have an in-tree parser for ELF that allows us to avoid a lot of extraneous code when extracting symbols. Partially because libunwind is incredibly slow (by profiler requirements), and partially because historically we never had to stash stack frames for contextual unwinding.
Could we write a new data collection module that does DWARF unwinding and stashes some 8kb of stack? Sure. Would people use it? Probably not, because again, it's so slow that people will start profiling by intuition again which is probably the worst of all options.
Can we write a eBPF kernel module to decode symbols there? Maybe? Can I? Probably not.
Personally, I think some libraries should not be compiled with -fno-omit-frame-pointer. However, I think that number is much smaller than the opposite. Encryption, graphics drivers, etc all seem like good candidates here to be explicit about performance requirements.
Sysprof's modular *and* system-wide profiling played a significant role in how GNOME Shell got faster over the past years. All of a sudden it's developers had a tool which could coalesce stack traces, counters, marks, logs, display timing information and GL command state both from apps and compositors, and track event propagation across processes.
To my knowledge, we don't have this tooling anywhere else on Fedora. The sad part is, those who want to casually drive by and fix performance start with "recompile the stack with jhbuild" or I guess RPMs/koji if you're into that sort of thing.
Christian Hergert wrote:
For those not familiar with Sysprof, or profiling the desktop at large, generally a single program is not the problem. The performance problems often exist across a number of processes.
[…]
The most used collector, however, is the perf collector which is just reading from a perf fd mmap'd into a ring buffer.
So it looks like what you folks are doing is actually very similar to what Facebook is doing. That is interesting, and explains why some GNOME developers are jumping on the bandwagon of this Change proposal.
The perf collector doesn't record the whole stack because the amount of time it takes to decode a 30 second system-wide capture with DWARF/etc is so slow practically nobody would use it.
So this is basically the same issue that Facebook is having.
Could we write a new data collection module that does DWARF unwinding and stashes some 8kb of stack? Sure. Would people use it? Probably not, because again, it's so slow that people will start profiling by intuition again which is probably the worst of all options.
Does profiling individual applications file under "profiling by intuition" for you? Because that is what I would expect developers to go back to if systemwide profiling stops being viable.
To my knowledge, we don't have this tooling anywhere else on Fedora. The sad part is, those who want to casually drive by and fix performance start with "recompile the stack with jhbuild" or I guess RPMs/koji if you're into that sort of thing.
And is it such a problem to require the handful developers who need to do systemwide profiling to do that, instead of slowing down the production installation for all users?
Kevin Kofler
So it looks like what you folks are doing is actually very similar to what Facebook is doing. That is interesting, and explains why some GNOME developers are jumping on the bandwagon of this Change proposal.
To be fair, we've been complaining about it internally in GNOME for probably around a decade. So it's not a new thing from our standpoint. What is new is that it appears others came to the similar-in-spirit solutions.
There are certainly places where it falls down still. Things like libffi, libc, encryption, hand-rolled assembler, etc as you mention. But since we still capture those stack traces as counting against the proper pid, it's often enough to allow you to dive deeper or see collateral damage. You might need to sort your callgraph a bit differently, but it's certainly possible in sysprof given the flexibility of the callgraph display.
Does profiling individual applications file under "profiling by intuition" for you? Because that is what I would expect developers to go back to if systemwide profiling stops being viable.
I would say yes because you have to be intuitive about which application(s) or libraries matter.
When sysprof is working correctly, you can click record and have a decent understanding of where things are going wrong. That's a tough thing to replicate in a handful of terminals simultaneously displaying information which likely exacerbate desktop workloads on their own.
And is it such a problem to require the handful developers who need to do systemwide profiling to do that, instead of slowing down the production installation for all users?
I think our goal should be to make it so easy that it's not just a handful of people doing system-wide profiling like it is today.
Using sysprof (or similar tool) as the first step in triage makes a lot of sense to me because it gives upstream a way to capture correlating information and visualize it in a useful manner. Despite being the author of the modern incarnation of Sysprof, I'm not against using pretty much anything else that works.
But here we are at an existential choice of what Fedora is. Are we for developers creating the platform(s)? Are we an optimized end-user distribution? If so where do the developers go that need to build these systems? Because it's clear to me that the status quo is often getting in the way for some of us.
On Fri, Jul 8, 2022 at 9:20 PM Christian Hergert chergert@redhat.com wrote:
So it looks like what you folks are doing is actually very similar to what Facebook is doing. That is interesting, and explains why some GNOME developers are jumping on the bandwagon of this Change proposal.
To be fair, we've been complaining about it internally in GNOME for probably around a decade. So it's not a new thing from our standpoint. What is new is that it appears others came to the similar-in-spirit solutions.
Coming at this problem from a different angle, just a hypothetical: Would it be acceptable to add -fno-omit-frame-pointer and -mno-omit-leaf-frame-pointer to the default compiler flags on RHEL? If the answer is yes, then I think we might talk about it for Fedora, as well.
If the answer is no (because people from CERN would come banging at Red Hat's doors, or something like that), then I don't think we should do it in Fedora, either. Because I don't think we should treat Fedora first and foremost as a development target, at the cost of making it less appealing to actual users. All the profiling and performance optimizations that might (or might not be) found by developing on Fedora won't help us, if our users move to some other distro that benefitted from those performance optimizations, but of course didn't add those compiler flags, and didn't make packages from their own distro perform worse to enable this kind of work.
Fabio
* Fabio Valentini:
On Fri, Jul 8, 2022 at 9:20 PM Christian Hergert chergert@redhat.com wrote:
So it looks like what you folks are doing is actually very similar to what Facebook is doing. That is interesting, and explains why some GNOME developers are jumping on the bandwagon of this Change proposal.
To be fair, we've been complaining about it internally in GNOME for probably around a decade. So it's not a new thing from our standpoint. What is new is that it appears others came to the similar-in-spirit solutions.
Coming at this problem from a different angle, just a hypothetical: Would it be acceptable to add -fno-omit-frame-pointer and -mno-omit-leaf-frame-pointer to the default compiler flags on RHEL?
It seems unlikely. Given typical hardware replacement cycles, hardware-assisted backtrace without severe limitations will be widely available soon, and it should have even less overhead than frame-pointer traversal.
Thanks, Florian
PS: One more question:
Christian Hergert wrote:
We have an in-tree parser for ELF that allows us to avoid a lot of extraneous code when extracting symbols. Partially because libunwind is incredibly slow (by profiler requirements), and partially because historically we never had to stash stack frames for contextual unwinding.
Frank Ch. Eigler mentions that elfutils has a more modern unwinding library. Could that perhaps solve your performance issues with libunwind?
Kevin Kofler
Frank Ch. Eigler mentions that elfutils has a more modern unwinding library. Could that perhaps solve your performance issues with libunwind?
I don't think so. The problem is two-fold.
First, we have to capture enough of the stack to do offline unwinding. I think the default many people do here is about 8kb of stack. While the instruction pointer array might fit in a couple cachelines, you now have an additional few pages to copy as well. And you probably want those pages aligned in your capture format. So no you need to interleave multiple types of data frames while padding for alignment.
Now do that a few thousand times a second.
The overhead here can be so great that it obscures what you're trying to find. Furthermore, it's a good chance that you'll cause CPU packages to spin up to a higher frequency, thusly hiding the exact performance issues you want to find or reduce to avoid that.
Now, say you've done the work and captured stacks (what has now turned from a few MB recording to a few GB recording) you need to decode them. We keep many lookaside-maps/interval-trees in Sysprof to keep this overhead low, but now you have to reference .eh/DWARF data. This is the slowest part of the whole process. What currently takes a second or two could take you easily 10 minutes.
Now I understand not everyone has ADHD like me, but I wont even remember what I was doing 10 minutes later.
On 7/8/22 15:29, Christian Hergert wrote:
Frank Ch. Eigler mentions that elfutils has a more modern unwinding library. Could that perhaps solve your performance issues with libunwind?
I don't think so. The problem is two-fold.
First, we have to capture enough of the stack to do offline unwinding. I think the default many people do here is about 8kb of stack. While the instruction pointer array might fit in a couple cachelines, you now have an additional few pages to copy as well. And you probably want those pages aligned in your capture format. So no you need to interleave multiple types of data frames while padding for alignment.
Now do that a few thousand times a second.
The overhead here can be so great that it obscures what you're trying to find. Furthermore, it's a good chance that you'll cause CPU packages to spin up to a higher frequency, thusly hiding the exact performance issues you want to find or reduce to avoid that.
Now, say you've done the work and captured stacks (what has now turned from a few MB recording to a few GB recording) you need to decode them. We keep many lookaside-maps/interval-trees in Sysprof to keep this overhead low, but now you have to reference .eh/DWARF data. This is the slowest part of the whole process. What currently takes a second or two could take you easily 10 minutes.
That is the problem right here: .eh_frame-based unwinding is too slow, so it has to be done offline in userspace. What about instead adding ORC information to userspace? That would be much faster to use.
That is the problem right here: .eh_frame-based unwinding is too slow, so it has to be done offline in userspace. What about instead adding ORC information to userspace? That would be much faster to use.
I'm not familiar with ORC, but there are a few things that initially come to mind in looking towards such a solution.
First, are there any examples of perf being able to reference ORC data coming from user-space or is it currently limited to PERF_CONTEXT_KERNEL? For system-wide profiling, we still require that the kernel can do high-velocity unwinding across address contexts.
My (limited) understanding of ORC is that the result produced by objtool gets you a series of unwind tables, but those tables require further processing by the kernel at boot.
Again, I have limited understanding, but wouldn't something need to be processed as part of spawning and loading executable pages? There are both .orc_unwind and .orc_unwind_ip sections, both of which need to be sorted. I don't know what layer would be responsible for that, or how it adapts to dlopen(), double-mapping pages like libffi, etc... but I'm sure people will have opinions about it.
I don't know if this is limited to generating ORC data from DWARF, but the orc-unwinder documentation also refers to difficulty when dealing with inline assembly. That would perhaps mean that this could end up being a lot of work and still not fix the minor-annoyance of strlen/etc not showing up correctly.
There is also a risk that ORC data cannot represent the ever-increasing optimizations from GCC.
On 7/8/22 20:18, Christian Hergert wrote:
That is the problem right here: .eh_frame-based unwinding is too slow, so it has to be done offline in userspace. What about instead adding ORC information to userspace? That would be much faster to use.
I'm not familiar with ORC, but there are a few things that initially come to mind in looking towards such a solution.
First, are there any examples of perf being able to reference ORC data coming from user-space or is it currently limited to PERF_CONTEXT_KERNEL? For system-wide profiling, we still require that the kernel can do high-velocity unwinding across address contexts.
Why does the unwinding need to happen in the kernel? The kernel can already asynchronously invoke userspace code in the form of signal handlers. Is the problem that it is necessary to collect profiling information in the middle of a system call, where another syscall would see inconsistent (and potentially exploitable) kernel state?
My (limited) understanding of ORC is that the result produced by objtool gets you a series of unwind tables, but those tables require further processing by the kernel at boot.
Again, I have limited understanding, but wouldn't something need to be processed as part of spawning and loading executable pages? There are both .orc_unwind and .orc_unwind_ip sections, both of which need to be sorted. I don't know what layer would be responsible for that, or how it adapts to dlopen(), double-mapping pages like libffi, etc... but I'm sure people will have opinions about it.
Ouch. That is a serious problem for a number of reasons, not least of which is security. Having the kernel parse even more complex untrusted input in C is a horrible idea.
I can think of at least two better options:
1. Wait for Rust support to be merged, and write the unwinder in Rust. 2. Implement the unwinder as an eBPF program.
I strongly prefer the latter approach. I believe the unwinder executes in NMI context, meaning that it must not block and must finish executing in a bounded amount of time. Furthermore, any oops becomes an immediate kernel panic. The eBPF verifier can trivially guarantee that the unwinder satisfies the properties needed here. For security reasons, submitting eBPF programs is a privileged operation, but some programs could be compiled into the kernel and thus considered trusted. Such programs could be used without any special privileges.
The key advantage of this approach is that privileged user-mode profiling tools, such as sysprof, can submit their own eBPF unwinders. This means that the kernel does not need to support whatever unwind info format userspace uses. One could use DWARF, ORC, or any other format one wishes.
Christian, would this be sufficient for your needs?
Why does the unwinding need to happen in the kernel? The kernel can already asynchronously invoke userspace code in the form of signal handlers. Is the problem that it is necessary to collect profiling information in the middle of a system call, where another syscall would see inconsistent (and potentially exploitable) kernel state?
One of the primary values of system-wide profiling is being able to see how a particular library call might have caused undesirable code paths due to a syscall, and where/what was reached (given high enough sampling rate).
Does that need to happen in kernel space? I don't know, perse, other than perf needs to be able to do that work as it is what gives us the array of instruction pointers back. There was some chatter a number of years ago in perf about how to handle ORC from user-space, and if I'm summing this up correctly, it was basically..
- When sampling in PERF_CONTEXT_KERNEL, stop unwinding at the syscall boundary - Append stacktrace samples to perf buffer ring - Upon rescheduling, backtrace a single time into user-space, and expect the consumer to know that N previous samples with matching task-id all have the user-space backtrace.
That's a pretty significant behavior change, and all tools would need surgery to support it. I have no idea if that is paletable to either side of the debate, but it was the one possible direction I saw.
It does have a number of pros, in that you can save a lot of unwinding time on syscall-heavy workloads by doing user-space unwinding once, and from scheduler task queues (so you can take faults), and can avoid the NMI context being the cost-center for accounting. But the cons are significant in that the behavior change is expansive, effects all tooling, and will require ORC data across the platform.
Ouch. That is a serious problem for a number of reasons, not least of which is security. Having the kernel parse even more complex untrusted input in C is a horrible idea.
It might seem that way by the description I gave, but we're just talking an array of intptr_t or similar. There is no dereferencing or state machines like you have with DWARF. Runtime resolution is also essentially bsearch() on interval arrays. I really don't think it's the sort of thing that requires Rust.
As for eBPF, we'd still probably be in NMI context with this route, and would fail if we had to page in ORC tables. So that means we'd either have to take a per-task memory overhead to maintain the mutated form (probably unreasonable) or find a way for that to be done from the task's space when returning from the syscall.
Christian, would this be sufficient for your needs?
I don't think so without significant work. The best case I see here is for perf to support user-space unwinding within the task, be it ORC or DWARF, and that unwinder not have to necessarily be in-tree with the kernel because we know they won't accept a DWARF unwinder again.
It does pose some questions on what would happen with carefully crafted DWARF data.
On 7/9/22 12:05, Christian Hergert wrote:
Why does the unwinding need to happen in the kernel? The kernel can already asynchronously invoke userspace code in the form of signal handlers. Is the problem that it is necessary to collect profiling information in the middle of a system call, where another syscall would see inconsistent (and potentially exploitable) kernel state?
One of the primary values of system-wide profiling is being able to see how a particular library call might have caused undesirable code paths due to a syscall, and where/what was reached (given high enough sampling rate).
Does that need to happen in kernel space? I don't know, perse, other than perf needs to be able to do that work as it is what gives us the array of instruction pointers back. There was some chatter a number of years ago in perf about how to handle ORC from user-space, and if I'm summing this up correctly, it was basically..
- When sampling in PERF_CONTEXT_KERNEL, stop unwinding at the syscall boundary
- Append stacktrace samples to perf buffer ring
- Upon rescheduling, backtrace a single time into user-space, and expect the consumer to know that N previous samples with matching task-id all have the user-space backtrace.
That's a pretty significant behavior change, and all tools would need surgery to support it. I have no idea if that is paletable to either side of the debate, but it was the one possible direction I saw.
It does have a number of pros, in that you can save a lot of unwinding time on syscall-heavy workloads by doing user-space unwinding once, and from scheduler task queues (so you can take faults), and can avoid the NMI context being the cost-center for accounting. But the cons are significant in that the behavior change is expansive, effects all tooling, and will require ORC data across the platform.
This (or a variant of it) is the only reasonable solution I know of. The current situation is not acceptable, and a system-wide slowdown from -fno-omit-frame-pointer is also not acceptable. A solution like you suggested here will be much more work, but it will also be a much better product.
Ouch. That is a serious problem for a number of reasons, not least of which is security. Having the kernel parse even more complex untrusted input in C is a horrible idea.
It might seem that way by the description I gave, but we're just talking an array of intptr_t or similar. There is no dereferencing or state machines like you have with DWARF. Runtime resolution is also essentially bsearch() on interval arrays. I really don't think it's the sort of thing that requires Rust.
bsearch() itself assumes that the input is trusted, but it should be possible to have a variant that does not make that assumption. Similarly, it should be possible to ensure that all user pointer access are guarded by checks to ensure they will not fault and actually point to userspace memory.
As for eBPF, we'd still probably be in NMI context with this route, and would fail if we had to page in ORC tables. So that means we'd either have to take a per-task memory overhead to maintain the mutated form (probably unreasonable) or find a way for that to be done from the task's space when returning from the syscall.
The latter is definitely the better option. The NMI handler needs to be simpler, not more complex. One option would be to replace the normal syscall return with a return to a userspace trampoline. The trampoline would write the userspace backtrace to a kernel-provided buffer and then jump to the original return address.
Some programs (such as LVM) would need to be able to opt-out of such profiling. LVM has critical sections where it is not safe to perform any I/O, as the device that backs the root filesystem might be suspended. Such a program would only be able to participate in unwinding if mlockall() was used.
Christian, would this be sufficient for your needs?
I don't think so without significant work. The best case I see here is for perf to support user-space unwinding within the task, be it ORC or DWARF, and that unwinder not have to necessarily be in-tree with the kernel because we know they won't accept a DWARF unwinder again.
Would it be sufficient with that significant work?
It does pose some questions on what would happen with carefully crafted DWARF data.
Processing untrusted DWARF data is probably a bad idea.
I strongly prefer the latter approach. I believe the unwinder executes in NMI context, meaning that it must not block and must finish executing in a bounded amount of time. Furthermore, any oops becomes an immediate kernel panic. The eBPF verifier can trivially guarantee that the unwinder satisfies the properties needed here. For security reasons, submitting eBPF programs is a privileged operation, but some programs could be compiled into the kernel and thus considered trusted. Such programs could be used without any special privileges.
The key advantage of this approach is that privileged user-mode profiling tools, such as sysprof, can submit their own eBPF unwinders. This means that the kernel does not need to support whatever unwind info format userspace uses. One could use DWARF, ORC, or any other format one wishes.
BPF programs do not have access to arbitrary ELF sections AFAIK. Every EBPF unwinder that I've found is implemented via preprocessing the unwind format in userspace and storing that in BPF maps so that it can be accessed from the BPF program.
Effectively, this means that every program that wants to do unwinding in BPF has to do this preprocessing and store all the required information in BPF maps. When you don't know which program you're going to be requesting a stacktrace for, this effectively means userspace has to provide this information for every program that might run on the system. While this might work for dedicated long-running system profiling daemons, it is not an option for software such as perf or bpftrace since it would drastically increase their startup time, as well as their overall resource usage.
Cheers,
Daan
________________________________________ From: Demi Marie Obenour demiobenour@gmail.com Sent: 09 July 2022 04:02 To: devel@lists.fedoraproject.org Subject: Re: F37 proposal: Add -fno-omit-frame-pointer to default compilation flags (System-Wide Change proposal)
On 7/8/22 20:18, Christian Hergert wrote:
That is the problem right here: .eh_frame-based unwinding is too slow, so it has to be done offline in userspace. What about instead adding ORC information to userspace? That would be much faster to use.
I'm not familiar with ORC, but there are a few things that initially come to mind in looking towards such a solution.
First, are there any examples of perf being able to reference ORC data coming from user-space or is it currently limited to PERF_CONTEXT_KERNEL? For system-wide profiling, we still require that the kernel can do high-velocity unwinding across address contexts.
Why does the unwinding need to happen in the kernel? The kernel can already asynchronously invoke userspace code in the form of signal handlers. Is the problem that it is necessary to collect profiling information in the middle of a system call, where another syscall would see inconsistent (and potentially exploitable) kernel state?
My (limited) understanding of ORC is that the result produced by objtool gets you a series of unwind tables, but those tables require further processing by the kernel at boot.
Again, I have limited understanding, but wouldn't something need to be processed as part of spawning and loading executable pages? There are both .orc_unwind and .orc_unwind_ip sections, both of which need to be sorted. I don't know what layer would be responsible for that, or how it adapts to dlopen(), double-mapping pages like libffi, etc... but I'm sure people will have opinions about it.
Ouch. That is a serious problem for a number of reasons, not least of which is security. Having the kernel parse even more complex untrusted input in C is a horrible idea.
I can think of at least two better options:
1. Wait for Rust support to be merged, and write the unwinder in Rust. 2. Implement the unwinder as an eBPF program.
I strongly prefer the latter approach. I believe the unwinder executes in NMI context, meaning that it must not block and must finish executing in a bounded amount of time. Furthermore, any oops becomes an immediate kernel panic. The eBPF verifier can trivially guarantee that the unwinder satisfies the properties needed here. For security reasons, submitting eBPF programs is a privileged operation, but some programs could be compiled into the kernel and thus considered trusted. Such programs could be used without any special privileges.
The key advantage of this approach is that privileged user-mode profiling tools, such as sysprof, can submit their own eBPF unwinders. This means that the kernel does not need to support whatever unwind info format userspace uses. One could use DWARF, ORC, or any other format one wishes.
Christian, would this be sufficient for your needs? -- Sincerely, Demi Marie Obenour (she/her/hers) _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
* Demi Marie Obenour:
That is the problem right here: .eh_frame-based unwinding is too slow, so it has to be done offline in userspace. What about instead adding ORC information to userspace? That would be much faster to use.
I'm not sure ORC covers all the registers that could be used to store the previous frame pointer. According to the kernel developers, ORC unwind data is 50% larger than DWARF data:
| The ORC data format does have a few downsides compared to DWARF. ORC | unwind tables take up ~50% more RAM (+1.3MB on an x86 defconfig kernel) | than DWARF-based eh_frame tables.
(Documentation/x86/orc-unwinder.rst)
We would have to have both in parallel, so this is going to drive up installation sizes somewhat. (The document also contains some concerning statistics for enabling frame pointers in the kernel, by the way.)
It's also not clear that all of the ORC performance tricks can be applied to untrusted userspace ORC that can be unloaded at any time (or if it might not be easier after all to do this on DWARF data).
Thanks, Florian
On Thu, Jul 07, 2022 at 08:13:52PM -0000, Christian Hergert wrote:
Sysprof has modular data collection backends, and not everything requires linking against libunwind.
For those not familiar with Sysprof, or profiling the desktop at large, generally a single program is not the problem. The performance problems often exist across a number of processes. That can be anything from a library used by multiple applications which cumulatively waste resources, IPC across programs, thundering herds when files on disk change, GPU usage, CPU frequency scaling, memory bandwidth, RAPL, etc.
That problem's not unique to the desktop space, it applies across any non-trivial usage of the OS, whether down at the base infrastructure level, or over at server applications too. IOW, don't think of 'sysprof' as only a tool for the desktop developers, what you describe is broadly applicable to any and all. Profiling on Linux is indeed an exercise in frustration much of the time, and recently I find myself turning to sysprof more than other options for analysing problems around the virt stack.
With regards, Daniel
On 7/7/22 16:13, Christian Hergert wrote:
Sysprof has modular data collection backends, and not everything requires linking against libunwind.
For those not familiar with Sysprof, or profiling the desktop at large, generally a single program is not the problem. The performance problems often exist across a number of processes. That can be anything from a library used by multiple applications which cumulatively waste resources, IPC across programs, thundering herds when files on disk change, GPU usage, CPU frequency scaling, memory bandwidth, RAPL, etc.
So Sysprof has a binary logging format that is straight-forward, efficient, and allows us to record many different types of information within a single file. That file format is used by a number of tools in the stack from GLib, Pango, Gtk, Mutter, GNOME Shell, GJS, various libraries, and applications on top of it. It can capture counters, stack traces, file contents, marks, logs, and a multitude of other data frames.
These capture files can also be muxed together at any point.
Some of the modular data collectors require libunwind, many do not. For example, the memprof collector records the backtraces from malloc/free/etc. But the GJS data-collector can use SpiderMonkey's internal APIs to get backtraces from a SIGPROF sigaction. The most used collector, however, is the perf collector which is just reading from a perf fd mmap'd into a ring buffer.
The perf collector doesn't record the whole stack because the amount of time it takes to decode a 30 second system-wide capture with DWARF/etc is so slow practically nobody would use it.
The best profiler is the one people will use.
We have an in-tree parser for ELF that allows us to avoid a lot of extraneous code when extracting symbols. Partially because libunwind is incredibly slow (by profiler requirements), and partially because historically we never had to stash stack frames for contextual unwinding.
Could we write a new data collection module that does DWARF unwinding and stashes some 8kb of stack? Sure. Would people use it? Probably not, because again, it's so slow that people will start profiling by intuition again which is probably the worst of all options.
Of course stashing the stack is not a good option. I just don’t think frame pointers are a good solution either. The correct solution (albeit the most difficult one) is to find a way to perform efficient profiling without frame pointers. I do not have the resources to write such a solution, but I am almost certain that Meta does.
Can we write a eBPF kernel module to decode symbols there? Maybe? Can I? Probably not.
Somebody else could, though. And it would not make the people who do not do system-wide profiling pay the price that frame pointers enact. Windows can do profiling without having to use frame pointers. There is no reason that Linux cannot as well.
Personally, I think some libraries should not be compiled with -fno-omit-frame-pointer. However, I think that number is much smaller than the opposite. Encryption, graphics drivers, etc all seem like good candidates here to be explicit about performance requirements.
Many encryption libraries will generally not have a frame pointer because much of the actual encryption code is hand-written assembler. glibc string functions do not maintain a frame pointer either.
Michael Catanzaro mcatanzaro@gnome.org writes:
I can point you to documentation for sysprof: https://gitlab.gnome.org/GNOME/sysprof#debugging-symbols which says that every library should be built with -fno-omit-frame-pointer.
Given that sysprof is a userspace program, it's not in a giant rush, so it should be capable of doing full dwarf unwinding. The fedora copy of sysprof is not linked against libunwind, or more modern libraries like elfutils, but only against glibc's little emergency backtrace() function. If sysprof learned to speak elfutils (like the eu-stack program demonstrates for unwinding), it could also benefit from debuginfod dwarf auto-downloading as needed.
- FChE
Michael Catanzaro wrote:
Problem is that in order to get good profiling results today, you need to rebuild all dependencies with frame pointers enabled. And that is not realistic. Nobody does that.
Actually, the Facebook developers, the ones who are proposing this very Change, claim that they do exactly that.
Kevin Kofler
On 7/6/22 08:20, Jonathan Wakely wrote:
On Wed, 6 Jul 2022 at 15:57, Jeff Law jeffreyalaw@gmail.com wrote:
On 7/6/2022 8:20 AM, Michael Catanzaro wrote:
On Wed, Jul 6 2022 at 08:06:45 AM -0600, Jeff Law jeffreyalaw@gmail.com wrote:
If I'm understanding things correctly, the original proposal is trying to make a very special case of profiling work better -- a case that 99.9% of Fedora users do not need or care about. That seems like a particularly bad cost/benefit for this proposal.
But all Fedora users benefit from performance improvements implemented as a result of profiling.
Yes, but, IMHO, you need to find another way to do the profiling you need to get those improvements.
Right. Developers already need to rebuild packages locally. The suggestion that developers of fedora packages can't improve those packages without making system-wide profiling work on every fedora users' machines seems like nonsense.
Let's say you profile something and find a performance problem. You tweak the code to improve performance. How do you verify if it improves performance? Do you push the change to rawhide, wait for koji and bodhi to do their thing, then re-profile with the new packages from the updates-testing repo? And if that didn't work, push another change to rawhide, wait for koji and bodhi, and re-profile? Of course not, that would be ridiculous.
You build locally and profile using your locally built packages. So you're already doing rebuilds, and you're already profiling custom builds anyway. So you can add frame pointers back in to your local builds that are used for profiling. It's not essential for frame pointers to be present in the official koji builds for you to do that. It might be a minor convenience because it reduces the number of system libs that you need to rebuild locally, but it's just a convenience, not a necessity.
Maybe we could make it easier to do local mock builds (or copr builds, or koji scratch builds) with changes to RPM_OPT_FLAGS to simplify rebuilding to get frame pointers. But pushing that to every user just so a handful of people don't have a few extra steps is the wrong trade-off.
I mentioned this in the FESCO meeting, but this proposal[1] is designed to make these kind of optflag changes easier to do.
-Tom
[1] https://fedoraproject.org/wiki/Changes/RPMMacrosForBuildFlags
devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
* Michael Catanzaro:
On Wed, Jul 6 2022 at 08:06:45 AM -0600, Jeff Law jeffreyalaw@gmail.com wrote:
If I'm understanding things correctly, the original proposal is trying to make a very special case of profiling work better -- a case that 99.9% of Fedora users do not need or care about. That seems like a particularly bad cost/benefit for this proposal.
But all Fedora users benefit from performance improvements implemented as a result of profiling.
I think we have no evidence that you could not get the same results using Fedora's current profiling tools. If the GNOME's sysprof does not work with Fedora, fix it or use something else. Do not change how Fedora is built. It's not really going to work anyway because typical workloads spent 5% to 10% in glibc's string functions. Those functions won't have frame pointers without some non-trivial development work (and also an ongoing maintenance cost). If you change compiler flags only, you still won't get accurate backtraces in many cases.
I had some interactions with Red Hat's performance teams over the years, and to my knowledge, the lack of frame pointers has never come up.
Thanks, Florian
On 7/6/2022 1:05 PM, Florian Weimer wrote:
- Michael Catanzaro:
On Wed, Jul 6 2022 at 08:06:45 AM -0600, Jeff Law jeffreyalaw@gmail.com wrote:
If I'm understanding things correctly, the original proposal is trying to make a very special case of profiling work better -- a case that 99.9% of Fedora users do not need or care about. That seems like a particularly bad cost/benefit for this proposal.
But all Fedora users benefit from performance improvements implemented as a result of profiling.
I think we have no evidence that you could not get the same results using Fedora's current profiling tools. If the GNOME's sysprof does not work with Fedora, fix it or use something else. Do not change how Fedora is built. It's not really going to work anyway because typical workloads spent 5% to 10% in glibc's string functions. Those functions won't have frame pointers without some non-trivial development work (and also an ongoing maintenance cost). If you change compiler flags only, you still won't get accurate backtraces in many cases.
Yea, it's not a complete solution.
I had some interactions with Red Hat's performance teams over the years, and to my knowledge, the lack of frame pointers has never come up.
I had some discussions in this space with Peter Z (IIRC) when he was still with Red Hat. We ran into a brick wall with the kernel insistence on no dwarf unwinding and the performance impact of keeping frame pointers around. This was the oprofile era IIRC, but the same principles apply.
Jeff
On Wed, Jul 6, 2022 at 3:06 PM Florian Weimer fweimer@redhat.com wrote:
If the GNOME's sysprof does not work with Fedora, fix it or use something else. Do not change how Fedora is built.
The result of that attitude is that performance work in the desktop space is happening on GNOME OS images, or in Flatpak runtimes instead of on Fedora. Which is a bit sad for Fedora as a supposedly developer-friendly environment.
On Thu, Jul 7, 2022 at 7:43 AM Matthias Clasen mclasen@redhat.com wrote:
On Wed, Jul 6, 2022 at 3:06 PM Florian Weimer fweimer@redhat.com wrote:
If the GNOME's sysprof does not work with Fedora, fix it or use something else. Do not change how Fedora is built.
The result of that attitude is that performance work in the desktop space is happening on GNOME OS images, or in Flatpak runtimes instead of on Fedora. Which is a bit sad for Fedora as a supposedly developer-friendly environment.
I agree, this is a completely unacceptable statement to make. The problem isn't sysprof, the problem is that profiling is garbage on Linux by default. And while maybe most developers may not bother to do profiling right now, we don't know if they wouldn't if profiling tools *worked*.
-- 真実はいつも一つ!/ Always, there's only one truth!
ngompa13 wrote:
[...] I agree, this is a completely unacceptable statement to make. The problem isn't sysprof, the problem is that profiling is garbage on Linux by default.
That's an overstatement.
And while maybe most developers may not bother to do profiling right now, we don't know if they wouldn't if profiling tools *worked*.
You said the problem isn't sysprof. Userspace profiling tools can fully unwind with dwarf / .eh-frame if they make the effort. Several do.
- FChE
On 7/7/22 10:46, Frank Ch. Eigler wrote:
ngompa13 wrote:
[...] I agree, this is a completely unacceptable statement to make. The problem isn't sysprof, the problem is that profiling is garbage on Linux by default.
That's an overstatement.
And while maybe most developers may not bother to do profiling right now, we don't know if they wouldn't if profiling tools *worked*.
You said the problem isn't sysprof. Userspace profiling tools can fully unwind with dwarf / .eh-frame if they make the effort. Several do.
The problem is that doing so has unacceptable performance. I believe perf actually winds up doing so offline because doing so online is too slow. That is why a different solution is needed, one that allows the unwinding to happen efficiently without requiring frame pointers.
* Matthias Clasen:
On Wed, Jul 6, 2022 at 3:06 PM Florian Weimer fweimer@redhat.com wrote:
If the GNOME's sysprof does not work with Fedora, fix it or use something else. Do not change how Fedora is built.
The result of that attitude is that performance work in the desktop space is happening on GNOME OS images, or in Flatpak runtimes instead of on Fedora. Which is a bit sad for Fedora as a supposedly developer-friendly environment.
My comment was specifically about sysprof. I've been told that the GNOME developers will not even consider anything else. This means that we need to fix sysprof. If we do that, it will be possible to use GNOME OS for profiling on older CPUs, and hardware-assisted backtraces on newer CPUs on Fedora (at least Skylake and Zen 3, especially once we've got userspace SHSTK support).
Even if this proposal is not accepted, I think we can collaborate on a couple of things:
* Enhance sysprof with LBR and SHSTK support.
* Enable userspace backtrace generation from BPF without frame pointers (possibly by using LBR and SHSTK at first).
* Investigate use of the Systemtap and elfutils unwinders in these tools.
* Speed up decoding of DWARF data structures using the BMI instruction sets (which only operate on scalar registers and should therefore be usable even within the kernel). According to https://lore.kernel.org/all/c54327dc-75c9-db48-f7c1-59f9fcfca26f@suse.cz/ that's a major source of DWARF processing overhead, and I don't think it has to be.
I'll try to get confirmation that it is technically feasible in priciple to use SHSTK to get arbitrarily deep backtraces from kernel space for userspace applications.
If we can get SHSTK to work, the value of the DWARF integration and performance work will diminish fairly quickly because most developers will soon have CPUs with fairly deep (32 entry) LBR buffers, SHSTK support, or both.
Thanks, Florian
If we can get SHSTK to work, the value of the DWARF integration and performance work will diminish fairly quickly because most developers will soon have CPUs with fairly deep (32 entry) LBR buffers, SHSTK support, or both.
This seems like a fairly bold assumption. I also want to add that as discussed in the proposal, we want to enable profiling not just on our laptops, but across our entire fleet that's running various generations of hardware. We can't simply replace all of our hardware just to get shadow stack support unfortunately. So we can't rely on new hardware features to get stacktraces.
Of course, if shadow stack support lands upstream, is found to be reliable and is fully supported by all hardware running on our fleet, we'd definitely look into using it instead of frame pointers. But's it's going to take many years before we can rely by all our hardware.
Aside from our use case, I don't think developers are constantly replacing their hardware either. I'd guess that with this approach we'd have many years of developers debugging why they're not getting full stacktraces only to find out their hardware doesn't support shadow stacks.
So to summarize, while we're anxiously awaiting for one of the mentioned alternatives to become viable, at the moment we think all of them result in a degraded profiling experience compared to frame pointers, either due to being slow, being a prototype and not available upstream, or due to requiring new hardware support.
Cheers,
Daan
________________________________________ From: Florian Weimer fweimer@redhat.com Sent: 11 July 2022 07:12 To: Matthias Clasen Cc: Development discussions related to Fedora Subject: Re: F37 proposal: Add -fno-omit-frame-pointer to default compilation flags (System-Wide Change proposal)
* Matthias Clasen:
On Wed, Jul 6, 2022 at 3:06 PM Florian Weimer fweimer@redhat.com wrote:
If the GNOME's sysprof does not work with Fedora, fix it or use something else. Do not change how Fedora is built.
The result of that attitude is that performance work in the desktop space is happening on GNOME OS images, or in Flatpak runtimes instead of on Fedora. Which is a bit sad for Fedora as a supposedly developer-friendly environment.
My comment was specifically about sysprof. I've been told that the GNOME developers will not even consider anything else. This means that we need to fix sysprof. If we do that, it will be possible to use GNOME OS for profiling on older CPUs, and hardware-assisted backtraces on newer CPUs on Fedora (at least Skylake and Zen 3, especially once we've got userspace SHSTK support).
Even if this proposal is not accepted, I think we can collaborate on a couple of things:
* Enhance sysprof with LBR and SHSTK support.
* Enable userspace backtrace generation from BPF without frame pointers (possibly by using LBR and SHSTK at first).
* Investigate use of the Systemtap and elfutils unwinders in these tools.
* Speed up decoding of DWARF data structures using the BMI instruction sets (which only operate on scalar registers and should therefore be usable even within the kernel). According to https://lore.kernel.org/all/c54327dc-75c9-db48-f7c1-59f9fcfca26f@suse.cz/ that's a major source of DWARF processing overhead, and I don't think it has to be.
I'll try to get confirmation that it is technically feasible in priciple to use SHSTK to get arbitrarily deep backtraces from kernel space for userspace applications.
If we can get SHSTK to work, the value of the DWARF integration and performance work will diminish fairly quickly because most developers will soon have CPUs with fairly deep (32 entry) LBR buffers, SHSTK support, or both.
Thanks, Florian _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
My comment was specifically about sysprof. I've been told that the GNOME developers will not even consider anything else. This means that we need to fix sysprof. If we do that, it will be possible to use GNOME OS for profiling on older CPUs, and hardware-assisted backtraces on newer CPUs on Fedora (at least Skylake and Zen 3, especially once we've got userspace SHSTK support).
As the maintainer of sysprof, trust me when I say that I'd be thrilled if GNOME used something else so it's one less thing to maintain. But the conjoining of data really is the value proposition which has kept it alive.
- Enhance sysprof with LBR and SHSTK support.
Agreed that regardless of this proposal, I'd be happy to see support for addition data recording features in Sysprof. I've filed https://gitlab.gnome.org/GNOME/sysprof/-/issues/77 to track it and would appreciate any comments you can leave about how/what that would entail.
* Jeff Law:
If I'm understanding things correctly, the original proposal is trying to make a very special case of profiling work better -- a case that 99.9% of Fedora users do not need or care about. That seems like a particularly bad cost/benefit for this proposal.
It became clear during yesterday's meeting that the actual goal is to enable userspace backtraces that can be analyzed by BPF (in the kernel), so it's not really about profiling. Instead it's about enhancing the capabilities of BPF. Nobody mentioned this explicitly, but I expect one could enhance osquery with this and push out BPF-based behavioral analysis using osquery. There is already some BPF support in osquery:
Process and socket auditing with osquery https://osquery.readthedocs.io/en/latest/deployment/process-auditing/
There really should be a way to reach that goal in a more direction, without having to rebuild the entire distribution with different compiler flags.
The core issue here is that kernel people boycott both x86 hardware shadow stacks *and* DWARF, which means that the most obvious approaches are not available to us.
Thanks, Florian
On Wed, Jul 6 2022 at 09:26:44 AM -0400, Marek Polacek polacek@redhat.com wrote:
I think you may be underestimating how much even 1% matters.
For Fedora Workstation, the primary concern should be to make sure sysprof works nicely. That's our profiling tool, and it currently doesn't work well at all with Fedora binaries due to lack of frame pointers. The best way to improve the performance of Fedora applications is currently to profile upstream binaries instead, which is awkward and disappointing.
I don't understand the technical details here and I will not take a position on what flags we use, but I'm concerned the goals here seem misplaced. We should surely accept a much bigger performance hit than 1% if it improves developer experience and facilitates profiling, since profiling allows us to make performance improvements that are orders of magnitude larger than 1%. Please find some way to make sysprof work well. Currently that requires frame pointers.
Michael
Marek Polacek wrote:
On Tue, Jul 05, 2022 at 03:47:26PM -0400, Matthias Clasen wrote:
(Un)acceptable for whom?
GCC maintainers in Fedora, at least.
What I do not understand is why a Change that wants to change the default GCC flags is even under discussion at all without the buy-in from the GCC maintainers. This is a GCC Change and as such IMHO GCC maintainers should be the only ones allowed to propose it (or at least, the GCC maintainers' approval ought to be a mandatory precondition for proposing it). Without you GCC maintainers' buy-in, it should just be summarily rejected.
Kevin Kofler
I don’t think we should be gatekeeping who can propose or discuss a Change. I do think that the opinions of the upstream and downstream GCC maintainers should be weighed quite heavily when considering a change to the default compiler flags.
As a FESCo member, I’m waiting to see if the Change owners can present broadly convincing benchmarks to characterize the full performance impact of the proposal. I’m also paying close attention to the (so far mostly negative) input from the GCC team and others that I recognize as experts on toolchains, glibc, and optimization.
– Ben Beasley
On 7/6/22 17:39, Kevin Kofler via devel wrote:
Marek Polacek wrote:
On Tue, Jul 05, 2022 at 03:47:26PM -0400, Matthias Clasen wrote:
(Un)acceptable for whom?
GCC maintainers in Fedora, at least.
What I do not understand is why a Change that wants to change the default GCC flags is even under discussion at all without the buy-in from the GCC maintainers. This is a GCC Change and as such IMHO GCC maintainers should be the only ones allowed to propose it (or at least, the GCC maintainers' approval ought to be a mandatory precondition for proposing it). Without you GCC maintainers' buy-in, it should just be summarily rejected.
Kevin Kofler
devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
On Tue, Jul 5, 2022 at 9:16 PM Matthias Clasen mclasen@redhat.com wrote:
On Thu, Jun 30, 2022 at 4:55 AM Kevin Kofler via devel devel@lists.fedoraproject.org wrote:
Daan De Meyer via devel wrote:
Which shows a smaller than 1% slowdown between the binary built with frame pointers and the binary built without frame pointers.
I am still strongly opposed to degrading performance and size for all users just to help the handful users of poorly-designed profiling tools.
I am coming a bit late to this discussion, but I would like to inject the viewpoint that 'performance' (however defined) isn't the only criterion by which we should just judge what Fedora produces. At least for Fedora Workstation, being a useful system for developers with working debugging and profiling tools should have some weight too.
And I doubt that you'd be able to notice a 'smaller than 1% slowdown' on your system.
I see one big problem with this argument - it is coloured from the point of view of a (C) developer. While software developers know how to - and are able to, if they need to - recompile libraries to make them better suitable for profiling, normal users don't know how to do that (if they even have beefy enough hardware to make this possible to do for them, at all). And if we say this argument is valid, then should we also build all our packages with ASAN / TSAN / etc. instrumentation, as well? It would make developing and debugging safety-critical software with AddressSanitizer etc. *much* easier on Fedora, even it will make everything run "a bit" slower. (Yes, I know that this is a "slippery-slope argument", but I liked the comparison of this proposed Change with using ASAN / TSAN / etc. instrumentation just too much to not include it here.)
I don't want Fedora to become a less good "general-purpose linux distribution" just to make it a better environment for a small share (those who need to do profiling against *system libraries*) of a small share of the target audience (developers). We can probably count the number of people who want to profile their software running on top of Fedora libraries on one (or at most, two) hands, but we have hundreds of thousands of users who would be negatively impacted by this change - even if by a little bit, and of course, those changes add up over time, as well.
Workstation is not the only edition of Fedora, and developers are not the only target audience. Developers who run profiling tools against libraries from Fedora packages are even fewer people. I assume that making binaries bigger (?) and make them run less efficiently would have a bigger impact for some other target audiences, especially the IoT space, where CPUs are less powerful (and have fewer available registers). All editions and spins build on top of the same packages from Fedora repositories, and making them less fit for purpose for everybody "just" to make sampling-based profiling work better for a few people does not seem like a worthwhile trade-off to me.
Fabio
Fabio Valentini wrote:
And if we say this argument is valid, then should we also build all our packages with ASAN / TSAN / etc. instrumentation, as well?
And ASAN would actually have tangible benefits for end users, namely preventing some memory bug exploits, whereas frame pointers only slow things down and are of no use whatsoever for non-developer users (and even most developers).
Kevin Kofler
Kevin Kofler via devel devel@lists.fedoraproject.org writes:
Fabio Valentini wrote:
And if we say this argument is valid, then should we also build all our packages with ASAN / TSAN / etc. instrumentation, as well?
And ASAN would actually have tangible benefits for end users, namely preventing some memory bug exploits, whereas frame pointers only slow things down and are of no use whatsoever for non-developer users (and even most developers).
Please never run ASAN in production workloads: https://www.openwall.com/lists/oss-security/2016/02/17/9
tl;dr; you'll create a local root exploit.
Cheers,
Dan
Dan Čermák wrote:
Please never run ASAN in production workloads: https://www.openwall.com/lists/oss-security/2016/02/17/9
tl;dr; you'll create a local root exploit.
Oh, the joys of automagically added insecure environment variable handlers… Good to know!
Kevin Kofler
On 05/07/2022 21:15, Matthias Clasen wrote:
And I doubt that you'd be able to notice a 'smaller than 1% slowdown' on your system.
4% slowdown is unacceptable.
At least for Fedora Workstation, being a useful system for developers with working debugging and profiling tools should have some weight too.
Debugging works well on Fedora without this flag.
On 7/6/22 08:30, Vitaly Zaitsev via devel wrote:
On 05/07/2022 21:15, Matthias Clasen wrote:
And I doubt that you'd be able to notice a 'smaller than 1% slowdown' on your system.
4% slowdown is unacceptable.
At least for Fedora Workstation, being a useful system for developers with working debugging and profiling tools should have some weight too.
Debugging works well on Fedora without this flag.
I agree that it works well (at least better than with most competing OSes), but that does not mean it works perfectly and that it's not improvable. I must say that global debugging capabilities have decreased with time since 20 years. More and more, one needs to rely on "debugging by printf" because gdb and similar tools are confused, notably because local variables are "optimsed out" (which I suspect is related to frame pointer). This is a price to pay to better optimisation and performance, but leads to more time spent to find the place of a bug.
So yes it works well, but it could be better.... Now at which cost for the overall community, this is another story.
Theo.
Theodore Papadopoulo wrote:
gdb and similar tools are confused, notably because local variables are "optimsed out" (which I suspect is related to frame pointer).
It is not.
This is related to compiling with any optimization at all, and you definitely do not want production binaries compiled without optimization (with -O0), they would be extremely slow and large.
Local variables are allocated to registers for as long as they are needed, and then evicted (overwritten by other data that needs the register) when the compiler finds they are "dead", i.e., no longer referenced in the rest of the function. Only when no registers are left, the variables are "spilled" to the stack (which is the only place where the frame pointer makes a difference, and only in the way the variables are accessed, i.e., they can be accessed relative to %rfp rather than %rsp). And some variables can be optimized out entirely, e.g., by inlining them into the computation of some other variable.
If anything, enabling the frame pointer will make it *more* likely that variables get optimized out because there will be one fewer register to use to hold their contents.
Kevin Kofler
Similarly, for the sysbench RAM test, which was the other test in the phoronix benchmark showing substantial regressions when compiled with frame pointers, we were unable to reproduce the results. Our results are as follows:
https://user-images.githubusercontent.com/9395011/177169145-d19bab77-cd97-44...
While our results also show a difference in performance, the sysbench benchmark is also somewhat noisy as shown by the standard deviation.
Daan De Meyer via devel wrote:
Our results are as follows:
https://user-images.githubusercontent.com/9395011/177169145-d19bab77-cd97-44...
This is a 4% slowdown, on a RAM-bound (not even CPU-bound) benchmark!
I do not see at all how this is even considered to possibly be acceptable, all the more for a feature intended to help improve performance (because that is really the only ultimate aim of improving profiling support, or what else would you want to do profiling for?).
No amount of profiling will allow recovering those 4% lost performance on any real-world software.
Kevin Kofler
* Daan De Meyer via devel:
Instead, we only saw differences from 0%-2% between Redis compiled with frame pointers and Redis compiled without frame pointers. These benchmarks were done using the phoronix-test-suite in exactly the same way as documented in the phoronix article.
Did you actually enable frame pointers everywhere, or did you just use -fno-omit-frame-pointer?
Merely building with -fno-omit-frame-pointer results in incomplete backchain-based backtraces on x86, so I don't see the point of that. You get worse performance, but backtracing without DWARF still doesn't quite work.
Thanks, Florian
The goal was to try and reproduce the phoronix benchmark results so this is without any system dependencies rebuilt with frame pointers, same as the phoronix benchmark.
Cheers,
Daan
* Daan De Meyer via devel:
The goal was to try and reproduce the phoronix benchmark results so this is without any system dependencies rebuilt with frame pointers, same as the phoronix benchmark.
But is this build configuration what are you proposing for Fedora?
Thanks, Florian
The proposed configuration is to add "-fno-omit-frame-pointer -mno-omit-leaf-frame-pointer" to the default compilation flags. Are you alluding to inline assembly that won't have frame pointers set up correctly even with these two options enabled?
Cheers,
Daan
On Tue, Jul 5, 2022 at 1:17 PM Daan De Meyer via devel devel@lists.fedoraproject.org wrote:
The proposed configuration is to add "-fno-omit-frame-pointer -mno-omit-leaf-frame-pointer" to the default compilation flags. Are you alluding to inline assembly that won't have frame pointers set up correctly even with these two options enabled?
No - I think the problem is that adding those flags to the default build configuration will affect the whole system - all executables and shared libraries, not only "leaf" binaries. And that makes your benchmarks (and those run by phoronix) basically useless for what we're discussing here, because they only measure performance impact when compiling the executables but not *the whole world* with those flags, including all shared libraries that are used by executables you're benchmarking.
So applying this change globally would, I assume, add to (or even multiply) the negative effect wrt/ performance, so the effect will likely be (much?) bigger than the few percent that were mentioned in this thread?
Fabio
On Tue, Jul 5 2022 at 01:42:05 PM +0200, Fabio Valentini decathorpe@gmail.com wrote:
No - I think the problem is that adding those flags to the default build configuration will affect the whole system - all executables and shared libraries, not only "leaf" binaries. And that makes your benchmarks (and those run by phoronix) basically useless for what we're discussing here, because they only measure performance impact when compiling the executables but not *the whole world* with those flags, including all shared libraries that are used by executables you're benchmarking.
So applying this change globally would, I assume, add to (or even multiply) the negative effect wrt/ performance, so the effect will likely be (much?) bigger than the few percent that were mentioned in this thread?
This is a good point. I suppose we need further investigation here to understand the true performance impact.
That said, this point cuts both ways. Recompiling the entire distro in order to add frame pointers is a significant effort, and a very high burden to ask of anyone looking to do performance work.
Michael
* Daan De Meyer via devel:
The proposed configuration is to add "-fno-omit-frame-pointer -mno-omit-leaf-frame-pointer" to the default compilation flags. Are you alluding to inline assembly that won't have frame pointers set up correctly even with these two options enabled?
The change proposal does not mention -mno-omit-leaf-frame-pointer, and it's unclear if it has been used for any performance testing so far.
(I finally pushed toe glibc changes to inherit those options today, so it should be easier now to get a customer glibc build, but of course the glibc string functions still do not use frame pointers.)
Thanks, Florian
On 7/5/22 06:06, Florian Weimer wrote:
- Daan De Meyer via devel:
Instead, we only saw differences from 0%-2% between Redis compiled with frame pointers and Redis compiled without frame pointers. These benchmarks were done using the phoronix-test-suite in exactly the same way as documented in the phoronix article.
Did you actually enable frame pointers everywhere, or did you just use -fno-omit-frame-pointer?
Merely building with -fno-omit-frame-pointer results in incomplete backchain-based backtraces on x86, so I don't see the point of that. You get worse performance, but backtracing without DWARF still doesn't quite work.
Also, frame pointers do not work at all for some packages, such as LuaJIT. LuaJIT’s assembler code does not maintain a frame pointer, so frame pointer-based unwinding passed it simply cannot work. A much better solution is to figure out how to safely unwind the stack in the kernel without frame pointers. SystemTap is apparently able to do this, and eBPF might be an option too with sufficient kernel changes. Another approach would be to change userspace build tools to generate ORC unwind info and have the kernel use that. That said, eBPF is my preferred approach: the verifier should prevent an eBPF-based unwinder from doing anything nasty, which is especially important when processing untrusted unwind information from userspace. Alternatively, once Rust support lands in the Linux kernel, an unwinder could be written in Rust.
I've just updated the proposal with an extended description describing the use cases enabled by frame pointers in more details. More specifically, on top of describing the profiling use case in much more detail, I've also added a section on BPF debugging tooling, such as bcc and bpftrace, which will also benefit from much more reliable stacktraces if this proposal is implemented.
I've also clarified that we'll also add -mno-omit-leaf-frame-pointer to the compiler options. I think this is already implied by -fno-omit-frame-pointer for GCC (maybe a GCC expert can correct me if I'm wrong), but it's better to be explicit.
Finally, I've added a description of shadow stacks to the alternatives section, which are a new hardware feature that might be an option for unwinding in the far future, but at the moment lacks widespread support and isn't available in the kernel so it's not an option just yet.
Cheers,
Daan
devel@lists.stg.fedoraproject.org