Reputation: 2530

Oprofile callgraph: origin of syscalls

I have been using oprofile to try to discover why my program was spending so much time in the kernel. I now have the symbols from the kernel, but apparently no links between my program and kernel that'll tell me which bits of my program are taking so long.

samples  %        image name               app name                 symbol name
-------------------------------------------------------------------------------
  201       0.8911  vmlinux-3.0.0-30-generic vmlinux-3.0.0-30-generic _raw_spin_lock_irq
  746       3.3073  vmlinux-3.0.0-30-generic vmlinux-3.0.0-30-generic rb_get_reader_page
  5000     22.1671  vmlinux-3.0.0-30-generic vmlinux-3.0.0-30-generic default_spin_lock_flags
  16575    73.4838  vmlinux-3.0.0-30-generic vmlinux-3.0.0-30-generic _raw_spin_lock
22469    11.1862  vmlinux-3.0.0-30-generic vmlinux-3.0.0-30-generic __ticket_spin_lock
  22469    99.6010  vmlinux-3.0.0-30-generic vmlinux-3.0.0-30-generic __ticket_spin_lock [self]
  26        0.1153  vmlinux-3.0.0-30-generic vmlinux-3.0.0-30-generic ret_from_intr

Where do I go from here? How do I discover the places in my program that are causing __ticket_spin_lock?

Upvotes: 3

Answers (2)

redblackbit

Reputation: 886

I agree with Mike's answer: a callgraph is not the right way to inspect the source of the problem. What you really want is to look at the callchains of the hottest samples.

If you don't want to inspect "by hand" the raw samples collected by oprofile, you could rerun your application with the record command of perf using the -g option in order to collect the stacktraces. You can then display the samples annotated with their callchains using the report command of perf. Since perf is not aggregating the callchains of the individual samples in a global callgraph, you don't have some of the issues outlined in Mike's post.

Upvotes: 2

Mike Dunlavey

Reputation: 40679

Oprofile takes stack samples. What you need to do is not look at summaries of them, but actually examine the raw samples. If you are spending, say, 30% of time in the kernel, then if you can see 10 stack samples chosen at random, you can expect 3 of them, more or less, to show you the full reason of how you got into the kernel.

That way you will see things the summaries or call graph won't show you.

IN CASE IT ISN'T CLEAR: Since __ticket_spin_lock is on the stack 99.6% of the time, then on each and every stack sample you look at, the probability is 99.6% you will see how you got into that routine. Then if you don't really need to be doing that, you have possibly a 250x speedup. That's like from four minutes down to one second. Screw the "correct" or "automated" approach - get the results.

ADDED: The thing about profilers is they are popular and some have very nice UIs, but sadly, I'm afraid, it's a case of "the emperor's new clothes". If such a tool doesn't find much to fix, you're going to like it, because it says (probably falsely) that your code, as written, is near-optimal.

There are lots of postings recommending this or that profiler, but I can't point to any claim of saving more than some percent of time, like 40%, using a profiler. Maybe there are some.

I have never heard of a profiler being used first to get a speedup, and then being used again to get a second speedup, and so on. That's how you get real speedup - multiple optimizations. Something that was just a small performance problem at the beginning is no longer small after you've removed a larger one. This picture shows how, by removing six problems, the speedup is nearly three orders of magnitude. You can't necessarily do that, but isn't it worth trying?

enter image description here

APOLOGIES for further editing. I just wanted to show easy it is to fool a call graph. The red lines represent call stack samples. Here A1 spends all its time calling C2, and vice-versa. Then suppose you keep the same behavior, but you put in a "dispatch" routine B. Now the call graph loses the information that A1 spends all its time in C2, and vice-versa. You can easily extend this example to multiple levels. enter image description here You can say a call tree would have seen that. Well, here's how you can fool a call tree. A spends all its time in calls to C. Now if instead A calls B1, B2, ... Bn, and those call C, the "hot path" from A to C is broken up into pieces, so the relationship between A and C is hidden. enter image description here There are many other perfectly ordinary programming practices that will confuse these tools, especially when the samples are 10-30 levels deep and the functions are all little, but the relationships cannot hide from a programmer carefully examining a moderate number of samples.

Upvotes: 5

Oprofile callgraph: origin of syscalls

Answers (2)

Related Questions