DIY Ruby CPU profiling - Part IV

Emil Soman  - July 18, 2015  |   , ,

In Part I we saw how CPU profilers work, in Part II we learned how to measure execution time of programs using CPU time and Wall time and in Part III we learned how to use Ruby’s TracePoint API to trace method calls and returns in Ruby VM. We also started building our DiyProf gem. It’s recommended to read those first before you continue. In this part we’re going to create a simple call graph visualization using the data we collected using the TracePoint API.

Part IV. Call Graphs

In Part III, we used Ruby’s TracePoint API to print method call and return traces. It’d be more useful if we can parse these traces and create a call graph out of it. A call graph is basically a visual representation of the method calls, in the form of a tree. It looks like this :

Call tree

This graph tells us that outer_method calls 3 methods from inside it: inner_method{1,2,3} and inner_method_3 calls another method named inner_most_method.

Graphviz and the DOT language

Graphviz is a graph visualization tool which can take a graph description in text format and generate diagrams as PDF and other formats. For describing graphs in text format, Graphviz uses a language called DOT, which is a simple declarative way of describing nodes and connections between them.

Before proceeding, make sure you’ve installed graphviz and the dot command is available in your $PATH. Follow their official documentation for installation steps.

Now let’s take an example.

digraph graph_name {
  a -> b -> c;
  b -> d;
}

Save the above lines as sample.dot and run the following command:

dot -Tpdf sample.dot -o graph.pdf

The above command will generate a directed graph that looks like this:

Directed graph example

There’s another way to represent graphs, by separating nodes and connections like so:

digraph graph_name {
  <nodes>
  <connections>
}

So the following DOT notation represents the same graph:

digraph graph_name {
  a;
  b;
  c;
  d;
  a -> b;
  b -> c;
  b -> d;
}

It’s also possible to add additional labels to nodes and connections.

digraph graph_name {
  a [label="a \n foo: bar"];
  b [label="b \n foo: baz"]
  a -> b [label="xyz"];
}

Above DOT code represents the following graph:

Graph with labels

Call Graph with Graphviz

We’re going to create a reporter class for our DiyProf gem which will keep collecting all method calls and create a DOT file that represents the call graph.

We’ll add a record method under the DotReporter class which pushes method infos into a stack when calls happen and pop them out of the stack when the methods return. When the method info is popped from the stack, the time it spent in the stack is recorded(which is the execution time) and it adds the method’s call info(method name and time) to a call tree. See the code below:

module DiyProf
  # Struct that will hold a method name and a timestamp
  CallInfo = Struct.new(:name, :time)

  class DotReporter
    def initialize
      # A stack for pushing/popping methods when methods get called/returned
      @call_stack = []
      # Nodes for all methods
      @methods = {}
      # Connections between the nodes
      @calls = {}
    end

    def record(event, method_name, time)
      case event
      when :call
        @call_stack << CallInfo.new(method_name, time)
      when :return
        # Return cannot be the first event in the call stack
        return if @call_stack.empty?

        method = @call_stack.pop
        # Set execution time of method in call info
        method.time = time - method.time
        add_method_to_call_tree(method)
      end
    end
  end
end

add_method_to_call_tree adds nodes and connections for the added method. Here’s the code that does that:

# Struct that holds the time spent in a method
# and the total number of times a method was called
MethodInfo = Struct.new(:count, :total_time, :self_time)

def add_method_to_call_tree(method)
  # Add method as a node to the call graph
  @methods[method.name] ||= MethodInfo.new(0, 0, 0)
  # Update total time(spent inside the method and methods called inside this method)
  @methods[method.name].total_time += method.time
  # Update self time(spent inside the method and not methods called inside this method)
  # This will be subtracted when children are added to the graph
  @methods[method.name].self_time += method.time
  # Update total no of times the method was called
  @methods[method.name].count += 1

  # If the method has a parent in the call stack
  # Add a connection from the parent node to this method
  if parent = @call_stack.last
    @calls[parent.name] ||= {}
    @calls[parent.name][method.name] ||= 0
    @calls[parent.name][method.name] += 1

    # Take away self time of parent
    @methods[parent.name] ||= MethodInfo.new(0, 0, 0)
    @methods[parent.name].self_time -= method.time
  end
end

Now that we have the nodes and connections in @methods and @calls, it’ll be trivial to generate a DOT file that looks like this:

digraph G {
  <graph_nodes>
  <graph_links>
}

Here’s the code that will walk through @methods and @calls and prepare such a DOT file:

def dot_notation
  dot = %Q(
    digraph G {
      #{graph_nodes}
      #{graph_links}
    }
  )
end

def graph_nodes
  nodes = ""
  @methods.each do |name, method_info|
    nodes << "#{name} [label=\"#{name}\\ncalls: #{method_info.count}\\ntotal time: #{method_info.total_time}\\nself time: #{method_info.self_time}\"];\n"
  end
  nodes
end

def graph_links
  links = ""
  @calls.each do |parent, children|
    children.each do |child, count|
      links << "#{parent} -> #{child} [label=\" #{count}\"];\n"
    end
  end
  links
end

Check out the lib/diy_prof/dot_reporter.rb file on the github repo to see how all of this will fit together. We also need some changes in the Tracer class we wrote in Part III to make use of a custom reporter class:

module DiyProf
  class Tracer
    # We'll need to initialize the Tracer with DotReporter.new
    def initialize(reporter)
      @reporter = reporter
      @tracepoints = [:call, :return].collect do |event|
        TracePoint.new(event) do |trace|
          # Use the reporter instead of just printing the traces
          reporter.record(event, trace.method_id, cpu_time)
        end
      end
    end

    ...

    # New method added to collect the report after profiling
    def result
      @reporter.result
    end
  end
end

Now that our tracer is able to generate a DOT formatted report, we can use it on our sample program we wrote in the previous parts:

def main
  3.times do
    find_many_square_roots
    find_many_squares
  end
end

def find_many_square_roots
  5000.times{|i| Math.sqrt(i)}
end

def find_many_squares
  5000.times{|i| i**2 }
end

DiyProf.start_profiling
main
result = DiyProf.stop_profiling

require 'tempfile'
f = Tempfile.open('dot')
f.write result
f.close
system("dot -Tpdf #{f.path} -o call_graph.pdf")

We should get a call graph that looks like this:

Demo

Congratulations if you got this far! Check out the github repo which has the complete code if you got stuck somewhere.

Bonus!

Wouldn’t it be cool if we could vary the size of the nodes based on self or total time of the method? The DOT syntax accepts a bunch of attributes for nodes and edges. Looks like we can use size and fontsize attributes to vary the size of the nodes and the text inside the nodes. I’ll leave it to you to think of how to use these to make the most expensive methods stand out in the call graph.

Recap

We’ve learned about Graphviz and the DOT language and how to use them to create call graph diagrams out of the data collected by Ruby Tracepoint API. In Part V, we’ll start building a sampling profiler for Ruby, in particular we’ll learn some basics about interrupts and signal handling in C. Thanks for reading! If you would like to get updates about subsequent blog posts in this DIY CPU profiler series, do follows us on twitter @codemancershq.

DIY Ruby CPU profiling - Part III

Emil Soman  - April 15, 2015  |   , ,

In Part I we saw how instrumentation profilers work, and in Part II we learned how to measure execution time of programs using CPU time and Wall time. It’s recommended to read those first before you continue. In this part we’ll apply what we have learned so far and build a very basic instrumentation CPU profiler.

Part III. DIY Instrumentation CPU Profiler

Instrumentation in Ruby

In Part I, we learned that instrumentation profiling tools make use of hooks which can either be installed by the profiling tools or provided by the language itself to do their job. Luckily, Ruby already has these hooks baked into it and gives us the nifty TracePoint class to make use of these hooks.

TracePoint API

Ruby VM emits a bunch of events at various points while executing your code. Ruby allows you to enable callbacks for these events using the TracePoint class. The API documentation of TracePoint has a list of all the events you could listen to, but we’re interested in only two of those for now:

  1. :call , which is triggered when a ruby method gets called
  2. :return, which is triggered when a ruby method returns.

We can use the TracePoint.new method to create a tracepoint object and pass it an event type to listen to and a block of code which should get executed when the event gets triggered. Here’s the code to create a tracepoint for :call and :return events:

trace = TracePoint.new(:call, :return) do |trace|
  puts "Method: #{trace.method_id}, event: #{trace.event}"
end

The argument trace which is passed into the callback code block gives you access to a bunch of tracepoint attributes which can tell you more about the event that got triggered. In our example, we’re going to need the method_id attribute which gives us the method name and the event attribute which gives us the name of the event which gets triggered. The callback code block above will print the method name and the type of the triggered event (call/return), every time a Ruby method gets called or returned.

After you create the tracepoint, you need to enable it:

trace.enable

Once the tracepoint is enabled, we can create a method and call it, and see if the callback that we registered gets executed:

def hello
end

hello

#=> Method: hello, event: call
#=> Method: hello, event: return

Great, our callback got executed twice as expected.

diy_prof gem

It’s finally time to start putting together all that we’ve learned and build a RubyGem. We’re going to call this gem diy_prof. The source for this gem is on GitHub and you can find it here.

Now let’s build the gem from scratch using bundler:

bundle gem diy_prof

This creates a skeleton project which we’ll build upon. Next, we’ll create a TimeHelpers module which will contain the methods cpu_time and wall_time which we wrote in Part II:

# lib/diy_prof/time_helpers.rb

module DiyProf::TimeHelpers
  # These methods make use of `clock_gettime` method introduced in Ruby 2.1
  # to measure CPU time and Wall clock time.

  def cpu_time
    Process.clock_gettime(Process::CLOCK_PROCESS_CPUTIME_ID, :microsecond)
  end

  def wall_time
    Process.clock_gettime(Process::CLOCK_MONOTONIC, :microsecond)
  end
end

We also need to require this file in our gem’s main file lib/diy_prof.rb:

# lib/diy_prof.rb

require 'diy_prof/version'
require 'diy_prof/time_helpers'
# Rest of the original file ..

Next, we’ll use the TracePoint API in our gem to print the time (using our TimeHelpers) when a :call or :return event occurs.

# lib/diy_prof/tracer.rb


module DiyProf
  class Tracer
    include TimeHelpers

    def initialize(clock_type: :cpu)
      @tracepoint = TracePoint.new(:call, :return) do |trace|
        time = clock_type == :wall ? wall_time : cpu_time
        printf("%-20s:%-20s%-20s\n", time, trace.event, trace.method_id)
      end
    end
    def enable
      @tracepoint.enable
    end

    def disable
      @tracepoint.disable
    end
  end
end

Also, we’ll require this file from our gem’s main file too:

# lib/diy_prof.rb

require 'diy_prof/version'
require 'diy_prof/time_helpers'
require 'diy_prof/tracer'
# Rest of the original file ..

Now let’s write an example script to test what we’ve built so far. We’ll use the same example we saw in Part I:

# examples/demo.rb

$:<< File.join(File.dirname(__FILE__), "../lib")
require 'diy_prof'

### Begin sample program ###
def main
  3.times do
    find_many_square_roots
    find_many_squares
  end
end

def find_many_square_roots
  5000.times{|i| Math.sqrt(i)}
end

def find_many_squares
  5000.times{|i| i**2 }
end
### End sample program ###

tracer = DiyProf::Tracer.new(clock_type: :cpu)
tracer.enable
main
tracer.disable

If we run the above Ruby program, we should get the following output:

70508               :return              enable
70587               :call                main
70604               :call                find_many_square_roots
73225               :return              find_many_square_roots
73256               :call                find_many_squares
74527               :return              find_many_squares
74562               :call                find_many_square_roots
77042               :return              find_many_square_roots
77064               :call                find_many_squares
77653               :return              find_many_squares
77690               :call                find_many_square_roots
79681               :return              find_many_square_roots
79698               :call                find_many_squares
80288               :return              find_many_squares
80306               :return              main
80326               :call                disable

The first column of the output is the CPU time of the triggered events, the second column is the name of the event that got triggered(call/return) and the third column is the name of the method which got called or returned. This is similar to the output we saw in Part I where we learned how instrumentation profilers work. As you can see, we can easily recreate a call stack using this output because we know which method called which other methods. We also have the time of all call and return events, so we can also tell how much time was spent in each method. It wouldn’t be too difficult to write a script that parses this output and prints a list of methods ordered by their execution them. But that’s too boring, we can build something way more interesting than that. Watch out for the next part to learn what else we can do with this data.

Recap

We’ve learned about Ruby’s TracePoint API and how to use it to listen to method calls and returns. We’ve also taken the code we’ve written so far and put everything together to build a gem which allows you to print the execution trace of your Ruby program. In Part 4 we’ll learn how to make sense of the data that our gem collects and build some cool call graphs out of it. Thanks for reading! If you would like to get updates about subsequent blog posts in this DIY CPU profiler series, do follows us on twitter @codemancershq.

DIY Ruby CPU profiling - Part II

Emil Soman  - March 12, 2015  |   , ,

In Part I we learned what CPU profiling means and also about the two modes of CPU profiling. In this part we’re going to explore CPU time and Wall time, the units used to measure execution cost. We’ll also write some code to get these measurements as the first step towards building our CPU profiler.

Part II. CPU time and Wall time

Wall time

Wall time is just the regular real world time that elapses between a method call and return. So if you were to measure the “Wall clock time” taken for a method to run, it would be theoretically possible to do so with a stopwatch. Just start the stopwatch when the method starts, and stop when the method returns. This is also called real time.

One important point about wall time is that it’s unpredictable and you may get different results every time you try to measure the same piece of code. This is because wall time is affected by the number of processes running in the background. When the CPU has to work on a bunch of processes at the same time, the operating system does a scheduling of the processes which are running at the time and tries to give a fair share of CPU to each of them. This means the total time spent by the CPU is divided into many slices and our method gets only some of these slices and not all of them. So while the wall clock ticks away, our process may be sitting idle and giving way for other processes which are running in parallel. This means the time spent on other processes will add to our Wall time too!

CPU time

CPU time is the time for which the CPU is dedicated to run the method. CPU time is measured in terms of CPU cycles(or ticks) which are used to execute the method. We can convert this into time if we know the frequency of the CPU in units of cycles per second aka Hertz. So if the CPU took x ticks to execute a method, and the frequency of the CPU is y hertz, the time taken by the CPU to execute the method = x/y seconds. Sometimes the OS does this conversion for us so we don’t have to do this calculation by ourselves.

CPU time will not be equal to Wall time. The difference depends on the type of instructions in our method. We can categorize the instructions into broadly 2 types: CPU bound and I/O bound. When I/O instructions are being executed, the CPU becomes idle and can move on to process other CPU bound instructions. So if our method has a time consuming I/O instruction, the CPU stops spending time on our method and moves on to something else until the I/O operation is completed. During this time the Wall time keeps ticking but the CPU time stops and lags behind Wall time.

Let’s say a very slow running method took 5 minutes on your clock to finish running. If you were to ask how much time was spent on the method, your wall clock would say “It took 5 minutes to run this method”, but the CPU would say “I spent 3 minutes of my time on this method”. So who are you going to listen to? Which time more accurately measures the cost of executing the method?

The answer is, it depends™. It depends on the kind of method you want to measure. If the method spends most of its time doing I/O operations, or it doesn’t deal with CPU bound instructions directly, the cost of execution depicted by CPU time is going to be grossly inaccurate. For these types of methods, it makes more sense to use Wall time as the measurement. For all other cases, it’s safe to stick with CPU time.

Measuring CPU time and Wall time

Since we’re going to write a CPU profiler, we’ll need a way to measure CPU time and wall time. Let’s take a look at the code in Ruby’s Benchmark module which already measures CPU time and Wall time.

def measure(label = "") # :yield:
  t0, r0 = Process.times, Process.clock_gettime(BENCHMARK_CLOCK)
  yield
  t1, r1 = Process.times, Process.clock_gettime(BENCHMARK_CLOCK)
  Benchmark::Tms.new(t1.utime  - t0.utime,
                     t1.stime  - t0.stime,
                     t1.cutime - t0.cutime,
                     t1.cstime - t0.cstime,
                     r1 - r0,
                     label)
end

So Ruby uses 2 methods from the Process class to measure time :

  1. times to measure CPU time
  2. clock_gettime to measure real time aka Wall time

But unfortunately the resolution of time returned by the times method is 1 second which means if we use times to measure CPU time in our profiler, we’ll only be able to profile methods that take a few seconds to complete. clock_gettime looks interesting, though.

clock_gettime

Process::clock_gettime is a method added in Ruby 2.1 and it uses POSIX clock_gettime() function and falls back to OS specific emulations to get the value of time in case clock_gettime is not available in the OS or when the type of clock we are measuring with clock_gettime is not implemented in the OS. This function accepts a clock_id and the time resolution as arguments. There are a bunch of clock_ids you could pass in to pick the kind of clock to use, but the ones we’re interested in are:

  1. CLOCK_MONOTONIC: This clock measures the elapsed wall clock time since an arbitrary point in the past and is not affected by changes in the system clock. Perfect for measuring Wall time.
  2. CLOCK_PROCESS_CPUTIME_ID: This clock measures per-process CPU time, ie the time consumed by all threads in the process. We can use this to measure CPU time.

Let’s make use of this and write some code:

module DiyProf
  # These methods make use of `clock_gettime` method introduced in Ruby 2.1
  # to measure CPU time and Wall clock time.

  def self.cpu_time
    Process.clock_gettime(Process::CLOCK_PROCESS_CPUTIME_ID, :microsecond)
  end

  def self.wall_time
    Process.clock_gettime(Process::CLOCK_MONOTONIC, :microsecond)
  end
end

We could use these methods to benchmark code:

puts "****CPU Bound****"
c1, w1 = DiyProf::cpu_time, DiyProf::wall_time
10000.times do |i|
  Math.sqrt(i)
end
c2, w2 = DiyProf::cpu_time, DiyProf::wall_time
puts "CPU time\t=\t#{c2-c1}\nWall time\t=\t#{w2-w1}"

puts "\n****IO Bound****"
require 'tempfile'

c1, w1 = DiyProf::cpu_time, DiyProf::wall_time
1000.times do |i|
  Tempfile.create('file') do |f|
    f.puts(i)
  end
end
c2, w2 = DiyProf::cpu_time, DiyProf::wall_time
puts "CPU time\t=\t#{c2-c1}\nWall time\t=\t#{w2-w1}"

Running this code would give an output similar to this:

****CPU Bound****
CPU time	=	5038
Wall time	=	5142

****IO Bound****
CPU time	=	337898
Wall time	=	475864

This clearly shows that on a single CPU core, CPU time and Wall time are nearly equal when running purely CPU bound instructions whereas CPU time is always less that Wall time when running I/O bound instructions.

Recap

We’ve learned what CPU time and Wall time mean, their differences, and when to use which. We also wrote some Ruby code to measure CPU time and Wall time which would help us measure time in the CPU profiler we’re building. In part 3 we’ll take a look at Ruby’s tracepoint API and use it to build an instrumentation profiler. Thanks for reading! If you would like to get updates about subsequent blog posts in this DIY CPU profiler series, do follows us on twitter @codemancershq.

DIY Ruby CPU profiling - Part I

Emil Soman  - March 6, 2015  |   , ,

At Codemancers, we’re building Rbkit, a fresh code profiler for the Ruby language with tonnes of cool features. I’m currently working on implementing a CPU profiler inside rbkit gem which would help rbkit UI to reconstruct the call graph of the profiled ruby process and draw useful visualizations on the screen. I learned a bunch of new things along the way and I’d love to share them with you in this series of blog posts.

We’re going to start from the fundamentals and step by step, we’re going to write a rudimentary CPU profiler for Ruby ourselves! By the time we finish we would learn :

Part I. An Introduction to CPU Profiling

By doing a CPU profile of your program, you can find out how expensive your program is with respect to CPU usage. In order to profile your program, you’ll need to use a profiling tool and follow the following steps :

  1. Start CPU profiling
  2. Execute the code you want to profile
  3. Stop CPU profiling and get profiling result
  4. Analyze result

By analyzing the profiling result, you can find the bottlenecks which slow down your whole program.

Profiling modes

CPU profiling is done in broadly 2 ways:

1. Instrumentation

In this mode, the profiling tool makes use of some hooks, either provided by the interpreter or inserted into the program, to understand the call graph and measure the execution time of each method in the call graph.

For example, consider the following piece of Ruby code :

def main
  3.times do
    find_many_square_roots
    find_many_squares
  end
end

def find_many_square_roots
  5000.times{|i| Math.sqrt(i)}
end

def find_many_squares
  5000.times{|i| i**2 }
end

main

I’ve inserted some comments to help understand how the hooks would get executed if the Ruby interpreter gave us method call and return hooks:

def main
  # method call hook gets executed
  3.times do
    find_many_square_roots
    find_many_squares
  end
  # method end hook gets executed
end

def find_many_square_roots
  # method call hook gets executed
  5000.times{|i| Math.sqrt(i)}
  # method end hook gets executed
end

def find_many_squares
  # method call hook gets executed
  5000.times{|i| i**2 }
  # method end hook gets executed
end

main

Now if we could print the current time and the name of the current method inside these hooks, we’d get an output which looks somewhat like this :

sec:00 usec:201007	called  	main
sec:00 usec:201108	called  	find_many_square_roots
sec:00 usec:692123	returned	find_many_square_roots
sec:00 usec:692178	called  	find_many_squares
sec:00 usec:846540	returned	find_many_squares
sec:00 usec:846594	called  	find_many_square_roots
sec:01 usec:336166	returned	find_many_square_roots
sec:01 usec:336215	called  	find_many_squares
sec:01 usec:484880	returned	find_many_squares
sec:01 usec:484945	called  	find_many_square_roots
sec:01 usec:959254	returned	find_many_square_roots
sec:01 usec:959315	called  	find_many_squares
sec:02 usec:106474	returned	find_many_squares
sec:02 usec:106526	returned	main

As you can see, this output can tell us how much time was spent inside each method. It also tells us how many times each method was called. This is roughly how instrumentation profiling works.

Pros :

Cons :

2. Sampling

In this mode of profiling, the profiler interrupts the program execution once in every x unit of time, and takes a peek into the call stack and records what it sees(called a “sample”). Once the program finishes running, the profiler collects all the samples and finds out the number of times each method appears across all the samples.

Hard to visualize? Let’s look at the same example code and see how different the output would be, if we used a sampling profiler.

Output from a sampling profiler would look like this :

Call stack at 0.5sec: main/find_many_square_roots
Call stack at 1.0sec: main/find_many_square_roots
Call stack at 1.5sec: main/find_many_square_roots
Call stack at 2.0sec: main/find_many_squares

In this example, the process was interrupted every 0.5 second and the call stack was recorded. Thus we got 4 samples over the lifetime of the program and out of those 4 samples, find_many_square_roots is present in 3 and find_many_squares is present in only one sample. From this sampling, we say that find_many_square_roots took 75% of the CPU where as find_many_squares took only 25%. That’s roughly how sampling profilers work.

Pros :

Cons :

Recap

We just looked into what CPU profiling means and the 2 common strategies of CPU profiling. In part 2, we’ll explore the 2 units of measuring CPU usage - CPU time and Wall time. We’ll also get our hands dirty and write some code to get these measurements. Thanks for reading!