DIY Ruby CPU profiling - Part II

Emil Soman  - March 12, 2015  |   , ,

In Part I we learned what CPU profiling means and also about the two modes of CPU profiling. In this part we’re going to explore CPU time and Wall time, the units used to measure execution cost. We’ll also write some code to get these measurements as the first step towards building our CPU profiler.

Part II. CPU time and Wall time

Wall time

Wall time is just the regular real world time that elapses between a method call and return. So if you were to measure the “Wall clock time” taken for a method to run, it would be theoretically possible to do so with a stopwatch. Just start the stopwatch when the method starts, and stop when the method returns. This is also called real time.

One important point about wall time is that it’s unpredictable and you may get different results every time you try to measure the same piece of code. This is because wall time is affected by the number of processes running in the background. When the CPU has to work on a bunch of processes at the same time, the operating system does a scheduling of the processes which are running at the time and tries to give a fair share of CPU to each of them. This means the total time spent by the CPU is divided into many slices and our method gets only some of these slices and not all of them. So while the wall clock ticks away, our process may be sitting idle and giving way for other processes which are running in parallel. This means the time spent on other processes will add to our Wall time too!

CPU time

CPU time is the time for which the CPU is dedicated to run the method. CPU time is measured in terms of CPU cycles(or ticks) which are used to execute the method. We can convert this into time if we know the frequency of the CPU in units of cycles per second aka Hertz. So if the CPU took x ticks to execute a method, and the frequency of the CPU is y hertz, the time taken by the CPU to execute the method = x/y seconds. Sometimes the OS does this conversion for us so we don’t have to do this calculation by ourselves.

CPU time will not be equal to Wall time. The difference depends on the type of instructions in our method. We can categorize the instructions into broadly 2 types: CPU bound and I/O bound. When I/O instructions are being executed, the CPU becomes idle and can move on to process other CPU bound instructions. So if our method has a time consuming I/O instruction, the CPU stops spending time on our method and moves on to something else until the I/O operation is completed. During this time the Wall time keeps ticking but the CPU time stops and lags behind Wall time.

Let’s say a very slow running method took 5 minutes on your clock to finish running. If you were to ask how much time was spent on the method, your wall clock would say “It took 5 minutes to run this method”, but the CPU would say “I spent 3 minutes of my time on this method”. So who are you going to listen to? Which time more accurately measures the cost of executing the method?

The answer is, it depends™. It depends on the kind of method you want to measure. If the method spends most of its time doing I/O operations, or it doesn’t deal with CPU bound instructions directly, the cost of execution depicted by CPU time is going to be grossly inaccurate. For these types of methods, it makes more sense to use Wall time as the measurement. For all other cases, it’s safe to stick with CPU time.

Measuring CPU time and Wall time

Since we’re going to write a CPU profiler, we’ll need a way to measure CPU time and wall time. Let’s take a look at the code in Ruby’s Benchmark module which already measures CPU time and Wall time.

def measure(label = "") # :yield:
  t0, r0 = Process.times, Process.clock_gettime(BENCHMARK_CLOCK)
  yield
  t1, r1 = Process.times, Process.clock_gettime(BENCHMARK_CLOCK)
  Benchmark::Tms.new(t1.utime  - t0.utime,
                     t1.stime  - t0.stime,
                     t1.cutime - t0.cutime,
                     t1.cstime - t0.cstime,
                     r1 - r0,
                     label)
end

So Ruby uses 2 methods from the Process class to measure time :

  1. times to measure CPU time
  2. clock_gettime to measure real time aka Wall time

But unfortunately the resolution of time returned by the times method is 1 second which means if we use times to measure CPU time in our profiler, we’ll only be able to profile methods that take a few seconds to complete. clock_gettime looks interesting, though.

clock_gettime

Process::clock_gettime is a method added in Ruby 2.1 and it uses POSIX clock_gettime() function and falls back to OS specific emulations to get the value of time in case clock_gettime is not available in the OS or when the type of clock we are measuring with clock_gettime is not implemented in the OS. This function accepts a clock_id and the time resolution as arguments. There are a bunch of clock_ids you could pass in to pick the kind of clock to use, but the ones we’re interested in are:

  1. CLOCK_MONOTONIC: This clock measures the elapsed wall clock time since an arbitrary point in the past and is not affected by changes in the system clock. Perfect for measuring Wall time.
  2. CLOCK_PROCESS_CPUTIME_ID: This clock measures per-process CPU time, ie the time consumed by all threads in the process. We can use this to measure CPU time.

Let’s make use of this and write some code:

module DiyProf
  # These methods make use of `clock_gettime` method introduced in Ruby 2.1
  # to measure CPU time and Wall clock time.

  def self.cpu_time
    Process.clock_gettime(Process::CLOCK_PROCESS_CPUTIME_ID, :microsecond)
  end

  def self.wall_time
    Process.clock_gettime(Process::CLOCK_MONOTONIC, :microsecond)
  end
end

We could use these methods to benchmark code:

puts "****CPU Bound****"
c1, w1 = DiyProf::cpu_time, DiyProf::wall_time
10000.times do |i|
  Math.sqrt(i)
end
c2, w2 = DiyProf::cpu_time, DiyProf::wall_time
puts "CPU time\t=\t#{c2-c1}\nWall time\t=\t#{w2-w1}"

puts "\n****IO Bound****"
require 'tempfile'

c1, w1 = DiyProf::cpu_time, DiyProf::wall_time
1000.times do |i|
  Tempfile.create('file') do |f|
    f.puts(i)
  end
end
c2, w2 = DiyProf::cpu_time, DiyProf::wall_time
puts "CPU time\t=\t#{c2-c1}\nWall time\t=\t#{w2-w1}"

Running this code would give an output similar to this:

****CPU Bound****
CPU time	=	5038
Wall time	=	5142

****IO Bound****
CPU time	=	337898
Wall time	=	475864

This clearly shows that on a single CPU core, CPU time and Wall time are nearly equal when running purely CPU bound instructions whereas CPU time is always less that Wall time when running I/O bound instructions.

Recap

We’ve learned what CPU time and Wall time mean, their differences, and when to use which. We also wrote some Ruby code to measure CPU time and Wall time which would help us measure time in the CPU profiler we’re building. In part 3 we’ll take a look at Ruby’s tracepoint API and use it to build an instrumentation profiler. Thanks for reading! If you would like to get updates about subsequent blog posts in this DIY CPU profiler series, do follows us on twitter @codemancershq.

DIY Ruby CPU profiling - Part I

Emil Soman  - March 6, 2015  |   , ,

At Codemancers, we’re building Rbkit, a fresh code profiler for the Ruby language with tonnes of cool features. I’m currently working on implementing a CPU profiler inside rbkit gem which would help rbkit UI to reconstruct the call graph of the profiled ruby process and draw useful visualizations on the screen. I learned a bunch of new things along the way and I’d love to share them with you in this series of blog posts.

We’re going to start from the fundamentals and step by step, we’re going to write a rudimentary CPU profiler for Ruby ourselves! By the time we finish we would learn :

Part I. An Introduction to CPU Profiling

By doing a CPU profile of your program, you can find out how expensive your program is with respect to CPU usage. In order to profile your program, you’ll need to use a profiling tool and follow the following steps :

  1. Start CPU profiling
  2. Execute the code you want to profile
  3. Stop CPU profiling and get profiling result
  4. Analyze result

By analyzing the profiling result, you can find the bottlenecks which slow down your whole program.

Profiling modes

CPU profiling is done in broadly 2 ways:

1. Instrumentation

In this mode, the profiling tool makes use of some hooks, either provided by the interpreter or inserted into the program, to understand the call graph and measure the execution time of each method in the call graph.

For example, consider the following piece of Ruby code :

def main
  3.times do
    find_many_square_roots
    find_many_squares
  end
end

def find_many_square_roots
  5000.times{|i| Math.sqrt(i)}
end

def find_many_squares
  5000.times{|i| i**2 }
end

main

I’ve inserted some comments to help understand how the hooks would get executed if the Ruby interpreter gave us method call and return hooks:

def main
  # method call hook gets executed
  3.times do
    find_many_square_roots
    find_many_squares
  end
  # method end hook gets executed
end

def find_many_square_roots
  # method call hook gets executed
  5000.times{|i| Math.sqrt(i)}
  # method end hook gets executed
end

def find_many_squares
  # method call hook gets executed
  5000.times{|i| i**2 }
  # method end hook gets executed
end

main

Now if we could print the current time and the name of the current method inside these hooks, we’d get an output which looks somewhat like this :

sec:00 usec:201007	called  	main
sec:00 usec:201108	called  	find_many_square_roots
sec:00 usec:692123	returned	find_many_square_roots
sec:00 usec:692178	called  	find_many_squares
sec:00 usec:846540	returned	find_many_squares
sec:00 usec:846594	called  	find_many_square_roots
sec:01 usec:336166	returned	find_many_square_roots
sec:01 usec:336215	called  	find_many_squares
sec:01 usec:484880	returned	find_many_squares
sec:01 usec:484945	called  	find_many_square_roots
sec:01 usec:959254	returned	find_many_square_roots
sec:01 usec:959315	called  	find_many_squares
sec:02 usec:106474	returned	find_many_squares
sec:02 usec:106526	returned	main

As you can see, this output can tell us how much time was spent inside each method. It also tells us how many times each method was called. This is roughly how instrumentation profiling works.

Pros :

Cons :

2. Sampling

In this mode of profiling, the profiler interrupts the program execution once in every x unit of time, and takes a peek into the call stack and records what it sees(called a “sample”). Once the program finishes running, the profiler collects all the samples and finds out the number of times each method appears across all the samples.

Hard to visualize? Let’s look at the same example code and see how different the output would be, if we used a sampling profiler.

Output from a sampling profiler would look like this :

Call stack at 0.5sec: main/find_many_square_roots
Call stack at 1.0sec: main/find_many_square_roots
Call stack at 1.5sec: main/find_many_square_roots
Call stack at 2.0sec: main/find_many_squares

In this example, the process was interrupted every 0.5 second and the call stack was recorded. Thus we got 4 samples over the lifetime of the program and out of those 4 samples, find_many_square_roots is present in 3 and find_many_squares is present in only one sample. From this sampling, we say that find_many_square_roots took 75% of the CPU where as find_many_squares took only 25%. That’s roughly how sampling profilers work.

Pros :

Cons :

Recap

We just looked into what CPU profiling means and the 2 common strategies of CPU profiling. In part 2, we’ll explore the 2 units of measuring CPU usage - CPU time and Wall time. We’ll also get our hands dirty and write some code to get these measurements. Thanks for reading!

Cake walk: Using bower with rails

Yuva  - December 10, 2014  |   , , ,

Traditionally, in order to use any javascript library (js lib), Rails users will do either of these two things:

First option is problematic because users have to keep track of changes to js lib, and update them whenever required. Second option delegates this responsibility to gem author, and users will just bump up gem version and most of the times assume that latest gem will have latest js lib. Both these approaches have problems because everytime js lib author improves lib either users have to copy sources or gem authors have to make a new release.

Of late, creating js libs and distributing them through bower has gained lot of traction. There are different ways to use bower with Rails. A popular way is to use bower rails gem. This blog will not use this gem, but explores sprockets inbuilt support for bower.

Sprockets - Bower support

Sprockets has support for bower. It doesn’t do package management on its own, but it can understand bower package structure, and pick up js, and css files from bower package. Lets go through this simple example:

Setting up bower json file

Packages installed from bower need to be specified in a bower.json file. Run bower init command at the root of rails app to generate this file. This file will be checked into version control, so that other devs can also know about dependencies.

> bower init
? name: rails app
? version: 0.0.0
? description:
? main file:
? what types of modules does this package expose?:
? keywords:
? authors: Yuva <yuva@codemancers.com>
? license: MIT
? homepage: rails-app.dev
? set currently installed components as dependencies?: No
? add commonly ignored files to ignore list?: No
? would you like to mark this package as private which prevents
  it from being accidentally published to the registry?: Yes

{
  name: 'rails app',
  version: '0.0.0',
  homepage: 'rails-app.dev',
  authors: [
    'Yuva <yuva@codemancers.com>'
  ],
  license: 'MIT',
  private: true
}

? Looks good?: Yes

One of the important things to note here is to mark this package as private so that its not published by mistake. The generated bower.json file can be further edited, and un-necessary fields like homepage, author can be removed.

{
  "name": "rails app",
  "version": "0.0.0",
  "private": true
}

Note: This is a one time process.

Setting up .bowerrc file

Since rails automatically picks up assets from fixed locations, bower can be instructed to install packages to one of the pre defined locations. Create a .bowerrc file like this:

{
  "directory": "vendor/assets/javascripts"
}

Since bower brings in third party js libs, its recommended to put them under vendor folder. Note: This is a one time process

Installing Faker js lib and using it

Use bower install to install above said lib. Since .bowerrc defaults the directory to vendor/assets/javascript, Faker will be installed under this directory. Use --save option with bower to update bower.json.

> bower install Faker --save

Thats it! Faker lib is installed. Add an entry in application.js and use the lib.

//= require Faker

Note: Make sure that its just Faker and not Faker.js. More details on why no extension will be explained later in the blog post.

What just happened? How did it work?

The vendor/assets/javascript folder has a folder called Faker, and that folder does not have any file called Faker.js. Inspecting any page from the rails app, script tab looks like this:

Firefox inspect script tab

Looking at source code of Faker, there is a file called faker.js, and is under build/build folder. How did rails app know the location of this file, even though application.js doesnot have explicit path? This is where sprockets support for bower kicks in:

Now, bower.json for Faker has explicit path for faker.js, which will be returned by sprockets

{
  "name": "faker",
  "main": "./build/build/faker.js",
  "version": "2.0.0",
  # ...
}
Bonus: Digging into sprockets source code

First, sprockets will populate asset search paths. When sprockets sees require Faker in application.js, it checks for extension, and since there is no extension, ie .js, asset search paths will be populated with 3 paths:

Gist of the source:

  def search_paths
    paths = [pathname.to_s]

    # optimization: bower.json can only be nested one level deep
    if !path_without_extension.to_s.index('/')
      paths << path_without_extension.join(".bower.json").to_s
      paths << path_without_extension.join("bower.json").to_s
      # DEPRECATED bower configuration file
      paths << path_without_extension.join("component.json").to_s
    end
  end

Source code is here: populating asset paths

Second, while resolving require Faker, if bower.json file is found, sprockets will parse the json file, and fetches main entry. Gist of the source:

  def resolve(logical_path, options = {})
    args = attributes_for(logical_path).search_paths + [options]
    pathname = Pathname.new(path)
    if %w( .bower.json bower.json component.json ).include?(pathname.basename.to_s)
      bower = json_decode(pathname.read)
      yield pathname.dirname.join(bower['main'])
    end
  end

Source code is here: resolving bower json

An Introduction to JSON Schema

Kashyap  - April 5, 2014  |   , ,

JSON, or JavaScript Object Notation has become the most widely used serialization and transport mechanism for information across various web-services. From it’s initial conception, the format garnered swift and wide appreciation for being really simple and non-verbose.

Lets say you want to consume the following JSON object via an API:

{
  id: 3232,
  name: "Kashyap",
  email: "kashyap@example.com"
  contact: {
    id: 123,
    address1: "Shire",
    zipcode: LSQ424
  }
}

Now, let’s assume that you want to ensure that before consuming this data, email and contact.zipcode must be present in the JSON. If that data is not present, you shouldn’t be using it. The typical way is to check for presence of those fields but this whack-a-mole quickly gets tiresome.

Similarly, lets say you are an API provider and you want to let your API users know the basic structure to which data is going to conform to, so that your API users can automatically test validity of data.

If you ever had to deal with above two problems, you should be using JSON schemas.

What’s a Schema?

A schema is defined in Wikipedia as a way to define the structure, content, and to some extent, the semantics of XML documents; which probably is the simplest way one could explain it. For every element — or node — in a document, a rule is given to which it needs to conform. Having constraints defined at this level will make it unnecessary to handle the edge cases in the application logic. This is a pretty powerful tool. This was missing from the original JSON specification but efforts were made to design one later on.

Why do we need a Schema?

If you’re familiar with HTML, the doctype declaration on the first line is a schema declaration. (Specific to HTML 4 and below.)

HTML 4 Transitional DOCTYPE declaration:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">

This line declares that the rest of the document conforms to the directives specified at the url http://www.w3.org/TR/html4/loose.dtd. That means, if you declare the document as strict, then the usage of any new elements like <sp></sp> will cause the page to display nothing. In other words, if you make a typo or forget to close a tag somewhere, then the page will not get rendered and your users will end up with a blank page.

At first glance, this looks like a pain — and it is, actually. That’s part of the reason why this was abandoned altogether in the newer version of HTML. However, HTML is not really a good use case for a schema. Having a well-defined schema upfront helps in validating user input at the language/protocol level than at the application’s implementation level. Let’s see how defining a schema makes it easy to handle user input errors.

JSON Schema

The JSON Schema specification is divided into three parts:

  1. JSON Schema Core: The JSON Schema Core specification is where the terminology for a schema is defined. Technically, this is simply the JSON spec with the only addition being definition of a new media type of application/schema+json. Oh! a more important contribution of this document is the $schema keyword which is used to identify the version of the schema and the location of a resource that defines a schema. This is analogous to the DOCTYPE declaration in the HTML 4.01 and other older HTML versions.

    The versions of the schema separate changes in the keywords and the general structure of a schema document. The resource of a schema is usually a webpage which provides a JSON object that defines a specification. Confused? Go open up the url http://www.w3.org/TR/html4/loose.dtd which I’m linking to here in a browser and go through the contents. This is the specification of HTML 4.01 Loose API. Tags like ENTITY, ELEMENT, ATTLIST are used to define the accepted elements, entities and attributes for a valid HTML document.

    Similarly, the JSON Schema Core resource URL (downloads the schema document) defines a superset of constraints.

  2. JSON Schema Validation: The JSON Schema Validation specification is the document that defines the valid ways to define validation constraints. This document also defines a set of keywords that can be used to specify validations for a JSON API. For example, keywords like multipleOf, maxLength, minLength etc. are defined in this specification. In the examples that follow, we will be using some of these keywords.

  3. JSON Hyper-Schema: This is another extension of the JSON Schema spec, where-in, the hyperlink and hypermedia-related keywords are defined. For example, consider the case of a globally available avatar (or, Gravatar). Every Gravatar is composed of three different components:

    1. A Picture ID,
    2. A Link to the picture,
    3. Details of the User (name and email ID).

    When we query the API provided by Gravatar, we get a reponse typically having this data encoded as JSON. This JSON response will not download the entire image but will have a link to the image. Let’s look at a JSON representation of a fake profile I’ve setup on Gravatar:

    {
      "entry":[{
        "id":"61443191",
        "hash":"756b5a91c931f6177e2ca3f3687298db",
        "requestHash":"756b5a91c931f6177e2ca3f3687298db",
        "profileUrl":"http:\/\/gravatar.com\/jsonguerilla",
        "preferredUsername":"jsonguerilla",
        "thumbnailUrl":"http:\/\/1.gravatar.com\/avatar\/756b5a91c931f6177e2ca3f3687298db",
        "photos":[{
          "value":"http:\/\/1.gravatar.com\/avatar\/756b5a91c931f6177e2ca3f3687298db",
          "type":"thumbnail"
        }],
        "name":{
          "givenName":"JSON",
          "familyName":"Schema",
          "formatted":"JSON Schema Blogpost"
        },
        "displayName":"jsonguerilla",
        "urls":[]
      }]
    }

    In this JSON response, the images are represented by hyperlinks but they are encoded as strings. Although this example is for a JSON object returned from a server, this is how traditional APIs handle input as well. This is due to the fact that JSON natively does not provide a way to handle hyperlinks; they are only Strings.

    JSON hyperschema attempts to specify a way to have a more semantic way of representing hyperlinks and images. It does this by defining keywords (as JSON properties) such as links, rel, href. Note that this specification does not try to re-define these words in general (as they are defined in HTTP protocol already) but it tries to normalize the way those keywords are used in JSON.

Drafts

The schema is still under development and the progress can be tracked by comparing the versions known as “drafts”. Currently, the schema is in the 4th version. The validation keywords can be dropped or added between versions. This article — and many more over the interwebs — refer to the 4th version of the draft.

Usage

Let’s build a basic JSON API that accepts the following data with some constraints:

  1. A post ID. This is a number and is a required parameter.
  2. Some free-form text with an attribute of body. This is a required parameter.
  3. A list of tags with an attribute of ‘tags’. Our paranoid API cannot accept more than 6 tags though. This is a required parameter.
  4. An optional list of hyperlinks with an attribute of ‘references’

Let’s face it, almost every app you might’ve ever written must’ve had some or the other constraints. We end up repeating the same verification logic everytime. Let’s see how we can simplify that.

We will be using Sinatra for building the API. This is the basic structure of our app.rb:

require 'sinatra'
require 'sinatra/json'
require 'json-schema'

post '/' do
end

The Gemfile:

gem 'sinatra'
gem 'sinatra-contrib'
gem 'json-schema'

We will be using the JSON-Schema gem for the app. Let’s look at the schema that we will define in a schema.json file:

{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "type": "object",
  "required": [ "id", "body", "tags" ],
  "properties": {
    "id": {
      "type": "integer"
    },

    "body": {
      "type": "string"
    },

    "tags": {
      "type": "array",
      "items": {
        "type": "string"
      },
      "maxItems": 6
    },

    "references": {
      "type": "array",
      "items": {
        "type": "string",
        "format": "uri"
      }
    }
  }
}
  1. The properties attribute holds the main chunk of the schema definition. This is the attribute under which each of the individual API attribute is explained in the form of a schema of it’s own.
  2. The required attribute takes in a list of strings that mention which of the API parameters are required. If any of these parameters is missing from the JSON input to our app, an error will be logged and the input won’t get validated.
  3. The type keyword specifies the schema type for that particular block. So, at the first level, we say it’s an object (analogous to a Ruby Hash). For the body, tags and references, the types are string, array and array respectively.
  4. In case an API parameter can accept an array, the items inside that array can be explained by a schema definition of their own. This is done by using an items attribute and defining how each of the item in the array should be validated.
  5. The format attribute is a built-in format for validation in the JSON Schema specification. This alleviates the pain of adding regex for validating common items like uri, ip4, ip6, email, date-time and hostname. That’s right, no more copy-pasting URI validation regexes from StackOverflow.
  6. The $schema attribute is a non-mandatory attribute that specifies the type of the schema being used. For our example, we will be using the draft#4 of the JSON Schema spec.

To use this schema in our app, we will create a helper method that uses validates the input with the schema we just defined. The json-schema gem provides three methods for validation — a validate method that returns either true or false, a validate! that raises an exception when validation of an attribute fails and a fully_validate method that builds up an array of errors similar to what Rails’ ActiveRecord#save method provides.

We will be using the JSON::Validator.fully_validate method in our app and return a nicely formatted JSON response to the user if the validation fails.

helpers do
  def validate(json_string_or_hash)
    JSON::Validator.fully_validate('schema.json', json_string_or_hash)
  end
end

Now, we can use this helper inside routes to check the validity of the input JSON like so:

post '/' do
  input = JSON.load(request.body.read)
  errors = validate(input)

  if errors.empty?
    json({ message: "The blog post has been saved!" })
  else
    status 400
    json({ errors: a })
  end
end

If the input is valid, the errors object will be empty. Otherwise, it will hold a list of errors. This object will be returned as a JSON response with the appropriate HTTP status code. For example, if we run this app and send in a request with a missing id parameter, the response will be something similar to the following:

[
  "The property '#/' did not contain a required property of 'id' in
  schema schema.json#"
]

Let’s say if we send in a request with id having a string parameter. The errors object will hold the following:

[
  "The property '#/id' of type String did not match the following type:
  integer in schema schema.json#"
]

Last example. Let’s try sending a references parameter with a malformed URI. We will send the following request:

{
  "id": 1,
  "body": "Hello, Universe",
  "tags": ["start", "first"],
  "references": [ "data:image/svg+xml;base64 C==" ]
}

(This input is in the file not_working_wrong_uri.txt)

curl \
  -d @not_working_wrong_uri.txt
  -H 'Content-Type: application/json' \
  http://localhost:4567

The output of this would be:

[
  "The property '#/references/0' must be a valid URI in schema
  schema.json#"
]

Thus, with a really simple validation library and a standard that library implementers in different languages use, we can achieve input validation with a really simple setup. One really great advantage of following a schema standard is that we can be sure about the basic implementation no matter what the language which might implment the schema. For example, we can use the same schema.json description with a JavaScript library for validating the user input — for example, in the front-end of the API we’ve just built.

Summary

The full app, some sample input files are present in this repo. The json-schema gem is not yet official and might have some unfinished components — For example, the format validations of hostname and email for a string type have not been implemented yet — and the JSON Schema specification itself is under constant revisions. But that doesn’t mean it’s not ready for usage. Few of our developers use the gem in one of our projects and are pretty happy with it. Try out the gem and go through the specfication to gain an idea of why this would be beneficial yourself.

More Reading

  1. Understanding JSON Schema
  2. JSON Schema Documentation
  3. This excellent article by David Walsh
  4. JSON Schema Example: This example uses more keywords that weren’t discussed in this post. For example, title and description.

Form object validations in Rails 4

Yuva  - March 22, 2014  |   ,

Of late, at Codemancers, we are using form objects to decouple forms in views. This also helps in cleaning up how the data filled by end user is consumed and persisted in the backend. So, far the results have been good.

What are form objects

This blog post assumes that you are already familiar with form objects. Railscasts has a nice screencast about form objects. Do check it out if you haven’t already.

Use case

Let’s say that there is an organization and it has several employees. We’re tasked to build a Rails app that provides an interface where an admin can select one or more employees and send them emails. A typical interface implementation might look like this:

Employee email form

After selecting employees, filling-in the subject and body, and upon clicking “Send”, the backend should send emails to the selected employees. This is done by passing the array of the ids of employees, the subject and body to the backend. The POST parameters for that request look like this:

{
  "utf8"=>"",
  "email_form"=>{"employee_ids"=>[""], "subject"=>"", "body"=>""},
  "commit"=>"Send emails to employees"
}

Mass mailer form

We will create a EmployeesMassMailerForm form to encapsulate the validations and performing the actual action of sending email. This form should accept the params sent by the form, perform validations like checking whether all the employee ids belong to organization etc., and then send the emails.

class Organization < ActiveRecord::Base
  def get_employees(ids)
    employees.where(id: ids)
  end
end

class EmployeeMassMailerForm
  include ActiveModel::Model

  attr_accessor :organization, :employee_ids, :subject, :body

  validates :organization, :employee_ids, :subject, :body, presence: true
  validate  :employee_ids_should_belong_to_organization

  def perform
    return false unless valid?

    @employees = organization.get_employees(xemployee_ids)
    @employees.each { |e| schedule_email_for(e) }
    true
  end

  private

  def employee_ids_should_belong_to_organization
    if organization.get_employees(employee_ids).length != employee_ids.length
      errors.add(:employee_ids, :invalid)
    end
  end

  def schedule_email_for(e)
    Mailer.send_email(e, subject, body)
  end
end

With Rails 4, ActiveModel ships with Model module which helps in assigning attributes, just like how you can do with ActiveRecord class, along with helpers for validations. It is no longer necessary to use other libraries for form objects. Just include ActiveModel in a PORO class and you are good to go.

Testing using rspec and shoulda

All the form objects can be broken down into 2 main sections:

  1. Validations
  2. Performing actions
Testing validations

Adding validations on forms and models is pretty straight forward. Except for database-related validations like uniqueness, all the ActiveRecord validations can be used on form objects. These validations also make it easy to display validation errors in the view.

At Codemancers, we mostly use rspec and shoulda for testing. Validations on forms can be tested like this:

describe EmployeeMassMailerForm do
  describe 'Validations' do
    it { should validate_presence_of(:organization) }
    it { should validate_presence_of(:employee_ids) }
    it { should validate_presence_of(:subject)      }
    it { should validate_presence_of(:body)         }

    context 'when employee ids belong to organization' do
      it 'validates form successfully' do
        employee_ids = [1, 2]
        organization = mock_model(Organization, get_employees: employees_ids)

        form = described_class.new(organization: organization, subject: 'Test',
                                   employee_ids: employee_ids, body: 'Test')
        expect(form).to be_valid
      end
    end

    context 'when one or more employee ids donot belong organization' do
      it 'fails to validate the form' do
        organization = mock_model(Organization, get_employees: [])

        form = described_class.new(organization: organization, subject: 'Test',
                                   employee_ids: [1, 2, 3], body: 'Test')
        expect(form).to be_invalid
      end
    end
  end
end

You can notice here that while validating employee ids, we use stubs and mock models so that tests never hit database. Testing a form that has validations is a bit hard, because one has to heavily stub and mock models until form becomes valid. But testing an invalid form is easy and sometimes easy to maintain. Notice that we do not care what get_employees returns and that we hard coded it with an empty array whose length is 0. Always try to put as many validations as possible on form object, so that very less exceptions are raised while performing actions.

Testing actions performed by form

Once all the validations pass, the form object will go ahead and perform the action it is supposed to do. It can be anything from sending emails to persisting objects to database. Lets see how we can test the action perform from above form.

describe EmployeeMassMailerForm do
  describe '#perform' do
    let(:organization) do
      employees = [stub(email: 'a@b.com'), stub(email: 'b@c.com')]
      mock_model(Organization, get_employees: employees)
    end

    let(:form) do
      described_class.new(organization: organization, subject: 'Test',
                          employee_ids: [1, 2], body: 'Test')
    end

    before(:each) do
      described_class.any_instance.should_receive(:valid?).and_return(true)
      InvitesMailer.deliveries.clear
    end

    it 'sends emails to all employees' do
      form.perform
      expect(InvitesMailer.deliveries.length).to eq 2
    end

    it 'returns true' do
      expect(form.perform).to be_true
    end
  end
end

The trick here is to hard-code valid? to be true in before block. Since we have already tested validations, we can hard code the return value of valid? to be true. This saves a bunch of db calls and mocks.

I hope you enjoyed this article and if you want to keep updated with latest stuff we are building or blogging about, follow us on twitter @codemancershq.