An Introduction to JSON Schema

Kashyap  - April 5, 2014  |   , ,

JSON, or JavaScript Object Notation has become the most widely used serialization and transport mechanism for information across various web-services. From it’s initial conception, the format garnered swift and wide appreciation for being really simple and non-verbose.

Lets say you want to consume the following JSON object via an API:

{
  id: 3232,
  name: "Kashyap",
  email: "kashyap@example.com"
  contact: {
    id: 123,
    address1: "Shire",
    zipcode: LSQ424
  }
}

Now, let’s assume that you want to ensure that before consuming this data, email and contact.zipcode must be present in the JSON. If that data is not present, you shouldn’t be using it. The typical way is to check for presence of those fields but this whack-a-mole quickly gets tiresome.

Similarly, lets say you are an API provider and you want to let your API users know the basic structure to which data is going to conform to, so that your API users can automatically test validity of data.

If you ever had to deal with above two problems, you should be using JSON schemas.

What’s a Schema?

A schema is defined in Wikipedia as a way to define the structure, content, and to some extent, the semantics of XML documents; which probably is the simplest way one could explain it. For every element — or node — in a document, a rule is given to which it needs to conform. Having constraints defined at this level will make it unnecessary to handle the edge cases in the application logic. This is a pretty powerful tool. This was missing from the original JSON specification but efforts were made to design one later on.

Why do we need a Schema?

If you’re familiar with HTML, the doctype declaration on the first line is a schema declaration. (Specific to HTML 4 and below.)

HTML 4 Transitional DOCTYPE declaration:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">

This line declares that the rest of the document conforms to the directives specified at the url http://www.w3.org/TR/html4/loose.dtd. That means, if you declare the document as strict, then the usage of any new elements like <sp></sp> will cause the page to display nothing. In other words, if you make a typo or forget to close a tag somewhere, then the page will not get rendered and your users will end up with a blank page.

At first glance, this looks like a pain — and it is, actually. That’s part of the reason why this was abandoned altogether in the newer version of HTML. However, HTML is not really a good use case for a schema. Having a well-defined schema upfront helps in validating user input at the language/protocol level than at the application’s implementation level. Let’s see how defining a schema makes it easy to handle user input errors.

JSON Schema

The JSON Schema specification is divided into three parts:

  1. JSON Schema Core: The JSON Schema Core specification is where the terminology for a schema is defined. Technically, this is simply the JSON spec with the only addition being definition of a new media type of application/schema+json. Oh! a more important contribution of this document is the $schema keyword which is used to identify the version of the schema and the location of a resource that defines a schema. This is analogous to the DOCTYPE declaration in the HTML 4.01 and other older HTML versions.

    The versions of the schema separate changes in the keywords and the general structure of a schema document. The resource of a schema is usually a webpage which provides a JSON object that defines a specification. Confused? Go open up the url http://www.w3.org/TR/html4/loose.dtd which I’m linking to here in a browser and go through the contents. This is the specification of HTML 4.01 Loose API. Tags like ENTITY, ELEMENT, ATTLIST are used to define the accepted elements, entities and attributes for a valid HTML document.

    Similarly, the JSON Schema Core resource URL (downloads the schema document) defines a superset of constraints.

  2. JSON Schema Validation: The JSON Schema Validation specification is the document that defines the valid ways to define validation constraints. This document also defines a set of keywords that can be used to specify validations for a JSON API. For example, keywords like multipleOf, maxLength, minLength etc. are defined in this specification. In the examples that follow, we will be using some of these keywords.

  3. JSON Hyper-Schema: This is another extension of the JSON Schema spec, where-in, the hyperlink and hypermedia-related keywords are defined. For example, consider the case of a globally available avatar (or, Gravatar). Every Gravatar is composed of three different components:

    1. A Picture ID,
    2. A Link to the picture,
    3. Details of the User (name and email ID).

    When we query the API provided by Gravatar, we get a reponse typically having this data encoded as JSON. This JSON response will not download the entire image but will have a link to the image. Let’s look at a JSON representation of a fake profile I’ve setup on Gravatar:

    {
      "entry":[{
        "id":"61443191",
        "hash":"756b5a91c931f6177e2ca3f3687298db",
        "requestHash":"756b5a91c931f6177e2ca3f3687298db",
        "profileUrl":"http:\/\/gravatar.com\/jsonguerilla",
        "preferredUsername":"jsonguerilla",
        "thumbnailUrl":"http:\/\/1.gravatar.com\/avatar\/756b5a91c931f6177e2ca3f3687298db",
        "photos":[{
          "value":"http:\/\/1.gravatar.com\/avatar\/756b5a91c931f6177e2ca3f3687298db",
          "type":"thumbnail"
        }],
        "name":{
          "givenName":"JSON",
          "familyName":"Schema",
          "formatted":"JSON Schema Blogpost"
        },
        "displayName":"jsonguerilla",
        "urls":[]
      }]
    }

    In this JSON response, the images are represented by hyperlinks but they are encoded as strings. Although this example is for a JSON object returned from a server, this is how traditional APIs handle input as well. This is due to the fact that JSON natively does not provide a way to handle hyperlinks; they are only Strings.

    JSON hyperschema attempts to specify a way to have a more semantic way of representing hyperlinks and images. It does this by defining keywords (as JSON properties) such as links, rel, href. Note that this specification does not try to re-define these words in general (as they are defined in HTTP protocol already) but it tries to normalize the way those keywords are used in JSON.

Drafts

The schema is still under development and the progress can be tracked by comparing the versions known as “drafts”. Currently, the schema is in the 4th version. The validation keywords can be dropped or added between versions. This article — and many more over the interwebs — refer to the 4th version of the draft.

Usage

Let’s build a basic JSON API that accepts the following data with some constraints:

  1. A post ID. This is a number and is a required parameter.
  2. Some free-form text with an attribute of body. This is a required parameter.
  3. A list of tags with an attribute of ‘tags’. Our paranoid API cannot accept more than 6 tags though. This is a required parameter.
  4. An optional list of hyperlinks with an attribute of ‘references’

Let’s face it, almost every app you might’ve ever written must’ve had some or the other constraints. We end up repeating the same verification logic everytime. Let’s see how we can simplify that.

We will be using Sinatra for building the API. This is the basic structure of our app.rb:

require 'sinatra'
require 'sinatra/json'
require 'json-schema'

post '/' do
end

The Gemfile:

gem 'sinatra'
gem 'sinatra-contrib'
gem 'json-schema'

We will be using the JSON-Schema gem for the app. Let’s look at the schema that we will define in a schema.json file:

{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "type": "object",
  "required": [ "id", "body", "tags" ],
  "properties": {
    "id": {
      "type": "integer"
    },

    "body": {
      "type": "string"
    },

    "tags": {
      "type": "array",
      "items": {
        "type": "string"
      },
      "maxItems": 6
    },

    "references": {
      "type": "array",
      "items": {
        "type": "string",
        "format": "uri"
      }
    }
  }
}
  1. The properties attribute holds the main chunk of the schema definition. This is the attribute under which each of the individual API attribute is explained in the form of a schema of it’s own.
  2. The required attribute takes in a list of strings that mention which of the API parameters are required. If any of these parameters is missing from the JSON input to our app, an error will be logged and the input won’t get validated.
  3. The type keyword specifies the schema type for that particular block. So, at the first level, we say it’s an object (analogous to a Ruby Hash). For the body, tags and references, the types are string, array and array respectively.
  4. In case an API parameter can accept an array, the items inside that array can be explained by a schema definition of their own. This is done by using an items attribute and defining how each of the item in the array should be validated.
  5. The format attribute is a built-in format for validation in the JSON Schema specification. This alleviates the pain of adding regex for validating common items like uri, ip4, ip6, email, date-time and hostname. That’s right, no more copy-pasting URI validation regexes from StackOverflow.
  6. The $schema attribute is a non-mandatory attribute that specifies the type of the schema being used. For our example, we will be using the draft#4 of the JSON Schema spec.

To use this schema in our app, we will create a helper method that uses validates the input with the schema we just defined. The json-schema gem provides three methods for validation — a validate method that returns either true or false, a validate! that raises an exception when validation of an attribute fails and a fully_validate method that builds up an array of errors similar to what Rails’ ActiveRecord#save method provides.

We will be using the JSON::Validator.fully_validate method in our app and return a nicely formatted JSON response to the user if the validation fails.

helpers do
  def validate(json_string_or_hash)
    JSON::Validator.fully_validate('schema.json', json_string_or_hash)
  end
end

Now, we can use this helper inside routes to check the validity of the input JSON like so:

post '/' do
  input = JSON.load(request.body.read)
  errors = validate(input)

  if errors.empty?
    json({ message: "The blog post has been saved!" })
  else
    status 400
    json({ errors: a })
  end
end

If the input is valid, the errors object will be empty. Otherwise, it will hold a list of errors. This object will be returned as a JSON response with the appropriate HTTP status code. For example, if we run this app and send in a request with a missing id parameter, the response will be something similar to the following:

[
  "The property '#/' did not contain a required property of 'id' in
  schema schema.json#"
]

Let’s say if we send in a request with id having a string parameter. The errors object will hold the following:

[
  "The property '#/id' of type String did not match the following type:
  integer in schema schema.json#"
]

Last example. Let’s try sending a references parameter with a malformed URI. We will send the following request:

{
  "id": 1,
  "body": "Hello, Universe",
  "tags": ["start", "first"],
  "references": [ "data:image/svg+xml;base64 C==" ]
}

(This input is in the file not_working_wrong_uri.txt)

curl \
  -d @not_working_wrong_uri.txt
  -H 'Content-Type: application/json' \
  http://localhost:4567

The output of this would be:

[
  "The property '#/references/0' must be a valid URI in schema
  schema.json#"
]

Thus, with a really simple validation library and a standard that library implementers in different languages use, we can achieve input validation with a really simple setup. One really great advantage of following a schema standard is that we can be sure about the basic implementation no matter what the language which might implment the schema. For example, we can use the same schema.json description with a JavaScript library for validating the user input — for example, in the front-end of the API we’ve just built.

Summary

The full app, some sample input files are present in this repo. The json-schema gem is not yet official and might have some unfinished components — For example, the format validations of hostname and email for a string type have not been implemented yet — and the JSON Schema specification itself is under constant revisions. But that doesn’t mean it’s not ready for usage. Few of our developers use the gem in one of our projects and are pretty happy with it. Try out the gem and go through the specfication to gain an idea of why this would be beneficial yourself.

More Reading

  1. Understanding JSON Schema
  2. JSON Schema Documentation
  3. This excellent article by David Walsh
  4. JSON Schema Example: This example uses more keywords that weren’t discussed in this post. For example, title and description.