How to Use JSON Schema to Validate JSON Documents in Python | by Lynn Kwong | Jul, 2022

Learn a standard way to ensure your data quality

Image by kreatikar in Pixabay

A JSON document can contain any number of key/value pairs. The key must be a string but the value can be any supported type, such as string, number, boolean, etc. The value can even be complex types like an array or nested object. This makes the JSON document both very flexible and very unstructured. However, this makes data processing more difficult because the data team often gets data through APIs whose responses are normally in JSON format. Having a consistent data format can make the data pipelines more robust. With a uniform data input, you don’t need to worry about unexpected data types and spend too much time on data cleansing. You can thus focus more on data analysis and work more efficiently.

In this post, we will introduce how to use JSON schema to validate JSON documents. The essential concepts, as well as basic and advanced use cases, will be introduced with simple code snippets that are easy to follow.

What is JSON schema?

A JSON Schema is a JSON document defining the schema of some JSON data. Well, honestly, this explanation is pretty strange and elusive but will get much clearer once we see the code later. For now, we need to understand two points:

  • A JSON schema itself is a valid JSON document with key/value pairs. Each key has a special meaning and is used to define the schema of some JSON data.
  • A schema is similar to the table definition in a SQL database and defines the data types of the fields in a JSON. It also defines which fields are required and which are optional.

Let’s get started with a simple JSON schema:

This JSON schema specifies that the target JSON is an object with two properties (which are also commonly referred to as keys/fields and will be used accordingly when appropriate), and the name property is required. Let’s dive a bit deeper into each validation keyword:

  • The type keyword specifies that the target JSON is an object. It can also be an array, which is normally an array of objects for API responses. We will see how to define the schema of an array field later. However, in most cases, the top-level type is almost always object.
  • The properties keyword specifies the schema for each field of the JSON object. Each field of the target JSON is specified as a key/value pair, with the key being the actual field name and the value being the type of the field in the target JSON. The type keyword for each field has the same meaning as the top-level one. As you may imagine, the type here can also be object. In this case, the corresponding field would be a nested object, as will be demonstrated later.
  • The required keyword is an array containing the properties that are required to be present. If any property specified here is missing, a ValidationError will be raised.

Besides the essential validation keywords, namely type, properties, and required specified above, there are other schema keywords that can be seen in online documentation and also in the JSON schemas generated automatically by some tools.

There are two schema keywords, namely $schema and $id. $schema defines the “draft” that is used for the schema. If $schema is not specified, the latest draft will be used, which is normally desired. You may get lost easily if you dive too much into the drafts as a beginner. We normally don’t need to touch the $schema field and will introduce a bit of it at the end of this post. On the other hand, $id defines a URI for the schema which makes the current schema accessible externally by other schemas. If $id is not specified, then the current schema can only be used locally, which is also normally desired, at least for small projects. However, for bigger projects, your institution may have an in-house system for how to store the schemas and how to reference them. In this case, you can set the $id keyword accordingly.

There are two annotation keywords, namely title and description, which specify the title and description for the JSON schema, respectively. They can be used for documentation and can make your schema easier to read and understand. They will also be displayed nicely by some graphical tools. For simplicity, they will not be specified in this post, but you should normally add them to your project for best practice.

How validate a JSON document again a schema in Python?

In Python, we can use the jsonschema library to validate a JSON instance (can also be referred to as JSON document as long as it’s unambiguous) against a schema. It can be installed with pip:

Let’s validate some JSON instances against the JSON schema defined above. Note that technically JSON is a string, but we need to specify the underlying data of the JSON to be validated, which is more convenient though.

It shows that the schema defined can be used to validate the JSON instances as expected. Incorrect data types or missing some required fields will trigger the ValidationError. However, it should be noted that by default additional fields are allowed, which may or may not be what you want. If you want a strict schema and only allow fields that are defined by the properties keyword, you can specify the additionalProperties to be False:

How to define the schema for an array field?

Even though it’s not so common to have an array as the top-level field, it’s very common to have it as a property. Let’s add an array property to our schema defined above. We need to set the type to be array and specify the type for each item with the items keyword:

As we see the type of the array elements can be checked correctly. However, empty arrays are allowed by default. To change this behavior, we can set minItems to be 1, or the number you expected that makes sense for your case.

How to define the schema for a nested object field?

As mentioned above, the type keyworld of a property has the same meaning and syntax as the top-level one. Therefore, if the type of a property is object then this property is a nested object. Let’s add an address property to our JSON data which will be a nested object:

As we see, the nested object field has exactly the same schema definition syntax as the top-level one. Therefore, it’s fairly straightforward to define the schemas for nested objects.

Use $defs to avoid code duplication.

What if the address field needs to be used at multiple places in the same schema? If we copy the field definition wherever it’s needed, there would be code repetition which is hated by programmers because it’s not DRY. In JSON schema definition, we can use the $defs keyword to define small subschemas which can bed referenced at other places to avoid code duplication. Let’s refactor our schema above with $defs to potentially avoid code duplication:

As we see, the new schema using $defs to define a subschema works in the same way as before. However, it has the advantage that code duplication can be avoided if the address field needs to be used at different places of the same schema.

How to set the schema for a tuple field?

Finally, what if we want the scores field to be a tuple with a fixed number of elements? Unfortunately, there is no tuple field in JSON schema, and we need to achieve the definition of a tuple by an array. The general logic is that an array has items (items) and optionally has some positionally defined items that come before the normal items (prefixItems). For a tuple, there are only prefixItems but no items which achieves the effect that a tuple has a fixed number of elements. And importantly, the type for each tuple element must be defined explicitly.

If you want to define the schema for a tuple field, you would need to have some knowledge of a draft in JSON schema, which is a bit more advanced. A draft is a standard or specification for the JSON schema and defines how the schema should be parsed by a validator. There are several drafts available and the latest one is 2020–12. You can find a list of drafts here.

Normally, we don’t need to worry about the $schema field and the draft to be used. However, when we need to define a tuple field, it is something that we should pay attention to.

If the jsonschema library installed is the latest version (v4.9.0 at the time of writing), then the latest draft (2020–12) will be used. If this is the version that you want, you don’t need to specify the draft by the $schema keyword. However, it’s seen as a good practice to always specify the version of the draft in your JSON schema for clarity. It’s omitted at the beginning of this post for simplicity so you won’t get overwhelmed, but it’s recommended to have it in practice.

On the other hand, if you want to use a different draft version rather than the latest one, you would need to specify the $schema keyword with the draft version explicitly. Otherwise, it won’t work properly.

Let’s define the schema for scores field with drafts 2020–12 and 2019–09, respectively, and demonstrate how to use the $schema keyword and how to define a tuple field accordingly:

As we see, the schema definition for the tuple field with draft 2020–12 is more intuitive using the prefixItems and items keywords and thus is recommended to use. For a more detailed explanation of the changes from 2019–09 to 2020–12 regarding the tuple field definition, please check this release note.

Besides, it should be noted that even if we want the scores field to be a tuple, it must be specified as an array (list in Python) rather than a tuple for the validator. Otherwise, it won’t work.

In this post, we have introduced what JSON schema is and how to use it to validate different data types in a JSON document. We have covered the fundamentals for basic data types like strings and numbers, as well as complex ones like arrays and nested objects. Besides, we have learned how to avoid code duplication with the $defs keyword which is used to define subschemas and can be handy for complex schemas. Last but not least, the basics of drafts are introduced and we now know how to define the schema of a tuple field with different drafts.

Leave a Reply

Your email address will not be published.