Why read this article?
Tone:
Consider a typical web app consisting of a front-end, an API, and a back-end. If you wanted to add a new input field to an existing page in your app and store that field in the database, how many places in the codebase would you have to change? How many different systems would have to be re-deployed? How many times would you have to re-type the name of that new variable and specify it's type?
Is there another way?
What if you could define the schema of the data in a single place and any change to schema would percolate throughout the system? That's the promise of unifying the schema.
In this article, we will define data schemas broadly and consider their usage in software systems. Next we will talk about an emerging architecture that "flattens the stack" and show examples of how that architecture enables the schema to be defined once and then used throughout the system. Finally, we'll discuss some of the trade-offs of this architecture and describe some ongoing research to make it feasible.
When you hear the word "schema", what do you first think of? Probably a database schema. With a database, we define the shape of the data that will be stored in the database. For relational databases, this is usually done with SQL, and we define a table at a time, like so:
CREATE TABLE Contact (
Name VARCHAR(255),
Email VARCHAR(255),
Color VARCHAR(50), -- Color in hexadecimal, i.e., #FF00CC
Emoji VARCHAR(8) CHARACTER SET utf8mb4 -- A single valid emoji
);
Rather than thinking of schema as simply a specification for the database, consider this expanded definition:
A schema is an explicit, formal definition of the shape of some data: the names and types of the fields and what it means for the field to contain valid data of that type.
When we consider schema as defining the shape of data, we see that schema shows up in many places in software systems. In fact, schema shows up anywhere there is data, whether stored at rest or passing between two systems. Types, APIs, database schemas, even function calls and user interfaces are rife with schema. Schema is everywhere.
A common architectural pattern is to place an API layer between the database and other systems that want access to the database. Whether the approach is REST or GraphQL, the general pattern is the same: there are named endpoints and each of those named endpoints has fields which have some types and constraints on them. The API layer may also do some validation of the fields to ensure they conform to the schema that goes beyond the types of the field.
example?
In statically-typed languages like TypeScript or Java, declaring the types for variables is describing the schema.
The arguments and return types of a function also express schema. Frequently a set of values actually express some implicit aggregate type that is not formally represented elsewhere.
The user interface is also riddled with schema, from the display of information to forms used for inputting and updating information.
It should come as no surprise that we discover schema littered throughout software systems. Software is about data; displaying, transforming, moving, storing it. What should be surprising is that there are so many diverse ways of representing the schema of the data.
Let's look at an simple software system and examine the different places that schema is defined.
Consider the architecture of a typical web application: a frontend, an API, a backend, and a database. There are typically many other components, but these form the core of the architecture. In a typical client-server application, the schema is defined separately in each of those individual systems.
In most software systems, there exists no unified way to fully describe the schema. Consider the case of the Contact
table above. The emoji
field is specified as a VARCHAR(8)
[1]. Any character could be stored in that column, including characters that are not emojis. Where do we define the set of acceptable emojis? How about email
? We need some email validation regular expression. And color
? Another regular expression.
With database schemas, the typical solution is to use an ORM like ActiveRecord or Sequelize and define validations on a field.
Let's consider our opening challenge: adding a field to the Contact record. Where would we have to make changes?
Walk through all of them.
What about a different architecture? What if we colocated the database with the front-end?
Describe the different architecture.
Most emojis are encoded in UTF-8 and can vary in byte length, typically ranging between 1 to 4 bytes, but with newer emoji sequences and variations, some can extend up to 8 bytes or more.↩︎