Schemas exist to facilitate communication and aid comprehension. It's a way of defining expectations for what kinds of content (below I will refer to them as "types") will be sent or received and to help with the interpretation of the message.
Schemas exist in human life wherever communication takes place. Schemas even existed before writing. Even when communication is oral, there are certain expectations about the "type" of response one might give in response to a question. Consider the following exchange:
Indeed, humor often leverages a "schema violation" to surprise the hearer as the core of the joke. See Abbot and Costello's famous "Who's on first?" sketch.
After the invention of writing, schema became even more important because the reader would be consuming the written content asynchronously. The "writer" wouldn't necessarily be around to clarify the statement in the case of schema violation.
Over time, we developed conventions about how to specify the "type" that a writer was using when they wrote down a message. Tables or spreadsheets are one such convention. Each column has a specific "type" to it.
(picture of a ledger)
Consider the following:
(picture of a check with the incorrect "type" in all of the fields)
When the bank receives the above check, they will be unable to deposit it. It fails "schema validation." It's not about whether a specific account has the correct amount of money in it. Rather, the message is nonsensical because the types of the fields do not conform to an expected schema.
What are these "kinds" of data, anyways? In computing, we often refer to them as "types." There are only a few "basic types":
Looking carefully at the basic types, we discover things are much more interesting. Text can describe a name of a city or a person, a brand of an automobile, a type of cloud, a command, so many things.
We build up a set of complex types from these basic building blocks: cities and states, company names, categories of things, even parts of speech. Though we most often use these basic types in software to define data, the interesting part of schema consists of these higher-level types.
So, a schema is simply defining the type of response that's expected when communicating a message to someone, whether orally or in writing.
Schemas are everywhere in computing systems. Schemas are the types in our programming languages that allow us to communicate with the compiler and with other software developers. Schemas are the description of our databases that describe how data is represented. Schemas are the public fields of a class that explain what functions and the types that are expected by that function. Schemas are our REST APIs, our GraphQL APIs, our tRPC. Any process-to-process communication, passing data back and forth, communicating between different computing systems, those are all described by a variety of different schema languages. Even event systems within a single program leverage schemas.
Our user interfaces are also rich in schema. Think about forms or any data we present on the screen. It has certain schemas we expect or accept when it comes to input. It has certain schemas that we expect for output. Just look at a tweet and break down the different elements of a tweet.
Schemas are everywhere in software, and we have a variety of ways of making explicit the types of things.
We've established that schemas are everywhere in software. Let's focus on schemas in one specific domain: web and mobile applications.
The default architecture for web and mobile systems today is client-server, sometimes called a layered or "n-tier" architecture.
With a client-server architecture we disperse the schema across the system in order to avoid over-coupling and enable the independent systems to develop more quickly.
This results in systems that are difficult to change and reason about. Consider a small change to an app and all the places you'd have to make that change
There have been some efforts toward streamlining the schema in client-server systems. One is Prisma. Prisma ...
Rails
RedwoodJS
Local-first software is a set of ideals and does not prescribe a specific architecture to meet those ideals. However, the ideals imply a few characteristics of the architecture that are trending towards a common pattern. For the purposes of this article, we'll focus on one aspect of that pattern: every client application has a local copy of the data for that application on the device and that local data store gets synchronized with the data stores on other devices. The synchronization of the data store happens either via peer-to-peer networking or via a sync server.
In this architecture, the n-tier architecture of client-server has been flattened such that there are only clients that have both the front-end code, the data store, and the "ORM" all on a single system.
We still need schema in a local-first architecture.
Any place communication happens between two systems, there is schema involved. Would be ideal if it’s the same schema on both sides.
Schema serves several purposes:
How do we bridge from "local-first apps need schema" to "the magic scenario"?
A version of Composer that's JUST ECHO and Shell for connecting devices and inviting others and dynamic plugins.
What if React is the last web framework?
We have implemented a local-first architecture in DXOS with a unified schema using the Effect Schema library. Below we describe how we implemented it and describe some of the benefits and challenges we have experienced.
The primary role of schema is to define the shape of data.
import * as S from "@effect/schema/Schema";
import * as E from "@dxos/echo-schema";
export const Contact = S.struct({
name: S.string,
email: S.string,
emoji: S.string,
color: S.string,
}).pipe(
// Give the schema a unique name and version
E.echoObject("dxos.types.contact", "0.1.0")
);
// E.Schema adds ECHO-specific fields to the type
export type Contact = E.Schema.To<typeof Contact>;
Code like the following will now throw compile-time type-check errors:
const contact: Contact = {
name: 1234,
}
// Create a contact object and persist in ECHO
const contact = space.db.add(Contact, {
name: "John Doe",
email: "john@doe.org",
color: "#FF0000",
emoji: "👍",
});
// Directly mutate, triggering synchronization with other devices
contact.email = "johndoe@altavista.com";
// ContactCard re-renders when contact is changed
const ContactCard = ({ contact }) =>
<span>{contact.name}</span>;
const EMAIL_REGEX = /^[a-zA-Z0-9._-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}$/;
export const Contact = S.struct({
// ... omitted
email: S.string.pipe(S.pattern(EMAIL_REGEX)),
// ... omitted
);
// 🔥 throws a ParseError
contact.email = "NOT_AN_EMAIL";
/* The @effect/schema ParseError
Parsing failed:
{ email: "NOT_AN_EMAIL" }
└─ ["email"]
└─ does not match pattern
*/
These parse errors can be used to generate human-readable error messages for user interfaces and API integrations.
We serialize the schema to JSON Schema and save to the data store. This enables any reader of the data store to inspect the shape of the data and maintain schema integrity through the validations.
When you have a single unified schema across the entire architecture, new scenarios are enabled.
By serializing the schema to the database, we enable cross-app interop scenarios where two different applications can read/write from the same data store simultaneously while maintaining schema integrity.
Composer's drag-and-drop functionality is schema-aware and uses an inversion of control model to enable plugins to expose functionality about themselves to other plugins.
We can write LLM prompts that request responses in the shape of the schema. For example:
prompt coming soon
If you have any experience with schemas in software systems, you have probably been wondering how we handle schema change. The short answer is we do not handle this case yet. At this time, we do not do automatic data migrations based on changes to the schema. We see the above work as creating a foundation on which we can build a robust schema migration system.
As mentioned above, we currently serialize schemas to JSON Schema and store them in the data store. However, JSON Schema does not serialize filters (Effect Schema validations) that contain arbitrary code, so some schema validity will not be maintained across systems. We intend to investigate distributing schemas as JavaScript packages and dynamically loading those packages from a schema repository at runtime, enabling arbitrary code to be serialized as well.
Feedback from Jonathan Edwards:
First of all, I’m not clear exactly what is the software stack you are proposing. Is there a relational database still involved or some new kind of database? If relational, then you need to explain how things will work better than all the prior attempts to use Object Relational Mappings as in Java and Rails. If not relational then exactly what is it and how is it better than things like Mongo and Firebase? People have tried and failed to do this before, and while it is certainly worth trying again you need to explain what is different this time. The best way to do this is to identify at least one of the failures of past approaches and show a new solution. That is what is needed to publish a research paper.
Sometimes there aren’t clear-cut new solutions and instead just a lot of little improvements that make life easier. That is generally referred to as “design”, and is harder to argue for in a paper. E.G. Rails is very nicely designed, but hard to boil down to a paper - demos work better for that.
It doesn't matter what kind of database-relational or k-v store. The software stack is to colocate the database with the client and keep the database synchronized across clients using an event-based