Featured image of post You Should Take Event Sourcing More Seriously

You Should Take Event Sourcing More Seriously

I built an entire startup on top of an event-sourced data architecture. In the end we didn’t win in the market, but I walked away from the experience thoroughly convinced by the power of the model. In this post I want to sketch out the essential elements of the pattern, and what makes it so compelling; in particular, I want to make an argument that more teams should take it seriously, even in simplified form.

I used it because we needed to support both historical state and branch-like planning—think Git, but for organizational structures. Event sourcing made that possible by allowing us to treat certain kinds of events as speculative. Having seen teams struggle with event-oriented architectures, I frankly did everything in my power to avoid it, and I tried to model the problem using a conventional relational model. I walked away frustrated and committed to trying it.

What ultimately surprised me, and won me over, was how many other things it made easy: implementing people analytics, create embeddings for vector databases and RAG applications, integrating with third-party systems. Stuff that typically takes months to build and manage just emerged naturally from the model.

I meet a lot of engineers who may be put off or intimidated by evented data architectures. This is a pity, because even if you don’t use such models explicitly, you likely live in a world enabled by them. I’ll start by explaining the essential elements and some of the pitfalls, and close by making a case for their long-term impact.

What event sourcing is (and is not)

Event sourcing is a data architecture in which the canonical source of truth about business data is from events that carry information about what has happened—facts. Some common examples of event-sourced architectures are Git repositories (each commit is an event) and accounting ledgers (each transaction is an event). You can restore the state of your Git filesystem, or read your bank account balance, by replaying every event related to that entity since the dawn of time.

Event sourcing inverts the relationship between your database and its history. In a typical stateful database, you read and write values to a table, and under the hood the database keeps a record of all changes—the transaction log. In event sourcing, the transaction log of events is the primary source of truth, and everything else is simply a derived cache designed to solve specific problems. You can blow away the cache and recreate it on demand, or create new representations of the data as needed.

It is this ability to change your mind about how business data is processed and stored that makes event sourcing so powerful. While it places some long-term constraints on some architecture decisions, like the nature and shape of events, it provides an enormous amount of flexibility and optionality for others. On balance, this is a tradeoff often worth making.

Some technical background

Event sourcing is built on events—atomic records of things that have happened. There are lots of ways to think about events, how to structure them, and what kind of data they should carry; a complete discussion is outside the scope of this article. For now, just understand an event in simple terms as a thing with a unique identifier, a timestamp, a type, and some arbitrary data:

1
2
3
4
5
6
interface Event<T extends string, P> {
  id: string
  timestamp: Date
  type: T
  payload: P
}

The name of the entity generated by replaying those events is called an aggregate: you build it by combining each event into a unified whole (aggregating them). You build aggregates by writing reducers; in Typescript terms:

1
type Reducer<S, E extends Event> = (state: S, event: E) => S

That is, given some state (maybe empty or defaulted) and an event, produce a new state. The way you produce an aggregate is by reducing over the history of all events:

1
2
3
4
const aggregate: S = events.reduce<S, E>(
  (state: S, event: E) => { ... },
  EMPTY_STATE
);

Event sourcing is not the same thing as an event-driven architecture. You don’t need Kafka to use event sourcing; you can build a whole application simply by writing and reading events from a single database table. (If you’ve done some frontend work in the past and this is starting to look suspiciously similar to useReducer or Redux, you’re not wrong.) Git, again, by way of example, is an event sourced architecture that does not inherently rely on something like a message queue or event bus. Certain more powerful architecture are enabled by an event-driven approach, but they can be difficult to build and maintain and aren’t strictly necessary to derive a lot of the more important benefits.

A concrete example

I’ll use a simple and concrete example to demonstrate how you’d use this technique (and why) to solve a problem. The classic example is a bank account, but I’ll use something a little bit more closer to my heart: an organizational model.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
type ID = string

// First, define some basic events related to the employee lifecycle.
type EventType =
  | 'EMPLOYEE_HIRED'
  | 'EMPLOYEE_LEFT'
  | 'EMPLOYEE_CHANGED_MANAGER'
  | 'EMPLOYEE_CHANGED_JOBS'

type EventEmployeeHired = Event<
  'EMPLOYEE_HIRED',
  { employeeId: ID; managerId: ID; titleId: ID }
>
type EventEmployeeLeft = Event<
  'EMPLOYEE_LEFT',
  { employeeId: ID; effectiveDate: Date }
>
type EventEmployeeChangedManagers = Event<
  'EMPLOYEE_CHANGED_MANAGERS',
  { employeeId: ID; managerId: ID }
>
type EventEmployeeChangedJobs = Event<
  'EMPLOYEE_CHANGED_JOBS',
  { employeeId: ID; titleId: ID }
>

// An EmployeeEvent is the union of all event types related to employees.
type EmployeeEvent =
  | EventEmployeeHired
  | EventEmployeeLeft
  | EventEmployeeChangedManagers
  | EventEmployeeChangedJobs

// Next, define a simple Employee datatype. This will be our aggregate.
interface Employee {
  id: ID
  titleId: ID
  managerId: ID
  startDate: Date
  state: 'PENDING' | 'ACTIVE' | 'EXITED'
  endDate?: Date
}

// The EmployeeReducer is a function that accepts an employee aggregate and
// corresponding event, and defines a concrete transformation for each
// different kind of event.
const EmployeeReducer: Reducer<Employee, EmployeeEvent> = (emp, evt) => {
  switch (evt.type) {
    case EMPLOYEE_HIRED:
      return {
        ...emp,
        id: evt.payload.employeeId,
        managerId: evt.payload.managerId,
        titleId: evt.payload.titleId,
        startDate: evt.timestamp,
        state: 'ACTIVE',
      }
    // ... etc
  }
}

// Define a base empty state (sometimes called S_0, the zero state).
const EMPTY_EMPLOYEE: Employee = {
  id: '',
  titleId: '',
  managerId: '',
  state: 'PENDING',
  startDate: new Date(),
}

// With these elements in place, we can retrieve events, reduce them,
// and store an aggregate.
const events = await db.events.selectAll({
  type: [
    'EMPLOYEE_HIRED',
    'EMPLOYEE_LEFT',
    'EMPLOYEE_CHANGED_MANAGERS',
    'EMPLOYEE_CHANGED_JOBS',
  ],
})

const employee = events.reduce(EmployeeReducer, EMPTY_EMPLOYEE)

await db.employees.insert(employee)

This is enough to build an org chart, and maybe render some fancy colors for job titles. It’s obviously not comprehensive and doesn’t capture everything you might want to know about an employee relationship. Its data model also has some notable shortcomings that we’ll discuss shortly. For now, there are a few things I’d like to note about this data model:

  • While we’ve only defined a single reducer here, one that generates an Employee aggregate, you can add as many as you want. Want to store timeseries data, embeddings for a vector-based RAG app, or feed those events into other tables? Just write another reducer.
  • Want to redefine what it means for an employee to be active later? Change the reducer and rebuild the aggregates. The event defines what happened, and the reducer defines what that means for the system’s current state. That’s a powerful decoupling.
  • On that note: at any time, all the aggregates can be rebuilt using the exact same technique. Retrieve the events, reduce the events, store the aggregate. The events are the source of truth, not the employees table. This has some operational complexity that I’m glossing over here, but the tradeoff—flexibility in how you build and evolve the model—is often worth it.

The difference between this approach and using a technique like temporal tables in a conventional RDBMS is that while temporality helps you capture historical state, you can’t decide after the fact to change how that state was defined and written. You can’t retroactively decide that you got it wrong when it came to determining which event made the employee ACTIVE and rebuild your table. You get what you get, just with more history.

Evolving the model

There are some notable ways you might want to change this approach as your application grows:

  • What if the employee isn’t actually meant to start on the day the EMPLOYEE_HIRED event was recorded? We’ve tied the reducer implementation directly to the timestamp of the event. We might want to decide that EMPLOYEE_HIRED represents when the offer was accepted, and capture an effective_date for the offer as part of the payload.
  • A future effective date is a fact in the sense that it was agreed upon and intended to occur, but speculative in the sense that it hasn’t actually happened yet in the real world. What if the real world gets in the way? We might want to introduce a new event, EMPLOYEE_STARTED, which represents the authoritative day they started at the organization.
  • What if we decided we wanted to capture more metadata about the employee? We might want to extend the EMPLOYEE_HIRED event with new data, like their salary.
  • What if we decided with the benefit of hindsight that managerId wasn’t useful or necessary to capture up-front, and make that field optional or remove it entirely? We might not want to encourage new events to populate it, and reduce the maintenance burden of tracking it.

Effective dates

The idea that the timestamp that an event was generated and stored might be different from when the event took place in the real world is called bitemporality: splitting our understanding of the timestamp associated with an event into two (or sometimes more!) dimensions.

Storing both dimensions is useful because they help us track different things: knowing when an event was introduced into a datastore, but knowing when the event was effective (or intended to be effective) lets us model our domain model more accurately, or in some cases go “back in time” to fix past issues.

If we did want to capture an effective_date, note that it may not follow our strict definition of bitemporality above. An effective_date in this context is speculative: it records the idea that we expect the employee to start at some future date, and our expectation is true as of today, but it hasn’t happened yet. It is simply part of our domain-level understanding of what has occurred.

Introducing new events

An example of where we might want to use bitemporality is to go back in time and introduce a new event to “fix” an employee’s start date. Let’s say we do introduce that event, EMPLOYEE_STARTED. How do we extend our reducer to understand it?

One thing we can do is introduce these events after the fact with a later timestamp but an earlier effective date, and then sort our retrieved events by effective date, not timestamp. When we rebuild the aggregate, the employee’s start date will be reflected correctly by the date in EMPLOYEE_STARTED, not the incorrect one in EMPLOYEE_HIRED.

Versioning & Backwards Compatibility

What happens when our understanding of an event changes? Events are meant to be immutable; we need to support old ones even if our understanding of the domain changes. Introducing new fields is easy—just tack them on—but larger, more structural changes (or a fundamental shift in one’s understanding of how to model a domain) can make this much harder.

It’s an interesting question and whole books have been written about it, but suffice it to say that there are three common approaches:

  1. Attach explicit versions to events and migrate them on demand.
  2. Don’t ever remove or change old fields on existing events, just introduce new ones (this is the Protobuf approach).
  3. Abandon the idea of past event immutability and use conventional database migrations to keep them tidy.

All of them are workable; commit to a model and try it out.

This seems complicated; is it worth it?

Is writing any software ever worth it? Unclear.

I like to use Hickey’s Razor for questions like this. Is event sourcing complicated? I don’t think so: it’s just a table of things that happened. You can describe the idea in a few sentences, and model it with just as few types. Echoes of it exist in other common architecture patterns. Some of the most important tools of the modern software engineering workflow, like version control systems, are built on its principles. The techniques for mitigating its weak points, like schema versioning and migrations, can be tedious but are likewise not at root very complex. It is, in the Hickeyan sense, a decomplected architecture.

But event-sourcing can be hard to implement, for at least three reasons.

First, the mechanics of ensuring that events are fanned out, read, and written appropriately are not trivial. Complex domains use a lot of different kinds of events, depending on how fine-grained you want to get with them, and all of that takes a lot of effort to steward and maintain. For this reason, event sourcing is more often used in domains like accounting, where events are high-cardinality but relatively simple, and less often in conventional SaaS applications, which tend to be lower traffic but much broader in scope.

The second is that it is a fundamentally different paradigm from stateful database writes and reads. The work that it takes a conventional RDBMS to seem simple is enormously difficult, but such systems are nonetheless a common and well-understood way to store and retrieve data. Event sourcing inverts the relationship between data logging and state persistence, and challenges the way most engineers have been taught to approach persistence problems from the very earliest parts of their careers.

The third is that event sourcing is often coupled with CQRS, an architectural model that intentionally separates the act of writing data from the act of reading it back out to clients: write events, wait, then read aggregates. This implies the need for an explicitly eventually consistent architectural model, in which clients don’t receive immediate confirmation about updated entity state and have to resync state later. If this seems like a wild or impractical idea, I get it.

The good news is that you don’t need to do any of this to gain a lot of value from the pattern.

Avoiding the pitfalls

First, if you want to keep things simple, lean into event logs but punt on tools like Kafka. An event log can just be a database table with some stuff in it; that’s what we used. You can synchronously write both events and aggregates if you like, and emulate many aspects of conventional transactional flows while still retaining full data history. Doing so has enormous upsides and buys you the opportunity to revisit certain decisions later, like how widely the events are propagated and how they’re processed. Having the full event history means that you can afford to make some mistakes. You can always bring in something heavier later if you have to.

Second, spend some time up-front to try to model out your domain as best you can with events. This is hard; advocates of the technique sometimes suggest event storming, a DDD technique, as a way of shaking these out, but my experience has been that your understanding of what works and what doesn’t will necessarily evolve. You’re not going to get all of these right; expect change. Nonetheless, try to identify major inflection points in the data flow and ensure they get an event assigned.

Third, try to find a balance between domain completeness and maintainability. We made the mistake of using extremely fine-grained events and wound up with a couple hundred different types; while this enabled some marvelous reporting tools, the maintenance burden was too high for a team our size. The conclusion I walked away with is that sometimes it’s okay to elide or combine events to and find other ways to capture context. No one wants to maintain that much code.

Making an argument for change

I’ve seen teams take months or years to structure their data in a way that lets them solve novel problems, and twist themselves into pretzels over the difficulty of reliably propagating data across systems. These are problems I don’t worry about as much anymore. Event sourcing helps. Write events, reduce them, rebuild as needed.

It’s a model that invites you to think long-term about your data—and gives you the tools to change your mind without paying a high cost. That’s rare, and worth taking seriously. You can revise your domain model without starting over. You can revisit decisions with better insight. And you can start solving problems that used to seem way too hard, like planning and version control.

It’s not for every team, but it works. Consider trying it for your next build.

Photo by Daniele Levis Pelusi on Unsplash

Built with Hugo
Theme Stack designed by Jimmy