Test behaviour, not implementation

Most testing arguments get lost in theoretical minutiae and miss the point: validating correctness. Testing units at every level makes tests brittle and code resistant to change. Testing full vertical slices is too coarse - it’s hard to cover invariants and edge cases.

What you want is something in the middle. Validate the edges with real components - make real HTTP calls, send actual messages on the queue, run real queries against a real DB, check the domain gets invoked correctly. Then validate your domain at its interface: mock the edges, mock the unexpected behaviour, make sure the domain behaves as expected.

Tests should let me refactor without fear. Mock the edges, run the core; run the edges, mock nothing.

What are we actually trying to test

A test is doing two jobs at once. It documents correctness and makes it safe to change the code later. A suite that only validates correctness at the unit level is brittle and fails to catch what happens across units, where the real bugs hide. A suite that only validates high-level behaviour fails to pin down all the invariants, and subtle bugs slip in as the code base evolves.

The two jobs pull in opposite directions. To prove correctness you want to pin down behaviour exactly. To preserve refactoring freedom you want to pin down as little as possible beyond the observable behaviour. Whether your tests succeed at both depends almost entirely on what you decide to call “the unit.”

The key to tests you can refactor through is picking the right unit. That means building the application with clear seams - the places where one part of the system stops and another starts. Find the seams and you find the units. Get the seams wrong and no amount of testing discipline will save you when the business changes its mind.

Mock at the seam, not inside the unit

Take a service that places an order. It looks up the customer, checks their credit limit, saves the order. The repositories it talks to are the seams. The credit-limit check is part of the unit.

A good test mocks the two repositories and asserts on what happens at the boundary:

customers.findById.mockResolvedValue({ creditLimit: 100 })

await expect(service.place({ total: 150 })).rejects.toThrow(OverLimit)
expect(repo.save).not.toHaveBeenCalled()

The test doesn’t know how the rejection is implemented. Extract the check into a value object, inline it back, move it into a domain method. It stays green.

A bad test reaches inside and mocks the credit-limit check itself:

const limitCheck = vi.fn().mockReturnValue(false)
const service = new OrderService(repo, customers, limitCheck)

await service.place({ total: 150 }).catch(() => {})
expect(limitCheck).toHaveBeenCalledWith(100, 150)

Same input, same outcome. But now the test has opinions about how the service is built. Refactor the internal structure and it breaks even though the behaviour didn’t change.

Same library. Same syntax. Opposite effect on whether you can refactor.

Freeman and Pryce called this out in Growing Object-Oriented Software, Guided by Tests: mock peers, not internals. The brittle tests people complain about come from mocking internals, not from the school the tests are written in. The tooling makes both equally easy; the discipline isn’t. (Fowler’s Mocks Aren’t Stubs is the canonical map of the classical/mockist split if you want the full taxonomy.)

The seam test treats the service as a black box. The internal test treats it as a glass box. Black box tests survive refactors; glass box tests just document the implementation.

Test the real thing at the edge

The other seam is between your code and what it talks to - the database, the queue, the HTTP service, the file system. The adapters that cross this seam need different testing.

Mocking these is a trap. A mocked repository returns whatever you tell it to. It tells you nothing about whether the SQL parses, whether the migration ran, or whether the JSON column round-trips your domain object. You can have a fully green test suite and a repository that throws on every real call.

So at the edge, run real things. Testcontainers spins up a real Postgres so your migrations run and your adapter is exercised against actual SQL. For HTTP, run a real server or use something like msw or wiremock that intercepts at the network layer rather than mocking the client. Queues need a real broker - localstack for SQS, an embedded ActiveMQ, a Kafka container, whichever you actually ship. Redis you just run.

At the edge, the adapter is what you’re testing, and what it wraps is the part most likely to break. Mocking the wrapped thing throws away the only useful signal.

These tests are slower. Mine usually run in seconds, not milliseconds. That’s fine. They run on CI before merge. Locally I run them when I’m working on the adapter; the rest of the time I rely on the fast seam tests. The trade-off is worth it because a mocked DB test that passes against broken SQL is worse than no test. It gives you false confidence.

Real-edge tests also validate the contract between your adapter and the rest of your code. If the service-level tests mock findById to return null, but the real repository actually throws RowNotFound, your service test is green and prod crashes. Running the adapter against a real database catches that drift before it ships.

Where this lands

Pick the seams first. The labels come last, if at all. When someone asks if I’m London or Chicago I usually say “depends on the layer” and watch them lose interest, which is fine - the question was never that interesting. The interesting question is where the seams go in your codebase and whether your tests let you move them.

If you can’t refactor without a wave of red, the tests aren’t doing their job, no matter what school they belong to.