Deterministic Simulation Testing for Distributed Systems: The Good, the Bad, and the Ugly

During my PhD, I built Fray, a deterministic simulation testing framework for concurrent programs written in Java. Fray was quite successful and has found many bugs in mature concurrent programs. As an academic, a natural question for me was: can we use a similar idea to test distributed systems? This led to the last chapter of my PhD thesis: Diorama, a deterministic simulation testing framework for distributed systems.

The Core Idea: Reduce a Distributed System to a Concurrent Program

The core idea of Diorama is simple: if we can reduce a distributed system to a concurrent program, we can use Fray’s deterministic simulation testing framework to test it. The power of this idea is that it reduces all “events” in a distributed system to “scheduling decisions,” allowing a centralized scheduler to control everything — when a network message is delivered, how fast a clock progresses, and in what order nodes observe each other’s actions.

Concretely, Diorama replaces the network layer of a distributed system with an in-memory implementation, replaces the physical clock with a logical clock, and isolates each node’s state through standalone class loaders.

The Good: Simplicity

Diorama is designed to be simple and easy to use. It does not require any intrusive code changes to the distributed system and provides a clean interface to launch nodes, inspect states, and verify behavior.

To test a distributed system with Diorama, you implement a ServerInstance interface for each node (both servers and clients):

1public interface ServerInstance {
2    void run(String[] args) throws Exception;
3    void stop();
4}

Then you write a standard JUnit test that launches your nodes through Diorama, waits for the workload to finish, and verifies results. You can run this test as a normal JUnit test with @Test, or as a controlled concurrency test with @FrayTest — the latter lets Fray’s scheduler systematically explore different message orderings and timing behaviors.

For example, here is a simplified ServerInstance for Zookeeper:

 1public class ZookeeperServer implements ServerInstance {
 2    private final QuorumPeerMain quorumPeer = new QuorumPeerMain();
 3
 4    @Override
 5    public void run(String[] args) {
 6        QuorumPeerConfig config = new QuorumPeerConfig();
 7        config.parse(args[0]);
 8        quorumPeer.runFromConfig(config);
 9    }
10
11    @Override
12    public void stop() {
13        quorumPeer.close();
14    }
15}

And a test that launches a 3-node Zookeeper cluster, kills the leader, restarts it, and verifies that the cluster re-converges with a single leader:

 1@ExtendWith(FrayTestExtension.class)
 2public class ZookeeperIntegrationTest {
 3    @FrayTest(...)
 4    void testLeaderRestart() throws Exception {
 5        // Launch 3 Zookeeper servers
 6        for (int i = 1; i <= 3; i++) {
 7            launchServer("ZookeeperServer", new String[]{configPath(i)}, ...);
 8        }
 9        Thread.sleep(3000L); // lets the cluster elect a leader
10
11        // Kill the leader and restart it
12        int leaderId = findLeader();
13        stopServer(leaderId);
14        launchServer(leaderId);
15
16        // Verify: cluster must converge to 1 leader + 2 followers
17        waitAndVerifyModes(3, m ->
18            countMode(m, "leader") == 1 && countMode(m, "follower") == 2
19        );
20    }
21}

Here, stopServer calls the stop method defined in ZookeeperServer, and waitAndVerifyModes spawns a Zookeeper client that sends the stat command to each server to count the number of leaders and followers (the same way you would interact with a real Zookeeper cluster).

Because the test logic is just straightforward Java with no framework-specific DSL, AI agents can generate test scenarios with simple prompting. In the diorama-examples repo, we provide three distributed system examples that demonstrate how Diorama can be used to test distributed systems. The JRaft and ScaleCube examples were generated entirely by AI agents.

The Bad: The Applicability Dilemma

In the Fray paper, I argued that mocking the concurrency library is not a good approach because it hurts applicability. However, in Diorama, we explicitly replace the network layer of a distributed system with an in-memory implementation, and this hurts applicability in the same way.

In our implementation, we replaced java.net.Socket and java.net.ServerSocket with in-memory implementations. This only works when: 1) the application uses physical network sockets directly; 2) the application does not use a network library such as Netty; 3) the application does not try to subclass these classes.

The Ugly

Luckily, we can still find many applications that use these classes directly, and we can still test them with Diorama. But once you write test scenarios and launch them with Diorama, you are going to face the ugly part of distributed system testing.

Timed Operations

In the paper A Note on Distributed Computing, the authors argue that “there are fundamental differences between the interactions of distributed objects and the interactions of non-distributed objects. Further, work in distributed object-oriented systems that is based on a model that ignores or denies these differences is doomed to failure.” They identify several fundamental differences: latency (distributed object access is much slower than local object access), partial failure (a node can fail without affecting the overall system), and concurrency (a distributed system does not have a single global clock).

As a result, one of the most significant differences I have observed in distributed systems compared to concurrent programs is the pervasive use of timed operations.

For example, network messages are usually sent with a timeout to ensure the system can still make progress even if the recipient is unavailable. Different nodes need to send heartbeats at a certain frequency to ensure others are alive. Requests usually come with retries to handle transient failures.

Unfortunately, time means nothing in “testing”. Timed operations such as sleep, timed wait, and timed try lock are semantically equivalent to a yield (I am simplifying a bit here, since wait and lock may actually consume the notify or acquire the lock). So in testing, if we really want to model the program behavior precisely, we should replace these operations with yield calls and let a scheduler explore all possible execution paths. But if you do so in a distributed system, you may find that the test barely makes progress: different nodes may DDoS each other through heartbeat messages because the rate limiter is usually implemented with sleep; requests sent to other machines constantly fail because the sender does not wait long enough; or some client gets stuck in a retry loop forever.

To solve this issue, we introduced a virtual clock with ordering semantics among timed operations, so that a thread calling sleep(100) is actually unblocked before a thread calling sleep(500). This solution is effective and helped us avoid many false positives.

However, implementing a virtual clock correctly is harder than it sounds. Many distributed systems have implicit assumptions about how long operations take relative to their timeouts. Replacing physical time with a virtual clock can violate these assumptions. For example, one natural implementation is to advance the virtual clock by a fixed amount per instruction executed. But this can cause the clock to “tick” much faster than real time relative to actual work done, making timeouts expire prematurely and causing spurious request failures. Getting the virtual clock to respect the system’s timing expectations requires careful tuning (sometimes done by coding agents).

But on the other hand, this solution comes at the cost of missing opportunities to find real bugs, because many concurrency bugs in distributed systems are caused by abnormal timing behavior. A promising next step might be to introduce timing anomalies selectively: for instance, occasionally letting a shorter sleep be delayed past a longer one, or injecting a timeout event at a specific operation. This would preserve progress in the common case while still exploring the timing-related edge cases where real bugs hide.

Shutting Down a Distributed System

Diorama relies on Fray to perform concurrency testing. Fray runs the test multiple times with different schedules. To improve performance, Fray reuses the same JVM for all test runs in a loop.

At each iteration, Fray expects the test to clean up all the resources it uses: closing open connections, releasing locks, and terminating all threads. Unfortunately, this is not always the case for real-world distributed systems. Many distributed systems do not provide a clean shutdown mechanism. They assume the way to stop the system is to kill the process. As a result, you may find code like the following:

 1public class ConnectionEventExecutor {
 2    Logger          logger   = BoltLoggerFactory.getLogger("CommonDefault");
 3    ExecutorService executor = new ThreadPoolExecutor(1, 1, 60L, TimeUnit.SECONDS,
 4                                 new LinkedBlockingQueue<Runnable>(10000),
 5                                 new NamedThreadFactory("Bolt-conn-event-executor", true));
 6
 7    public void onEvent(Runnable runnable) {
 8        try {
 9            executor.execute(runnable);
10        } catch (Throwable t) {
11            logger.error("Exception caught when execute connection event!", t);
12        }
13    }
14}

This class creates a ThreadPoolExecutor without ever providing a way to shut it down. When the test finishes and tries to start the next iteration, the leftover threads from the previous run are still alive. They may hold references to stale state, interfere with newly created nodes, or cause Fray’s scheduler to hang waiting for threads that will never terminate.

To solve this issue, Fray introduces abortThreadExecutionAfterMainExit which throws exceptions in application threads when the main test thread exits. But there is a way to sidestep the issue entirely. The reason Fray runs tests in a loop within a single JVM is to avoid the bootstrap overhead of starting a new process each time. If instead we had an efficient way to snapshot the JVM state at an arbitrary point — similar to what CRaC does for coordinated checkpoint/restore — we could restore from a clean snapshot at the start of each iteration. This would eliminate the need for graceful shutdown altogether.

The Future

Most distributed systems are designed without simulation testing in mind — largely because no mature simulation testing frameworks exist for them yet. It is both a cultural and technical chicken-and-egg problem: developers do not design for testability because the tooling is not there, and the tooling struggles because existing systems were not designed for it. Today, simulation testing remains a niche technique practiced by a handful of teams (FoundationDB, TigerBeetle, Antithesis) that built their systems around it from day one.

My hope is that as more simulation testing frameworks emerge, developers will begin to internalize testability as a design constraint. Diorama is a small step in this direction. It is far from perfect, but I hope it demonstrates that simulation testing does not have to be an all-or-nothing proposition reserved for greenfield projects.

#Fray #Diorama #Java #Distributed System Testing