Discovering a JDK Race Condition, and Debugging it in 30 Minutes with Fray

I’ve been adding more integration tests for Fray recently. To ensure Fray can handle different scenarios, I wrote many creative test cases. Many of them passed as expected, while some failures led to epic fixes in Fray. Then something unexpected happened: Fray threw a deadlock exception while testing the following seemingly innocent code:

 1private void test() {
 2    ScheduledThreadPoolExecutor executor = new ScheduledThreadPoolExecutor(1);
 3    // Shutdown thread.
 4    new Thread(() -> {
 5        executor.shutdown();
 6    }).start();
 7    try {
 8        ScheduledFuture<?> future = executor.schedule(() -> {
 9            Thread.yield();
10        }, 10, TimeUnit.MILLISECONDS);
11        try {
12            future.get();
13            Thread.yield();
14        } catch (Throwable e) {}
15    } catch (RejectedExecutionException e) {}
16}

This code creates a ScheduledThreadPoolExecutor, schedules a task, and shuts down the executor in another thread. Initially, I suspected a bug in Fray, but after investigation, I discovered that the deadlock was actually caused by a bug in the JDK itself.

Debugging this issue was straightforward thanks to Fray’s deterministic replay and schedule visualization. To understand the deadlock, let’s first take a look of the implementation of ScheduledThreadPoolExecutor:

 1public class ScheduledThreadPoolExecutor extends ThreadPoolExecutor implements ScheduledExecutorService {
 2    public Future<?> schedule(Runnable command, long delay, TimeUnit unit) {
 3        // ...
 4        RunnableScheduledFuture<?> task = decorateTask(...);
 5        // delayedExecute method
 6        super.getQueue().add(task);
 7
 8        // addWorker method
 9        for (int c = ctl.get();;) {
10            if (runStateAtLeast(c, SHUTDOWN)
11                && (runStateAtLeast(c, STOP)
12                    || firstTask != null
13                    || workQueue.isEmpty()))
14                return ...;
15            // add task to a worker thread
16        }
17        return task;
18    }
19
20    public void shutdown() {
21        // tryTerminate method
22        int c = ctl.get();
23        if (isRunning(c) ||
24            runStateAtLeast(c, TIDYING) ||
25            (runStateLessThan(c, STOP) && ! workQueue.isEmpty()))
26            return;
27        if (workerCountOf(c) != 0) { // Eligible to terminate
28            interruptIdleWorkers(ONLY_ONE);
29            return;
30        }
31
32        if (ctl.compareAndSet(c, ctlOf(TIDYING, 0))) {
33            //...
34            ctl.set(ctlOf(TERMINATED, 0));
35        }
36    }
37}

Bug Behavior

The ScheduledThreadPoolExecutor schedules tasks by adding them to a work queue and executing them in worker threads. Depending on the executor’s state, users would expect the following behavior:

State	`ScheduledThreadPoolExecutor.schedule`	`FutureTask.get`
RUNNING	returns task	blocks until task completes
SHUTDOWN	throws `RejectedExecutionException`	throws `CancellationException`

However, Fray revealed that when the ScheduledThreadPoolExecutor is in the SHUTDOWN state, the FutureTask.get method may block indefinitely waiting for the task to complete.

How the Bug Occurs

The bug manifests when Fray interleaves the schedule method and the shutdown method. At line 9 of ScheduledThreadPoolExecutor.schedule, the executor tries to add a new worker to execute tasks. It first checks whether the executor is in a state that allows the task to run. If the executor is in the SHUTDOWN state, the schedule method assumes that the shutdown process will terminate the task, so it simply returns (Line 14) without creating a new worker thread. However, this assumption breaks when the shutdown method transitions to the SHUTDOWN state but will not terminate the task, leaving it in a limbo state.

The following stack traces illustrate the problematic interleaving. The main thread (test worker) is going to create a new worker for the task, while the shutdown thread (Thread-3) is executing tryTerminate and setting the state to TIDYING.

Next, Fray yields to the main thread, which performs the condition check at line 10 of ScheduledThreadPoolExecutor.schedule. The executor is now in the TIDYING state, making both runStateAtLeast(c, SHUTDOWN) and runStateAtLeast(c, STOP) return true. The schedule method then returns the task without adding to a worker.

Meanwhile, in the shutdown thread, execution reaches ctl.compareAndSet(c, ctlOf(TERMINATED, 0)) at line 16 of ScheduledThreadPoolExecutor.shutdown. The ctl is set to TERMINATED, and the shutdown thread completes. At this point, no thread will execute or interrupt the task, leaving it blocked forever.

More Details

While the bug is conceptually simple, reaching this state is not straightforward because ScheduledThreadPoolExecutor and ThreadPoolExecutor are designed to prevent such situations. For example, in the tryTerminate method (Line 23-29), ThreadPoolExecutor checks whether the work queue is empty and workers are interrupted before setting the state to TIDYING. However, Fray demonstrates that this check can be bypassed if execution of super.getQueue().add(task) in the main thread is paused until the shutdown thread reaches the ctl.compareAndSet(c, ctlOf(TIDYING, 0)) statement—a classic race condition.

Debugging with Fray

Imagine discovering this bug in your codebase. You observe a thread blocked on FutureTask.get, but you don’t understand why. You cannot reproduce the bug because when you rerun the test, the deadlock disappears. Adding logging makes it disappear. Using a debugger makes it disappear. This is the notorious “Heisenbug” phenomenon common in concurrent programming.

This time, with Fray, you get a deterministic replay file that allows you to replay the execution step by step. You can observe the exact thread interleaving that triggers the bug and visualize the thread scheduling to understand the root cause.

Try it Yourself!

To experience this yourself, clone the JDK bug repository and open it with IntelliJ IDEA.

Then download the Fray plugin.

In IntelliJ IDEA, open the ScheduledThreadPoolExecutorTest class and navigate to the testWithFray method. Click the Run icon (▶️) next to the testWithFray method, select the first Run 'ScheduledThreadPoolExecutorTest.testWithFray()' button, and then select frayTest. If Fray finds the bug, it will display messages in the Run tool window.

Look for the output 2025-06-07 13:58:06 [INFO]: The recording is saved to PATH_TO_REPLAY_FILE and note this path for replaying the bug.

Replay and Understand the Bug

Copy the path to the replay file, navigate to the replayWithFray method, and paste the path into the replay field in the @ConcurrencyTest annotation. Click the Run icon () next to the replayWithFray method, select Replay (Fray) 'ScheduledThreadPoolExecutorTest.replayWithFray()', and then select frayTest.

Fray will replay the bug and pause at each context switch point (e.g., when the main thread is paused and yields to the shutdown thread). You can click the “Next step” button to step through the replay and observe how the bug unfolds. The Fray debugger also visualizes the thread timeline and highlights the currently executing lines in the editor.

Note that Fray is designed for application concurrency testing, so it hides highlights in JDK methods by default—you may only see highlights in the test code itself. However, since this capability proves valuable for testing the JDK, we plan to add a feature to show highlights in JDK methods in future releases.

Reporting the Bug

When submitting this bug report, I created a patch for the JDK that adds sleep statements to trigger the bug. However, the JDK team didn’t include this patch in the public bug report. Instead, the final report only described how to reproduce the bug using Fray.

Happy debugging.

#Fray #Concurrency Bug #Concurrency #Jdk