Working with multiple threads is one of the most complex problems we may encounter in our daily work. When put against the wall of multithreading most people right away reach out for blocking approaches. In Java, it takes the form synchronized keyword, or some other less painful mechanisms, like ReentrantLock. Lock are not the only option: Lock-Free programming is also the way.
In this text, I will show problems, techniques and best practices related to Lock-Free Programming. I will also provide a real life example of how to implement a Lock-Free stack. Besides, I will share common patterns on moving from Lock-Free to Wait-Free.
Here you can find only the most interesting code samples. The full source code is available in my GitHub repository.
What is Lock-Free Programming?
Guarantee: Lock-free programming algorithms guarantees system-wide progress. Within a finite number of steps by all threads, at least one thread completes its operation.
Typical techniques: atomic primitives CAS, LL/SC, FAA.
Pros
- Scales well under contention because there is no kernel blocking or context-switch overhead.
- Eliminate deadlocks and livelocks
Cons
- Harder to design and reason about.
- Higher cost under very low contention compared with a simple mutex.
- Starvation is still possible—one unlucky thread might repeatedly fail while others succeed, but the program as a whole keeps moving forward.
What is Wait-Free Programming?
Guarantee: Wait-Free algorithm provides a per-thread progress bound. Every thread finishes its operation in a finite number of its own steps, regardless of what the other threads do.
Typical techniques: Per-thread operation descriptors and finite step helper functions
Pros
- All from Lock-Free Programming
- Real-time friendly
- Eliminate Starvation problem
Cons
- Significantly more memory and bookkeeping.
- Often slower in the uncontended “happy path”
- Harder (sometimes impossible) to create for complex data structures.
Blocking vs Lock-Free vs Wait-Free – The Progress Guarantees
Deadlock
Deadlock is probably the most common error in the multithreading. Two (or more) threads each hold resources the others need and wait forever for those resources to be released. Our threads will not be able to move forward with processing. Thus, we end with stand-still — a Deadlock.
Thread A: lock(m1); lock(m2);
Thread B: lock(m2); lock(m1);
If A acquires m1 and B acquires m2, both block forever on the second lock().
Livelock
A livelock is motion without progress. Like two people in a hallway, sidestepping left and right in perfect sync and never getting past each other. Threads are active—spinning, retrying, sending messages, but the system never completes a useful operation. Usually caused by over-reactive collision‐avoidance or continual retries without an eventual terminal state.
Starvation
Starvation means a particular thread never gets the resources or CPU time it needs. Even despite the fact that the system as a whole keeps making progress. If other threads hit the same data structure with high frequency. Then one of the threads may have to retry many times because its compare-and-swap keeps failing. This high contention can cause the Thread to starve.
Problem | Lock-Free Programming | Wait-Free Programming |
---|---|---|
Deadlock | Impossible. No thread ever waits while holding a lock, so circular-wait conditions can’t arise. | Impossible. Same lock-free property plus bounded completion time for every thread. |
Livelock | Prevented. At least one operation must complete in a finite number of steps, so the system can’t get stuck in endless mutual retries. | Prevented. Every operation finishes in a bounded number of steps, so the whole system and each thread move forward. |
Starvation | Possible. System makes progress, but an unlucky thread may be perpetually overtaken by others. | Impossible. Each thread completes within a fixed bound, so no one can be starved. |
Atomic Primitives That Power Non-Blocking Algorithms
Compare-and-Swap (CAS)
Compare-and-Swap (CAS) is probably the single most famous atomic instruction. In one atomic step, the processor:
- reads a memory word,
- checks whether that word still holds an expected value
- only if the comparison succeeds replace it with a new value.
That simple “check-and-set” sequence lets a thread act as if it briefly owned the variable without ever locking it.
Most if not all non-blocking datastructures rely on this instruction mix with loop to work correctly. You will see one of them below — in form of LockFreeStack. However, there is a catch or two hidden here. First when contention is rising, the loop starts taking more and more time to complete. The other is known as a A-B-A problem.
In Java, we all Atomic classes, have compareAndSwap methods. Additionally, you can try to emulate CAS with usage of volatile keyword.
Load-Link / Store-Conditional (LL/SC)
One of the ways to address the A-B-A problem from above is the LC/SC instruction pair. It is a pair of two steps — atomic with respect to other threads. Effectively creating a “split CAS.”
Step | What happens |
---|---|
LL | Load the word at addr and place the CPU into a reservation state for that cache line |
SC | If reservation valid, no intervening write by another core, stores new value, otherwise it does nothing and returns false |
The main difference between CAS and LL/SC is a reservation mechanism. Thus, it dodge the ABA problem. Unfortunately, LL/SC is not a silver bullet and has drawbacks. The biggest one is that the Reservation can be lost for reasons other than contention. In such a case we lose all the benefits of using this primitive.
In java the closest thing to LL/SC is AtomicStampedReference.
Fetch-and-Add (FAA)
Fetch-and-Add, or AtomicAdd perform a following operations in single atomic step:
- Read the current value at addr.
- Adds a delta k.
- Stores the result back.
- Returns the old value (some variants return the new one).
There are other variants of this primitive like: Fetch-and-Subtract. FAA is used as a base in more complex techniques like Fetch-And-Store or Test-And-Set.
The main disadvantage of this primitive is, what name suggests. FAA only supports arithmetic transforms of the form old + k. To do anything more complex you will still need CAS or LL/SC.
In Java all Atomic classes, have expose getAndIncrement or incrementAndGet methods.
Double-Width CAS
This is an extension of classic CAS. Double Width CAS (or DCAS) extends the regular CAS idea to two adjacent machine words treated as a single 128-bit (or larger) unit. Both sub-words must match their expected values for the store to occur.
In pseudocode
cas2(expected_lo, expected_hi,
new_lo, new_hi);
It allows you to store a pointer in the low word and a tag in the high word. Both words must match, so resurrecting an old pointer with the same address, but a different tag fails. Thus solving the A-B-A problem.
DCAS enables part of lock-free datastructures to swap pairs (pointer/counter or value/status) atomically. The main drawback is heavier micro-architectural cost than single-word CAS (locks 16-byte cache line)
Primitives Summary
Primitive | General form | Typical uses | Common pitfalls |
---|---|---|---|
CAS | old ⇒ new | Universal lock-free building block | ABA problem, retry cost under contention |
DW-CAS | (old_lo, old_hi) ⇒ (new_lo, new_hi) | Pointer + tag pairs, ABA defense | Heavier locks 16-byte cache line |
LL/SC | LL; …; SC split | Portable lock-free ops when CAS not present | Reservation loss → more retries |
FAA | new = old + k | Counters, ticket locks, ref-counts | Cache-line ping-pong, limited to arithmetic ops |
Beside these 4 primitives we have more complex ones like Atomic Exchange / Test-and-Set (TAS), Fetch-and-store (FAS).
Lock-Free Programming In Java – Lock-Free Stack
Base
Simplest Lock Free Stack, with single atomic top pointer. Push and pop CAS loops operate on that pointer. It is a singly linked Stack thus the Node class only contains next reference.
import java.util.concurrent.atomic.AtomicReference;
private record Node
<E>(E value, Node<E> next) {
}
private final AtomicReference<Node<E>> top = new AtomicReference<>();
Lock-Free Push
Here we have a lock free push. You can see the CAS loop wrap around the top pointer.
/* ---------------- push ----------------------- */
public void push(E item) {
Node
<E> newNode;
Node
<E> oldTop;
do {
oldTop = top.get();
newNode = new Node<>(item, oldTop);
} while (!top.compareAndSet(oldTop, newNode));
}
What’s happening:
- Read the current head pointer.
- Build the replacement node – safe because nothing else can see newNode yet.
- CAS attempt cas(oldTop, newNode).
- If no one has changed top in the meantime, the CAS succeeds, and the loop exits.
- If another thread slipped in, the CAS fails, we loop to re-snapshot the new top.
Lock-Free Pop
public E pop() {
Node
<E> oldTop;
Node
<E> newTop;
do {
oldTop = top.get();
if (oldTop == null) return null;
newTop = oldTop.next();
} while (!top.compareAndSet(oldTop, newTop));
return oldTop.value();
}
What’s happening:
- Read the current head pointer.
- Null-check nothing to pop → return immediately, no CAS needed.
- Assign a new top.
- CAS from oldTop ➜ newTop
- If no one has changed top in the meantime, the CAS succeeds, we atomically detach oldTop.
- If another thread slipped in, the CAS fails, we loop to re-snapshot the new top.
- Return — safe because oldTop is now private to this thread, GC will eventually reclaim it.
From Lock-Free Programming to Wait-Free Programming
I won’t be presenting a complete Wait-Free implementation. It is long, complex and explaining it correctly will probably double the size of this blog. However, there are a few good sources on building Wait-Free data structures.
- Queues
- Stack
- general
There are a couple of common points shared between all Wait-Free implementations:
- Op Log — Each thread writes a small record describing what it wants to do.
- Per-thread slot reservation → Thread grabs a unique index with FAA; writes its OpLogs there.
- Phase (ticket) numbers → Every request gets a monotonically increasing “phase”.
- Bounded Help loops → A helper will finish at most k foreign operations before returning to its own.
Summary
As you could see, both Lock-free and wait-free are not simple things. However, with a couple of good tools and techniques, they can be much easier.
Key takeaways:
- Definitions
- Lock-free programming → from all active threads, at least one makes progress in a finite number of steps.
- Wait-free programming→ every operation by every thread completes in a bounded number of its own steps.
- Atomic Primitives
- CAS
- LL/Sc
- FAA
- Double-width CAS
- Best practises in implementing Lock-Free and Wait-Free structures:
- Design first, code later → Draw the state diagram and identify every location that can change concurrently.
- One atomic word to one logical invariant → Keep each CAS/LL-SC operating on the entire state you need to test.
- Use DCAS → rather than splitting an invariant across two variables.
- Write the fast path first → Attempt a cheap single-CAS path first; after N failures publish a descriptor and switch to the helping (“slow”) path.
- Bounded helping (Wait-Free) Ensure a helper can finish at most k foreign operations before returning to its own.
Thank, you for your time