JEP-171: Fence Intrisics Keep Memory In Line

Doug Lea just posted a new Java enhancement proposal with JEP-171 - Fence Intrinsics. This enhancement is all about exposing memory fence controls into Java code so that the java.util.concurrent APIs can more accurately and efficiently control memory ordering and bounding.

This is an interesting JEP as it proposes no Java consumer-oriented APIs; the only changes would be on the sun.misc.Unsafe class - which is already a fascinating pivot point for many of the advanced concurrency features of Java. Here is a Stack Overflow article that recaps many of them better than I could: “Interesting Uses of sun.misc.Unsafe”. This class (as you can guess from the package) is not intended for regular Java devs to access; it’s an implementation-specific class, and is really only meant for internal use by class library devs on the JDK. (That said, many folks have taken this class by the horns to wrangle the most out of the JVM anyway.)

What is proposed instead is to make ordering fences with memory a first-class citizen in the implementation layer of the JDK so that the various core APIs for concurrency can leverage fencing without having to resort on side-effects of other lower-level intrinsics (something that is done regularly today).

You may be asking why memory fencing is important. Modern CPUs can easily re-order memory accessing and storing when it can see via the upcoming instruction set that re-ordering will not impact the overall outcome of the program as considered by a single thread in the CPU. When you start throwing multiple threads or CPUs at a problem, out-of-order operations on memory that would otherwise go un-noticed could instead cause all kinds of data corruption and confusion. That’s why effectively all of the core synchronization and atomic operations in Java today implicitly carry a memory fence along with them; it’s part of the larger equation of protecting memory access.

This JEP includes three operations in the proposal:

  • Unsafe.loadFence() - Prevent reordering of load operations before this call with loads and stores after this call.
  • Unsafe.storeFence() - Prevent reordering of store operations before this call with loads and stores after this call.
  • Unsafe.fullFence() - Prevent reordering of all memory operations before this call with loads and stores after this call.

You can read more about memory fences in the Wikipedia Memory Barrier article.

It’s worth noting that the JEP does consider potentially surfacing memory fence operations to full devs at some point in the future given that sun.misc.Unsafe is already platform specific (making it risky for external libs to access), and may become impossible to access given the efforts of Jigsaw:

Adding these methods at the VM level permits use by JDK libraries in support of JDK 8 features, while also opening up the possibility of later exporting the base functionality via new java.util.concurrent APIs. This may become essential to allow people developing non-JDK low-level libraries if upcoming modularity support makes these methods impossible for others to access.

JEP-155: New Concurrency Utils Part 1 - Atomically Delicious

Concurrency geeks rejoice! Doug Lea and the JSR166 exper group are a perpetual fountain of classes, constructs, and frameworks for multi-threading, and they are at it once again; working to funnel more amazing roflscale concurrency primitives and not-so-primitives into Java 8 - this time in the form of JEP 155: Concurrency Updates.

There are a number of enhancements being proposed in this JEP, so I’ll probably tackle it in a few blog posts. For today, we’ll talk numbers!

Atomic-ish Numbers

It’s very very common to use atomic values in metrics gathering capacities in Java application. Applications everywhere (if they know what is good for them) are doing things like tracking total execution time, # of executions, and max execution time for as many data points as they can stomach.

Compared to the “olden days” of Java, atomics (particularly java.util.concurrent.atomic.AtomicLong) are a huge boon for this as they get rid of the overhead caused by contention when you’re dealing with synchronization. A particular code-flow will be much more concurrent when updating an atomic value than trying to fence through a synchronization boundary in heavy traffic.

However, all is not well in atomic land. While they are much better than everything that has come before them, they can still be quite disruptive. Updating an atomic value is, by definition, still fully shared between threads. While it’s a narrow operation, the guarantee is that after the operation is done, all CPU active on the system will see the same value in that memory location. To do this, the CPU core doing the work has to ensure that the value of that atomic is pushed back out to main memory; not just retained in the CPU cache. And in so doing, it has to keep trying to update the core memory until the incremental is what is expected (via a CAS - compare and swap operation). This CAS and memory fence ensures that staleness does not happen due to the CPU keeping data in local cache only, and not dumping out to slow system memory constantly.

However, this inherently means that, for this value, we are writing out to slow system memory constantly. So it’s not free. Additionally, what often can happen is that the atomic changing invalidates a line of cache shared by multiple values. This is fairly common because the contiguous cache blocks the CPU is designed to invalidate are generally large enough to hold multiple numeric values. This means that the CPU is forced to (unintentionally) re-fence multiple atomics; often atomics that are allocated together for the same data structure.

This lack of cache coherence and affinity can be a very real, and very hidden cost with all fence operations. Martin Thompson and Michael Barker (the wizards behind the much discussed LMAX Architecture and the disruptor stream processing framework) have a talk that covers this problem (and many others) in detail titled Lock-Free Algorithms.

Thompson and Barker show exactly how much this cache issue can impact performance, and how with a little adjusting, the atomic classes can be designed with what they call “mechanical sympathy” in mind; in other words, built to avoid sharing cache regions with other atomic values. So instead of this:

cache line 1 . . . . <atomic int val A> . . . . . <atomic int val B> . . .
cache line 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

You ideally would get this:

cache line 1 . . . . <atomic int val A> . . . . . . . . . . . . . . . . .
cache line 2 . . <atomic int val B> . . . . . . . . . . . . . . . . . . .

Now the two values are isolated from each other, reducing cache damaging significantly.

While I’m listing interesting related reads, another article just popped up in the past couple days that shows some very real metrics for the problems you can hit with Atomics in terms of program throughput. Marc Brooker just posted how atomics and volatiles in Java work on x86, and it shows exactly what a lack of mechanical sympathy can mean to your cache hit performance. The perf stat output he shows with the significant increase of LLC-load-misses is exactly what this padded cache is meant to mitigate as much as possible - of course, it can only do so much; the entire idea of an atomic implies there is some shared memory involvement, which is simply going to be slower. The goal is to narrow that slowness to specific values, as much as possible.

So mechanical sympathy changes help, but only get you so far. The work in JEP-155 doesn’t stop by exploring this idea. There is another issue with atomics as it pertains to being used for metrics-like monitoring in large scale Java systems. Their atomic requirement may actually be too stringent. Often times when you’re looking at something like an “average” execution time or “max” execution time, it’s acceptable to say the value can be eventually consistent across multiple updates. If there is a little margin for error (like you might miss one recent update on a read), it’s not going to impact your analysis in any real way.

Knowing this, the JSR166 folks have taken one of the ideas that are common in highly concurrent programming, and applied them to these types: striping, or segmenting.

If you’ve ever looked into the internals of the ConcurrentHashMap class, you may be familiar with the constructs it uses to minimize contention and avoid locking. Internally, the CHM uses a series of segments which represent sub-sections of the map, and can be worked with independently. By using multiple segments, the data structure is able to immediately divide the potential contention, at the expense of a little more memory used.

(Incidentally, the fastest JDBC connection pool I’ve ever benchmarked at scale uses this same mechanic to avoid pool contention - see BoneCP.)

This same idea is in place for a number of new atomic classes. There is an abstract Striped64 super-class that holds a collection of Cell objects, which are really just Atomics with special padding to help with mechanical sympathy. The cells are created lazily as CAS updates fail due to contention. What this means is that a single instance of this may have several internal atomics representing the value in the aggregate. Since this class doesn’t just sit on the CAS wall until it updates successfully, it doesn’t have the same repeated crawling down into shared memory from the CPU cache. Instead it will retry a set number of times, and if it can’t get the first cell to update, it creates another one. The next one is a fully separate atomic in memory, and is bound by its own memory fencing and contention.

What’s interesting about this is since the atomics are intentionally written to avoid cache collisions, each one can be independently represented on a CPU, and since the class will try to balance the cells to individual threads, it will avoid causing the painful contentious updates altogether, making the most valuable use of the atomics. Each thread can say “If I created a new cell because a previous cell was contentious, I’ll keep using that new cell” - this means that CAS collisions can theoretically be eliminated over time, with the expense of potentially creating a larger memory footprint for a single number.

Since the cells are all independently update-able and create-able, and based on multiple iterative passes, the process of calculating a value over them involves iterating each cell and getting a snapshot value from them independently. Because of that, it’s possible that during the iteration, some of the cells may be slightly out of date or non existent on my local thread.

Additionally, since this is all based on combining the values out of multiple cells together, it immediately implies that there are only certain accumulative operations where this segmentation can be used.

There are several new classes that extend Striped64 to provide relaxed-consistency classes for cases where you want to accumulate values in this way:

  • LongAdder; DoubleAdder - Classes that allow threads to accumulate values into a long or double in a thread-safe, and extremely low-contention path. Example Usage: Accumulating total execution time for a particular call so you can calculate average execution time.
  • LongMaxUpdater; DoubleMaxUpdater - Classes that allow threads to calculate the “max” value seen up to a point, again in a very low-contention implementation. Example Usage: Tracking the max execution time for a particular method call.
  • LongAdderTable - A handy extrapolation of long adder into a hash table form. This is an optimized map of LongAdders (although it doesn’t implement the Map interface for a number of reasons) that is specifically designed to be as efficient as possible in the use of long adder values per key, and allows for decrementing, incrementing, and adding by key. It also will create the adders for you automatically simply as an operation of you incrementing. Example Usage: Tracking several call execution times for an application all in one data structure.

Let’s look at how these classes work internally to bring home the idea of cells for contention management. If we look at the LongAdder class as an example - let’s say we have a simple counter value we’re tracking for some method call, and it looks like this at one point:

*-- Cell[0] = 51352

Our thread comes in to increment the first cell and we keep getting a CAS collision. So, based on the collision heuristics, we decide to create a new cell and increment there:

*-- Cell[0] = 51494
*-- Cell[1] = 1

Note that Cell[0] has incremented several times outside of our control (other threads), and we also created Cell[1] to hold our new value.

To calculate the sum total for this long adder, we simply sum all of the cells together, returning 51495.

The internal documentation on the Striped64 draft class has phenomenal notes on the algorithm).

While discussing atomic numbers, it should also be noted that there are implementations of AtomicDouble and AtomicDoubleArray laid out in the extras package of this draft work as well. These would ideally bring true atomic floating point to Java for the first time, but as indicated on the draft group site, the extras folder is probably not going to make it into Java 8 directly. They do show how these can be implemented, however, which is a feat in and of itself.

What’s Next?

There’s a good bit more in the JEP-155 bundle, including stamped locks and some nifty fork join improvements. Those are for another time, however.

Value Objects Proposal in Java with JEP 169

There have been a lot of interesting JEPs proposed recently for Java. With the pending onslaught of Lambdas, and with them, an entirely new programming model, there is a whole new set of enhancements and opportunities for both the language and the JVM to expand.

One of the recently announced JEPs is JEP 169 - Value Objects. Value objects are a number of tools for expicitly declaring that a Java object represents a raw set of vectorize-able values, rather than a composite set of memory values linked via pointers, and dynamically allocated on the heap.

As Java programmers, most of us are comfortable with (even if not entirely happy with) the distinction between primitives and objects. It’s one of those ugly, very un-academic pragmatics that you deal with because of the benefits it brings. Unfortunately, as the aforementioned JEP mentions, it’s also an ugly dividing line. You find yourself choosing between very low-level and hard-to-manage data structures and value types, or choosing more manageable high-level types, sacrificing the performance gains you would see otherwise:

In modern JVMs, object allocation is inexpensive, with a cost comparable to out-of-line procedure calling. But even this cost is often a painful overhead when compared to individual operations on primitive values. Thus, Java programmers face a binary choice between existing primitive types (which avoid allocation) and other types (which allow data abstraction and other benefits of classes). When they need to define small composite values such as complex numbers, pixels, or pairs of return values, neither approach serves. This dilemma often has no good solution, and the workarounds distort Java programs and APIs. Consider, for example, the lack of a good complex number type for those who program numeric algorithms in Java.

The general idea is allow programmers to give the JVM more knowledge about objects that are already designed to be immutable and simply represent a value, so that it can be treated in a more efficient way; much like primitives.

The example that is used repeatedly in the JEP is a class that represents a Complex number; call it java.lang.Complex. As you may or may not know, Java today doesn’t have a complex number value type. There are a number of open-source and academic examples of creating one, with the commonality between them being that they are written as a Java object. Usually, they are represented as a couple double values internally, and they have a numbef of methods for doing things like adding, subtracting, dividing, calculating cosine, and so forth. Apache-Commons Math has a perfect example.

However, if you think about it, there is really no reason this Complex value has a full heap allocated object. It’s really just a couple primitives that are bound together inextricably. But the only way to do this with pure primitives would be to juggle multiple primitive values manually, and that’s not exactly a pretty API: hard to explain to implementors, and very hard to add functionality around. Sadly, a lot of APIs that have to deal with quickly shuffling numeric values (such as rendering APIs) fall into the trap of having to deal with buckets o’ numbers.

This JEP is an attempt at tackling this problem. Once an object has been declared a value object, it can now be scalarized by the VM into a compact binary representation that can be propagated on the stack, and can even be interned in memory. In other words, the VM can treat the Complex class exactly like two double values held in a single memory vector; but the developer code can still handle it like an object, with all of the encapsulation and behavioral binding that implies.

Because the value object is being treated like a primitive, it now has no surrogate identity (no pointer location), and as such, it can’t be used for operations like synchronization and reference equality checks.

Boxing Benefits

One of the interesting notes about this proposal is the fact that it changes the cost and complexity of boxing and unboxing of values from primitive types and back. Consider this contrived example:

 1public void someMethod() {
 2	int val1 = 5;
 3	int val2 = 10;
 4	int result = sumofsquares(a, b));
 5	// ...
 8public Integer sumofsquares(Integer a, Integer b) {
 9	return square(a) + square(b);
12public Integer square(Integer val) {
13	int val = val.intValue();
14	return val * val;

In this case, we have two primitive values in someMethod(), but immediately pass them into a method that expects boxed types. As such, the compiler is going to secretly inject code like this: sumofsquares(Integer.valueOf(val1), Integer.valueOf(val2)).

What this new value objects support would allow the compiler and runtime to do is defer any of boxing all the way until some complex behavior is invoked on the type - in this case that would be the intValue() invocation within the square method. However, since the compiler can determine it’s simply calling a method that returns the value component contained there-in, it can actually completely eliminate the boxing altogether.

The method tree then turns around and returns a boxed value again, but the compiler and runtime can again eliminate any boxing, as the method tree never deals with them in a complex way.

Therefore, somewhat surprisingly, the boxing to a reference style java.lang.Integer above can in fact be eliminated completely. The other exciting benefit is that your own value types can be treated this same way, even though (unlike int and Integer) there is no native primitive value directly representing your type. This opens a whole new opportunity for expressive and clean APIs that have to deal with primitives, closing many of the difficult “performance vs. readable” problems Java developers have faced.

Class vs Object

Several months ago I watched a presentation from Brian Goetz (can’t find the specific talk) where he discussed the potential for value objects in Java’s future, and at the time the thought was that it could be a new keyword or annotation on the type that indicated it was a value type. This proposal, however, discusses using a per-instance locking mechanism similar to:

1Integer x = new Integer(123456);
2// Not yet a value object...
3x = x.lockPermanently();
4// Now a value object...

There are some benefits to this manual locking:

  • You can use some objects as reference types; if you’re synchronizing on Integer somewhere in your code (Why?) this would allow you to continue to do so.
  • Initialization of objects that will eventually be value types can be done in multiple steps.
  • Even if your code (or some library you use) doesn’t mark the object as a value type explicitly, the JIT can still analyze the usage of the object, and transliterate it to a value type internally as a form of runtime optimization.

This idea of value objects becomes a huge benefit when exploring expressive sophisticated functional programming models. Since one of the main tenants of functional programming is immutable types - if Java can in turn represent those immutable types as primitive values, it helps ensure that functional programming models don’t pay a penalty for their expressivity and immutability.