BIG
DATA

JAVA

Compact Strings in Java 9

Read more about »
  • Java 9 features
  • Read about Hadoop
  • Read about Storm
  • Read about Storm
 

JEP 254: Compact Strings

The main intention of this JEP(JDK Enhancement Proposal) is to adopt a more space-efficient internal representation for strings. Improve the space efficiency of the String class and related classes while maintaining performance in most scenarios and preserving full compatibility for all related Java and native interfaces.


In any of the Java applications, Strings are used extensively. So any optimization on String would affect almost all Java application.

String instances are stored on the heap. Quite a big portion of the heap memory is actually consumed by Strings. Studies say that Strings usually consume as much as 25% of the heap memory. Making String twice as small would mean not only a significant memory consumption reduction, but also a significant reduction of Garbage Collection overhead. It thus makes sense to make the strings shorter or compact.

String in Java is internally represented by two objects. First object is the String object itself. Second one is the char array that handles the data contained by the String. And, every char is made up of 2 bytes because Java internally uses UTF-16. As this is UTF-16, it allows even representation of all the special characters.

The problem is, that the vast majority of the strings in applications can be expressed by just one byte using ISO-8859-1/Latin-1 as they contain no special characters. As a result, if a String contains a word in the English language, the leading 8 bits will all be 0 for every char, as an ASCII character can be represented using a single byte. So, there is a scope to improve the memory consumption and performance.


Java 6 Compressed Strings

The string memory consumption issue is not new. It has been discussed for quite some time already. In fact, in Java 6, a new feature was introduced to address this issue – Compressed Strings.

The idea was, instead of using char[] array for the internal representation an Object could be used. If necessary, two bytes per character would still be used assigning char[] array to that object. If not, one byte per character is sufficient and byte[] array can be used.

This was an optional, experimental feature, which could be enabled on demand using a -XX flag. However, this option was eventually removed in JDK 7, mainly because it had some unintended performance consequences.


Java 9 Compact Strings

Java 9 has brought the concept of compact Strings back. While the implementation of Compressed Strings was flawed in many ways, the main idea was still valid. In Java 9, a new feature was introduced as a replacement of Compressed Strings – Compact Strings.

Instead of having char[] array, String is now represented as byte[] array. Depending on which characters it contains, it will either use UTF-16 or Latin-1, that is – either one or two bytes per character. This means that whenever we create a String if all the characters of the String can be represented using a byte — LATIN-1 representation, a byte array will be used internally, such that one byte is given for one character.

In other cases, if any character requires more than 8-bits to represent it, all the characters are stored using two bytes for each — UTF-16 representation. So basically, whenever possible, it’ll just use a single byte for each character.

Now, the question is – how will all the String operations work? How will it distinguish between the LATIN-1 and UTF-16 representations? Well, to tackle this issue, another change is made to the internal implementation of the String. We have a final field coder, that preserves this information.

In Java 9 String class implementation, the length is calculated as:

public int length() {
    return value.length >> coder;
}
  • If the String contains only LATIN-1, the value of the coder will be 0 so the length of the Stringwill be the same as the length of the byte array.
  • In other cases, if the String is in UTF-16 representation, the value of coder will be 1, and hence the length will be half the size of the actual byte array.

Note: All the changes made for Compact String, are in the internal implementation of the String class and are fully transparent for developers using String users. String-related classes such as AbstractStringBuilder, StringBuilder, and StringBuffer will be updated to use the same representation, as will the HotSpot VM's intrinsic string operations.

Kill-Switch For Compact String Feature

Compact String feature is enabled by default in Java 9. If we are sure that at runtime, your application will generate Strings that are mostly representable only using UTF-16, we may want to disable this compact string feature so that the overlay incurred during optimistic conversion to 1 byte (LATIN-1) representation and failure to do so can be avoided during String construction. To disable the feature, we can use the following switch:

+XX:-CompactStrings 


Conclusion

As no public interfaces were changed for compact strings feature, it is backward compatible. Compact Strings does reduce the memory footprint of String to half in most of the cases and if required, it can be disabled by an -XX flag.