Confusing Java Strings
I do not think it means what you think it means
In this article, I would like to show you a couple of confusing things in connection with Java String
s and give you a few suggestions to avoid issues with them. I also prepared a GitHub repo for you where you can find some code that you can use to try the examples out on your own: github.com/jonatan-ivanov/java-strings-demo.
Quiz
In order to demonstrate this, let me invite you for a little quiz:
What do you think, what is the length
of the following String
s in Java?
- Java
- ζεζ¬’θΆ
- πππ
- π©β€β
- π©βπ»β€οΈπ΅
(solution)
By now, you might get why “Confusing Java Strings” is the title of this article. In the rest of the article, I’m going to explain why you might got unexpected results in the quiz and give you a few suggestions to avoid issues.
Facts and Terminology
To understand the weirdness in String
s, you need to be familiar with some Encoding/Unicode terms.
As you might know, Java uses UTF-16 to encode Unicode text. Unicode is a standard to represent text while UTF-16 is a way to encode Unicode characters. That’s why the size of the Java char
type is 2 bytes (2x8 = 16 bits).
There are two important Unicode terms here, you need to know about: Code Point and Code Unit.
- Code Point is a unique integer value that identifies a character
- Code Unit is a bit sequence used to encode a character (Code Point)
UTF-16
As I mentioned above, UTF-16 is a way to encode Unicode characters. Not the only way but that is what Java uses.
Unicode Code Points are logically divided into 17 planes (groups). The first plane, the Basic Multilingual Plane (BMP) contains the “classic” characters (from U+0000 to U+FFFF). The other planes contain the “supplementary” characters (from U+10000 to U+10FFFF).
Characters (Code Points) from the first plane are encoded in one 16-bit Code Unit with the same value. Supplementary characters (Code Points) are encoded in two Code Units (see Wikipedia - UTF-16 for more information).
The key thing here is that one or more Code Units may be required to encode a Code Point (character).
Example
Character: A
Unicode Code Point: U+0041
(see: codepoints.net/U+0041)
UTF-16 Code Unit(s): \u0041
Character: πΈ
Unicode Code Point: U+1D538
(see: codepoints.net/U+1D538)
UTF-16 Code Unit(s): \uD835\uDD38
As you can see here A is encoded by one Code Unit while πΈ is encoded by two.
String::length
Let’s take a look at the Javadoc of the length()
method of the String
class; it says the followings:
public int length()
Returns the length of this string. The length is equal to the number of Unicode code units in the string.
So if you have one supplementary character that consists of two Code Units, the length
of that single character is two.
Let that sink in: this means that the char
type (as well as the Character
class) in Java is not what we usually mean by a character.
Code Points vs. Code Units Example
If we go back to our quiz, we can explain some of the anomalies:
Java -> U+004A U+0061 U+0076 U+0061 // 4 Code Points
Java -> \u004A \u0061 \u0076 \u0061 // 4 Code Units, length: 4
Likewise:
ζεζ¬’θΆ -> U+6211 U+559C U+6B22 U+8336 // 4 Code Points
ζεζ¬’θΆ -> \u6211 \u559C \u6B22 \u8336 // 4 Code Units, length: 4
But π, π, and π are supplementary characters:
πππ -> U+1D552 U+1D553 U+1D554 // 3 Code Points
πππ -> \uD835\uDD52 \uD835\uDD53 \uD835\uDD54 // 6 Code Units, length: 6
~Solution
If you really need to, you can count the Code Points to get the number of characters, not the number of Code Units:
String str = "πππ";
str.codePointCount(0, str.length()) // evaluates to 3
Consequences
The rest of the quiz is a bit more tricky but before I go there, let me mention a couple of things that are implied from the fact that Java is using UTF-16 under the hood; let me use the "πΈ"
String as an example:
- As you know by now this single character is represented by two
char
(orCharacter
) values (Code Units) and thelength
of thisString
is two - The
toCharArray()
method returns achar
array (char[]
) that has two elements (0xD835
and0xDD38
respectively) - Both
charAt(0)
andcharAt(1)
return something (noStringIndexOutOfBoundsException
) but these values alone are not valid characters - If you do any character manipulation, you need to consider this case and handle these characters that consist of two
char
s (surrogates) - Therefore, most of the character manipulation code we ever wrote is probably broken :)
- And you probably do not want to do any character manipulation
String::reverse
In Java, the String
class does not have a reverse
method so sometimes you can bump into methods like this:
String reverse(String original) {
String reversed = "";
for (int i = original.length() - 1; 0 <= i; i--) {
reversed += original.charAt(i);
}
return reversed;
}
By now, I think you might have a good idea what’s wrong with this method; let’s see it in action:
String str = "πΈBC"; // Three characters, but four chars
System.out.println(reverse(str)); // prints CB??
The tricky part of the "πΈBC"
String in the example is the πΈ
character that consists of two Code Units: \uD835\uDD38
.
If you execute the reverse
method above, it will produce a String
like this: "CB\uDD38\uD835"
.C
and B
are ok but \uDD38\uD835
is not valid, that’s why you see ??
instead. The method should not have reversed it, the valid result would be "CB\uD835\uDD38"
(CBπΈ
).
You can also get other, quite unexpected results:
System.out.println(reverse("πππ")); // prints ?ππ?
Can you tell why we got this result?
(Hint: look into the Code Units of πππ
above and check what you get if you read them backwards.)
Solution
Usually, not writing code to solve problems is a good idea:new StringBuilder(original).reverse().toString()
.
Emojis
The first emoji sequence (π©β€β) in the quiz does not have anything tricky that you haven’t known by now: the first emoji in it consists of two Code Units the other two consist of 1-1:
π© β€ β
U+1F469 U+2764 U+2615 // 3 Code Points
\uD83D\uDC69 \u2764 \u2615 // 4 Code Units, length: 4
The second sequence (π©βπ»β€οΈπ΅) has two tricks:
π©βπ» is actually two emojis joined together with a special Zero Width Joiner (ZWJ) character.
π© ZWJ π» // 3 characters
U+1F469 U+200D U+1F4BB // 3 Code Points
\D83D\uDC69 \u200D \uD83D\uDCBB // 5 Code Units, length: 5
β€οΈ is actually a β€ plus a variation selector that makes it red.
β€ mod // 2 characters
U+2764 U+FE0F // 2 Code Points
\u2764 \uFE0F // 2 Code Units, length: 2
π΅ is “just” a supplementary character:
U+1F375 // 1 Code Point
\uD83C\uDF75 // 2 Code Units, length: 2
So the length
of π©βπ»β€οΈπ΅ is π©βπ»(5) + β€οΈ(2) + π΅(2) = 9.
String::substring
If substring
cuts into the “wrong” place, we might get an invalid character or a new (different) character or both:
System.out.println("πππ".substring(0, 5)); // prints ππ? (invalid character)
System.out.println("abcπ©βπ»".substring(0, 5)); // prints abcπ© (new character)
System.out.println("aπ©βπ»".substring(0, 5)); // prints aπ©β? (both)
Can you tell why this happened?
(Hint: look into the Code Units above.)
Takeaway
As you have seen above, Strings are trickier than they seem first so try to avoid String manipulation as much as you can. :)