Compile-time guarantees for string content
The language and Kernel library are moving towards better support for Unicode, e.g. with loops and operators. What is going with strings?
Manifest strings
In EiffelStudio 19.05 and earlier, the following code would not compile although it looks innocent:
inherit
LOCALIZED_PRINTER
...
localized_print ("[
“Poets have been mysteriously silent on the subject of cheese.â€
]")
Why? The compiler would report an error that the manifest string of type STRING_8
cannot contain characters with code points outside range 0–255. But the text seems to be using only ASCII characters that fit into this range, no? Well, not. The leading and trailing double quotes follow rules of English typography and are different, i.e. they both are not ASCII double quote "
. There is a straightforward "fix": the string needs to be prefixed with an explicit type STRING_32
:
localized_print ({STRING_32} "[
“Poets have been mysteriously silent on the subject of cheese.â€
]")
Now the code compiles and runs as expected. But why do we need the prefix in the first place? Why manifest integers and reals of different sizes do not require any prefix and code still compiles and works? Because the compiler is smart enough to figure out the type of a numeric constant by looking into its value. Can it be done with strings? Yes!
If a string has characters beyond range 0–255, EiffelStudio 20.05 and above automatically associates it with type STRING_32
. As a result, the initial code snippet would compile and work unchanged without an issue. This removes the need to prefix manifest strings with an explicit type in most cases when non-ASCII characters are used and makes code much cleaner.
Direct Unicode output
The code above still looks a bit too heavy: to output Unicode to the console, the class needs to use LOCALIZED_PRINTER
and to call Unicode-aware procedure localized_print
. EiffelStudio 20.05 adds a new feature put_string_32
to standard output classes. As a result, Unicode strings can be output without any additional boilerplate code:
io.put_string_32 ("[
“Poets have been mysteriously silent on the subject of cheese.â€
]")
There are other features to be added or modified to make Unicode input/output even more seamless, but this is the first important step in this direction.
Forthcoming improvements in string concatenation
Suppose, the citation above is used in a slightly more involved context with an attribution phrase before it:
io.put_string_32 (attribution + "[
“Poets have been mysteriously silent on the subject of cheese.â€
]")
The value of the first operand of the string concatenation operator +
(entity attribution
) is randomly retrieved from a pool of suitable phrases like "Сlassic said: "
, "We know: "
, etc. Does everything look right? Well, it depends. It does not work right if the type of entity attribution
is STRING_8
. In that case one would see
Classic said: Poets have been mysteriously silent on the subject of cheese.
Where are our nice double quotes around the citation? It turns out, the concatenation operator +
in EiffelStudio 19.05 and earlier accepts any type of string, but returns an object of the same type as the first operand. And what is the type of the first operand? STRING_8
! So, such concatenation causes information loss.
Turning assertions on reveals precondition violation in the function plus
. The argument should be composed only from characters with codes in range 0–255. The issue is expected.
However, used to compile-time mechanisms that prevent run-time issues (such as void safety or data race freedom), one would not expect to deal with information loss or assertion violation caused by mixing different string types at run-time. Indeed, the issue can be avoided at compile time. How? Here are the key elements:
- Forbid automatic string conversion causing information loss.
- Make sure string concatenation produces an object of type
STRING_32
as soon as either of its operands is 32-bit.
In order to avoid massive breaking changes to strings classes, in EiffelStudio 20.05, all features that convert 32-bit strings to 8-bit ones are marked as obsolete. The second point is addressed by marking the universal feature plus
as obsolete and adding features with the same name for every sized string variant. (This is a breaking change for code that concatenates STRING_8
with STRING_GENERAL
and variants, but it should not cause too many compilation errors.)
As soon as code is updated to avoid these errors and warnings, no information loss will be possible still preserving the ability to mix different string types. The obsolete features will be removed in a couple of releases, bringing us closer to Unicode-compatible user-friendly string handling. When the transition is over, the code at the beginning of this section will print the phrase in quotes as expected regardless of the string type of entity attribution
:
Classic said: “Poets have been mysteriously silent on the subject of cheese.â€