UTF-8 Unicode in Eiffel for .NET
We have an Eiffel for .NET dll which is called by a VB.NET application. We need to internationalize this application. "That should be easy", I thought, "because Eiffel's STRING
is basically just a wrapper for the .NET
The Eiffel STRING
class is not natively Unicode, but it can encode UTF-8 as a sequence of bytes.
It seems that STRING
is heavily byte-oriented, even in .NET. I guess this is just something the Eiffel Software team hasn't got around to doing properly yet in EiffelStudio 5.7. How could I convert this sequence of bytes, supplied by STRING
, into a true Unicode
The answer lay in SYSTEM_STRING_FACTORY
. This is a helper class for STRING
; it converts to and from the .NET UC_UTF8_STRING
. The first step was to copy SYSTEM_STRING_FACTORY
from the base.kernel.dotnet
cluster to an override cluster of my own making.
Then I edited my override version of the class. The original version of the from_string_to_system_string
function converts the bytes in the Eiffel STRING
l_str8
to the .NET Result
with this one line of code:
create Result.make (l_str8.area.native_array, 0, a_str.count)
Now I really don't understand why this doesn't work. The native_array
is declared as NATIVE_ARRAY [CHARACTER]
, and CHARACTER
is simply a .NET
create utf8.make_from_utf8 (l_str8)
Note that utf8
is declared as UC_UTF8_STRING
. Then I looped through utf8
, copying each Unicode character to l_str
via a .NET SYSTEM_STRING_FACTORY
uses to copy 32-bit strings, by the way). The full source for my version of SYSTEM_STRING_FACTORY
is attached.
This single change was sufficient for allowing VB client classes to display the strings that our Eiffel libraries create. No change was required to any VB code. Each of the hundreds of places in our VB code that called STRING.to_cil
now automatically did the conversion with the help of my version of SYSTEM_STRING_FACTORY.from_string_to_system_string
, and so our application displayed proper Farsi text.
But this wasn't enough to handle converting the other way. Passing VB strings to our Eiffel libraries still didn't work: I had to fix the creation routine STRING.make_from_cil
. This was achieved by modifying the SYSTEM_STRING_FACTORY.read_system_string_into
command, which converts the .NET a_str
to the Eiffel STRING
l_str8
with this line of code:
a_str.copy_to (0, l_str8.area.native_array, 0, a_str.length)
Once again, I'm not sure why this doesn't work; but it doesn't. To fix it, I replaced the above line with two loops:
from i := 0 nb := a_str.length create utf8.make_empty until i = nb loop utf8.append_character (a_str.chars (i)) i := i + 1 end from i := 1 nb := utf8.byte_count l_str8 ?= a_result l_str8.wipe_out until i > nb loop l_str8.append_character (utf8.byte_item (i)) i := i + 1 end
So now make_from_cil
and to_cil
handle UTF-8 properly. In each case, they copy the contents of the input string to an instance of UC_UTF8_STRING, which is then copied to the result. This implementation is no doubt inefficient, but it does seem to be working ok. It requires minimal change on the Eiffel side (only one class, SYSTEM_STRING_FACTORY
, has been overridden); and it's completely transparent, as far as I can tell, on the VB side.
Before arriving at this approach, I had several false starts. I tried doing all of the conversion on the VB side, but that required modifying hundreds of lines of code. I tried mapping the Eiffel STRING
class to STRING_32
, but that didn't even compile (and it probably wouldn't have worked anyway); and I tried making VB work directly with a .NET-enabled override of UC_UTF8_STRING
, but that blew up at run time with type-cast errors in STRING.is_equal
(CAT-calls, I think).
So all in all, I'm happy that it seems to be working; but I'm unhappy that it took a lot of work to figure out how to do it. I'm looking forward to the day when Unicode in Eiffel is as easy as it is in C# and VB.
Attached is my override of SYSTEM_STRING_FACTORY
.
Note (May 17, 2007): I've written a follow-up to this at UTF-8 in .NET, revisited, including a new override of SYSTEM_STRING_FACTORY
.
Limitations
As far as I can tell, this implies that all your Eiffel strings are UTF-8, as otherwise it might not work for characters that are above 128. But if you get your data from UTF-8, wouldn't it be better to generate STRING_32 instead when reading the data. Once done, the STRING_32 would convert nicely with .NET System.String.
Limitations - Yes, UTF-8
Yes, the assumption here is that the strings are all UTF-8. (I tried to make that clear, especially in the title). This assumption is sound for our purposes.
For this reason, the official EiffelStudio version of
SYSTEM_STRING_FACTORY
probably should not adopt my "fix". Other good reasons for EiffelStudio to come up with a better fix than this are that my implementation is inefficient, and it would create a dependency of thebase
library on thegobo
library. I don't mind my own project having a dependency on Gobo - the project already uses Gobo - but this is not ok in general.I agree with your idea of generating
STRING_32
when reading the data. I was thinking along those lines when I attempted (unsuccessfully) to mapSTRING
toSTRING_32
. I modified my project's config to usebase
as a cluster rather than a library; then I copied all of the mappings frombase.ecf
to my config, editingSTRING
to map it toSTRING_32
rather thanSTRING_8
. But I quickly abandoned that route, because it wouldn't even compile. Some line in a library (something liketrue_string: STRING is "True"
) couldn't convert aSTRING_8
to aSTRING_32
. I could have left the mapping alone, I suppose, and done a global search and replace ofSTRING
withSTRING_32
in our code; but that is very invasive, and I'd be surprised if it worked given that the libraries we call would still be producingSTRING_8
objects.I really don't like this
STRING_8
/STRING_32
idea. I programmed in C# for two years, developing an internationalised application, and I was barely conscious of the fact that my strings and characters were Unicode. It just worked. I acknowledge that Eiffel is contending with a backward-compatibility problem here, but I would be much happier if I could just flip a switch in the config file so that all of mySTRING
objects instantly became Unicode.Flip of a coin
That would be soon possible when we have converted all our legacy code that only handle STRING_8 will be adapted to work with STRING_32 as well.
Multiple string types
I don't think there was a backwards-compatibility problem - at least, not one that needed the STRING_*/STRING_32 separation as a solution. A much simpler fix of adding a query would have done the trick.
Solution?
The issue is that you had legacy code wrapping C interfaces required 8-bit strings and having Unicode strings would have broken those API. Therefore the separation was and is still needed.
What do you mean by a query?
My solution
Well, two queries actually.
maximum_code: INTEGER is -- Maximum value of `code' permitted by `character_set' do -- 255 for ISO-8859-x, 1114111 for Unicode -- compiler can optimize this as a builtin query -- in the case that only 1 character set is used in -- a system. Otherwise it would be an attribute ensure positive_code: Result > 0 end and character_set: STRING is -- Name of character set used in `Current' do -- same considerations as for `maximum_code' ensure result_not_empty: Result /= Void and then not Result.is_empty result_is_ascii: -- whatever end
With these two queries, all incompatibilities can be coped with (including your c-string stuff - the compiler can include transcoding when necessary).
Note that the latter query was also needed for ETL2, for supporting multiple or alternate encodings, such as ISO-8859-2.
Real issue
What I meant is that it is not an issue of encoding or character set, but an issue with memory representation of Eiffel strings. Indeed the existing legacy C code wrapping are using this directly. So changing STRING to support unicode will break those programs. This is one of the reason why EiffelBase introduced C_STRING so that it works regardless of the memory representation of Eiffel strings.
STRING_8/STRING_32
I would also really like to know why we should have STRING_8 etc. It makes no sense to me at all. Currently the compiler is doing tricks to convert all STRINGs in our code to STRING_8, but certain things like checking generated_type.is_equal("STRING") break; anywhere where dynamic_type() from INTERNAL is used with STRING objects might or might not work. And I don't see why issues of UTF encoding (not the same as unicode per se) should be exposed at a developer-visible level.
What we need is to be able to say at the beginning of an application set_unicode_encoding_utf8 or set_unicode_encoding_utf16 and everything just works. The default should be whichever makes sense in your linguistic culture (UTF-8 in all european languages).
- thomas
Unicode encoding
There are two Unicode encodings to be considered - the Unicode Encoding Form, and the Unicode Encoding Scheme.
The former is one of UTF-8, UTF-16, and UTF-32. This is what is used internally within the program. It is a classic time/space trade-off. ISE use UTF-32, which waste memory to speed computing time. Either UTF-16 or UTF-8 would be slower, but UTF-16 is rarely significantly slower. I don't see any linguistic cultural differences affecting the issue.
The Unicode Encoding Schemes are byte serializations of the encoding forms. The full list is UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE. The natural default tends to map to your computer hardware + O/S, except there is a disk-space consideration here: The UTF-32* use more disk space. Which is the most economical DOES depend upon the linguistic culture - in Europe UTF-8 is cheapest, in East Asia, UTF-16* are cheapest. I don't know what {STRING_32}.out produces with ISE 5.7.
So there are two possible sets of set_unicode_encoding features.
Colin Adams
The form doesn't matter since it is most likely hidden from the user point of view (the user manipulate a sequence of characters and nothing more). Nevertheless, it might be better if it is kept 32-bit since it is faster.
Regarding the encoding scheme, it cannot be set on the application level since many libraries might choose a different encoding, or you might have a need to read different encoding. So it has to be configurable and this should be outside the STRING class.
Serializing
I agree that the encoding form doesn't matter, and that the compiler should be free to choose whichever it likes (and UTF-32 would be my choice too).
But for serializing, STRING_GENERAL should have the following routines (bodies omitted):
to_utf8: !STRING_8 is -- Serialization of `Current' as bytes of UTF-8 representation. do ensure not_shorter: Result.count >= count end to_utf_16_be: !STRING_8 is -- Serialization of `Current' as bytes of UTF-16BE representation. do ensure not_shorter: Result.count >= 2 * count end to_utf_16_le: !STRING_8 is -- Serialization of `Current' as bytes of UTF-16LE representation. do ensure not_shorter: Result.count >= 2 * count end to_utf_32_be: !STRING_8 is -- Serialization of `Current' as bytes of UTF-32BE representation. do ensure four_times_longer: Result.count = 4 * count end to_utf_32_le: !STRING_8 is -- Serialization of `Current' as bytes of UTF-32LE representation. do ensure four_times_longer: Result.count = 4 * count end
The question remains what {STRING_32}.out should produce. Perhaps it should be platform-specific (and may be finer grained than just Windows v. POSIX - different Windows configurations may have different natural defaults - I'm not sure about this).Colin Adams
I believe it does not matter whether or not those routines are in STRING_GENERAL. It might be better to have them outside, possibly that you may want to serialize the data in something else than a string and to reduce code duplication it makes more sense outside.
For {STRING_32}.out, I don't think it is a major issue. The default implementation of
out' is compiler defined at the moment to be STRING. In the future we may want to change this to be STRING_32, but for the time being, being a truncated version of the STRING_32 representation is fine to me since
out' has different semantics depending on the Eiffel class. In my opinion using `out' for encoding would be really wrong.