ZSTRING released in Eiffel-Loop 1.3.1
Preface
Since 2012 I have been interested in the idea of creating a memory efficient alternative to STRING_32 that does not sacrifice performance. In 2013 I published this article which describes and benchmarks the class EL_ASTRING (aka ASTRING). While ASTRING peformed well against STRING_32 and the Gobo UC_UTF8_STRING in terms of both memory consumption and performance, it had one fundamental problem which I felt would prevent it from ever gaining wide acceptance. This was that each ASTRING instance was limited to a maximum of 26 unicode characters that could be used to augment the base latin character set.
In the fall of 2015 I came up with a better algorithm which did not suffer the limitations of ASTRING. Although the new design was somewhat less memory efficient for storing characters wider than 8-bits, it performed well overall. I decided to call it ZSTRING. In November 2015 I announced work in progress on ZSTRING in this article.
I am happy to announce that this project is now completed and since January I have migrated all my projects to use ZSTRING. ZSTRING is now available to download as part of the latest 1.3.1 release of Eiffel-Loop.
Finnian Reilly
What is ZSTRING?
ZSTRING is a memory efficient alternative to the standard STRING_32 class. If your application processes text that is mainly written in a language that is encodeable by a latin character set, you can expect up to a 75% saving in memory. There is a small trade off in performance, but not by much as the benchmarks show. Actually there is not always a trade off. Some ZSTRING string operations are slightly faster than their STRING_32 equivalent.
How is ZSTRING different from STRING_8?
With STRING_8 you are limited to characters from the ISO-8859-1 character set. A ZSTRING instance can be initialized from any STRING_32 string and you are guaranteed to always get back the same STRING_32 string with a call to the function {ZSTRING}.as_string_32. You can also convert it to UTF-8 with a call to {ZSTRING}.to_utf_8.
How ZSTRING Works
ZSTRING uses a combination of two different character arrays to represent the text. The first array is defined as:
When initialized from a STRING_32, area holds every character that is encodeable using a system defined latin character-set (ISO-8859-15 by default). Any unicode character that is not encodeable as latin, is denoted by the control code 26 'SUB' character. SUB stands for substitution.
The unencoded array
The second array is defined as:
This array is used to store any characters that are not encodeable by the latin codec. The array is structured as a sequence of substrings. The first two cells of each substring record the indices defining the substring interval. As an example to illustrate, consider the string:
"#61 ZhÅng Fú ä¸åš Inner Truth"
It contains a total of 27 characters. 24 of these are encodeable as Latin-15 and 3 are not. The three that are not encodeable are stored in the unencoded_area array as shown in the following diagram:
The olive colored cells store the interval indices of the 2 substrings "Å" and "ä¸åš".
The Default Codec
By default, ZSTRING uses the codec class EL_ISO_8859_15_ZCODEC to convert unicode strings. However you can override this choice by calling the routine:
For a desktop application you could set the system codec according to the user locale. The following class offers a convenient way to create codec instances from an integer arguement:
The ZSTRING Test Suite
ZSTRING has a comprehensive test suite that verifies the results of all string operations against STRING_32.
See ZSTRING_TEST_SET also TEST_STRING_CONSTANTS
ZSTRING for Asian text
If your application is mainly processing Asian text, it is obvious that the current implementation of ZSTRING will incur a significant penalty in terms of both memory and runtime performance. The solution to this problem is to use a different source compatible implementation of ZSTRING adapted for Asian text that can be set with an ECF option.
A 2-byte Solution
The solution is in inherent in the fact that most Asian characters will fit into 2 bytes. By changing the storage arrays to the following definition it should be possible to create an efficient ZSTRING for Asian text
Ideally area should be defined as CHARACTER_16, but as this is not part of standard Eiffel, NATURAL_16 can be used as a workaround.
Perhaps in the future some Asian coder will be interested to take on the challenge of implementing this.
A 4-byte Solution
Of course an easier option is to create a source compatible version of ZSTRING that uses the same implementation as STRING_32.
Benchmark Notes
You can find a set of benchmarks at the following URL comparing ZSTRING and STRING_32 in terms of memory consumption and runtime performance.
http://www.eiffel-loop.com/benchmarks/ZSTRING-benchmarks-latin-15.html
Benchmark Source Code links on Github
Pure Latin-15 encoding
Mixed Latin-15 and Unicode encoding
MIXED_ENCODING_STRING_BENCHMARK
MIXED_ENCODING_ZSTRING_BENCHMARK
MIXED_ENCODING_STRING_32_BENCHMARK
The Test Data
The test data used in these benchmarks is shown at the bottom of the benchmark page. The data is composed of 64 rows and 4 columns referenced as A, B, C, D. The Input column indicates which combination of columns have been selected for each test. For example "$A $D"
means that each input string is a string from column A joined with a string from column D. Each runtime performance test is repeated for each of the 64 rows.
The columns have different storage characteristics as follows:
Columns A and D consist solely of characters that can be encoded as Latin-15. Column B is composed of characters, some of which will need to be stored in the unencoded_area array. Column C is composed of characters, all of which will need to be stored in the unencoded_area array.
Memory Consumption
Table 1 shows memory consumption under optimal conditions, where all the text data can be encoded as Latin-1.
Table 2 shows memory consumption under sub-optimal conditions, where some proportion of the text must be stored in the unencoded_area array.
Runtime Performance
Runtime performance is benchmarked using the 25 most common string operations. The performance figures are the average of 1000 runs for each benchmark test operation. The results are sorted in order of descending ZSTRING performance rank. Table 3 shows runtime performance under optimal conditions, where all the text data can be encoded as Latin-1. Table 4 shows performance under sub-optimal conditions, where some proportion of the text must be stored in the unencoded_area array. Table 4 has more input permutations than table 3.
I Ching Hexagram Test Strings
Table 5 at the bottom of the page shows the data used for the benchmarks. The data are names and descriptions of the 64 I Ching hexagrams divided into 4 columns as follows: A. Hexagram number B. Chinese names as pinyin phonetic spelling C. Chinese names as chinese characters D. English description
ZSTRING argument options
Most ZSTRING routines have two alternate forms as follows:
- taking arguments of type EL_READABLE_ZSTRING
- taking arguments of type READABLE_STRING_GENERAL
For example:
Extra ZSTRING routines
ZSTRING has a number of extra routines not found in the standard Eiffel string classes.
This routine is inspired by the Python string formatting or interpolation operator '%'. It is not quite as sophisticated as the Python operator as it cannot do numeric formatting, but nevertheless is very useful. It means any ZSTRING can be used as an interpolation template with the addition of '%S' markers (AKA '#').
This routine is inspired by the Python method string.translate(s, table[, deletechars]). ZSTRING has a number of different variations of this routine one of which is able to do character deletion.
This returns a list of substring indices as an arrayed list.
The class EL_SEQUENTIAL_INTERVALS is a high performance alternative to using ARRAYED_LIST [INTEGER_INTERVAL] both in terms of memory consumption and runtime performance.
Returns an escaped copy of current string
Unescapes the current string
Source Code Links
class EL_ZSTRING aka ZSTRING
class EL_READABLE_ZSTRING
class EL_UNENCODED_CHARACTERS
Are you aware of Gobo's UC_STRING? That's basically UTF-8 (there is also a UTF-16 variant), and that has been a very stable solution for me for many years.
gobo class UC_UTF8_STRING
hi Berend
I published this article in 2013 which benchmarks UC_UTF8_STRING against STRING_32 and ASTRING (a precursor to ZSTRING). These benchmarks show that UC_UTF8_STRING is extremely slow for some commonly used operations. I don't recall now the UTF-16 variant or perhaps it's something new.
UC_UTF8_STRING Benchmark Screenshot
UC_UTF8_STRING is broken
Besides being very slow, UC_UTF8_STRING is also broken, as this test shows. (I am using the version distributed with EiffelStudio 15.01.9)
UC_UTF8_STRING fixed
Thank you for reporting this problem. It's now fixed in the Git repository of Gobo in Github.