Introduction
I am currently working on new kind of string class: ZSTRING. Z-strings are essentially a compressed STRING_32 offering similar performance characteristics but with a much smaller memory footprint. Large strings composed mainly of characters found in the Latin-15 character-set use up to 75% less memory. Some preliminary benchmarks can be found here: http://www.eiffel-loop.com/benchmarks/zstring.html
The first benchmark set shows the best case scenario and the second the worst case. The letters A, B, C, D are codes for the test strings shown in a table at the bottom of the benchmark page. Each benchmark is the average of 100 runs with a full garbage collect at the end of each.
Z-strings have a precursor called EL_ASTRING (alias ASTRING) which I wrote about some years ago in this article.
However Z-strings use a very different kind of compression algorithm that removes one of the fundamental limitations of A-strings which is they were limited to a maximum of 26 unicode overflows per string. By overflow is meant characters which are not mappable to latin-15. With Z-string you can include an unlimited amount of Unicode overflow. The Z-string code is also a lot simpler.
The Algorithm
Z-string works by using a hybrid of Latin and Unicode character sets. This means that for languages which work well with 8 bit character-sets, the strings are stored as 8 bit character sequences. Z-strings are mainly initialized from Unicode or UTF-8 strings. Any character which is not translatable with the current system codec is put into an overflow array unencoded, which is composed of a sequence of substrings. Each substring is a start_index followed by an end_index followed by the unicode substring characters. By default unencoded is assigned the once singleton: Empty_unencoded.
unencoded: SPECIAL [NATURAL]
You can choose any Latin or Windows codec to be your system codec depending on the selected language locale of the user machine. Latin-15 is fine for most Western-European languages.
Asian Languages
However looking at the benchmarks it becomes obvious that Z-strings will inefficient for handling Asian text which is mostly composed of double byte characters. The solution I envisage for this is to have a second Z-string implemenation that is available as an ECF compile option. The Asian implemenation of ZSTRING will inherit from a class EL_ZSTRING_16 which operate on double byte character sequences. Any odd characters requiring more than 2 bytes can be stored in the unencoded overflow array. So even with Asian languages, I would expect to see a substantially smaller memory footprint compared to STRING_32.
Test Suite
I have been able to reuse an extensive set of tests developed for EL_ASTRING that prove equivalent results to STRING_32 routines.
Extra Routines
There are also a number of fancy routines in EL_ASTRING that are not found in the STRING_32. Later I will add these to ZSTRING. The code will be the same.
The most useful of these is an operator which every Python programmer will miss if he switches to Eiffel.
template_demo
local
template: ASTRING; count: INTEGER
do
count := 2
template := "The value of $S is $S"
log_or_io.put_string (template #$ ["count", count])
log_or_io.put_new_line
end
This is equivalent to the Python code
def template_demo ():
count = 2
template = "The value of %s is %s"
print template % ("count", count)
Program output:
The value of count is 2
There is also a useful routine translate, which is similar to the Python String translate routine. It is used to either substitute or delete selected characters from a string.
There are about 10 other useful routines not found in STRING_32
Code EL_ZSTRING (alias ZSTRING)
EL_ZSTRING inherits from class EL_ZSTRING_8 which contains routines copied from STRING_8. EL_ZSTRING wouldn't work if I just inherited from STRING_8 due to various contract violations that are not possible to override. Since the class will be used a lot, it will be conveniently aliased as ZSTRING in the ECF.
Note that this is work in progress and there are still many routines to be implemented.
class
EL_ZSTRING
inherit
STRING_GENERAL
rename
append as append_unicode_general,
append_substring as append_substring_general,
code as raw_code,
prepend as prepend_string_general,
prepend_substring as prepend_substring_general,
same_caseless_characters as same_caseless_characters_general,
starts_with as starts_with_general,
ends_with as ends_with_general,
is_case_insensitive_equal as is_case_insensitive_equal_general
undefine
has, index_of, last_index_of, occurrences, out
redefine
-- Access
hash_code, out, as_string_32, to_string_32,
-- Element change
append_unicode_general, copy, prepend_string_general,
-- Comparison
is_equal, same_characters,
-- Conversion
split
select
has
end
EL_ZSTRING_8
rename
append as append_8,
ends_with as ends_with_8,
fill_character as fill_character_8,
has as has_8,
hash_code as hash_code_8,
item as item_8,
index_of as index_of_8,
keep_head as keep_head_8,
keep_tail as keep_tail_8,
linear_representation as linear_representation_8,
make as make_8,
occurrences as occurrences_8,
same_characters as same_characters_8,
same_string as same_string_8,
share as share_8,
string as string_8,
substring as substring_8,
String_searcher as String_8_searcher,
wipe_out as wipe_out_8
export
{EL_ZSTRING, EL_ZSTRING_SEARCHER} area_lower
undefine
copy, is_equal
redefine
changeable_comparison_criterion, make_from_string
end
EL_UNENCODED_CHARACTERS
rename
append as append_unencoded,
grow as unencoded_grow,
hash_code as unencoded_hash_code,
has as unencoded_has,
make as make_unencoded,
make_from_other as make_unencoded_from_other,
overlaps as overlaps_unencoded,
put_code as put_unencoded_code,
shifted as shifted_unencoded,
substring as unencoded_substring,
same_string as same_unencoded_string
export
{EL_ZSTRING} all
{EL_ZSTRING_SEARCHER, EL_ZSTRING} unencoded_code, set_unencoded
{STRING_HANDLER} unencoded, has_unencoded
undefine
is_equal, copy, out
redefine
set_unencoded
end
READABLE_INDEXABLE [CHARACTER_32]
undefine
out, copy, is_equal
end
INDEXABLE [CHARACTER_32, INTEGER]
undefine
copy, is_equal, out
redefine
prune_all,
changeable_comparison_criterion
select
linear_representation
end
RESIZABLE [CHARACTER_32]
undefine
copy, is_equal, out
redefine
changeable_comparison_criterion
end
EL_SHARED_ZCODEC
export
{NONE} all
undefine
is_equal, copy, out
end
EL_SHARED_ONCE_STRINGS
export
{NONE} all
undefine
is_equal, copy, out
end
EL_MODULE_UTF
export
{NONE} all
undefine
is_equal, copy, out
end
create
make, make_empty, make_from_string, make_from_unicode, make_from_latin_1, make_from_utf_8, make_shared,
make_from_other, make_filled, make_from_components, make_from_latin1_c
-- make_from_string_view
convert
make_from_unicode ({STRING_32}), make_from_latin_1 ({STRING_8}),
-- make_from_string_view ({EL_STRING_VIEW}),
to_unicode: {STRING_32}
feature {NONE} -- Initialization
make (n: INTEGER)
-- Allocate space for at least `n' characters.
do
make_8 (n)
make_unencoded
end
make_filled (uc: CHARACTER_32; n: INTEGER)
-- Create string of length `n' filled with `c'.
require
valid_count: n >= 0
do
make (n)
fill_character (uc)
ensure
count_set: count = n
area_allocated: capacity >= n
filled: occurrences (uc) = count
end
make_from_components (s: STRING; a_unencoded: like unencoded)
do
make_filled ('%U', s.count)
area.copy_data (s.area, 0, 0, s.count)
unencoded := a_unencoded
end
make_from_latin1_c (latin1_ptr: POINTER)
do
make_from_latin_1 (create {STRING}.make_from_c (latin1_ptr))
end
make_from_other (other: EL_ZSTRING)
do
area := other.area.twin
count := other.count
make_unencoded_from_other (other)
end
feature {NONE} -- Initialization
make_from_string (s: STRING)
-- initialize with string that has the same encoding as codec
do
Precursor (s)
make_unencoded
end
make_from_unicode, make_from_latin_1 (a_unicode: READABLE_STRING_GENERAL)
do
make_filled ('%U', a_unicode.count)
encode (a_unicode, 0)
end
make_from_utf_8 (utf_8: READABLE_STRING_8)
local
l_unicode: STRING_32
do
l_unicode := empty_once_string_32
UTF.utf_8_string_8_into_string_32 (utf_8, l_unicode)
make_from_unicode (l_unicode)
end
-- make_from_string_view (matched_text: EL_STRING_VIEW)
-- do
-- make_from_other (matched_text.to_string)
-- end
make_shared (other: like Current)
do
share (other)
end
feature -- Access
fuzzy_index (other: READABLE_STRING_GENERAL; start: INTEGER; fuzz: INTEGER): INTEGER
-- <Precursor>
do
-- Result := string_searcher.fuzzy_index (Current, other, start, count, fuzz)
end
hash_code: INTEGER
-- Hash code value
do
Result := internal_hash_code
if Result = 0 then
Result := unencoded_hash_code (hash_code_8)
internal_hash_code := Result
end
end
index_of (uc: CHARACTER_32; start_index: INTEGER): INTEGER
local
c: CHARACTER; uc_code: NATURAL; i, l_count: INTEGER; l_area: like area
do
c := codec.encoded_character (uc.natural_32_code)
if c = Unencoded_character then
uc_code := uc.natural_32_code
l_area := area; l_count := count
from i := start_index until Result > 0 or else i > l_count loop
if l_area [i - 1] = c and then uc_code = unencoded_code (i) then
Result := i
end
i := i + 1
end
else
Result := index_of_8 (c, start_index)
end
ensure then
valid_result: Result = 0 or (start_index <= Result and Result <= count)
zero_if_absent: (Result = 0) = not substring (start_index, count).has (uc)
found_if_present: substring (start_index, count).has (uc) implies item (Result) = uc
none_before: substring (start_index, count).has (uc) implies
not substring (start_index, Result - 1).has (uc)
end
item alias "[]", at alias "@" (i: INTEGER): CHARACTER_32 assign put
-- Unicode character at position `i'
local
c: CHARACTER
do
c := item_8 (i)
if c = Unencoded_character then
Result := unencoded_code (i).to_character_32
else
Result := codec.as_unicode (c)
end
end
last_index_of (uc: CHARACTER_32; start_index_from_end: INTEGER): INTEGER
local
c: CHARACTER; uc_code: NATURAL; i, l_count: INTEGER; l_area: like area
do
c := codec.encoded_character (uc.natural_32_code)
if c = Unencoded_character then
uc_code := uc.natural_32_code
l_area := area; l_count := count
from i := start_index_from_end until Result > 0 or else i = 0 loop
if l_area [i - 1] = c and then uc_code = unencoded_code (i) then
Result := i
end
i := i - 1
end
else
Result := last_index_of (c, start_index_from_end)
end
end
raw_code (i: INTEGER): NATURAL_32
-- code which can be latin or unicode depending on presence of unencoded characters
local
c: CHARACTER
do
c := area [i-1]
if c = Unencoded_character then
Result := unencoded_code (i)
else
Result := c.natural_32_code
end
end
share (other: like Current)
do
share_8 (other)
unencoded := other.unencoded
end
substring_index (other: READABLE_STRING_GENERAL; start_index: INTEGER): INTEGER
do
Result := String_searcher.substring_index (Current, adapted_general (other), start_index, count)
end
substring_index_in_bounds (other: READABLE_STRING_GENERAL; start_pos, end_pos: INTEGER): INTEGER
-- <Precursor>
do
-- Result := string_searcher.substring_index (Current, other, start_pos, end_pos)
end
unicode (i: INTEGER): NATURAL_32
local
c: CHARACTER
do
c := area [i - 1]
if c = Unencoded_character then
Result := unencoded_code (i)
else
Result := codec.as_unicode (c).natural_32_code
end
end
unicode_item (i: INTEGER): CHARACTER_32
do
Result := unicode (i).to_character_32
end
feature -- Measurement
index_set: INTEGER_INTERVAL
-- Range of acceptable indexes
do
create Result.make (1, count)
ensure then
index_set_not_void: Result /= Void
index_set_count: Result.count = count
end
occurrences (uc: CHARACTER_32): INTEGER
local
c: CHARACTER; i, nb: INTEGER; l_area: like area; uc_code: NATURAL
do
c := codec.encoded_character (uc.natural_32_code)
if c = Unencoded_character then
uc_code := uc.natural_32_code
l_area := area; nb := count
from i := 0 until i = nb loop
if l_area [i] = c and then uc_code = unencoded_code (i + 1) then
Result := Result + 1
end
i := i + 1
end
else
Result := occurrences_8 (c)
end
end
feature -- Status query
changeable_comparison_criterion: BOOLEAN = False
ends_with (str: READABLE_STRING_GENERAL): BOOLEAN
local
i, offset, l_count: INTEGER
do
if attached {like Current} str as zstring then
Result := ends_with_8 (zstring)
if Result and then zstring.has_unencoded then
-- match unencoded
l_count := zstring.count; offset := count - l_count
from i := 1 until i = l_count or else not Result loop
Result := Result and zstring.unencoded_code (i) = unencoded_code (i + offset)
i := i + 1
end
end
else
Result := ends_with (adapted_general (str))
end
end
extendible: BOOLEAN = True
-- May new items be added? (Answer: yes.)
has (uc: CHARACTER_32): BOOLEAN
-- Does string include `uc'?
local
c: CHARACTER
do
c := codec.encoded_character (uc.natural_32_code)
if c = Unencoded_character then
Result := unencoded_has (uc.natural_32_code)
else
Result := has_8 (c)
end
end
is_string_8: BOOLEAN = False
-- <Precursor>
is_string_32: BOOLEAN = True
-- <Precursor>
is_substring_whitespace (start_index, end_index: INTEGER): BOOLEAN
do
end
is_valid_as_string_8: BOOLEAN
do
Result := not has_unencoded
end
prunable: BOOLEAN
-- May items be removed? (Answer: yes.)
do
Result := True
end
valid_code (code: NATURAL_32): BOOLEAN
-- Is `code' a valid code for a CHARACTER_32?
do
Result := True
end
feature -- Element change
append (s: like Current)
local
old_count: INTEGER
do
if s.has_unencoded then
old_count := count
append_8 (s)
append_unencoded (s.shifted_unencoded (old_count))
else
append_8 (s)
end
end
append_character, extend (uc: CHARACTER_32)
do
append_unicode (uc.natural_32_code)
end
append_string (s: READABLE_STRING_GENERAL)
do
if attached {like Current} s as zstring then
append (zstring)
else
append_unicode_general (s)
end
end
append_unicode (c: like unicode)
-- Append `c' at end.
-- It would be nice to make this routine over ride 'append_code' but unfortunately
-- the post condition links it to 'code' and for performance reasons it is undesirable to have
-- code return unicode.
local
l_count: INTEGER
do
l_count := count + 1
if l_count > capacity then
resize (l_count)
end
set_count (l_count)
put_unicode (c, l_count)
ensure then
item_inserted: unicode (count) = c
new_count: count = old count + 1
stable_before: elks_checking implies substring (1, count - 1) ~ (old twin)
end
append_unicode_general (s: READABLE_STRING_GENERAL)
local
old_count: INTEGER
do
old_count := count
grow (old_count + s.count)
set_count (old_count + s.count)
encode (s, old_count)
reset_hash
ensure then
valid_unencoded_count: unencoded_count = occurrences_8 (Unencoded_character)
unicode_agrees: to_unicode ~ old (to_unicode + s)
end
fill_character (uc: CHARACTER_32)
local
c: CHARACTER
do
c := codec.encoded_character (uc.natural_32_code)
if c = Unencoded_character then
else
fill_character_8 (c)
end
end
left_adjust
do
end
prepend_string_general (s: READABLE_STRING_GENERAL)
do
end
put (c: CHARACTER_32; i: INTEGER)
-- Replace character at position `i' by `c'.
do
-- area.put (c, i - 1)
internal_hash_code := 0
ensure then
stable_count: count = old count
stable_before_i: elks_checking implies substring (1, i - 1) ~ (old substring (1, i - 1))
stable_after_i: elks_checking implies substring (i + 1, count) ~ (old substring (i + 1, count))
end
put_code (code: NATURAL_32; i: INTEGER)
-- Replace character at position `i' by character of code `v'.
do
end
put_unicode (a_code: NATURAL_32; i: INTEGER)
-- put unicode at i th position
require -- from STRING_GENERAL
valid_index: valid_index (i)
local
c: CHARACTER; l_area: like area
do
l_area := area
c := codec.encoded_character (a_code)
l_area [i - 1] := c
if c = Unencoded_character then
put_unencoded_code (a_code, i)
end
internal_hash_code := 0
ensure
inserted: unicode (i) = a_code
stable_count: count = old count
stable_before_i: Elks_checking implies substring (1, i - 1) ~ (old substring (1, i - 1))
stable_after_i: Elks_checking implies substring (i + 1, count) ~ (old substring (i + 1, count))
end
right_adjust
do
end
feature {EL_ZSTRING} -- Element change
set_unencoded (a_unencoded: like unencoded)
do
unencoded := a_unencoded
ensure then
valid_unencoded: is_unencoded_valid
end
feature -- Removal
keep_head (n: INTEGER)
-- Remove all characters except for the first `n';
-- do nothing if `n' >= `count'.
local
old_count: INTEGER
do
old_count := count
keep_head_8 (n)
if n < old_count and then has_unencoded then
set_unencoded (unencoded_substring (1, n).unencoded)
end
end
keep_tail (n: INTEGER)
-- Remove all characters except for the last `n';
-- do nothing if `n' >= `count'.
local
old_count: INTEGER
do
old_count := count
keep_tail_8 (n)
if n < old_count and then has_unencoded then
set_unencoded (unencoded_substring (old_count - n + 1, old_count).unencoded)
end
end
prune (uc: CHARACTER_32)
do
end
prune_all (uc: CHARACTER_32)
do
end
remove (i: INTEGER)
do
end
remove_tail (n: INTEGER)
-- Remove last `n' characters;
-- if `n' > `count', remove all.
require
n_non_negative: n >= 0
local
l_count: INTEGER
do
l_count := count
if n > l_count then
count := 0
internal_hash_code := 0
else
keep_head (l_count - n)
end
ensure
removed: elks_checking implies Current ~ (old substring (1, count - n.min (count)))
end
wipe_out
do
wipe_out_8
unencoded := Empty_unencoded
end
feature -- Conversion
as_lower: like Current
-- New object with all letters in lower case.
do
Result := twin
Result.to_lower
end
as_upper: like Current
-- New object with all letters in upper case
do
Result := twin
Result.to_upper
end
as_string_32, to_string_32, to_unicode: STRING_32
-- UCS-4
do
create Result.make_filled ('%/000/', count)
codec.decode (count, area, Result.area)
if has_unencoded then
write (Result)
end
end
linear_representation: LINEAR [CHARACTER_32]
do
end
plus alias "+" (s: READABLE_STRING_GENERAL): like Current
-- <Precursor>
do
Result := new_string (count + s.count)
Result.append (Current)
Result.append_unicode_general (s)
end
split (a_separator: CHARACTER_32): LIST [like Current]
-- Split on `a_separator'.
local
l_list: ARRAYED_LIST [like Current]; part: like Current; i, j, l_count: INTEGER
separator: CHARACTER; is_unencoded_separator: BOOLEAN
do
separator := codec.encoded_character (a_separator.natural_32_code)
is_unencoded_separator := separator = Unencoded_character
l_count := count
-- Worse case allocation: every character is a separator
if is_unencoded_separator then
create l_list.make (occurrences (a_separator) + 1)
else
create l_list.make (occurrences_8 (separator) + 1)
end
if l_count > 0 then
from i := 1 until i > l_count loop
if is_unencoded_separator then
j := index_of (a_separator, i)
else
j := index_of_8 (separator, i)
end
if j = 0 then
-- No separator was found, we will
-- simply create a list with a copy of
-- Current in it.
j := l_count + 1
end
part := substring (i, j - 1)
l_list.extend (part)
i := j + 1
end
if j = l_count then
check
last_character_is_a_separator: item (j) = a_separator
end
-- A separator was found at the end of the string
l_list.extend (new_string (0))
end
else
-- Extend empty string, since Current is empty.
l_list.extend (new_string (0))
end
Result := l_list
check
l_list.count = occurrences (a_separator) + 1
end
end
to_upper
do
end
to_lower
do
end
feature -- Duplication
copy (other: like Current)
do
end
substring (start_index, end_index: INTEGER): like Current
-- Copy of substring containing all characters at indices
-- between `start_index' and `end_index'
do
-- Cannot use Precursor because post-conditions will become invalid
if (1 <= start_index) and (start_index <= end_index) and (end_index <= count) then
Result := new_string (end_index - start_index + 1)
Result.area.copy_data (area, start_index - 1, 0, end_index - start_index + 1)
Result.set_count (end_index - start_index + 1)
else
Result := new_string (0)
end
if has_unencoded then
Result.set_unencoded (unencoded_substring (start_index, end_index).unencoded)
end
ensure then
valid_unencoded_count: Result.unencoded_count = Result.occurrences_8 (Unencoded_character)
end
feature -- Comparison
is_equal (other: like Current): BOOLEAN
local
l_count: INTEGER; l_hash, l_other_hash: like internal_hash_code
do
if other = Current then
Result := True
else
l_count := count
if l_count = other.count then
-- Let's compare the content if and only if the hash_code are the same or not yet computed.
l_hash := internal_hash_code
l_other_hash := other.internal_hash_code
if l_hash = 0 or else l_other_hash = 0 or else l_hash = l_other_hash then
Result := same_string (other)
end
end
end
end
is_less alias "<" (other: like Current): BOOLEAN
-- Is string lexicographically lower than `other'?
local
other_count, l_count: INTEGER
do
if other /= Current then
other_count := other.count
l_count := count
if other_count = l_count then
Result := order_comparison (other_count, other) > 0
else
if l_count < other_count then
Result := order_comparison (l_count, other) >= 0
else
Result := order_comparison (other_count, other) > 0
end
end
end
end
same_characters (other: READABLE_STRING_GENERAL; start_pos, end_pos, index_pos: INTEGER): BOOLEAN
local
l_has_unencoded, other_has_unencoded: BOOLEAN
do
if attached {like Current} other as z_other then
if has_unencoded or else z_other.has_unencoded then
Result := same_unencoded_string (z_other) and then same_characters_8 (z_other, start_pos, end_pos, index_pos)
else
Result := same_characters_8 (z_other, start_pos, end_pos, index_pos)
end
else
Result := Precursor (other, start_pos, end_pos, index_pos)
end
end
feature {EL_ZSTRING} -- Contract Support
is_unencoded_valid: BOOLEAN
local
i, j, lower, upper, l_count, sum_count, array_count: INTEGER
l_unencoded: like unencoded; l_area: like area
do
l_area := area; l_unencoded := unencoded; array_count := l_unencoded.count
Result := True
from i := 0 until not Result or else i = array_count loop
lower := l_unencoded.item (i).to_integer_32; upper := l_unencoded.item (i + 1).to_integer_32
l_count := upper - lower + 1
from j := lower until not Result or else j > upper loop
Result := Result and l_area [j - 1] = Unencoded_character
j := j + 1
end
sum_count := sum_count + l_count
i := i + l_count + 2
end
Result := Result and occurrences_8 (Unencoded_character) = sum_count
end
feature {STRING_HANDLER} -- Implementation
frozen set_count (number: INTEGER)
-- Set `count' to `number' of characters.
do
count := number
reset_hash
end
feature {NONE} -- Implementation
adapted_general (str: READABLE_STRING_GENERAL): like Current
do
if attached {like Current} str as zstr then
Result := zstr
else
create Result.make_from_unicode (str)
end
end
encode (a_unicode: READABLE_STRING_GENERAL; area_offset: INTEGER)
require
valid_area_offset: a_unicode.count > 0 implies area.valid_index (a_unicode.count + area_offset - 1)
local
l_unencoded: like Once_extendible_unencoded
do
l_unencoded := Once_extendible_unencoded; l_unencoded.wipe_out
codec.encode (a_unicode, area, area_offset, l_unencoded)
if l_unencoded.unencoded.count > 0 then
if has_unencoded then
append_unencoded (l_unencoded)
else
unencoded := l_unencoded.unencoded_copy
end
end
end
new_string (n: INTEGER): like Current
-- New instance of current with space for at least `n' characters.
do
create Result.make (n)
end
order_comparison (n: INTEGER; other: like Current): INTEGER
-- Compare `n' characters from `area' starting at `area_lower' with
-- `n' characters from and `other' starting at `other.area_lower'.
-- 0 if equal, < 0 if `Current' < `other', > 0 if `Current' > `other'
require
other_not_void: other /= Void
n_non_negative: n >= 0
n_valid: n <= (area.upper - other.area_lower + 1) and n <= (other.area.upper - area_lower + 1)
local
i, j, nb, index_other, index: INTEGER; l_code, other_code: NATURAL; c, other_c: CHARACTER
other_area, l_area: like area
do
l_area := area; other_area := other.area; index := area_lower; index_other := other.area_lower
from i := index_other; nb := i + n; j := index until i = nb loop
c := l_area [i]; other_c := other_area [i]
if c = Unencoded_character then
-- Do Unicode comparison
l_code := unencoded_code (i + 1)
if other_c = Unencoded_character then
other_code := other.unencoded_code (i + 1)
else
other_code := codec.as_unicode (other_c).natural_32_code
end
else
if other_c = Unencoded_character then
-- Do Unicode comparison
l_code := codec.as_unicode (c).natural_32_code
other_code := other.unencoded_code (i + 1)
else
l_code := c.natural_32_code
other_code := other_c.natural_32_code
end
end
if l_code /= other_code then
if l_code < other_code then
Result := (other_code - l_code).to_integer_32
else
Result := -(l_code - other_code).to_integer_32
end
i := nb - 1 -- Jump out of loop
end
i := i + 1; j := j + 1
end
end
reset_hash
do
internal_hash_code := 0
end
feature {NONE} -- Constants
Codec: EL_ZCODEC
once
Result := system_codec
end
Once_like_current: EL_ZSTRING
once
create Result.make (0)
end
Once_extendible_unencoded: EL_EXTENDABLE_UNENCODED_CHARACTERS
do
create Result.make
end
String_searcher: EL_ZSTRING_SEARCHER
once
create Result.make
end
Unencoded_character: CHARACTER = '%/026/'
end
Code EL_UNENCODED_CHARACTERS
note
description: "[
Representation of consecutive substrings in a STRING_32 string that could not be encoded using
a latin character set. The substring are held in the array unecoded: SPECIAL [CHARACTER_32]
Each substring is prececded by two 32 bit characters representing the lower and upper index.
]"
author: "Finnian Reilly"
class
EL_UNENCODED_CHARACTERS
create
make, make_from_other
feature {NONE} -- Initialization
make
do
unencoded := Empty_unencoded
end
make_from_other (other: EL_UNENCODED_CHARACTERS)
do
if other.has_unencoded then
unencoded := other.unencoded.twin
else
make
end
end
feature -- Access
hash_code (seed: INTEGER): INTEGER
-- Hash code value
local
i, offset, lower, upper, count, array_count: INTEGER; l_unencoded: like unencoded
do
Result := seed
l_unencoded := unencoded; array_count := l_unencoded.count
from i := 0 until i = array_count loop
lower := l_unencoded.item (i).to_integer_32; upper := l_unencoded.item (i + 1).to_integer_32
count := upper - lower + 1
from i := i + 2; offset := 0 until offset = count loop
-- The magic number `8388593' below is the greatest prime lower than
-- 2^23 so that this magic number shifted to the left does not exceed 2^31.
Result := ((Result \\ 8388593) |<< 8) + l_unencoded.item (i + offset).to_integer_32
offset := offset + 1
end
i := i + count
end
end
first_lower: INTEGER
local
l_unencoded: like unencoded
do
l_unencoded := unencoded
if l_unencoded.count > 0 then
Result := l_unencoded.item (0).to_integer_32
end
end
last_upper: INTEGER
-- count when substrings are expanded to original source string
local
i, lower, upper, array_count: INTEGER; l_unencoded: like unencoded
do
l_unencoded := unencoded; array_count := l_unencoded.count
from i := 0 until i = array_count loop
lower := l_unencoded.item (i).to_integer_32; upper := l_unencoded.item (i + 1).to_integer_32
i := i + upper - lower + 3
end
Result := upper
end
unencoded_count: INTEGER
-- count when substrings are expanded to original source string
local
i, lower, upper, count, array_count: INTEGER; l_unencoded: like unencoded
do
l_unencoded := unencoded; array_count := l_unencoded.count
from i := 0 until i = array_count loop
lower := l_unencoded.item (i).to_integer_32; upper := l_unencoded.item (i + 1).to_integer_32
count := upper - lower + 1
Result := Result + count
i := i + count + 2
end
end
unencoded_code (index: INTEGER): NATURAL
local
i, lower, upper, array_count: INTEGER; l_unencoded: like unencoded
do
l_unencoded := unencoded; array_count := l_unencoded.count
from i := 0 until Result > 0 or else i = array_count loop
lower := l_unencoded.item (i).to_integer_32; upper := l_unencoded.item (i + 1).to_integer_32
i := i + 2
if lower <= index and then index <= upper then
Result := l_unencoded [i + index - lower]
end
i := i + upper - lower + 1
end
end
intervals: ARRAYED_LIST [INTEGER_INTERVAL]
local
i, lower, upper, count, array_count: INTEGER; l_unencoded: like unencoded
do
create Result.make (3)
l_unencoded := unencoded; array_count := l_unencoded.count
from i := 0 until i = array_count loop
lower := l_unencoded.item (i).to_integer_32; upper := l_unencoded.item (i + 1).to_integer_32
count := upper - lower + 1
Result.extend (lower |..| upper)
i := i + count + 2
end
end
feature -- Access attributes
unencoded: SPECIAL [NATURAL]
feature -- Status query
has (unicode: NATURAL): BOOLEAN
local
i, j, lower, upper, count, array_count: INTEGER; l_unencoded: like unencoded
do
l_unencoded := unencoded; array_count := l_unencoded.count
from i := 0 until Result or else i = array_count loop
lower := l_unencoded.item (i).to_integer_32; upper := l_unencoded.item (i + 1).to_integer_32
count := upper - lower + 1
from j := 1 until Result or else j > count loop
if l_unencoded [i + 1 + j] = unicode then
Result := True
end
j := j + 1
end
i := i + count + 2
end
end
has_unencoded: BOOLEAN
do
Result := unencoded.count > 0
end
overlaps (start_index, end_index: INTEGER): BOOLEAN
do
Result := not (end_index < first_lower) and then not (start_index > last_upper)
end
feature -- Comparison
same_string (other: EL_UNENCODED_CHARACTERS): BOOLEAN
local
l_unencoded: like unencoded
do
l_unencoded := unencoded
if l_unencoded.count = other.unencoded.count then
Result := l_unencoded.same_items (other.unencoded, 0, 0, l_unencoded.count)
end
end
feature -- Element change
append (other: EL_UNENCODED_CHARACTERS)
require
other_has_unencoded: other.has_unencoded
local
i, lower, upper, count, array_count: INTEGER; l_unencoded, other_unencoded: like unencoded
do
if has_unencoded then
other_unencoded := other.unencoded
l_unencoded := unencoded; array_count := l_unencoded.count
from i := 0 until i = array_count loop
lower := l_unencoded.item (i).to_integer_32; upper := l_unencoded.item (i + 1).to_integer_32
count := upper - lower + 1
i := i + count + 2
end
if upper + 1 = other.first_lower then
-- merge intervals
l_unencoded := resized (l_unencoded, array_count + other_unencoded.count - 2)
l_unencoded.copy_data (other_unencoded, 2, i, other_unencoded.count - 2)
l_unencoded.put (other_unencoded [1], i - count - 1)
else
l_unencoded := resized (l_unencoded, array_count + other_unencoded.count)
l_unencoded.copy_data (other_unencoded, 0, i, other_unencoded.count)
end
else
unencoded := other.unencoded
end
ensure
valid_count: unencoded_count = old unencoded_count + other.unencoded_count
end
append_interval (a_unencoded: like unencoded; source_index, lower, upper: INTEGER)
local
old_count, count: INTEGER; l_unencoded: like unencoded
do
l_unencoded := unencoded; old_count := unencoded.count; count := upper - lower + 1
l_unencoded := resized (l_unencoded, old_count + count + 2)
l_unencoded.put (lower.to_natural_32, old_count)
l_unencoded.put (upper.to_natural_32, old_count + 1)
l_unencoded.copy_data (a_unencoded, source_index, old_count + 2, count)
ensure
count_increased_by_count: unencoded_count = old unencoded_count + upper - lower + 1
end
put_code (code: NATURAL; index: INTEGER)
local
i, previous_i, previous_upper, lower, upper, count: INTEGER; found: BOOLEAN
l_unencoded: like unencoded
do
l_unencoded := unencoded
from i := 0 until found or i = l_unencoded.count loop
lower := l_unencoded.item (i).to_integer_32; upper := l_unencoded.item (i + 1).to_integer_32
count := upper - lower + 1
if lower <= index and then index <= upper then
-- update interval
l_unencoded.put (code, i + 2 + index - lower)
found := True
elseif upper + 1 = index then
-- extend interval right
count := count + 1
l_unencoded := extended (i + count + 1, 0, 0, code)
l_unencoded.put (index.to_natural_32, i + 1)
found := True
elseif lower - 1 = index then
-- extend interval left
count := count + 1
lower := lower - 1
l_unencoded := extended (i + 2, 0, 0, code)
l_unencoded.put (index.to_natural_32, i)
found := True
elseif previous_upper < index and then index < lower then
-- insert a new interval
l_unencoded := extended (previous_i, index, index, code)
i := i + 3
found := True
end
if previous_upper > 0 and then previous_upper + 1 = lower then
-- Merge interval with previous
l_unencoded.overlapping_move (i + 2, i, l_unencoded.count - i - 2)
l_unencoded.remove_tail (2)
i := previous_i; upper := previous_upper + count
lower := l_unencoded.item (i).to_integer_32
l_unencoded.put (upper.to_natural_32, i + 1)
count := upper - lower + 1
end
previous_i := i; previous_upper := upper
i := i + count + 2
end
if not found then
-- append a new interval
l_unencoded := extended (l_unencoded.count, index, index, code)
end
end
shift (n: INTEGER)
-- shift intervals right by n characters. n < 0 shifts to the left.
local
i, lower, upper, count, array_count: INTEGER; l_unencoded: like unencoded
do
if n /= 0 then
l_unencoded := unencoded; array_count := l_unencoded.count
from i := 0 until i = array_count loop
lower := l_unencoded.item (i).to_integer_32; upper := l_unencoded.item (i + 1).to_integer_32
count := upper - lower + 1
l_unencoded.put ((lower + n).to_natural_32, i)
l_unencoded.put ((upper + n).to_natural_32, i + 1)
i := i + count + 2
end
end
end
set_unencoded (a_unencoded: like unencoded)
do
unencoded := a_unencoded
end
feature -- Basic operations
write (expanded_str: STRING_32)
-- write substrings into expanded string 'str'
require
string_big_enough: last_upper <= expanded_str.count
local
i, j, lower, upper, count, array_count: INTEGER; l_unencoded: like unencoded
do
l_unencoded := unencoded; array_count := l_unencoded.count
from i := 0 until i = array_count loop
lower := l_unencoded.item (i).to_integer_32; upper := l_unencoded.item (i + 1).to_integer_32
count := upper - lower + 1
from j := 1 until j > count loop
expanded_str.put_code (l_unencoded.item (i + j + 1), lower + j - 1)
j := j + 1
end
i := i + count + 2
end
end
feature -- Duplication
substring (start_index, end_index: INTEGER): EL_UNENCODED_CHARACTERS
local
i, lower, upper, count, array_count: INTEGER
l_unencoded: like unencoded
do
create Result.make
l_unencoded := unencoded; array_count := l_unencoded.count
from i := 0 until i = array_count loop
lower := l_unencoded.item (i).to_integer_32; upper := l_unencoded.item (i + 1).to_integer_32
count := upper - lower + 1
if lower <= start_index and then end_index <= upper then
-- Append full interval
Result.append_interval (l_unencoded, i + 2 + (start_index - lower), start_index, end_index)
elseif start_index <= lower and then upper <= end_index then
-- Append full interval
Result.append_interval (l_unencoded, i + 2, lower, upper)
elseif lower <= end_index and then end_index <= upper then
-- Append left section
Result.append_interval (l_unencoded, i + 2, lower, end_index)
elseif lower <= start_index and then start_index <= upper then
-- Append right section
Result.append_interval (l_unencoded, i + 2 + (start_index - lower), start_index, upper)
end
i := i + count + 2
end
Result.shift (-(start_index - 1))
end
shifted (n: INTEGER): EL_UNENCODED_CHARACTERS
do
create Result.make_from_other (Current)
Result.shift (n)
end
feature {NONE} -- Implementation
resized (old_unencoded: like unencoded; new_count: INTEGER): like unencoded
local
i, delta: INTEGER
do
Result := old_unencoded
if new_count > Result.capacity then
grow (new_count)
Result := unencoded
end
delta := new_count - Result.count
from i := 1 until i > delta loop
Result.extend (0)
i := i + 1
end
end
grow (new_count: INTEGER)
local
l_unencoded: like unencoded
do
l_unencoded := unencoded
if new_count > l_unencoded.capacity then
create unencoded.make_empty (new_count + l_unencoded.capacity // 2)
unencoded.insert_data (l_unencoded, 0, 0, l_unencoded.count)
end
ensure
big_enough: new_count <= unencoded.capacity
end
extended (destination_index, lower, upper: INTEGER; code: NATURAL): like unencoded
local
insert: like unencoded
do
Result := unencoded; insert := Unencoded_insert; insert.wipe_out
if lower > 0 then
insert.extend (lower.to_natural_32)
end
if upper > 0 then
insert.extend (upper.to_natural_32)
end
insert.extend (code)
if Result.count + insert.count > Result.capacity then
grow (unencoded.count + insert.count)
Result := unencoded
end
Result.insert_data (insert, 0, destination_index, insert.count)
end
feature {NONE} -- Constants
Empty_unencoded: SPECIAL [NATURAL]
once
create Result.make_empty (0)
end
Unencoded_insert: SPECIAL [NATURAL]
once
create Result.make_empty (3)
end
Minimum_capacity: INTEGER = 3
end
Code EL_ZCODEC
deferred class
EL_ZCODEC
inherit
EL_MODULE_UTF
STRING_HANDLER
feature {NONE} -- Initialization
make
do
create latin_characters.make_filled ('%U', 1)
unicode_table := create_unicode_table
initialize_latin_sets
end
feature -- Access
id: INTEGER
deferred
end
name: STRING
do
create Result.make_from_string (type)
Result.append_character ('-')
Result.append_integer (id)
end
type: STRING
deferred
end
feature -- Basic operations
decode (a_count: INTEGER; latin_chars_in: SPECIAL [CHARACTER]; unicode_chars_out: SPECIAL [CHARACTER_32])
-- Replace Ctrl characters used as place holders for foreign characters with original unicode characters.
-- If 'a_decode' is true encode output as unicode
-- Result is count of unencodeable Unicode characters
require
enough_latin_chars_in: latin_chars_in.count > a_count
unicode_chars_out_big_enough: unicode_chars_out.count > a_count
local
i, code: INTEGER; c: CHARACTER; l_unicodes: like unicode_table
do
l_unicodes := unicode_table
from i := 0 until i = a_count loop
c := latin_chars_in [i]; code := c.code
if c /= Unencoded_character then
unicode_chars_out [i] := l_unicodes.item (code)
end
i := i + 1
end
end
encode (
unicode_in: READABLE_STRING_GENERAL; latin_chars_out: SPECIAL [CHARACTER]; out_offset: INTEGER;
unencoded_characters: EL_EXTENDABLE_UNENCODED_CHARACTERS
)
-- encode unicode characters as latin
-- Set unencodeable characters as the Substitute character (26) and record location in unencoded_intervals
require
latin_chars_out_big_enough: latin_chars_out.count >= unicode_in.count
valid_offset_and_count: unicode_in.count > 0 implies latin_chars_out.valid_index (unicode_in.count + out_offset - 1)
local
i, count, unicode: INTEGER; uc: CHARACTER_32; c: CHARACTER; l_unicodes: like unicode_table
do
l_unicodes := unicode_table; count := unicode_in.count
from i := 1 until i > count loop
uc := unicode_in [i]; unicode := uc.code
if unicode <= 255 and then l_unicodes [unicode] = uc then
latin_chars_out [i + out_offset - 1] := uc.to_character_8
else
c := latin_character (uc, unicode)
if c.code = 0 then
latin_chars_out [i + out_offset - 1] := Unencoded_character
unencoded_characters.extend (uc, i + out_offset)
else
latin_chars_out [i + out_offset - 1] := c
end
end
i := i + 1
end
end
feature -- Conversion
as_unicode (c: CHARACTER): CHARACTER_32
-- lookup for character which there is not a place holder
do
Result := unicode_table [c.code]
end
encoded_character (unicode: NATURAL_32): CHARACTER
local
uc: CHARACTER_32; index: INTEGER
do
uc := unicode.to_character_32; index := unicode.to_integer_32
if unicode <= 255 and then unicode_table [index] = uc then
Result := uc.to_character_8
else
Result := latin_character (uc, index)
if Result.code = 0 then
Result := Unencoded_character
end
end
end
latin_character (uc: CHARACTER_32; unicode: INTEGER): CHARACTER
-- unicode to latin translation
-- Returns '%U' if translation is the same as ISO-8859-1 or else not in current set
deferred
ensure
valid_latin: Result /= '%U' implies unicode_table [Result.code] = uc
end
feature {EL_ZSTRING} -- Implementation
create_unicode_table: SPECIAL [CHARACTER_32]
-- map latin to unicode
deferred
end
initialize_latin_sets
deferred
end
latin_set_from_array (array: ARRAY [INTEGER]): SPECIAL [CHARACTER]
do
create Result.make_empty (array.count)
across array as c loop
Result.extend (c.item.to_character_8)
end
end
single_byte_unicode_chars: SPECIAL [CHARACTER_32]
local
i: INTEGER
do
create Result.make_filled ('%U', 256)
from i := 0 until i > 255 loop
Result [i] := i.to_character_32
i := i + 1
end
end
feature {NONE} -- Internal attributes
latin_characters: SPECIAL [CHARACTER]
unicode_table: like create_unicode_table
-- map latin to unicode
feature {NONE} -- Constants
Unencoded_character: CHARACTER = '%/026/'
end
Code STRING_BENCHMARK
deferred class
STRING_BENCHMARK
inherit
MODULE_HEXAGRAM
EL_MODULE_EIFFEL
MEMORY
feature {NONE} -- Initialization
make (a_number_of_runs: like number_of_runs)
do
number_of_runs := a_number_of_runs
create string_list.make (64)
create tests.make (23)
end
feature -- Access
tests: ARRAYED_LIST [
TUPLE [routines: STRING_32; input_format: STRING; output_count: INTEGER; average_time: DOUBLE; storage_size: INTEGER]
]
feature -- Factory
new_string (unicode: READABLE_STRING_GENERAL): STRING_GENERAL
deferred
end
new_string_with_count (n: INTEGER): STRING_GENERAL
deferred
end
feature -- Basic operations
do_test (
routines: STRING_32; input_format: STRING; is_string_32_input: BOOLEAN
procedure: PROCEDURE [STRING_BENCHMARK, TUPLE [like input_strings]]
)
require
valid_input_format: across input_format as c all c.item.is_alpha implies c.item.is_upper end
local
timer: EL_EXECUTION_TIMER; i, l_storage_bytes, output_count: INTEGER; args: TUPLE [like input_strings]
l_input_format: STRING
do
create args
args.put (input_strings (input_format, is_string_32_input), 1)
full_collect
create timer.make
from i := 1 until i > number_of_runs loop
procedure.call (args)
if i + 1 > number_of_runs then
across string_list as string loop
l_storage_bytes := l_storage_bytes + storage_bytes (string.item)
end
end
output_count := string_list.count
string_list.wipe_out; full_collect
i := i + 1
end
timer.stop
if input_format.has ('$') then
create l_input_format.make (input_format.count + 2)
l_input_format.append_character ('"')
l_input_format.append (input_format)
l_input_format.append_character ('"')
else
create l_input_format.make (input_format.count + 2)
l_input_format.append ("<< ")
l_input_format.append (input_format)
l_input_format.append (" >>")
end
tests.extend ([routines, l_input_format, output_count, timer.elapsed_millisecs / number_of_runs, l_storage_bytes])
end
feature -- Benchmark tests
join_all (a_input_strings: like input_strings)
local
target: like new_string_with_count
do
target := new_string_with_count (joined_count (Hexagram.Chinese_characters, False))
across a_input_strings as strings loop
append (target, strings.item.first)
end
string_list.extend (target)
end
join (a_input_strings: like input_strings)
local
target: like new_string
string_general: READABLE_STRING_GENERAL
do
across a_input_strings as array loop
target := new_string_with_count (joined_count (array.item, True))
across array.item as string loop
if string.cursor_index = 1 then
append (target, space)
end
append (target, string.item)
end
string_list.extend (target)
end
end
find_word (word: like new_string; a_input_strings: like input_strings)
local
index: INTEGER
do
across a_input_strings as strings loop
if attached {like new_string} strings.item.first as string then
index := string.substring_index (word, 1)
end
end
end
sort (a_input_strings: like input_strings)
local
sortable: EL_SORTABLE_ARRAYED_LIST [like new_string]
do
create sortable.make (a_input_strings.count)
across a_input_strings as strings loop
if attached {like new_string} strings.item.first as string then
sortable.extend (string)
end
end
sortable.sort
string_list := sortable
end
split (a_input_strings: like input_strings)
do
across a_input_strings as strings loop
if attached {like new_string} strings.item.first as string then
across string.split (' ') as word loop
string_list.extend (word.item)
end
end
end
end
feature {NONE} -- Implementation
append (target: like new_string; s: READABLE_STRING_GENERAL)
deferred
end
to_string_32 (string: like new_string): STRING_32
deferred
end
input_strings (format: STRING; is_string_32_input: BOOLEAN): ARRAYED_LIST [ARRAYED_LIST [READABLE_STRING_GENERAL]]
local
columns: like input_columns; parts_32, parts: ARRAYED_LIST [STRING_GENERAL]
i: INTEGER
do
columns := input_columns (format)
create Result.make (64)
across Hexagram.string_arrays as array loop
create parts_32.make (columns.count)
from i := 1 until i > columns.upper loop
parts_32.extend (array.item [columns [i]])
i := i + 1
end
if format.has ('$') then
create parts.make (1)
parts.extend (new_string_general (joined_count (parts_32, True), is_string_32_input))
across parts_32 as part loop
if part.cursor_index > 1 then
parts.last.append (space)
end
if is_string_32_input then
parts.last.append (part.item)
else
append (parts.last, part.item)
end
end
else
create parts.make (parts_32.count)
across parts_32 as part_32 loop
if is_string_32_input then
parts.extend (part_32.item)
else
parts.extend (new_string (part_32.item))
end
end
end
Result.extend (parts)
end
end
new_string_general (n: INTEGER; is_string_32_input: BOOLEAN): STRING_GENERAL
do
if is_string_32_input then
create {STRING_32} Result.make (n)
else
Result := new_string_with_count (n)
end
end
input_columns (format: STRING): ARRAY [INTEGER]
local
c: CHARACTER; l_array: ARRAYED_LIST [CHARACTER]; i: INTEGER
do
create l_array.make (4)
if format.has ('$') then
across format.split (' ') as str loop
l_array.extend (str.item [str.item.count])
end
else
across format.split (',') as str loop
l_array.extend (str.item [1])
end
end
create Result.make (1, l_array.count)
from i := 1 until i > l_array.count loop
c := l_array [i]
Result [i] := c.natural_32_code.to_integer_32 - {ASCII}.upper_a + 1
i := i + 1
end
end
joined_count (parts: LIST [READABLE_STRING_GENERAL]; with_separators: BOOLEAN): INTEGER
do
if with_separators then
Result := parts.count - 1
end
across parts as part loop
Result := Result + part.item.count
end
end
storage_bytes (s: like new_string): INTEGER
deferred
end
feature {NONE} -- Internal attributes
number_of_runs: INTEGER
string_list: ARRAYED_LIST [like new_string]
feature {NONE} -- Constants
Space: STRING = " "
end
Bug in blogging software
For Eiffel Software: please note that due a strange bug in this blogging software I had to change a line of Eiffel code so the characters "[code]" do not appear in it. They cause the rest of the article to become garbled.