Let's Talk About Strings

by Simon Hudon (modified: 2008 Oct 15)

strings
aliasing

It seems to me that the recent debate on immutable strings is misplaced. It seems related to the fact that:

strings are not duplicated for efficiency reasons.
since strings are mutable, it creates more aliasing than what we would care to have.

I have seen mentioned nowhere that in very few occasions we are interested in a string's identify i.e. we want to share it to allow an elegant change propagation using aliasing.

A debate similar in content seamed to have taken place when Eiffel Software decided to make expanded deprecated as an entity attribute. The point was that making a type of objects expanded was an important design decision and, therefore, it should be used in the body of the class affected.

I have not seen the terminology used anywhere else but I'd dare call expanded classes pure abstract data type. The reason for this is that whether we implement commands as functions or as procedures with those does not change the semantic since they don't have any aliasing properties.

By opposition, I would put a greater emphasis on the abstract machine nature of the other classes. In those cases, object identity has a most important role to play.

Coming back to strings, I think the nature of strings as a pure abstract data type is predominant compared to its role as a machine since its identity is rarely needed. I would go further and add that it is so rare that we care about a strings identity that we could create a class STRING_XX_REF and we would see it appear as rarely as INTEGER_XX_REF.

The aim of immutable strings, if I get it right, is to get rid of the devious effect of aliasing. I think it's greater quality is its efficiency and I don't think it is sufficient to support its wide adoption.

An interesting example that shows some problems that exists with strings and that would not disappear with the advent of immutable strings is the case of collections of strings. Systematically, when someone wants to create a collection of strings, he has to develop the reflex of setting the collection with object semantics. In other words, most of the time, the default value for collection of strings is wrong.

To address the efficiency issue, I think it is possible to limit the copy of expanded strings by copying only the reference of the SPECIAL implementing the string whenever a it is copied. Also, a flag can be kept to tell whether or not the string has been copied and to ensure that the SPECIAL will be duplicated before write operations if it is shared.

The downside of this precise implementation is that it is pessimistic and a strings implementation can be duplicated even if it is no longer shared because the `shared' flag is not shared. This can be seen for example if a string is copied and both copies are modified. In that case, the representation will be duplicated twice instead of once. As far as I know, it's the best we can do without changing the runtime to allow reference counters.

In summary, what I reproach the immutable string solution is that it only partially solves the problem caused by mutable and aliasable strings. Furthermore, it splits an abstraction in several classes which would benefit from a consolidation instead.

Simon

Comments

Franck Arnaud (16 years ago 12/11/2008)
expanded classes are neither immutable nor eliminating aliasing

it just pushes the problem one step away:

expanded class PERSON name: STRING set_name (a: STRING) is do name := a end end

a,b: PERSON -- expanded! l_name: STRING l_name := "Bob" a.set_name (l_name) b.set_name (l_name) a.name.append_string (" is a fool!") -- b.name has changed!

The example works as well if you had immutable strings, replace the string attribute with any other mutable attribute.

A better pattern for immutable classes is procedure-less classes -- no functions, could call them functional classes :) -- which are immutable if CQS is respected (and for which expandedness has no impact, and therefore is of no use).

The aliasing issue remains the same: aliasing disappears only for transitively functional classes, as it does for transitively expanded classes.
- Simon Hudon (16 years ago 13/11/2008)
  expanded must be applied to the STRING class
  
  In your example, PERSON is not aliasable. Whatever you do, it is impossible to have to references to the same PERSON instance. In my proposal, I say that STRING should be expanded and, of course, the implementation should make sure that no harmful aliasing can happen.
  
  One important thing to do for this is to make sure the client cannot manipulate directly the implementation. Among other things, this means that there should be no setter for the internal representation of the string.
  
  On the other hand, the copy mechanism of the string should be carefully designed to make sure that, if the representation is aliased, a flag is set so that no side effect is applied to the representation until it has been duplicated.
  
  In your example, expanded has no effect because it is applied to the wrong class. If you apply it to STRING, you will see that the last statement has no effect at all. If I'm not mistaken, a.name will yield a copy of a's name which will be mutated and lost.
  
  Simon
  - Franck Arnaud (16 years ago 14/11/2008)
    No need for expanded
    
    My argument is that you were apparently propagating the common fallacy that expanded classes are free of aliasing issues. They are so only if all their attributes are also expanded, which makes the feature considerably more brittle and less useful. It's particularly misleading if naive people get thinking they can rely on this property for any expanded class.
    
    The aliasing properties of an expanded class are not enforced by any validity constraint, and cannot be known from the flat short of a class (even if it doesn't export any routine it could still be subject to transitive aliasing). So expanded is just a hint that a class might, or might not, have interesting properties re aliasing.
    
    And given that you can do all you're trying to with aliasing with immutable (no-command) classes, with more or less the same balance of advantages and disadvantages, is there any use left for this feature? I don't think so.
    
    If expanded was removed from, or hadn't been in the language in the first place, we'd just have a better, cleaner, less confusing language at no loss of expressiveness.
Simon Hudon (16 years ago 16/11/2008)
No need for string identity

> My argument is that you were apparently propagating the common > fallacy that expanded classes are free of aliasing issues. They are so > only if all their attributes are also expanded, which makes the feature > considerably more brittle and less useful. It's particularly misleading > if naive people get thinking they can rely on this property for any > expanded class. > The aliasing properties of an expanded class are not enforced by any > validity constraint, and cannot be known from the flat short of a class > (even if it doesn't export any routine it could still be subject to transitive > aliasing). So expanded is just a hint that a class might, or might not, > have interesting properties re aliasing. All I say is that an expanded class cannot be aliased although it can cause aliasing of other objects. Regarding the aliasing of a string's representation, it is up to the implementation of STRING to make sure that no side effects are applied to aliased object. It is doable and it is a sufficient statement for the issue I am trying to address. The matter at hand is to determine what use we can make of a string's identity. In most cases, there is no use for it and this is why we should compare its semantics to that of integers and reals. > And given that you can do all you're trying to with aliasing with > immutable (no-command) classes, with more or less the same > balance of advantages and disadvantages, is there any use left > for this feature? I don't think so. I think that, by saying that, you miss the point completely. If we use reference type as the default for strings, and we have two string references s1 and s2, then s1 = s2 Would yield a correct true result really quickly but the negative result cannot be relied upon. We end up having to use equal in this case (which is what would be used in the case of expanded types). Maybe we can think that we save on copy since a reference copy is done in constant time. Well, we can implement an expanded string copy in constant time too so the argument for immutable strings is not very strong here either. What we have with reference type strings is what we could call conceptual pollution. We have two distinct notions of what a particular (reference type) string is: its identity and its value. We know its identity has no use at all since the strings would be immutable so they can only be the source of confusion and should be eliminated altogether (in the case of strings). Hence, let's use expanded classes to model regular strings. > If expanded was removed from, or hadn't been in the language > in the first place, we'd just have a better, cleaner, less confusing > language at no loss of expressiveness. Would you rather have only reference types for integers, characters and reals? I wouldn't. Saying that it is much more efficient to have them as expanded types is merely an interesting addition. The matter is that expanded classes capture the right abstraction for those classes. It is also true for strings. It is all a matter of thinking in terms of the right abstraction. I'm not sure I said it often enough but seeking the right abstraction is what should drive the activity of designing classes and it leads directly to the conclusion that making strings expanded is far superior to making them immutable. Simon
- Colin Adams (16 years ago 16/11/2008)
 And hash tables?
 
 As far as I can follow your argument, then it applies to hash tables too. Or any other container class. Again, the negative result from reference comparison doesn't tell us anything interesting. Do you think they should be expanded too? Colin Adams
 - Simon Hudon (16 years ago 17/11/2008)
 Hash table has not been required to be immutable, yet
 
 An important difference with hash tables is that nobody has suggested to make them immutable yet. It would seem that aliasing hash tables create less problems than aliasing strings. I think for such classes, we are used to be careful and we duplicate them explicitly whenever we want to avoid aliasing.
 
 Object identity is only made useless in the case of immutable objects.
 
 Simon
 - Simon Hudon (16 years ago 17/11/2008)
 Immutable collections
 
 On second thought, there is one library where the notion of immutable collections is necessary: the MML library which provides mathematical models to write specifications with. Consistently, I am strongly in favor of making every exposed class expanded and I don't think I would add classes such as MML_SET_REF since there is no need for them to be mutable (at least as far as I can see). I don't exclude it yet, though.
 
 Simon
- Franck Arnaud (16 years ago 19/11/2008)
 You can have object equality without expanded
 
 I agree that if expanded is removed, including for basic types, you need to adapt the rules so that '=' points to object equality. I had left that as an exercise to the reader. Shouldn't be too taxing (e.g. have an altogether distinct operator or function for reference equality, frankly it would be clearer, it's not as if you actually need ref equality very often). Some exercises are left to the reader re tidying up ref semantics for numbers (one can think of a few angles) though I don't expect show stopping problems to pop up there (in a new language).
 
 Performance is a moot point because whether you live conceptually in a reference-only model or an expanded one, a compiler can optimise one model into the implied 'natural' implementation of the other: that is implement expanded semantics with hidden pointers (what Eiffel compilers have often done until now for custom expandeds) or implement immutable references with hidden copying (trivially when no ref equality use is detected in a program), so I think expanded just pollutes the abstract model for nothing much.