FSet for Common Lisp

Design Rationale

The Common Lisp implementation of FSet posed a number of challenging design choices. I wanted each implementation to be, as much as possible, idiomatic and well integrated into its source language. This went double for the CL implementation, as it's the one I am using personally :) It occurred to me that CLOS could be used to define a single, global, yet extensible ordering/equality relation across all types of interest, and this is probably the most unusual feature of FSet/CL. Because the ordering/equality relation is global, it is not necessary to supply one when creating collections. (In fact, you can't even override the global relation for a particular collection. This is probably a bug, but a pretty minor one, because of the extensibility.)

Specifically, the generic function compare takes two arguments and must return one of these four symbols: :less, :greater, :equal, or :unequal. If you look in order.lisp you will see the orderings I have established for the built-in types I thought were important: primarily numbers, strings, and symbols, but also lists and vectors. Note that I have defined not only the orderings within each of these types, but also the orderings between each pair of types. This allows FSet/CL collections to be completely untyped; you can put any combination of types of objects into one, just as you can with Lisp lists and vectors. I think this is very much in the spirit of Lisp :)

(I defined compare on lists and vectors because I thought these methods might be handy, but be aware that if you make a set containing some of these, or a map using them as keys, you must not modify them while they're in the set or map. See the section on Tuples.)

I faced a quandary, though, trying to decide what to do about some of the existing CL operations. FSet is somewhat at odds with CL both terminologically and semantically, a conflict exacerbated by the fact that CL is somewhat inconsistent and occasionally just screwy. Yet it is clearly very important that FSet coexist well with code that uses lists and CL sequences, because those are ubiquitous in existing code, and even a dedicated FSet user such as myself can't entirely escape using them in new code either. My first thought, as I approached this problem, was to try to integrate FSet sequences into the CL generic sequence system by defining new versions of all the CL operations, that would do CLOS dispatches to the builtin implementations when invoked on lists and vectors, and to the FSet implementations when invoked on FSet types. This sounds like the right thing, but turned out to be problematic, as the CL operations are just too differently designed from the FSet ones, in two ways. First is the obvious: the CL types are mutable, while the FSet ones are not. The best example of why this is a problem concerns the expansion of (setf (elt x n) y). This form would mean something quite different in the two cases: applied to a Lisp sequence, it would modify the sequence in place; but the only reasonable meaning on an FSet sequence would be to update the place x to hold a new sequence. It would be bizarre to choose between these two semantics at runtime based on the class of the sequence! Clearly, true genericity across these two systems is a chimera.

The second divergence of design is that the FSet types are more semantically loaded: they know whether they're sets or sequences, for instance, and they know what equivalence relation is in force. In contrast, the Lisp philosophy is to use bare data structures as the representation, and choose the semantics at the time the operation is applied. I don't know that this divergence produces any outright ridiculous results like the above example, but it was an endless annoyance to me as I tried to write this generic interface, and I suspect it would also have been annoying to users.

And yet, it's silly for FSet not to support familiar operations such as find-if. So the approach of not providing any genericity at all was also clearly wrong (as I quickly discovered when I tried to use FSet that way).

Considering all the competing pressures on the design at some length, I have finally settled on an approach guided by these principles:

The nastiest aspect of this approach — both for me and, I expect, for FSet users — has to do with the Lisp names first, rest, last, and second through seventh. First, let's note the inconsistencies already in CL:

For FSet seqs — as I call the sequence type, to distinguish it from Lisp sequences — I felt it was important to treat first and last symmetrically, since the implementation is symmetric. Also, there is a whole structure of related operations operating on the ends of seqs; I have called these with-first, with-last, less-first, less-last, push-first, push-last, pop-first, and pop-last, the first four being the functional update operations, and the last four being the modify macros that invoke the first four, respectively. Because of the importance of these operations, I felt it would be acceptable to force some compromises on existing code. So FSet exports a generic function first which includes a compatibility method for lists; here the impact on existing code is limited to the cost of the CLOS dispatch. For last, however, I think that a compatibility method that returns the same thing as cl:last when invoked on a list would only exacerbate the existing screwiness; so FSet doesn't supply one, but instead exports lastcons in the hope that users will agree that it's a better name, and won't mind changing their code to use it instead.

FSet does not, however, export any version of rest, or of second through seventh. For rest, I think less-first is better as it participates in the symmetric structure shown above (though, it's a bit long). And second through seventh normally show up when short lists are being used as quick-and-dirty tuples; one could argue that it's poor form to use them at all, instead of using (defstruct (... (:type list)) ...). (From the perspective of FSet, with its symmetric functional sequences, it strikes me as a little unfortunate that the names that caught on as modern replacements for car and cdr were first and rest instead of, say, head and tail. first and rest, it seems to me, are naturally applied to any sequence, where head and tail more specifically suggest the structure of a list. As it turns out, lists still have a role to play even in heavily FSet-ified code (a point I'll return to below); this role is not just as another kind of sequence, I would argue, but specifically takes advantage of their head/tail structure. So FSet exports head and tail as synonyms for car and cdr.)

Another issue. I mentioned above that generic-function versions of find-if etc. are provided. They are, but they are not completely compatible when invoked on FSet types (they are completely compatible, of course, when invoked on Lisp sequences). The primary differences are:

Okay. Here is a complete list of builtin CL symbols for which FSet exports its own versions. I expect that normally users will want to shadowing-import these, though it is certainly possible to keep the Lisp versions instead, and refer to the FSet versions with an explicit package prefix. (Of course, you don't have to import any FSet symbols at all; you can refer to all of them with a package prefix. You may wish to define f: or fs: as an abbreviation for fset: in this case.) But if you do want to make substantial and idiomatic use of FSet, I recommend you shadowing-import the FSet versions of these, or at least most of them:

set
A no-brainer; import fset:set. The CL set is archaic and nearly useless in the modern style, and easily replaced with (setf (symbol-value ...) ...), which is stylistically preferable anyway. fset:set names both a type and a "constructor macro" for that type; more on this below.
map
This one is a little harder, because cl:map has perfectly legitimate uses, though it's rarer than mapcar. Personally, I never use it because I use my GMap macro instead, but others might miss it. (The similar FSet operation is called image.) Again, fset:map names both a type and a constructor macro.
union, intersection, and set-difference
FSet exports generic functions with these names; these have methods for lists that invoke the Lisp builtins. They do not accept mixed operands (one FSet collection and one list).
first and last
See the discussion above. These are now generic functions. first has a compatibility method for lists; I've also included one for CL sequences (once the price of the CLOS dispatch is being paid, there's no reason not to). last, as described above, similarly returns the last element of a seq, or list, or CL sequence.
find, find-if, find-if-not
count, count-if, count-if-not
position, position-if, position-if-not
remove, remove-if, remove-if-not
substitute, substitute-if, substitute-if-not
subseq, reverse, sort, stable-sort
search (not yet implemented), mismatch (not yet implemented)
See the discussion above. These are now generic functions with compatibility methods on CL sequences that simply forward the call to the original CL versions. They also have methods on seqs; find etc., count etc., and remove etc. have methods on sets, bags, and maps as well. When called on an FSet collection, the default for :test is #'equal?; the :test-not keyword is not supported; and the :start, :end, and :from-end keywords are supported only on seqs, not on sets, bags, or maps. On a map, only the domain elements are examined.
some, every, notany, notevery
These are now ordinary functions that work on FSet seqs as well as CL sequences.

Here are some Lisp sequence operations for which FSet does not export a symbol with the same name, with comments on each. I have omitted destructive operations, which FSet of course has no equivalent for.

elt
The FSet operation (which applies to maps as well as seqs) is lookup; see also @, below.
length
The FSet operation (which applies to all FSet collections, not just seqs) is size.
remove-duplicates, merge
Use FSet sets.

Seqs and Strings

Along with using heterogeneous trees to save space, the seq implementation tries to save even more space by using strings, when possible, for the short leaf vectors. This is possible when the contiguous sequence of up to 8 consecutive elements of the seq that happen to wind up in a particular leaf vector are all characters. In Lisp implementations where base-char is distinct from character, FSet will even use base-strings when possible. The upshot is:

Since seqs have log-time concatenation and subsequence operations, they can actually be very useful for string manipulation.

The @ Macro

Common Lisp is, of course, a "Lisp-2", meaning a two-namespace Lisp in which a given name can have both a function binding and a value binding. In a Lisp-1, like Scheme, it would be nice to set the FSet types up as objects that could be called as functions as well as having the various operations invoked on them (some object systems for other Lisps have supported this, such as that of T). Then, instead of (lookup m x) where m is a map, you could write simply (m x). But this doesn't work in a Lisp-2, since the car of a form is evaluated in the function namespace; you'd have to write something like (funcall m x), which isn't any more elegant.

What we can do, however, is give you a shorter name to type than funcall. Thus is born @. It is a macro that expands to code that tests whether its first argument is a Lisp function; if so, it funcalls it (with the supplied arguments); otherwise it calls lookup on it. So you can use it when writing higher-order code to make it accept a map as well as a unary Lisp function. (Also, a seq behaves as a map from index to element, and a set behaves as a map from value to boolean.)

Whether you want to use @ generally, in place of lookup, is a matter of taste. Both are setf-able; though doing (setf (@ m x) y) works only if m is an FSet collection. Remember, this doesn't modify the collection; it's equivalent to (setf m (with m x y)).

Bags

The design of the FSet bag interface is unfortunately not informed by much actual experience, on my part, with bags. I have tried to guess what operations are likely to be useful. One open question, in my mind, is how often one wants to treat a bag in a set-like way (e.g., in which iterating over the bag would return some members multiple times) versus in a map-like way (where iteration would return each member and its multiplicity as a pair). It's possible I've overemphasized the map-like style of use.

The binary bag operations (bag-sum etc.) will accept a set for one operand and convert as necessary.

I have been careful not to add fixnum declarations to those variables and structure slots holding multiplicities. This is out of the thought, probably mistaken, that these bag operations might have some use in cryptography or something where you might want arbitrarily large multiplicities. I may change my mind about this. Feedback invited.

Dynamic Tuples

Along with the more familiar sets, bags, sequences, and maps, FSet/CL includes a "dynamic tuple" type. This was something of an experiment, born of the observation that linear searches of short vectors are actually very fast on modern processors because of cache effects. I had seen this approach taken for storing sparse slots in an object system ("sparse" meaning that any given object may have values for only a few of the possible slots), and wondered whether I could put it to use for functional tuples. I also wanted to try my hand at dynamic reordering. I think the result is interesting.

Tuples are abstractly similar to maps, at least in an untyped language like Lisp, but have a very different usage profile. The big difference is that the key set is bounded, and indeed normally fixed at compile time. Tuple keys are thus analogous to slots of classes; they're not arbitrary values. In fact, there is a class tuple-key; keys must be instances of this class. So tuples are really more like structs than maps, except that (a) update is functional (of course) and (b) there's only one tuple type; any key can be added to any tuple. Despite that, the representation is sparse: any given tuple takes only space proportional to the number of pairs it actually contains.

Here is an example of a situation in which the use of tuples is appropriate — indeed, highly recommended. Suppose you have instances of a mutable class, and you want to create a map mapping those instances to some range type, but rather than the mapping being by identity, you want it to be by some part of the content of the instances. You might be inclined to define a compare method on this class that compares the instances by comparing an appropriate subset of their attributes. But there's a danger here. If any of the values of those attributes ever changes after the instance has been used as a map key, the instance will, in general, no longer compare the same way against the other instances in the map, thus blowing one of the key invariants of the map's internal structure and rendering it more or less useless.

What you should do instead is, for each instance used as a key in this map, make a tuple containing the values of the attributes in question, and then use the tuple as the map key. If you attach the tuple to the instance by a new attribute, then the object can always know what tuple it has been indexed by in the map; and if the values in that tuple differ, at some point, from those in the instance, then you know that the instance values have changed since the instance was last indexed in the map, and you have handy all the information handy to re-index the instance (removing the old mapping, constructing the new tuple, and then adding the new mapping).

The implementation gets its speed by arranging for lookup to be done by scanning consecutive memory locations (thus taking advantage of cache effects) rather than by ordered trees. Thus the performance characteristics of tuples are quite different from those of the other FSet classes; where the other FSet classes do lookups in log time but with a significant constant factor, tuple lookups use linear search but with a very small constant factor — making them faster than even a well-compiled assoc. This is partly because of better use of the cache, and partly because of a clever scheme for dynamic reordering that makes frequently used slots cheaper to access.

(Note to Lisp implementors: I think tuples, with some minor adjustments, might make a great data structure for deep-bound special variables — shallow binding being incompatible with modern SMP implementations, and ordinary alists being potentially quadratic in recursive algorithms.)