The FSet Project

This page is intended for anyone interested in extending FSet.

Design rationale

Equivalent-but-unequal elements
None of the other tree-based implementations of collections that I am aware of support the use of elements that are equivalent, according to the ordering relation, but not equal. That is, they all treat them as equal, so for instance, a set could only contain one of each equivalence class. It surprises me that this is generally considered acceptable. I suppose it comes down to the question of how often it proves difficult to write a comparison function for some type that is as fine-grained as equality, so that every pair of unequal values is ordered one way or the other. (One case in which it's slightly nonobvious involves the different kinds of numbers in Common Lisp. The integer 1 and the floating point number 1.0 represent mathematically equal values, yet are distinct. The natural treatment is to consider them equivalent, though I guess nothing goes wrong if we arbitrarily consider the integer to be less than the float. A better example is when we're arranging for sets of objects to have a desired canonical order, but our true equivalence relation is still object identity — we occasionally have duplicate objects which we want to treat as distinct.)

I don't know how often that happens, but it seemed to me that it might happen occasionally, and so, because my goal was to make FSet as broadly useful as possible, I decided to go to the effort to support equivalent-but-unequal elements. In one case this was critical: some of the FSet/Java classes take their ordering from the hash values, in case the element class doesn't implement Comparable at all; and of course there's no guarantee that hash values won't collide. In the remaining cases, it's not clear how important it is. Well, time will tell.

Heterogeneous trees
The usual binary tree structures are quite space-inefficient. Instead of using one word per element as a simple vector would, they use one node per element; with an object header and allocation overhead, a node could easily run to 6 or 8 words or more. Again, the goal of FSet is to be as broadly useful as possible; that includes situations where a significant fraction of one's process data could be composed of collections. (At the very least, I didn't want users to have space concerns in the backs of their minds.) To address this, I used what I call heterogeneous trees. I take what would have been the bottom two to three levels of the tree and place their elements in leaf vectors, whose size is bounded (normally <= 8). This reduces the space overhead of the tree by roughly a factor of four. It does complicate the algorithms somewhat, and there is probably a small time cost for that (as of this writing, I haven't attempted to measure it), but overall it probably speeds things up — after all, allocating 1/4 the space means writing 1/4 the locations, other things being equal, which of course they aren't :) An insertion, for instance, frequently requires copying the contents of a leaf vector to a new vector with the new element included; that might sound expensive, but since the vector length is limited to 8, it's only about as many write operations as it takes to initialize one tree node.
Weird mix of lowercase and mixed case
Lisp is of course traditionally case-insensitive, and that's the only form in which I've ever used it. Over the years I've gotten into the habit of using mixed case for my own code, figuring that Lisp's case insensitivity keeps my preference in this regard from being imposed on my fellow programmers. However, it is getting more and more popular to use Lisp in a nonstandard mode in which the reader is case-sensitive, and the predefined Common Lisp symbols are lowercase; this mode simplifies foreign-function interfacing to case-sensitive languages. Thinking that I would like FSet to be usable in this mode, again without imposing my mixed-case convention on those using it, I decided to lowercase all exported FSet symbols. However, I have not yet been able to bring myself to lowercase the entire codebase; so there are many internal symbols still written in mixed case. If this becomes a major source of annoyance for FSet developers other than myself (if there ever are any :), I can probably be persuaded to abandon the mixed case entirely. Let me know if it drives you nuts.
Abstraction level and duplication of code
The internal tree manipulation routines have a lot of code that is -- well, not duplicated outright, but very similar. This is particularly evident in FSet/Java, with its three implementations each of sets and maps. Couldn't a lot of this commonality have been abstracted somehow? Probably, but I didn't bother, because (1) it's not obvious how to do this abstraction without significant performance costs, particularly since I wanted to support implementations with unsophisticated compilers; (2) this is code that, once written and debugged, is unlikely to have to be touched again; and (3) it wasn't really that bad doing it this way.

That said, it should be possible to do a pretty good job of this abstraction in Lisp using macros, and it might be interesting to see the result. Anyone who feels inspired to tackle this is welcome to do so.

The really right solution, seems to me, would be to code the algorithms in some more abstract formalism that, when combined with information about the implementation strategy, could be translated into the target language. This would be a very fun project but would have taken me quite far afield — and I really wanted to get this thing working so I could use it.

Development priorities

These fall into several categories.

New types

I'm sure we can think of more...

FSet/Common Lisp

FSet/Java

The last work I've done on FSet/Java was a few years ago, before the release of Java 1.5. I see that since then, some new collection implementations have been added. Of these, the most interesting is ConcurrentSkipListSet. By providing thread-safe update without locking, this class eliminates one of the motivations for using FSet sets. Fair enough. I still think the stylistic benefits of functional collections make them superior in most cases. It will be interesting to see whether this argument gains any traction.

(As of November 2008, it appears unlikely that FSet/Java will ever see the light of day, but let me know if you're interested in it.)

New implementations

I think my priorities for new implementations would be in roughly this order: