The FSet Project

This page is intended for anyone interested in extending FSet.

Design rationale

Equivalent-but-unequal elements

None of the other tree-based implementations of collections that I am aware of support the use of elements that are equivalent, according to the ordering relation, but not equal. That is, they all treat them as equal, so for instance, a set could only contain one of each equivalence class. It surprises me that this is generally considered acceptable. I suppose it comes down to the question of how often it proves difficult to write a comparison function for some type that is as fine-grained as equality, so that every pair of unequal values is ordered one way or the other. (One case in which it's slightly nonobvious involves the different kinds of numbers in Common Lisp. The integer 1 and the floating point number 1.0 represent mathematically equal values, yet are distinct. The natural treatment is to consider them equivalent, though I guess nothing goes wrong if we arbitrarily consider the integer to be less than the float. A better example is when we're arranging for sets of objects to have a desired canonical order, but our true equivalence relation is still object identity — we occasionally have duplicate objects which we want to treat as distinct.)

I don't know how often that happens, but it seemed to me that it might happen occasionally, and so, because my goal was to make FSet as broadly useful as possible, I decided to go to the effort to support equivalent-but-unequal elements. In one case this was critical: some of the FSet/Java classes take their ordering from the hash values, in case the element class doesn't implement Comparable at all; and of course there's no guarantee that hash values won't collide. In the remaining cases, it's not clear how important it is. Well, time will tell.

Heterogeneous trees

The usual binary tree structures are quite space-inefficient. Instead of using one word per element as a simple vector would, they use one node per element; with an object header and allocation overhead, a node could easily run to 6 or 8 words or more. Again, the goal of FSet is to be as broadly useful as possible; that includes situations where a significant fraction of one's process data could be composed of collections. (At the very least, I didn't want users to have space concerns in the backs of their minds.) To address this, I used what I call heterogeneous trees. I take what would have been the bottom two to three levels of the tree and place their elements in leaf vectors, whose size is bounded (normally <= 8). This reduces the space overhead of the tree by roughly a factor of four. It does complicate the algorithms somewhat, and there is probably a small time cost for that (as of this writing, I haven't attempted to measure it), but overall it probably speeds things up — after all, allocating 1/4 the space means writing 1/4 the locations, other things being equal, which of course they aren't :) An insertion, for instance, frequently requires copying the contents of a leaf vector to a new vector with the new element included; that might sound expensive, but since the vector length is limited to 8, it's only about as many write operations as it takes to initialize one tree node.

Weird mix of lowercase and mixed case

Lisp is of course traditionally case-insensitive, and that's the only form in which I've ever used it. Over the years I've gotten into the habit of using mixed case for my own code, figuring that Lisp's case insensitivity keeps my preference in this regard from being imposed on my fellow programmers. However, it is getting more and more popular to use Lisp in a nonstandard mode in which the reader is case-sensitive, and the predefined Common Lisp symbols are lowercase; this mode simplifies foreign-function interfacing to case-sensitive languages. Thinking that I would like FSet to be usable in this mode, again without imposing my mixed-case convention on those using it, I decided to lowercase all exported FSet symbols. However, I have not yet been able to bring myself to lowercase the entire codebase; so there are many internal symbols still written in mixed case. If this becomes a major source of annoyance for FSet developers other than myself (if there ever are any :), I can probably be persuaded to abandon the mixed case entirely. Let me know if it drives you nuts.

Abstraction level and duplication of code

The internal tree manipulation routines have a lot of code that is -- well, not duplicated outright, but very similar. This is particularly evident in FSet/Java, with its three implementations each of sets and maps. Couldn't a lot of this commonality have been abstracted somehow? Probably, but I didn't bother, because (1) it's not obvious how to do this abstraction without significant performance costs, particularly since I wanted to support implementations with unsophisticated compilers; (2) this is code that, once written and debugged, is unlikely to have to be touched again; and (3) it wasn't really that bad doing it this way.

That said, it should be possible to do a pretty good job of this abstraction in Lisp using macros, and it might be interesting to see the result. Anyone who feels inspired to tackle this is welcome to do so.

The really right solution, seems to me, would be to code the algorithms in some more abstract formalism that, when combined with information about the implementation strategy, could be translated into the target language. This would be a very fun project but would have taken me quite far afield — and I really wanted to get this thing working so I could use it.

Development priorities

These fall into several categories.

New types

Binary relations. These could be implemented as pairs of maps. I consider this a pretty high priority, actually. (A first cut at this has been released with 1.2.)
General relations. These I would be inclined to implement as sets of tuples (as vectors), with the index for each position constructed lazily when first needed, and then maintained incrementally.
Graphs. Making a good functional graph implementation is a deliciously hard problem, one which I'm not aware anyone has solved. One's first thought, by analogy with the usual way of implementing mutable graphs, might be to make each node a map with one entry for each out-edge of the node (I'm assuming here labeled, directed graphs), mapping the label of that edge to its target node. But this doesn't begin to work. For starters, there's no way to create a circularity; and if some magic way were provided, the graph would then appear infinite (which would turn into infinite recursion in the equality comparison between nodes). And even if there were a way around that, there's still the problem that any update to the graph would require copying the whole thing.
No, a node is not just its set of out-edges; it has an identity. An alternative design, one suggested by the standard mathematical formulation, would be to use any set of distinct point objects, such as the natural numbers, to represent the nodes; then the graph can be represented as a map from nodes to (maps from edge labels to target nodes). This works pretty well, but still has one problem: as new versions of the graph are constructed, some nodes may become unreachable from the root set of interest; yet mappings for those nodes will, by default, remain indefinitely. Consider that when one is using mutable graphs made of mutable nodes in the usual way, the garbage collector takes care of unreachable nodes; now, because we are no longer representing the graph as language-level objects connected by pointers that the GC understands, we no longer get the benefit of GC for our graph. So, we have to GC it ourselves, probably by occasionally copying it starting from a root set (which we will have to compute explicitly).

While GCing our graphs ourselves could work well enough, there is possibly a better solution involving weak data structures. (I refuse to use the misnomer "weak pointer" despite its prevalence — weakness is not a property of the pointer, but of the location in which the pointer is stored.) If we could use weak functional maps to represent the graph, and use language-level objects to represent the nodes, then the built-in GC would work on our graphs.

But making a usable implementation of a weak functional map seems to be highly nontrivial. First, the primitives provided by languages for constructing weak data structures vary, and in most cases appear to be inadequate to our task. This goes doubly for Common Lisp, where the standard does not provide for them at all; and though most major implementations that do have some support for them, the nature of that support varies. (For a good overview of what's possible, see this paper by Bruno Haible.) Secondly, tree data structures don't like to have their keys disappear; those keys are needed to navigate the tree! There may be a solution to this problem using lazy cleanup of subtrees containing deleted keys, but it will definitely complicate FSet's algorithms.
Sequences with markers

I'm sure we can think of more...

FSet/Common Lisp

I've done some performance tuning, mostly consisting of adding speed-3 safety-0 declarations on routines that seem to deserve them, but lots more could be done along these lines. For instance, in the CMUCL family, it might help to use ext:freeze-type proclamations, at least on the structs declared in wb-trees.lisp. Also, a few spots may deserve dynamic-extent declarations.

FSet/Java

The last work I've done on FSet/Java was a few years ago, before the release of Java 1.5. I see that since then, some new collection implementations have been added. Of these, the most interesting is ConcurrentSkipListSet. By providing thread-safe update without locking, this class eliminates one of the motivations for using FSet sets. Fair enough. I still think the stylistic benefits of functional collections make them superior in most cases. It will be interesting to see whether this argument gains any traction.

Genericization. Obviously, FSet needs to be updated to make use of the generics facility. This is a screamingly high priority. Subtasks:
- Add type-parameter declarations to the classes, change all element-valued parameter types to the type parameter type, and change all collection-valued parameter types to extends wildcard types (? I think).
- Check whether there are any new methods of umbrella classes like Collection that need to be implemented (I think there are).
- Implement the new Java 1.6 NavigableSet etc. interfaces.
It looks like we'll have to maintain separate versions for Java 1.4, 1.5, and 1.6. Yuck.
Bags. I didn't bother to implement them in Java, partly because I'm not sure how useful they'll be; I wanted to get some experience with them in Lisp first. But anyone who needs them should let me know, or feel free to implement them yourself :)
Tuples? Again, let's get some experience with these in Lisp first.

(As of November 2008, it appears unlikely that FSet/Java will ever see the light of day, but let me know if you're interested in it.)

New implementations

I think my priorities for new implementations would be in roughly this order:

Objective Caml (aka OCaml). This member of the ML family has gotten some traction in the academic computer science community, and its use seems to be growing, particularly for program manipulation tools. (I haven't used it.) It has functional collections already, but FSet would add more features.
Python is a tough one. On the one hand, I will venture to guess that the language is the second most widely used semi-functional language, after Java, and its popularity is growing. An FSet implementation for Python would bring the benefits of functional collections to a large user community. On the other hand (and I say this not having used Python much), it seems Python has headed off in another design direction — it treats even lists as mutable objects. (See, e.g., note 2 on this page.) A quick glance at the Python library documentation shows that the language does already have a set datatype, but lacking FSet's generality. Interestingly, the Python designers do understand the undesirability of placing mutable objects in sets or using them as map keys, to the point that the language disallows such actions; but they didn't take what seems to me the obvious next step of declaring that all collections should be immutable. Also problematic is that Python sets are implemented with hash tables, which (although fast at lookup and insertion) don't efficiently support all the FSet operations, notably comparison.
The upshot of all this is that while I would love to see an FSet implementation for Python — which is probably the second most commercially important semi-functional language, after Java — it sounds like persuading the Python powers that be to endorse FSet (e.g., by adding it to the standard library) is going to be very hard.
Ruby, possibly. I don't know much about it.
Scala. This very interesting, very modern language is implemented on top of the JVM. It does already have some functional collections (for example TreeMap), but they are of relatively limited functionality. Anyway, FSet is definitely in the spirit of the language, and I would bet the maintainers would be delighted to add an FSet implementation to the standard library, if someone were to write one.
Nice. Another language, like Scala, implemented on top of the JVM.
Some version of Prolog, possibly Ciao. This is mostly to satisfy my curiosity about what such an implementation would look like.
Dylan. (Okay, I'm an incurable romantic.)