[elephant-devel] Rucksack and Elephant

Ian Eslick eslick at csail.mit.edu
Sun Jun 4 06:26:34 UTC 2006


I distracted myself this afternoon by writing a cached binary file and 
buffer library with serializer as a potential step towards a native 
backend for Elephant.  As I was contemplating some design decisions, I 
was curious how Arthur Lemmons made similar trade offs in Rucksack, 
motivating me to give his code a good read.  That experience prompted 
the following comparison. 

(Rucksack is described in detail here: 
http://weitz.de/eclm2006/rucksack-eclm2006.txt)

At present Elephant is fully functional and has been tested and used 
extensively in several demanding applications.  Rucksack is not yet 
operational, but has a critical mass of code written for all 
functionality and has some architectural features worth keeping an eye 
on.  The most exciting feature, of course, is that Rucksack is written 
entirely in mostly portable Common Lisp!

Serialization:
  Both systems take a similar approach to binary serialization and 
should perform similarly.

Persistent object storage:
Rucksack and Elephant handle persistent objects very differently.  In 
Elephant, every slot has a serialized descriptor (oid:class:slotname) 
that is used as a key to store all slot values in one large BDB BTree. 
The object oid is stored in class instances and used, along with class 
and slot names to index into the on-disk BTree to retrieve or overwrite 
a value.

In Rucksack, object OIDs index a large vector which contain the current 
on-disk location of the serialized objects.  On slot-writes, a new 
instance of the object is written to disk.  On transaction commit,  the 
vector pointer is updated.  This requires Rucksack to commit to garbage 
collection in order to reclaim stored objects (something Elephant 
doesn't do as BDB handles transaction logging differently and does 
writes in place).  However, the Rucksack choice provides a convenient 
way to handle transaction logging and rollbacks without a separate 
logging mechanism.

This means that Rucksack has to serialize all dirty objects when it 
commits a transaction.  This involve more writing of the disk and more 
total disk access than Elephant which only writes changed slot values.  
Within a transaction Rucksack provides an in-memory object cache of 
dirty objects and maintains a cache of committed objects as well so that 
future transaction don't need to re-serialize objects.

MOP:
The metaobject protocol support for persistent objects is similar, 
although Rucksack's is simpler in part because it makes more commitment 
to object level storage instead of slot-level storage.  Both Elephant 
and Rucksack support schema evolution, the ability to redefine objects 
at runtime and have the persistent instances updates as in 
UPDATE-INSTANCE-FOR-REDEFINED-CLASS.  Rucksack saves prior schemas so 
old instances can be loaded and then updated.  Elephant effectively does 
the same by storing slot names so that the new schema can pick old 
values stored in the same name, then run the loaded instance through the 
update function.  There are some potential pitfalls here in Elephant and 
I was intending to fix them in a similar way to Rucksack as part of a 
serializer enhancement to avoid writing slot names all the time.

Garbage collection:
Rucksack has a full incremental mark-and-sweep collector.  Elephant only 
has a poor-man's stop-and-copy via the repository migration interface 
(support for doing this automatically is not built in and it's 
expensive).  Enough said.

ACID:
  Rucksack has an elegant solution to ACID properties by copy-on-write 
for persistent objects so that each parallel transaction has its own set 
of live objects.  This avoids conflicts but also delays rollbacks.  When 
a transaction has to abort because of a conflict, it just throws away 
the live objects in memory and restarts.  This does mean that rollbacks 
are caused by object level write conflicts instead of slot conflicts.

Summary:

Rucksack is an elegant approach to persisting objects in Common Lisp.  
Its interface and Elephant's are very similar but they take a number of 
different and incompatible approaches to handling persistent slots, 
transactions, locking, etc.  I don't foresee significant performance 
advantages on either side, but the serializer in Rucksack seems more 
efficient for standard objects at the cost of some robustness on class 
redefinition.  I imagine I will be surprised by real-world benchmarks 
later.  For example, I suspect that transaction performance will vary 
greatly based on workload.  Typical website models should work the same 
on either as there are far fewer possible transaction collisions. 

Unfortunately Rucksack isn't easily re-targeted as a native lisp backend 
for Elephant because of the greatly differing assumptions behind 
persistent objects.  There may be a bit of code and design ideas that 
can be lifted however - such as the heap and btree implementation.  
There are some smart ideas in the serializer and in schema evolution 
that I've considered already so it's nice to have a reference 
implementation to refer to.


Notable differences:

- Rucksack is a reasonably compact, easy-to-understand system written 
entirely in Common Lisp.  Elephant has complex dependencies between 
Lisp, C and the architectural commitments of BDB.  Elephant performs 
poorly on SQL today so BDB is the high performance backend.  BDB has 
license issues for even small scale commercial deployment.

- Rucksack has full support for garbage collection, Elephant has minimal 
off-line support for storage reclamation

- Elephant will allow multiple lisp processes to use the same persistent 
store concurrently, a Rucksack store is locked to a single lisp 
instance.  Elephant can be configured with BDB replication, allowing for 
larger-scale deployment.

- Elephant is much more mature and it's disk storage is much more likely 
to be reliable so it will be some time until Rucksack is sufficiently 
mature for prime time.

- Rucksack performs object-level collision detection, Elephant performs 
record-based collision in a paged storage system.  This has different 
implications for how classes should be designed (slot values with large 
arrays, for instance, should be wrapped in their own persistent class so 
that writes to other slots does not result in multiple copies of that 
array).

This review has been somewhat rambling, but I hope it makes people look 
forward to playing with Rucksack, produces some good ideas for Elephant 
and emphasizes that Elephant is ready for real world (although probably 
non-critical) applications today.

Ian




More information about the elephant-devel mailing list