[Removed doc folder from the repository. Vladimir Sedach **20090901235701 Ignore-this: 523ece8f2cfdf0f6fbfbf1c49a37e93b ] { hunk ./doc/background.html 1 - - - - - Idle cores to the left - - - - - -

- Idle cores to the left of me, race conditions to the right -

-

- (or how I learned to stop worrying and embrace the deadlock) -

- -

- Topics: concurrency, multicore processors, Common Lisp
- Author: Marijn Haverbeke
- Date: June 5th 2008 -

- -

I think that, over the past year, I've read some thirty - articles that start with the solemn or anxious announcement that - the future of programming will be multicore, and that one way or - another, we will have to get used to it. I'll try not to write - another one of those. In fact, it appears that multicore - processors are also pretty good single-core processors, and most - applications are, for the foreseeable future, comfortably - single-threaded. And in a lot of cases, they should be. So is - there any pressing reason to take notice of this new-fangled - hardware?

- -
[I]t looks more or less like the hardware designers - have run out of ideas, and that they’re trying to pass the blame - for the future demise of Moore’s Law to the software writers by - giving us machines that work faster only on a few key - benchmarks!
- -

That is from none less than Donald Knuth, in a recent - interview. Nonetheless, I have (belatedly, I'll admit) started - to get excited about incorporating the kind of 'multi-threading - for the sake of multi-threading' that multiple cores encourage - into my programs. Why? This might not be a very convincing reason - for the serious, result-oriented programmer, but parallel - programming, it turns out, is very amusing. Sure, there are - dead-lock death-traps, race-conditions, and a general dangerous - sense of non-determinism. But there is a logic to all of it, and a - working parallel program can be a thing of beauty. That kind of - beauty, and the complexity that tends to go with it, is what got - me into this business in the first place. (It wasn't my love for - hunching over keyboards or a deep dislike of sunlight, in any - case.)

- -

This infatuation started ― predictably enough ― - with Erlang. Erlang is kind of hip, makes you think about - concurrency in a whole new way, and is entirely dissatisfactory as - a programming language. I won't go too deeply into that last - point, since taking cheap shots at other people's work tends to - hurt one's credibility, but the un-polished and limited feel of - the language offended my delicate linguistic sensibilities, and - prevented me from investing too deeply into it, with the result - that I am still (also somewhat grudgingly, perhaps) doing most of - my work in Common Lisp.

- -

Common Lisp has no integrated, light-weight, multicore-friendly - threads. In fact, in only recently started having widespread - support for OS-level threads at all. This support comes, - invariably, in the form of shared memory, locks, and semaphores - (bordeaux-threads - provides a pleasant portable wrapper). While these allow just - about any kind of concurrent program to be written, they are - fairly tedious and error-prone to work with. CL-MUPROC - builds a message-passing system on top of them, which allows a - more pleasant programming style. Every CL-MUPROC process is an OS - thread though, so you won't want to create thousands of them.

- -

Another approach for getting away from mutexes, one that has - been generating a lot of noise recently, is software transactional - memory. If that doesn't mean anything to you, I'd recommend - watching this video of - Simon Peyton Jones explaining the idea. I was surprised to find - that there apparently exists a CL-STM (aren't we - CLers creative when it comes to naming projects). Seems to have - been inactive since early 2007, and as I understand it, STM is - rather hard to get right, so I'm not sure I'd risk building - something serious on this... but it appears to work. The - Clojure people seem - to be making good use of STM in any case.

- -

On a related note, SBCL - exports a symbol sb-ext:compare-and-swap, which can - be used to do small-scale transactional tricks, and is often an - order of magnitude faster than the equivalent locking solution. - Here is a very simple concurrency-safe stack. The macro is a - convenient wrapper that helps use compare-and-swap in - a 'transactional' way.

- -
-(defmacro concurrent-update (place var &body body)
-  `(flet ((action (,var) ,@body))
-     (let ((prev ,place))
-       (loop :until (eq (sb-ext:compare-and-swap ,place prev (action prev)) prev)
-             :do (setf prev ,place))
-       prev)))
-
-(defun make-cstack (&rest elements)
-  (cons :stack elements))
-(defun push-cstack (element stack)
-  (concurrent-update (cdr stack) elts
-    (cons element elts))
-  (values))
-(defun pop-cstack (stack)
-  (car (concurrent-update (cdr stack) elts
-         (cdr elts))))
- -

Stacks are represented as cons cells, since - compare-and-swap can only be used on a limited set of - 'places' (cars, cdrs, svrefs, symbol-values), and not on things - like instance or struct slots. Writing a queue is only slightly - more complicated. Making popping threads block when there are no - elements available, on the other hand, is decidedly tricky. (I - think I have a working implementation... it is much faster than - the locking variant, but it is so complicated that I'm not - convinced it is correct. If you can recommend any good books or - papers on low-level concurrency techniques, please drop me an e-mail.)

- -

In the book How to - Write Parallel Programs, which is old (1992) but still - relevant, the authors distinguish three models of concurrency: - Message passing (as in Erlang), shared or distributed data - structures (as in Java), and 'live' data structures, where - computations turn into values after they finish running. These are - all fundamentally equivalent, in that a program using one of them - can always be transformed to use another, but some programs are - much more natural when expressed in the right model. STM, which - had not been invented at the time the book was written, would - probably qualify as a special case of shared data structures.

- -

One of these models, live data structures, has never been very - popular. This might be related to the fact that they offer none of - the control-flow advantages that the other models have ― - you spawn a process, and then later you can read the value it - produced, but this only buys you something when there is a direct - advantage to computing values in parallel. On single-core - machines, there isn't, but on multicores, this might be a very - pleasant way to speed up some CPU-bound programs. On first seeing - Haskell's par operator, which implements something - like this, I was duly impressed. You just pass two computations to - a little operator and, under the right circumstances, you get your - results twice as fast.

- -

Now, without further ado, I'd like to present my new library: - PCall (for parallel call). It implements a thread pool and a few - simple operators to get behaviour not unlike that of Haskell's - par. As a silly example, it can get two seconds worth - of sleep in only a single second!

- -
-(time (let ((a (pexec (sleep 1)))
-            (b (pexec (sleep 1))))
-        (join a)
-        (join b)))
- -

The code behind this is rather humble and simple, but it seems - to work well. See the project page for more - details and examples.

- -

This kind of 'local parallelism' makes it relatively easy to - verify that the tasks are using their data in a safe way ― - you typically just make sure that no tasks write to the same - location or read anything another task might write.

- -

One issue that keeps coming up with this style of parallelism, - an issue that the book I mentioned above also discusses, is - finding the right granularity when splitting up tasks. Wrapping up - a computation in a function, putting it into a queue, having some - thread find it and execute it, and then reading out the results... - that is obviously more work than just directly executing the - computation. Thus, if the computation is small enough, you are - making your program slower by parallelising it.

- -

The solution suggested by the book is to provide a 'granularity - knob' ― some setting that controls the size of the - computation chunks that get delegated off to other processes. When - mapping some function over a list, you could for example provide a - way to control the amount of elements each task takes care of, and - vary that based on the amount of available processors and the - amount of computation the given function requires. In some cases, - the granularity adjustment could be done automatically by - profiling the running code. I guess that is usually more trouble - than it is worth, though.

- -

There are also situations where different environments call for - completely different approaches. Having a central lock on - something works well when there are ten threads occasionally - accessing it, but becomes a bottleneck when there are a thousand. - Optimistic transactions that fail when another thread interferes - with their data have the same problem: They fall apart when there - are too many processes. On the other hand, complex schemes that - minimise the amount of conflict between threads also carry an - overhead, and are often counterproductive when there are only a - handful of threads. This is a less pleasant side of concurrent - programming: There often is no right solution. In sequential - programs, you can weight program complexity against efficiency, - and often feel you've hit some sweet spot. In concurrent programs, - the sweet spot on my dual-core system might be completely stupid - on both an old single-core system and a flashy sixteen-core - server.

- -

Despite of this, and the claims by various people that - threading is 'just too hard', I rather suspect we'll be able to - work it out. As in other complicated fields, it is mostly a matter - of finding better abstractions. I hope PCall can provide an - abstraction suitable for some problems (though, at this point, I'm - not even a hundred percent convinced there are no race conditions - left in the library... do help me test it).

- - - rmfile ./doc/background.html hunk ./doc/index.html 1 - - - - - PCall - - - - - - -

PCall

- -

PCall, or parallel call, is a Common Lisp library intended to - simplify 'result-oriented' parallelism. It uses a thread pool to - concurrently run small computations without spawning a new thread. - This makes it possible to exploit multiple cores without much - extra fuss.

- -

Contents

- -
    -
  1. News
  2. -
  3. License
  4. -
  5. Download and installation
  6. -
  7. Support and mailing lists
  8. -
  9. Quickstart
  10. -
  11. Reference
  12. -
- -

News

- -

26-01-2009: Version - 0.2: Since there suddenly is some increased attention for the - library, I'm releasing the current code as 0.2. Still beta-ish, - but seems to work. This version adds with-local-thread-pool.

- -

06-06-2008: Version - 0.1: The first release. Should be considered beta. Any testing - and feedback is appreciated.

- -

05-06-2008: Added a background article with some related - thoughts.

- -

License

- -

PCall is released under a zlib-like license, which - approximately means you can use the code in whatever way you like, - except for passing it off as your own or releasing a modified - version without indication that it is not the original. See the - LICENSE file in the distribution.

- -

Download and installation

- -

PCall depends on bordeaux-threads, - and on fiveam - for the test suite.

- -

The latest release of PCall can be downloaded from http://marijn.haverbeke.nl/pcall/pcall.tgz, - or installed with asdf-install.

- -

A darcs repository with the - most recent changes can be checked out with:

- -
> darcs get http://marijn.haverbeke.nl/pcall/pcall
- -

Or you can look at the repository - through darcsweb.

- -

Support and mailing lists

- -

There is a Google group for - any discussion relating this project. Also feel free to drop me an - e-mail directly: Marijn - Haverbeke.

- -

Quickstart

- -

PCall is a rather simple library. There are only three basic - concepts that you have to come to grips with:

- - - -

Imagine that we have this wonderful algorithm for computing - (once again) Fibonnaci numbers:

- -
-(defun fib (n)
-  (if (> n 2)
-      (+ (fib (- n 1)) (fib (- n 2)))
-      1))
-
-(time (fib 40))
- -

Depending on your machine, this might take some 2 to 10 - seconds. We don't have that kind of patience. You can see that - this algorithm is entirely optimal, so our only option, it seems, - is to use more hardware ― or make better use of the - hardware we have:

- -
-(time (let ((f39 (pexec (fib 39)))
-            (f38 (pexec (fib 38))))
-        (+ (join f39) (join f38))))
- -

On my 2-core machine, that speeds things up by about a third - ― which makes sense, since computing fib(39) is about twice - as much work as computing fib(38). A nicer way to write the same - thing is:

- -
-(time (plet ((f39 (fib 39))
-             (f38 (fib 38)))
-        (+ f39 f38)))
- -

plet takes care of the - wrapping and joining in cases like this. Why do we need the let - anyway? You could try this:

- -
-(time (+ (join (pexec (fib 39))) (join (pexec (fib 38)))))
- -

... but that won't buy you anything. The tasks have to both be - created before you join the first one, or the second task will not - exist when the first one runs, and thus won't be able to run - concurrently with it.

- -

You might be tempted to write something like this:

- -
-(defun pfib (n)
-  (if (> n 2)
-      (plet ((a (pfib (- n 1)))
-             (b (pfib (- n 2))))
-        (+ a b))
-      1))
- -

... but don't. There is some overhead associated with creating - and executing tasks, and for a function like naive-fibonacci, - which recurses a zillion times even for small inputs, this will - radically slow your algorithm down. A parallel mapping function, - as shown below, works great for mapping a relatively heavy - function over a list of limited length, but is no use for mapping - 1+ over a million elements.

- -
-(defun pmapcar (f list)
-  (let ((result (mapcar (lambda (n) (pexec (funcall f n))) list)))
-    (map-into result #'join result)))
-
-(defvar *numbers* (loop :for i :from 0 :below 30 :collect i))
-(time (mapcar #'fib i))
-(time (pmapcar #'fib i))
- -

Note that joining tasks is not required. When you do not care - about the result of a computation, you can just spawn the task and - leave it at that.

- -

As a final note, PCall can also be used when a program is not - CPU-bound, but needs to do some tasks that are hampered by other - bottlenecks (network latency, disk speed). If they can be executed - in parallel, you can have the thread pool run them. In the - following example, the second version runs three times faster on - my machine:

- -
-(defvar *urls* '("http://marijn.haverbeke.nl/pcall" "http://common-lisp.net"
-                 "http://eloquentjavascript.net" "http://xkcd.com"))
-
-(time (mapc 'drakma:http-request *urls*))
-(time (mapc 'join (mapcar (lambda (url) (pexec (drakma:http-request url))) *urls*)))
- -

In some applications, doing multiple database queries at the - same time could really help. You might need to increase the size - of the thread pool in such a situation, since some threads will be - sitting idle, waiting on a socket.

- -

Reference

- -

- function - pcall (thunk) -
→ task -

- -

Create a task that will call the given - argumentless function.

- -

- macro - pexec (&body body) -
→ task -

- -

A shorthand for (pcall - (lambda () ...)).

- -

- macro - plet ((bindings) &body body) -

- -

Follows the behaviour of let, but - wraps every bound value into a pexec form, and automatically adds - join calls around uses of the - bound variables.

- -

- function - join (task) -
→ result -

- -

Waits for the given task to finish, and then - returns any values the task produced. When executing the task - raised an uncaught error, this error will be raised when joining - the task. (Note that this re-raising might cause problems with - condition handlers, which might not be active in the worker - threads. If you are using handlers, take extra care, and look at - set-worker-environment.) - When, at the moment join is called, the task has not - been assigned to any thread, the joining thread will execute the - task itself. (Note that this makes the dynamic environment in - which the task runs unpredictable.) A task may be joined multiple - times. Subsequent joins will again return the values, without - re-executing the task.

- -

- function - select-one (&rest tasks) -

- -

Waits until at least one of the given tasks is - finished and then returns that task.

- -

- function - done-p (task) -
→ boolean -

- -

Tests whether a task has been executed.

- -

- function - thread-pool-size () -

- -

Returns the current size of the thread pool. Also - supports setf to change this size. The default value - is 3.

- -

- function - set-worker-environment (wrapper) -

- -

This can be used to make dynamic variables or - bound handlers available in the worker threads. - wrapper should be either nil, for no - wrapping, or a function that, given a function argument, calls its - argument in the new environment. Works best with local thead pools.

- -

- function - finish-tasks () -

- -

Takes the current threads out of the pool, waits - for the task queue to empty, and then for the threads to finish - executing any tasks they might be busy with. This is intended to - be called when shutting down ― killing the threads might - cause some tasks to be aborted, which could result in data loss. - If you join every task you create, this should not be necessary. - Note that if you call this function while other threads are still - creating tasks, it might never return.

- -

- macro - with-local-thread-pool (&key size on-unwind worker-environment) -

- -

Run body with a fresh thread pool. If - on-unwind is :wait (the default), the - form will wait for all local tasks to finish before returning. If - it is :leave, it will return while threads are still - working. Given :stop or :destroy, the - threads will be stopped at the end of the body. With - :stop, they will first finish their current task (if - any), with :destroy, they will be brutally destroyed - and might leak resources, leave stuff in inconsistent state, - etcetera. worker-environment can be used to give the - workers in the local pool a specific dynamic environment.

- - - - rmfile ./doc/index.html hunk ./doc/style.css 1 -body { - margin: 0; - font-family: tahoma, arial, sans-serif; - margin: 60px 120px; - color: black; - max-width: 50em; -} - -h1 { - font-size: 250%; - border-bottom: 3px solid #C42; -} - -h2 { - font-size: 140%; - border-bottom: 1px solid #C42; -} - -h3 { - font-size: 110%; -} - -code { - font-size: 1.2em; -} - -p.news { - text-indent: -3em; - padding-left: 3em; -} - -pre.code { - margin: 0 16px; - padding: 7px; - border: 1px solid #DA8; -} - -p.def { - margin-top: 1.5em; - font-family: courier; -} - -p.def span { - color: #555555; - font-weight: bold; - font-family: tahoma, arial, sans-serif; - font-size: .8em; -} - -.desc { - margin-left: 1em; -} - -thead { - font-weight: bold; -} - -table { - border-collapse: collapse; -} - -tr + tr { - border-top: 1px solid #88BB99; -} - -thead tr { - border-bottom: 2px solid #88BB99; -} - -td + td, th + th { - border-left: 2px solid #88BB99; -} - -th { - text-align: left; - padding: 2px 5px; -} - -td { - padding: 2px 5px; - vertical-align: top; -} - -a:link { - color: #3333AA; - text-decoration: none; -} - -a:visited { - color: #773377; - text-decoration: none; -} - -a:hover { - text-decoration: underline; -} - -ul.symbol-index { - font-family: monospace; - font-size: 1.2em; -} - -blockquote { - font-style: italic; -} rmfile ./doc/style.css rmdir ./doc }