[add background document, touch up docs marijnh@gmail.com**20080605202412] { addfile ./doc/background.html hunk ./doc/background.html 1 + + + +
+
+ Topics: concurrency, multicore processors, Common Lisp
+ Author: Marijn Haverbeke
+ Date: June 5th 2008
+
I think that, over the past year, I've read some thirty + articles that start with the solemn or anxious announcement that + the future of programming will be multicore, and that one way or + another, we will have to get used to it. I'll try not to write + another one of those. In fact, it appears that multicore + processors are also pretty good single-core processors, and most + applications are, for the foreseeable future, comfortably + single-threaded. And in a lot of cases, they should be. So is + there any pressing reason to take notice of this new-fangled + hardware?
+ +[I]t looks more or less like the hardware designers + have run out of ideas, and that they’re trying to pass the blame + for the future demise of Moore’s Law to the software writers by + giving us machines that work faster only on a few key + benchmarks!+ +
That is from none less than Donald Knuth, in a recent + interview. Nonetheless, I have (belatedly, I'll admit) started + to get excited about incorporating the kind of 'multi-threading + for the sake of multi-threading' that multiple cores encourage + into my programs. Why? This might not be a very convincing reason + for the serious, result-oriented programmer, parallel programming, + it turns out, is very amusing. Sure, there are dead-lock + death-traps, race-conditions, and a general dangerous sense of + non-determinism. But there is a logic to all of it, and a working + parallel program can be a thing of beauty. That kind of beauty, + and the complexity that tends to go with it, is what got me into + this business in the first place. (It wasn't my love for hunching + over keyboards or a deep dislike of sunlight, in any case.)
+ +This infatuation started ― predictably enough ― + with Erlang. Erlang is kind of hip, makes you think about + concurrency in a whole new way, and is entirely dissatisfactory as + a programming language. I won't go too deeply into that last + point, since taking cheap shots at other people's work tends to + hurt one's credibility, but the un-polished and limited feel of + the language offended my delicate linguistic sensibilities, and + prevented me from investing too deeply into it, with the result + that I am still (also somewhat grudgingly, perhaps) doing most of + my work in Common Lisp.
+ +Common Lisp has no integrated, light-weight, multicore-friendly + threads. In fact, in only recently started having widespread + support for OS-level threads at all. This support comes, + invariably, in the form of shared memory, locks, and semaphores + (bordeaux-threads + provides a pleasant portable wrapper). While these allow just + about any kind of concurrent program to be written, they are + fairly tedious and error-prone to work with. CL-MUPROC + builds a message-passing system on top of them, which allows a + more pleasant programming style. Every CL-MUPROC process is an OS + thread though, so you won't want to create thousands of them.
+ +Another approach for getting away from mutexes, one that has + been generating a lot of noise recently, is software transactional + memory. If that doesn't mean anything to you, I'd recommend + watching this video of + Simon Peyton Jones explaining the idea. I was surprised to find + that there apparently exists a CL-STM (aren't we + CLers creative when it comes to naming projects). Seems to have + been inactive since early 2007, and as I understand it, STM is + rather hard to get right, so I'm not sure I'd risk building + something serious on this... but it appears to work. The + Clojure people seem + to be making good use of STM in any case.
+ +On a related note, SBCL
+ exports a symbol sb-ext:compare-and-swap
, which can
+ be used to do small-scale transactional tricks, and is often an
+ order of magnitude faster than the equivalent locking solution.
+ Here is a very simple concurrency-safe stack. The macro is a
+ convenient wrapper that helps use compare-and-swap
in
+ a 'transactional' way.
+(defmacro concurrent-update (place var &body body) + (let ((prev (gensym)) (func (gensym))) + `(let ((,prev ,place) + (,func (lambda (,var) ,@body))) + (loop :until (eq (sb-ext:compare-and-swap ,place ,prev (funcall ,func ,prev)) ,prev) + :do (setf ,prev ,place)) + ,prev))) + +(defun make-cstack (&rest elements) + (cons :stack elements)) +(defun push-cstack (element stack) + (concurrent-update (cdr stack) elts + (cons element elts)) + (values)) +(defun pop-cstack (stack) + (car (concurrent-update (cdr stack) elts + (cdr elts))))+ +
Stacks are represented as cons cells, since
+ compare-and-swap
can only be used on a limited set of
+ 'places' (cars, cdrs, svrefs, symbol-values), and not on things
+ like instance or struct slots. Writing a queue is only slightly
+ more complicated. Making popping threads block when there are no
+ elements available, on the other hand, is decidedly tricky. (I
+ think I have a working implementation... it is much faster than
+ the locking variant, but it is so complicated that I'm not
+ convinced it is correct. If you can recommend any good books or
+ papers on low-level concurrency techniques, please drop me an e-mail.)
In the book How to + Write Parallel Programs, which is old (1992) but still + relevant, the authors distinguish three models of concurrency: + Message passing (as in Erlang), shared or distributed data + structures (as in Java), and 'live' data structures, where + computations turn into values after they finish running. These are + all fundamentally equivalent, in that a program using one of them + can always be transformed to use another, but some programs are + much more natural when expressed in the right model. STM, which + had not been invented at the time the book was written, would + probably qualify as a special case of shared data structures.
+ +One of these models, live data structures, has never been very
+ popular. This might be related to the fact that they offer none of
+ the control-flow advantages that the other models have ―
+ you spawn a process, and then later you can read the value it
+ produced, but this only buys you something when there is a direct
+ advantage to computing values in parallel. On single-core
+ machines, there isn't, but on multicores, this might be a very
+ pleasant way to speed up some CPU-bound programs. On first seeing
+ Haskell's par
operator, which implements something
+ like this, I was duly impressed. You just pass two computations to
+ a little operator and, under the right circumstances, you get your
+ results twice as fast.
Now, without further ado, I'd like to present my new library:
+ PCall (for parallel call). It implements a thread pool and a few
+ simple operators to get behaviour not unlike that of Haskell's
+ par
. As a silly example, it can get two seconds worth
+ of sleep in only a single second!
+(time (let ((a (pexec (sleep 1))) + (b (pexec (sleep 1)))) + (join a) + (join b)))+ +
The code behind this is rather humble and simple, but it seems + to work well. See the project page for more + details and examples.
+ +This kind of 'local parallelism' makes it relatively easy to + verify that the tasks are using their data in a safe way ― + you typically just make sure that no tasks write to the same + location or read anything another task might write.
+ +One issue that keeps coming up with this style of parallelism, + an issue that the book I mentioned above also discusses, is + finding the right granularity when splitting up tasks. Wrapping up + a computation in a function, putting it into a queue, having some + thread find it and execute it, and then reading out the results... + that is obviously more work than just directly executing the + computation. Thus, if the computation is small enough, you making + your program slower by parallelising it.
+ +The solution suggested by the book is to provide a 'granularity + knob' ― some setting that controls the size of the + computation chunks that get delegated off to other processes. When + mapping some function over a list, you could for example provide a + way to control the amount of elements each task takes care of, and + vary that based on the amount of available processors and the + amount of computation the given function requires. In some cases, + the granularity adjustment could be done automatically by + profiling the running code. I guess that is usually more trouble + than it is worth, though.
+ +There are also situations where different environments call for + completely different approaches. Having a central lock on + something works well when there are ten threads occasionally + accessing it, but becomes a bottleneck when there are a thousand. + Optimistic transactions that fail when another thread interferes + with their data have the same problem: They fall apart when there + are too many processes. On the other hand, complex schemes that + minimise the amount of conflict between threads also carry an + overhead, and are often counterproductive when there are only a + handful of threads. This is a less pleasant side of concurrent + programming: There often is no right solution. In sequential + programs, you can weight program complexity against efficiency, + and often feel you've hit some sweet spot. In concurrent programs, + the sweet spot on my dual-core system might be completely stupid + on both an old single-core system and a flashy sixteen-core + server.
+ +Despite of this, and the claims by various people that + threading is 'just too hard', I rather suspect we'll be able to + work it out. As in other complicated fields, it is mostly a matter + of finding better abstractions. I hope PCall can provide an + abstraction suitable for some problems (though, at this point, I'm + not even a hundred percent convinced there are no race conditions + left in the library... do help me test it).
+ + + hunk ./doc/index.html 33 -18-05-2008: 06-06-2008: 05-06-2008: Added a background article with some related + thoughts.
+ hunk ./doc/index.html 74 -I hope to have a common-lisp.net mailing list for
- the project soon. For now, feel free to drop me an e-mail
- directly: Marijn
+ There is a Google group for
+ any discussion relating this project. Also feel free to drop me an
+ e-mail directly: Marijn
hunk ./doc/index.html 82
- PCall is a rather simple library. There are three basic
+ PCall is a rather simple library. There are only three basic
hunk ./doc/index.html 107
- (why not) Fibonnaci numbers:
On my 2-core machine, that speeds things up by about a third ― - which makes sense, since computing fib(39) is about twice as much - work as computing fib(38). A nicer way to write the same would - be:
+On my 2-core machine, that speeds things up by about a third + ― which makes sense, since computing fib(39) is about twice + as much work as computing fib(38). A nicer way to write the same + thing is:
hunk ./doc/index.html 148 - concurrently. + concurrently with it. hunk ./doc/index.html 181 +As a final note, PCall can also be used when a program is not + CPU-bound, but needs to do some tasks that are hampered by other + bottlenecks (network latency, disk speed). If they can be executed + in parallel, you can have the thread pool run them. In the + following example, the second version runs three times faster on + my machine:
+ ++(defvar *urls* '("http://marijn.haverbeke.nl/pcall" "http://common-lisp.net" + "http://eloquentjavascript.net" "http://xkcd.com")) + +(time (mapc 'drakma:http-request *urls*)) +(time (mapc 'join (mapcar (lambda (url) (pexec (drakma:http-request url))) *urls*)))+ +
In some applications, doing multiple database queries at the + same time could really help. You might need to increase the size + of the thread pool in such a situation, since some threads will be + sitting idle, waiting on a socket.
+ hunk ./doc/style.css 103 +blockquote { + font-style: italic; +} }