[add background document, touch up docs marijnh@gmail.com**20080605202412] { addfile ./doc/background.html hunk ./doc/background.html 1 + + + + + Idle cores to the left + + + + + +

+ Idle cores to the left of me, race conditions to the right +

+

+ (or how I learned to stop worrying and embrace the deadlock) +

+ +

+ Topics: concurrency, multicore processors, Common Lisp
+ Author: Marijn Haverbeke
+ Date: June 5th 2008 +

+ +

I think that, over the past year, I've read some thirty + articles that start with the solemn or anxious announcement that + the future of programming will be multicore, and that one way or + another, we will have to get used to it. I'll try not to write + another one of those. In fact, it appears that multicore + processors are also pretty good single-core processors, and most + applications are, for the foreseeable future, comfortably + single-threaded. And in a lot of cases, they should be. So is + there any pressing reason to take notice of this new-fangled + hardware?

+ +
[I]t looks more or less like the hardware designers + have run out of ideas, and that they’re trying to pass the blame + for the future demise of Moore’s Law to the software writers by + giving us machines that work faster only on a few key + benchmarks!
+ +

That is from none less than Donald Knuth, in a recent + interview. Nonetheless, I have (belatedly, I'll admit) started + to get excited about incorporating the kind of 'multi-threading + for the sake of multi-threading' that multiple cores encourage + into my programs. Why? This might not be a very convincing reason + for the serious, result-oriented programmer, parallel programming, + it turns out, is very amusing. Sure, there are dead-lock + death-traps, race-conditions, and a general dangerous sense of + non-determinism. But there is a logic to all of it, and a working + parallel program can be a thing of beauty. That kind of beauty, + and the complexity that tends to go with it, is what got me into + this business in the first place. (It wasn't my love for hunching + over keyboards or a deep dislike of sunlight, in any case.)

+ +

This infatuation started ― predictably enough ― + with Erlang. Erlang is kind of hip, makes you think about + concurrency in a whole new way, and is entirely dissatisfactory as + a programming language. I won't go too deeply into that last + point, since taking cheap shots at other people's work tends to + hurt one's credibility, but the un-polished and limited feel of + the language offended my delicate linguistic sensibilities, and + prevented me from investing too deeply into it, with the result + that I am still (also somewhat grudgingly, perhaps) doing most of + my work in Common Lisp.

+ +

Common Lisp has no integrated, light-weight, multicore-friendly + threads. In fact, in only recently started having widespread + support for OS-level threads at all. This support comes, + invariably, in the form of shared memory, locks, and semaphores + (bordeaux-threads + provides a pleasant portable wrapper). While these allow just + about any kind of concurrent program to be written, they are + fairly tedious and error-prone to work with. CL-MUPROC + builds a message-passing system on top of them, which allows a + more pleasant programming style. Every CL-MUPROC process is an OS + thread though, so you won't want to create thousands of them.

+ +

Another approach for getting away from mutexes, one that has + been generating a lot of noise recently, is software transactional + memory. If that doesn't mean anything to you, I'd recommend + watching this video of + Simon Peyton Jones explaining the idea. I was surprised to find + that there apparently exists a CL-STM (aren't we + CLers creative when it comes to naming projects). Seems to have + been inactive since early 2007, and as I understand it, STM is + rather hard to get right, so I'm not sure I'd risk building + something serious on this... but it appears to work. The + Clojure people seem + to be making good use of STM in any case.

+ +

On a related note, SBCL + exports a symbol sb-ext:compare-and-swap, which can + be used to do small-scale transactional tricks, and is often an + order of magnitude faster than the equivalent locking solution. + Here is a very simple concurrency-safe stack. The macro is a + convenient wrapper that helps use compare-and-swap in + a 'transactional' way.

+ +
+(defmacro concurrent-update (place var &body body)
+  (let ((prev (gensym)) (func (gensym)))
+    `(let ((,prev ,place)
+           (,func (lambda (,var) ,@body)))
+      (loop :until (eq (sb-ext:compare-and-swap ,place ,prev (funcall ,func ,prev)) ,prev)
+            :do (setf ,prev ,place))
+      ,prev)))
+
+(defun make-cstack (&rest elements)
+  (cons :stack elements))
+(defun push-cstack (element stack)
+  (concurrent-update (cdr stack) elts
+    (cons element elts))
+  (values))
+(defun pop-cstack (stack)
+  (car (concurrent-update (cdr stack) elts
+         (cdr elts))))
+ +

Stacks are represented as cons cells, since + compare-and-swap can only be used on a limited set of + 'places' (cars, cdrs, svrefs, symbol-values), and not on things + like instance or struct slots. Writing a queue is only slightly + more complicated. Making popping threads block when there are no + elements available, on the other hand, is decidedly tricky. (I + think I have a working implementation... it is much faster than + the locking variant, but it is so complicated that I'm not + convinced it is correct. If you can recommend any good books or + papers on low-level concurrency techniques, please drop me an e-mail.)

+ +

In the book How to + Write Parallel Programs, which is old (1992) but still + relevant, the authors distinguish three models of concurrency: + Message passing (as in Erlang), shared or distributed data + structures (as in Java), and 'live' data structures, where + computations turn into values after they finish running. These are + all fundamentally equivalent, in that a program using one of them + can always be transformed to use another, but some programs are + much more natural when expressed in the right model. STM, which + had not been invented at the time the book was written, would + probably qualify as a special case of shared data structures.

+ +

One of these models, live data structures, has never been very + popular. This might be related to the fact that they offer none of + the control-flow advantages that the other models have ― + you spawn a process, and then later you can read the value it + produced, but this only buys you something when there is a direct + advantage to computing values in parallel. On single-core + machines, there isn't, but on multicores, this might be a very + pleasant way to speed up some CPU-bound programs. On first seeing + Haskell's par operator, which implements something + like this, I was duly impressed. You just pass two computations to + a little operator and, under the right circumstances, you get your + results twice as fast.

+ +

Now, without further ado, I'd like to present my new library: + PCall (for parallel call). It implements a thread pool and a few + simple operators to get behaviour not unlike that of Haskell's + par. As a silly example, it can get two seconds worth + of sleep in only a single second!

+ +
+(time (let ((a (pexec (sleep 1)))
+            (b (pexec (sleep 1))))
+        (join a)
+        (join b)))
+ +

The code behind this is rather humble and simple, but it seems + to work well. See the project page for more + details and examples.

+ +

This kind of 'local parallelism' makes it relatively easy to + verify that the tasks are using their data in a safe way ― + you typically just make sure that no tasks write to the same + location or read anything another task might write.

+ +

One issue that keeps coming up with this style of parallelism, + an issue that the book I mentioned above also discusses, is + finding the right granularity when splitting up tasks. Wrapping up + a computation in a function, putting it into a queue, having some + thread find it and execute it, and then reading out the results... + that is obviously more work than just directly executing the + computation. Thus, if the computation is small enough, you making + your program slower by parallelising it.

+ +

The solution suggested by the book is to provide a 'granularity + knob' ― some setting that controls the size of the + computation chunks that get delegated off to other processes. When + mapping some function over a list, you could for example provide a + way to control the amount of elements each task takes care of, and + vary that based on the amount of available processors and the + amount of computation the given function requires. In some cases, + the granularity adjustment could be done automatically by + profiling the running code. I guess that is usually more trouble + than it is worth, though.

+ +

There are also situations where different environments call for + completely different approaches. Having a central lock on + something works well when there are ten threads occasionally + accessing it, but becomes a bottleneck when there are a thousand. + Optimistic transactions that fail when another thread interferes + with their data have the same problem: They fall apart when there + are too many processes. On the other hand, complex schemes that + minimise the amount of conflict between threads also carry an + overhead, and are often counterproductive when there are only a + handful of threads. This is a less pleasant side of concurrent + programming: There often is no right solution. In sequential + programs, you can weight program complexity against efficiency, + and often feel you've hit some sweet spot. In concurrent programs, + the sweet spot on my dual-core system might be completely stupid + on both an old single-core system and a flashy sixteen-core + server.

+ +

Despite of this, and the claims by various people that + threading is 'just too hard', I rather suspect we'll be able to + work it out. As in other complicated fields, it is mostly a matter + of finding better abstractions. I hope PCall can provide an + abstraction suitable for some problems (though, at this point, I'm + not even a hundred percent convinced there are no race conditions + left in the library... do help me test it).

+ + + hunk ./doc/index.html 33 -

18-05-2008: 06-06-2008: 05-06-2008: Added a background article with some related + thoughts.

+ hunk ./doc/index.html 74 -

I hope to have a common-lisp.net mailing list for - the project soon. For now, feel free to drop me an e-mail - directly: Marijn +

There is a Google group for + any discussion relating this project. Also feel free to drop me an + e-mail directly: Marijn hunk ./doc/index.html 82 -

PCall is a rather simple library. There are three basic +

PCall is a rather simple library. There are only three basic hunk ./doc/index.html 107 - (why not) Fibonnaci numbers:

+ (once again) Fibonnaci numbers:

hunk ./doc/index.html 118 - seconds. We don't have that kind of patience, and you can see that - this algorithm is entirely optimal. Our only option, it seems, is - to use more hardware ― or make better use of the hardware we - have:

+ seconds. We don't have that kind of patience. You can see that + this algorithm is entirely optimal, so our only option, it seems, + is to use more hardware ― or make better use of the + hardware we have:

hunk ./doc/index.html 128 -

On my 2-core machine, that speeds things up by about a third ― - which makes sense, since computing fib(39) is about twice as much - work as computing fib(38). A nicer way to write the same would - be:

+

On my 2-core machine, that speeds things up by about a third + ― which makes sense, since computing fib(39) is about twice + as much work as computing fib(38). A nicer way to write the same + thing is:

hunk ./doc/index.html 148 - concurrently.

+ concurrently with it.

hunk ./doc/index.html 181 +

As a final note, PCall can also be used when a program is not + CPU-bound, but needs to do some tasks that are hampered by other + bottlenecks (network latency, disk speed). If they can be executed + in parallel, you can have the thread pool run them. In the + following example, the second version runs three times faster on + my machine:

+ +
+(defvar *urls* '("http://marijn.haverbeke.nl/pcall" "http://common-lisp.net"
+                 "http://eloquentjavascript.net" "http://xkcd.com"))
+
+(time (mapc 'drakma:http-request *urls*))
+(time (mapc 'join (mapcar (lambda (url) (pexec (drakma:http-request url))) *urls*)))
+ +

In some applications, doing multiple database queries at the + same time could really help. You might need to increase the size + of the thread pool in such a situation, since some threads will be + sitting idle, waiting on a socket.

+ hunk ./doc/style.css 103 +blockquote { + font-style: italic; +} }